Comment to U.S Copyright Office on Data Provenance and Copyright

Mahari, Robert; Shayne, Longpre; Donewald, Lisette; Polozov, Alan; Pentland, Alex 'Sandy'; Lipsitz, Ari

dc.contributor.author	Mahari, Robert
dc.contributor.author	Shayne, Longpre
dc.contributor.author	Donewald, Lisette
dc.contributor.author	Polozov, Alan
dc.contributor.author	Pentland, Alex 'Sandy'
dc.contributor.author	Lipsitz, Ari
dc.date.accessioned	2024-04-17T17:35:01Z
dc.date.available	2024-04-17T17:35:01Z
dc.date.issued	2023-11-01
dc.identifier.uri	https://hdl.handle.net/1721.1/154171
dc.description.abstract	Scholars have paid much attention to the copying of raw data to train and develop machine learning models. Many have argued that such use of raw data, derived either directly from the internet or from a dataset, is protected under fair use such that the owners of the original work may not be successful in a claim for copyright infringement. We refer to such compilations of data derived from another source, and repurposed for machine learning, as unsupervised datasets. Less attention, however, has been paid to supervised datasets, which we define as datasets containing data created for the sole purpose of training machine learning models (mainly for finetuning and alignment). Supervised datasets may likely contain copyrightable contributions from the dataset creators in the form of annotations. To the extent that dataset creators likely have copyright interests in their supervised datasets, model developers must either rely on fair use or a license in order to avoid infringing the work of dataset creators. However, we argue that the unauthorized use of supervised datasets is unlikely to be protected by fair use. Whereas the use of unsupervised data for training machine learning is distinct from the original purpose of the unsupervised data, the unauthorized use of supervised datasets for training machine learning is identical to its original purpose. Fair use would therefore likely not apply to the annotations, labels, and curated comments in supervised datasets. For this reason, having a valid license to a supervised dataset is perhaps particularly critical. Unfortunately, our recent research has found that the licenses attached to publicly available supervised datasets are often imprecise, inaccurate, or missing altogether. Model developers may be exposing themselves to unknown amounts of liability. We argue that this is a problem that needs to be addressed and propose a tool that might serve as a launching point for ensuring license transparency.	en_US
dc.language.iso	en_US	en_US
dc.publisher	U.S. Copyright Office	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Computational Law	en_US
dc.subject	Data Provenance	en_US
dc.subject	Fair Use	en_US
dc.subject	Copyright Law	en_US
dc.subject	Large Language Models	en_US
dc.subject	Generative AI	en_US
dc.subject	AI Regulation	en_US
dc.subject	Regulation by Design	en_US
dc.title	Comment to U.S Copyright Office on Data Provenance and Copyright	en_US
dc.type	Other	en_US

Files in this item

Name:: license_rdf
Size:: 811bytes
Format:: application/rdf+xml

View/Open

Name:: COLC-2023-0006-9063_attachment_1 ...
Size:: 346.1Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Connection Science

Show simple item record