Identifying Expression Fingerprints using Linguistic Information
This thesis presents a technology to complement taxation-based policy proposals aimed at addressing the digital copyright problem. Theapproach presented facilitates identification of intellectual propertyusing expression fingerprints. Copyright law protects expression of content. Recognizing literaryworks for copyright protection requires identification of theexpression of their content. The expression fingerprints described inthis thesis use a novel set of linguistic features that capture boththe content presented in documents and the manner of expression usedin conveying this content. These fingerprints consist of bothsyntactic and semantic elements of language. Examples of thesyntactic elements of expression include structures of embedding andembedded verb phrases. The semantic elements of expression consist ofhigh-level, broad semantic categories. Syntactic and semantic elements of expression enable generation ofmodels that correctly identify books and their paraphrases 82% of thetime, providing a significant (approximately 18%) improvement over modelsthat use tfidf-weighted keywords. The performance of models builtwith these features is also better than models created with standardfeatures used in stylometry (e.g., function words), which yield anaccuracy of 62%.In the non-digital world, copyright holders collect revenues bycontrolling distribution of their works. Current approaches to thedigital copyright problem attempt to provide copyright holders withthe same kind of control over distribution by employing Digital RightsManagement (DRM) systems. However, DRM systems also enable copyrightholders to control and limit fair use, to inhibit others' speech, andto collect private information about individual users of digitalworks.Digital tracking technologies enable alternate solutions to thedigital copyright problem; some of these solutions can protectcreative incentives of copyright holders in the absence of controlover distribution of works. Expression fingerprints facilitatedigital tracking even when literary works are DRM- and watermark-free,and even when they are paraphrased. As such, they enable meteringpopularity of works and make practicable solutions that encouragelarge-scale dissemination and unrestricted use of digital works andthat protect the revenues of copyright holders, for example throughtaxation-based revenue collection and distribution systems, withoutimposing limits on distribution.
Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory
AI, natural language processing, syntactic information, content, expression