Show simple item record

dc.contributor.advisorPeter Szolovits.en_US
dc.contributor.authorKatirai, Hoomanen_US
dc.contributor.otherMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2006-11-07T12:45:15Z
dc.date.available2006-11-07T12:45:15Z
dc.date.copyright2006en_US
dc.date.issued2006en_US
dc.identifier.urihttp://hdl.handle.net/1721.1/34526
dc.descriptionThesis (S.M.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology and Policy Program; and, Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.en_US
dc.descriptionIncludes bibliographical references (leaves 85-86).en_US
dc.description.abstractPrivacy laws are an important facet of our society. But they can also serve as formidable barriers to medical research. The same laws that prevent casual disclosure of medical data have also made it difficult for researchers to access the information they need to conduct research into the causes of disease. But it is possible to overcome some of these legal barriers through technology. The US law known as HIPAA, for example, allows medical records to be released to researchers without patient consent if the records are provably anonymized prior to their disclosure. It is not enough for records to be seemingly anonymous. For example, one researcher estimates that 87.1% of the US population can be uniquely identified by the combination of their zip, gender, and date of birth - fields that most people would consider anonymous. One promising technique for provably anonymizing records is called k-anonymity. It modifies each record so that it matches k other individuals in a population - where k is an arbitrary parameter. This is achieved by, for example, changing specific information such as a date of birth, to a less specific counterpart such as a year of birth.en_US
dc.description.abstract(cont.) Previous studies have shown that achieving k-anonymity while minimizing information loss is an NP-hard problem; thus a brute force search is out of the question for most real world data sets. In this thesis, we present an open source Java toolkit that seeks to anonymize data while minimizing information loss. It uses an optimization framework and methods typically used to attack NP-hard problems including greedy search and clustering strategies. To test the toolkit a number of previously unpublished algorithms and information loss metrics have been implemented. These algorithms and measures are then empirically evaluated using a data set consisting of 1000 real patient medical records taken from a local hospital. The theoretical contributions of this work include: (1) A new threat model for privacy - that allows an adversary's capabilities to be modeled using a formalism called a virtual attack database. (2) Rationally defensible information loss measures - we show that previously published information loss measures are difficult to defend because they fall prey to what is known as the "weighted indexing problem." To remedy this problem we propose a number of information-loss measures that are in principle more attractive than previously published measures.en_US
dc.description.abstract(cont.) (3) Shown that suppression and generalization - two concepts that were previously thought to be distinct - are in fact the same thing; insofar as each generalization can be represented by a suppression and vice versa. (4) We show that Domain Generalization Hierarchies can be harvested to assist the construction of a Bayesian network to measure information loss. (5) A database can be thought of as a sub-sample of a population. We outline a technique that allows one to predict k-anonymity in a population. This allows us, under some conditions, to release records that match fewer than k individuals in a database while still achieving k-anonymity against an adversary according to some probability and confidence interval. While we have chosen to focus our thesis on the anonymization of medical records, our methodologies, toolkit and command line tools are equally applicable to any tabular data such as the data one finds in relational databases - the most common type of database today.en_US
dc.description.statementofresponsibilityby Hooman Katirai.en_US
dc.format.extent86 leavesen_US
dc.format.extent14904672 bytes
dc.format.extent14904307 bytes
dc.format.mimetypeapplication/pdf
dc.format.mimetypeapplication/pdf
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582
dc.subjectTechnology and Policy Program.en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleA theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information lossen_US
dc.typeThesisen_US
dc.description.degreeS.M.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.contributor.departmentMassachusetts Institute of Technology. Engineering Systems Division
dc.contributor.departmentTechnology and Policy Program
dc.identifier.oclc70902079en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record