A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss

Katirai, Hooman

dc.contributor.advisor	Peter Szolovits.	en_US
dc.contributor.author	Katirai, Hooman	en_US
dc.contributor.other	Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2006-11-07T12:45:15Z
dc.date.available	2006-11-07T12:45:15Z
dc.date.copyright	2006	en_US
dc.date.issued	2006	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/34526
dc.description	Thesis (S.M.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology and Policy Program; and, Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.	en_US
dc.description	Includes bibliographical references (leaves 85-86).	en_US
dc.description.abstract	Privacy laws are an important facet of our society. But they can also serve as formidable barriers to medical research. The same laws that prevent casual disclosure of medical data have also made it difficult for researchers to access the information they need to conduct research into the causes of disease. But it is possible to overcome some of these legal barriers through technology. The US law known as HIPAA, for example, allows medical records to be released to researchers without patient consent if the records are provably anonymized prior to their disclosure. It is not enough for records to be seemingly anonymous. For example, one researcher estimates that 87.1% of the US population can be uniquely identified by the combination of their zip, gender, and date of birth - fields that most people would consider anonymous. One promising technique for provably anonymizing records is called k-anonymity. It modifies each record so that it matches k other individuals in a population - where k is an arbitrary parameter. This is achieved by, for example, changing specific information such as a date of birth, to a less specific counterpart such as a year of birth.	en_US
dc.description.abstract	(cont.) Previous studies have shown that achieving k-anonymity while minimizing information loss is an NP-hard problem; thus a brute force search is out of the question for most real world data sets. In this thesis, we present an open source Java toolkit that seeks to anonymize data while minimizing information loss. It uses an optimization framework and methods typically used to attack NP-hard problems including greedy search and clustering strategies. To test the toolkit a number of previously unpublished algorithms and information loss metrics have been implemented. These algorithms and measures are then empirically evaluated using a data set consisting of 1000 real patient medical records taken from a local hospital. The theoretical contributions of this work include: (1) A new threat model for privacy - that allows an adversary's capabilities to be modeled using a formalism called a virtual attack database. (2) Rationally defensible information loss measures - we show that previously published information loss measures are difficult to defend because they fall prey to what is known as the "weighted indexing problem." To remedy this problem we propose a number of information-loss measures that are in principle more attractive than previously published measures.	en_US
dc.description.abstract	(cont.) (3) Shown that suppression and generalization - two concepts that were previously thought to be distinct - are in fact the same thing; insofar as each generalization can be represented by a suppression and vice versa. (4) We show that Domain Generalization Hierarchies can be harvested to assist the construction of a Bayesian network to measure information loss. (5) A database can be thought of as a sub-sample of a population. We outline a technique that allows one to predict k-anonymity in a population. This allows us, under some conditions, to release records that match fewer than k individuals in a database while still achieving k-anonymity against an adversary according to some probability and confidence interval. While we have chosen to focus our thesis on the anonymization of medical records, our methodologies, toolkit and command line tools are equally applicable to any tabular data such as the data one finds in relational databases - the most common type of database today.	en_US
dc.description.statementofresponsibility	by Hooman Katirai.	en_US
dc.format.extent	86 leaves	en_US
dc.format.extent	14904672 bytes
dc.format.extent	14904307 bytes
dc.format.mimetype	application/pdf
dc.format.mimetype	application/pdf
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582
dc.subject	Technology and Policy Program.	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss	en_US
dc.type	Thesis	en_US
dc.description.degree	S.M.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.contributor.department	Massachusetts Institute of Technology. Engineering Systems Division
dc.contributor.department	Technology and Policy Program
dc.identifier.oclc	70902079	en_US

Files in this item

Name:: 70902079-MIT.pdf
Size:: 14.21Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record