De-identification of patient notes with recurrent neural networks

Dernoncourt, Franck; Lee, Ji Young; Uzuner, Ozlem; Szolovits, Peter

dc.contributor.author	Dernoncourt, Franck
dc.contributor.author	Lee, Ji Young
dc.contributor.author	Uzuner, Ozlem
dc.contributor.author	Szolovits, Peter
dc.date.accessioned	2017-08-29T19:37:29Z
dc.date.available	2017-08-29T19:37:29Z
dc.date.issued	2016-12
dc.date.submitted	2016-09
dc.identifier.issn	1067-5027
dc.identifier.issn	1527-974X
dc.identifier.uri	http://hdl.handle.net/1721.1/111064
dc.description.abstract	Objective: Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. Materials and Methods: We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. Results: Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. Conclusion: Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering.	en_US
dc.language.iso	en_US
dc.publisher	BMJ Publishing Group	en_US
dc.relation.isversionof	http://dx.doi.org/10.1093/jamia/ocw156	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	arXiv	en_US
dc.title	De-identification of patient notes with recurrent neural networks	en_US
dc.type	Article	en_US
dc.identifier.citation	Dernoncourt, Franck et al. “De-Identification of Patient Notes with Recurrent Neural Networks.” Journal of the American Medical Informatics Association (December 2016): 596–606 © 2016 The Authors	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.mitauthor	Dernoncourt, Franck
dc.contributor.mitauthor	Lee, Ji Young
dc.contributor.mitauthor	Szolovits, Peter
dc.relation.journal	Journal of the American Medical Informatics Association	en_US
dc.eprint.version	Original manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dspace.orderedauthors	Dernoncourt, Franck; Lee, Ji Young; Uzuner, Ozlem; Szolovits, Peter	en_US
dspace.embargo.terms	N	en_US
dc.identifier.orcid	https://orcid.org/0000-0002-1119-1346
dc.identifier.orcid	https://orcid.org/0000-0001-6887-0924
dc.identifier.orcid	https://orcid.org/0000-0001-8411-6403
mit.license	OPEN_ACCESS_POLICY	en_US

Files in this item

Name:: Szolovits_De-identification.pdf
Size:: 363.8Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record