Show simple item record

dc.contributor.authorBao, Yujia
dc.contributor.authorDeng, Zhengyi
dc.contributor.authorWang, Yan
dc.contributor.authorKim, Heeyoon
dc.contributor.authorArmengol, Victor Diego
dc.contributor.authorAcevedo, Francisco
dc.contributor.authorOuardaoui, Nofal
dc.contributor.authorWang, Cathy
dc.contributor.authorParmigiani, Giovanni
dc.contributor.authorBarzilay, Regina
dc.contributor.authorBraun, Danielle
dc.contributor.authorHughes, Kevin S.
dc.date.accessioned2022-02-09T16:08:57Z
dc.date.available2021-10-27T20:35:25Z
dc.date.available2022-02-09T16:08:57Z
dc.date.issued2019-12
dc.identifier.issn2473-4276
dc.identifier.urihttps://hdl.handle.net/1721.1/136447.2
dc.description.abstract© 2019 by American Society of Clinical Oncology PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance—risk of cancer for germline mutation carriers—or prevalence of germline genetic mutations. MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy—percentage of papers that were correctly classified—whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy. CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene–cancer associations and keep the knowledge bases for clinical decision support tools up to date.en_US
dc.language.isoen
dc.publisherAmerican Society of Clinical Oncology (ASCO)en_US
dc.relation.isversionofhttp://dx.doi.org/10.1200/cci.19.00042en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourcearXiven_US
dc.subjectGeneral Medicineen_US
dc.titleUsing Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genesen_US
dc.typeArticleen_US
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.relation.journalJCO Clinical Cancer Informaticsen_US
dc.eprint.versionOriginal manuscripten_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2020-12-01T16:55:48Z
dspace.orderedauthorsBao, Y; Deng, Z; Wang, Y; Kim, H; Armengol, VD; Acevedo, F; Ouardaoui, N; Wang, C; Parmigiani, G; Barzilay, R; Braun, D; Hughes, KSen_US
dspace.date.submission2020-12-01T16:55:52Z
mit.journal.volume3en_US
mit.journal.issue3en_US
mit.licenseOPEN_ACCESS_POLICY
mit.metadata.statusAuthority Work Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

VersionItemDateSummary

*Selected version