Deep learning benchmarks on L1000 gene expression data

McDermott, Matthew B. A.(Matthew Brian Andrew)

dc.contributor.advisor	Peter Szolovits.	en_US
dc.contributor.author	McDermott, Matthew B. A.(Matthew Brian Andrew)	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2019-07-17T20:59:28Z
dc.date.available	2019-07-17T20:59:28Z
dc.date.copyright	2019	en_US
dc.date.issued	2019	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/121738
dc.description	Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (pages 57-62).	en_US
dc.description.abstract	Gene expression data holds the potential to offer deep, physiological insights about the dynamic state of a cell beyond the static coding of the genome alone. I believe that realizing this potential requires specialized machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of an empirical methodological foundation, including published benchmarks and well characterized baselines. In this work, we lay that foundation by profiling a battery of classifiers against newly defined biologically motivated classification tasks on multiple L1000 gene expression datasets. In addition, on our smallest dataset, a privately produced L1000 corpus, we profile per-subject generalizability to provide a novel assessment of performance that is lost in many typical analyses. We compare traditional classifiers, including feed-forward artificial neural networks (FF-ANNs), linear methods, random forests, decision trees, and K nearest neighbor classifiers, as well as graph convolutional neural networks (GCNNs), which augment learning via prior biological domain knowledge. We find GCNNs offer performance improvements given sufficient data, excelling at all tasks on our largest dataset. On smaller datasets, FF-ANNs offer greatest performance. Linear models significantly underperform on all dataset scales, but offer the best per-subject generalizability. Ultimately, these results suggest that structured models such as GCNNs can represent a new direction of focus for the field as our scale of data continues to increase.	en_US
dc.description.statementofresponsibility	by Matthew B. A. McDermott.	en_US
dc.format.extent	62 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Deep learning benchmarks on L1000 gene expression data	en_US
dc.type	Thesis	en_US
dc.description.degree	S.M.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1102050364	en_US
dc.description.collection	S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science	en_US
dspace.imported	2019-07-17T20:59:25Z	en_US
mit.thesis.degree	Master	en_US
mit.thesis.department	EECS	en_US

Files in this item

Name:: 1102050364-MIT.pdf
Size:: 4.840Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record