Short tandem repeat (STR) profile authentication via machine learning techniques

Shcherbina, Anna

dc.contributor.advisor	Anthony Lapadula and Manolis Kellis.	en_US
dc.contributor.author	Shcherbina, Anna	en_US
dc.contributor.other	Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2013-02-14T15:39:05Z
dc.date.available	2013-02-14T15:39:05Z
dc.date.copyright	2012	en_US
dc.date.issued	2012	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/77020
dc.description	Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.	en_US
dc.description	Cataloged from PDF version of thesis.	en_US
dc.description	Includes bibliographical references (p. 169-171).	en_US
dc.description.abstract	Short tandem repeat (STR) DNA profiles have multiple uses in forensic analysis, kinship identification, and human biometrics. However, as biotechnology progresses, there is a growing concern that STR profiles can be created using standard laboratory techniques such as whole genome amplification and molecular cloning. Such technologies can be used to synthesize any STR profile without the need for a physical sample, only knowledge of the desired genetic sequence. Therefore, to preserve the credibility of DNA as a forensic tool, it is imperative to develop means to authenticate STR profiles. The leading technique in the field, methylation analysis, is accurate but also expensive, time-consuming, and degrades the forensic sample so that further analysis is not possible. The realm of machine learning offers techniques to address the need for more effective STR profile authentication. In this work, a set of features were identified at both the channel and profile levels of STR electropherograms. A number of supervised and unsupervised machine learning algorithms were then used to predict whether a given STR electropherogram was authentic or synthesized by laboratory techniques. With the aid of the LNKnet machine learning toolkit, various classifiers were trained with the default set of parameters and the full set of features to quantify their baseline performance. Particular emphasis was placed on detecting profiles generated by Whole Genome Amplification (WGA). A greedy forward-backward search algorithm was implemented to determine the most useful subset of features from the initial group. Though the set of optimal feature values varied by classifier, a trend was observed indicating that the inter-locus imbalance error, stutter count, and range of peak widths for a profile were particularly useful features. These were selected by over two thirds of the classifiers. The signal-to- noise ratio was also a useful feature, selected by seven out of 16 classifiers. The selected features were in turn used to tune the parameters of machine learning algorithms and to compare their performance. From a set of 16 initial classifiers, the K-nearest neighbors, condensed K-nearest neighbors, multi-layer perceptron, Parzen window, and support vector machine classifiers achieved the best performance. These classification algorithms all attained error rates of approximately ten percent, defined as the percentage of profiles misclassified with the highest performing classifier achieving an error rate of less than eight percent. Overall, the classifiers performed well at detecting artificial profiles but had more difficulty accurately distinguishing natural profiles. There were many false positives for the artificial class, since profiles in this category took on a greater range of feature values. Finally, preliminary steps were taken to form classifier committees. However, combining the top performing classifiers via a majority vote did not significantly improve performance. The results of this work demonstrate the feasibility of a completely software-based approach to profile authentication. They confirm that machine learning techniques are a useful tool to trigger further investigation of profile authenticity via more expensive approaches.	en_US
dc.description.statementofresponsibility	by Anna Shcherbina.	en_US
dc.format.extent	171 p	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Short tandem repeat (STR) profile authentication via machine learning techniques	en_US
dc.title.alternative	Short tandem repeat profile authentication via machine learning techniques	en_US
dc.title.alternative	STR profile authentication via machine learning techniques	en_US
dc.type	Thesis	en_US
dc.description.degree	M.Eng.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.oclc	825770402	en_US

Files in this item

Name:: 825770402-MIT.pdf
Size:: 24.84Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record