Advanced Search

Biomedical data retrieval utilizing textual data in a gene expression database by Richard Lu, MD.

Research and Teaching Output of the MIT Community

Show simple item record

dc.contributor.advisor Ronilda Lacson. en_US Lu, Richard, M.D en_US
dc.contributor.other Harvard University--MIT Division of Health Sciences and Technology. en_US 2010-09-03T18:36:13Z 2010-09-03T18:36:13Z 2010 en_US 2010 en_US
dc.description Thesis (S.M.)--Harvard-MIT Division of Health Sciences and Technology, 2010. en_US
dc.description Cataloged from PDF version of thesis. en_US
dc.description Includes bibliographical references (p. 68-74). en_US
dc.description.abstract Background: The commoditization of high-throughput gene expression sequencing and microarrays has led to a proliferation in both the amount of genomic and clinical data that is available. Descriptive textual information deposited with gene expression data in the Gene Expression Omnibus (GEO) is an underutilized resource because the textual information is unstructured and difficult to query. Rendering this information in a structured format utilizing standard medical terms would facilitate better searching and data reuse. Such a procedure would significantly increase the clinical utility of biomedical data repositories. Methods: The thesis is divided into two sections. The first section compares how well four medical terminologies were able to represent textual information deposited in GEO. The second section implements free-text search and faceted search and evaluates how well they are able to answer clinical queries with varying levels of complexity. Part I: 120 samples were randomly extracted from samples deposited in the GEO database from six clinical domains-breast cancer, colon cancer, rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), type I diabetes mellitus (IDDM), and asthma. These samples were previously annotated manually and structured textual information was obtained in a tag:value format. Data was mapped to four different controlled terminologies: NCI Thesaurus, MeSH, SNOMED-CT, and ICD- 10. The samples were assigned a score on a three-point scale that was based on how well the terminology was able to represent descriptive textual information. Part II: Faceted and free-text search tools were implemented, with 300 GEO samples included for querying. Eight natural language search questions were selected randomly from scientific journals. Academic researchers were recruited and asked to use the faceted and free-text search tools to locate samples matching the question criteria. Precision, recall, F-score, and search time were compared and analyzed for both free-text and faceted search. Results: The results show that the NCI Thesaurus consistently ranked as the most comprehensive terminology across all domains while ICD-10 consistently ranked as the least comprehensive. Using NCI Thesaurus to augment the faceted search tool, each researcher was able to reach 100% precision and recall (F-score 1.0) for each of the eight search questions. Using free-text search, test users averaged 22.8% precision, 60.7% recall, and an F-score of 0.282. The mean search time per question using faceted search and free-text search were 116.7 seconds, and 138.4 seconds, respectively. The difference between search time was not statistically significant (p=0. 734). However, paired t-test analysis showed a statistically signficant difference between the two search strategies with respect to precision (p=O.001), recall (p=O.042), and F-score (p<0. 001). Conclusion: This work demonstrates that biomedical terms included in a gene expression database can be adequately expressed using the NCI Thesaurus. It also shows that faceted searching using a controlled terminology is superior to conventional free-text searching when answering queries of varying levels of complexity. en_US
dc.format.extent 76 p. en_US
dc.language.iso eng en_US
dc.publisher Massachusetts Institute of Technology en_US
dc.rights M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. en_US
dc.rights.uri en_US
dc.subject Harvard University--MIT Division of Health Sciences and Technology. en_US
dc.title Biomedical data retrieval utilizing textual data in a gene expression database by Richard Lu, MD. en_US
dc.type Thesis en_US S.M. en_US
dc.contributor.department Harvard University--MIT Division of Health Sciences and Technology. en_US
dc.identifier.oclc 656269955 en_US

Files in this item

Name Size Format Description
656269955-MIT.pdf 6.540Mb PDF Full printable version

This item appears in the following Collection(s)

Show simple item record