An exploratory analysis of large health cohort study using Bayesian networks

Shen, Delin

Author(s)

Shen, Delin

DownloadFull printable version (5.471Mb)

Other Contributors

Harvard University--MIT Division of Health Sciences and Technology.

Advisor

Peter Szolovits.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/34478 http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Large health cohort studies are among the most effective ways in studying the causes, treatments and outcomes of diseases by systematically collecting a wide range of data over long periods. The wealth of data in such studies may yield important results in addition to the already numerous findings, especially when subjected to newer analytical methods. Bayesian Networks (BN) provide a relatively new method of representing uncertain relationships among variables, using the tools of probability and graph theory, and have been widely used in analyzing dependencies and the interplay between variables. We used BN to perform an exploratory analysis on a rich collection of data from one large health cohort study, the Nurses' Health Study (NHS), with the focus on breast cancer. We explored the data from the NHS using BN to look for breast cancer risk factors, including a group of Single Nucleotide Polymorphisms (SNP). We found no association between the SNPs and breast cancer, but found a dependency between clomid and breast cancer. We evaluated clomid as a potential riskfactor after matching on age and number of children. Our results showed for clomid an increased risk of estrogen receptor positive breast cancer (odds ratio 1.52, 95% CI 1.11-2.09) and a decreased risk of estrogen receptor negative breast cancer (odds ratio 0.46, 95% CI 0.22-0.97).

(cont.) We developed breast cancer risk models using BN. We trained models on 75% of the data, and evaluated them on the remaining. Because of the clinical importance of predicting risks for Estrogen Receptor positive and Progesterone Receptor positive breast cancer, we focused on this specific type of breast cancer to predict two-year, four-year, and six-year risks. The concordance statistics of the prediction results on test sets are 0.70 (95% CI: 0.67-0.74), 0.68 (95% CI: 0.64-0.72), and 0.66 (95% CI: 0.62-0.69) for two, four, and six year models, respectively. We also evaluated the calibration performance of the models, and applied a filter to the output to improve the linear relationship between predicted and observed risks using Agglomerative Information Bottleneck clustering without sacrificing much discrimination performance.

Description

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2006.

Includes bibliographical references (p. 91-98).

Date issued

2006

URI

http://dspace.mit.edu/handle/1721.1/34478
http://hdl.handle.net/1721.1/34478

Department

Harvard University--MIT Division of Health Sciences and Technology

Publisher

Massachusetts Institute of Technology

Keywords

Harvard University--MIT Division of Health Sciences and Technology.

Collections

Doctoral Theses