Physiological time series retrieval and prediction with locality-sensitive hashing

Kim, Yongwook Bryce

Author(s)

Kim, Yongwook Bryce

DownloadFull printable version (14.84Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Una-May O'Reilly.

Terms of use

MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

The amount of time series data collected in the medical community has recently been exploding due to widespread affordable sensors and storage devices. However, while the massive repositories of such physiological time series data provide enormous opportunities for machine learning to make significant impacts, they are largely under-utilized due to their granular detail, overwhelming size, and lack of proper tools. Besides scale, fast yet accurate processing of physiological waveform data is desired in medical practice, especially in time-critical settings such as the intensive care unit (ICU). Efficiently leveraging these massive datasets is a key challenge that, when resolved, will support a new paradigm of scientific discovery and operational innovation in medicine. In this thesis, we develop highly efficient similarity-based methods that make it practical to search massive physiological time series repositories to rapidly identify waveforms similar to those from a given individual. We call this concept "patients with trajectories like mine." Our goal is to exploit rapid similar waveform retrieval to enable critical event prediction in the ICU setting. In order to achieve this goal, we propose to apply locality-sensitive hashing (LSH), which supports a very fast approximate nearest neighbor search in high dimensions. We empirically demonstrate that LSH based retrieval and prediction methods vastly speed up querying time while sacrificing only a trivial amount of accuracy as a cost. Despite being fast and accurate, the generic LSH has two shortcomings. First, it is capable of utilizing only one similarity measure at a time. To overcome this limit, we introduce Stratified LSH (SLSH) which finds similarity among the data from a more integrated perspective by employing multiple distance metrics in one framework. SLSH is essentially a dual-level hierarchical LSH where each LSH layer is associated with a distinct distance metric capturing a unique facet of similarity. The second shortcoming and the main bottleneck of the generic LSH is that it involves exhaustive distance calculations as a subroutine when short-listing the candidate set to find the final nearest neighbors. To surmount this, we propose Collision Frequency LSH (CFLSH) which short-lists the candidate set by simply counting the frequency of collision based on the key idea that the more frequently an element and a query collide across multiple LSH hash tables, the more similar they are. We show that with SLSH and CFLSH, we improve the efficiency of LSH in terms of both prediction accuracy and querying speed. We demonstrate our proposed methods on a mean arterial blood pressure dataset extracted from the MIMIC II database in the context of predicting acute hypotensive episodes in ICU. To examine the generality of our methods with respect to scaling, we validate our methods on datasets with various dimensions and item counts.

Description

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 131-143).

Date issued

2017

URI

http://hdl.handle.net/1721.1/111903

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses