Detecting cognitive impairment from spoken language

Alhanai, Tuka(Tuka Waddah Talib Ali Al Hanai)

Author(s)

Alhanai, Tuka(Tuka Waddah Talib Ali Al Hanai)

Download1124075112-MIT.pdf (39.95Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

James R. Glass.

Terms of use

MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Dementia comes second only to spinal cord injuries in terms of its debilitating effects; from memory-loss to physical disability. The standard approach to evaluate cognitive conditions are neuropsychological exams, which are conducted via in-person interviews to measure memory, thinking, language, and motor skills. Work is on-going to determine biomarkers of cognitive impairment, yet one modality that has been relatively less explored is speech. Speech has the advantage of being easy to record, and contains the majority of information transmitted during neuropsychological exams. To determine the viability of speech-based biomarkers, we utilize data from the Framingham Heart Study, that contains hour-long audio recordings of neuropsychological exams for over 5,000 individuals. The data is representative of a population and the real-world prevalence of cognitive conditions (3-4%). We first explore modeling cognitive impairment from a relatively small set of 92 subjects with complete information on audio, transcripts, and speaker turns. We loosen these constraints by modeling with only a fraction of audio (~2-3 minutes), of which the speaker segments are defined through text-based diarization. We next apply this diarization method to extract audio features from all 7,000+ recordings (most of which have no transcripts), to model cognitive impairment (AUC 0.83, spec. 78%, sens. 79%). Finally, we eliminate the need for feature-engineering by training a neural network to learn higher-order representations from filterbank features (AUC 0.85, spec. 81%, sens. 82%). Our speech models exhibit strong performance and are comparable to the baseline demographic model (AUC 0.85, spec. 93%, sens. 65%). Further analysis shows that our neural network model automatically learns to detect specific speech activity which clusters according to: pause followed by onset of speech, short burst of speech, speech activity in high-frequency spectral energy bands, and silence.

Description

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 141-165).

Date issued

2019

URI

https://hdl.handle.net/1721.1/122724

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Doctoral Theses