Comparison of Natural Language Processing Models for Depression Detection in Chatbot Dialogues

Belser, Christian Alexander

Author(s)

Belser, Christian Alexander

DownloadThesis PDF (1.357Mb)

Advisor

Fletcher, Richard Ribon

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Depression is an important challenge in the world today and a large source of disability. In the US, a recent study showed that approximately 36 million adults had at least one major depressive episode, including some with severe impairment [1]. However, approximately two-thirds of all depression cases are never diagnosed [2], largely due to a shortage of trained mental health professionals as well as a lingering cultural stigma that often prevents afflicted people from seeking professional care. In order to address this need, there is an emerging interest in using computer algorithms to automatically screen for depression, which offers the potential to be widely deployed to the public via clinical websites and mobile apps. Within this field, Dr. Fletcher’s group at MIT develops mobile platforms that are used to support mental health wellness and psychotherapy, including tools to screen for mental health disorders and refer people to treatment. As part of this work, this thesis compares three distinct Natural Language Processing (NLP) models used to screen for depression. I have revised and updated three state-of-the-art models: (1) Bi-directional gated recurrent unit (BGRU) models, (2) Hierarchical attention networks (HAN), and (3) Long-sequence Transformer models to accurately screen for depression in individuals. The models were all trained and tested on a common standard clinical dataset (DAICWoz) that is derived from clinical patient interviews. After optimization, and exploring several variants of each type of model, the following results were found: BGRU (accuracy=0.71, precision=0.65, recall=63, F1-score=0.64, MCC=0.20); HAN (accuracy= 0.77, precision=0.76, recall=0.77, F1-score=0.76, MCC=0.46); Transformer (accuracy=0.77, precision=0.76, recall=0.77, F1-score=0.76, MCC=0.43). In addition to model performance, I also compare the different categories of models based on computational resources and input token size. I also discuss the future evolution of these models and provide recommendations for specific use cases.

Date issued

2023-09

URI

https://hdl.handle.net/1721.1/152710

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses