MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Open Intent Generation Through Unsupervised Semantic Clustering of Task-Oriented Dialog

Author(s)
Wagner, Julia N.
Thumbnail
DownloadThesis PDF (32.68Mb)
Advisor
Glass, James
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
The natural language processing field has seen task-oriented dialog systems emerge as a strong area of interest in research and industry over the past years. However, the limited existence of complex and sufficiently annotated training data still places a bottleneck on the development of more advanced, domain-agnostic chatbots. Novel domains require extensive time and manual effort from experts when creating intents for new datasets to support dialog systems. This thesis analyzes a two-staged unsupervised semantic clustering and intent generation approach with multiple dataset adaptive interchangeable methods. We examine various pre-trained embeddings, scoring objectives for the number of clusters, unsupervised clustering algorithms, intent generation techniques, and utterance tokenization schemes. We then run experiments with these combinations on three datasets: SNIPS, MultiWOZ, and real-world chat data. This is followed by quantitative metric and in-depth qualitative cluster-based evaluation. We show the benefits of bigram frequency intent generation as datasets increase irregularity and confirm the success of the universal sentence encoder embeddings with K-Means clustering. Additionally, our examination of real-world data underlines the importance of fine-grained utterance tokenization and gives promise to the feasibility of research methods on unpublished data. Altogether, this thesis provides a comprehensive analysis covering the abilities of the the two-stage pipeline components to support open intent discovery for a variety of dataset characteristics, offering alternative solutions where beneficial for real-world applications. This gives insight to the optimal configuration to automatically generate a novel dialog training dataset from unstructured, unlabeled chat utterances. The code for this thesis can be found at https://github.com/jnwagner53/dialog-intent-generation.
Date issued
2022-05
URI
https://hdl.handle.net/1721.1/144901
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.