Show simple item record

dc.contributor.advisorGlass, James
dc.contributor.authorWagner, Julia N.
dc.date.accessioned2022-08-29T16:19:44Z
dc.date.available2022-08-29T16:19:44Z
dc.date.issued2022-05
dc.date.submitted2022-06-30T15:52:18.667Z
dc.identifier.urihttps://hdl.handle.net/1721.1/144901
dc.description.abstractThe natural language processing field has seen task-oriented dialog systems emerge as a strong area of interest in research and industry over the past years. However, the limited existence of complex and sufficiently annotated training data still places a bottleneck on the development of more advanced, domain-agnostic chatbots. Novel domains require extensive time and manual effort from experts when creating intents for new datasets to support dialog systems. This thesis analyzes a two-staged unsupervised semantic clustering and intent generation approach with multiple dataset adaptive interchangeable methods. We examine various pre-trained embeddings, scoring objectives for the number of clusters, unsupervised clustering algorithms, intent generation techniques, and utterance tokenization schemes. We then run experiments with these combinations on three datasets: SNIPS, MultiWOZ, and real-world chat data. This is followed by quantitative metric and in-depth qualitative cluster-based evaluation. We show the benefits of bigram frequency intent generation as datasets increase irregularity and confirm the success of the universal sentence encoder embeddings with K-Means clustering. Additionally, our examination of real-world data underlines the importance of fine-grained utterance tokenization and gives promise to the feasibility of research methods on unpublished data. Altogether, this thesis provides a comprehensive analysis covering the abilities of the the two-stage pipeline components to support open intent discovery for a variety of dataset characteristics, offering alternative solutions where beneficial for real-world applications. This gives insight to the optimal configuration to automatically generate a novel dialog training dataset from unstructured, unlabeled chat utterances. The code for this thesis can be found at https://github.com/jnwagner53/dialog-intent-generation.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleOpen Intent Generation Through Unsupervised Semantic Clustering of Task-Oriented Dialog
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record