Open Intent Generation Through Unsupervised Semantic Clustering of Task-Oriented Dialog

Wagner, Julia N.

dc.contributor.advisor	Glass, James
dc.contributor.author	Wagner, Julia N.
dc.date.accessioned	2022-08-29T16:19:44Z
dc.date.available	2022-08-29T16:19:44Z
dc.date.issued	2022-05
dc.date.submitted	2022-06-30T15:52:18.667Z
dc.identifier.uri	https://hdl.handle.net/1721.1/144901
dc.description.abstract	The natural language processing field has seen task-oriented dialog systems emerge as a strong area of interest in research and industry over the past years. However, the limited existence of complex and sufficiently annotated training data still places a bottleneck on the development of more advanced, domain-agnostic chatbots. Novel domains require extensive time and manual effort from experts when creating intents for new datasets to support dialog systems. This thesis analyzes a two-staged unsupervised semantic clustering and intent generation approach with multiple dataset adaptive interchangeable methods. We examine various pre-trained embeddings, scoring objectives for the number of clusters, unsupervised clustering algorithms, intent generation techniques, and utterance tokenization schemes. We then run experiments with these combinations on three datasets: SNIPS, MultiWOZ, and real-world chat data. This is followed by quantitative metric and in-depth qualitative cluster-based evaluation. We show the benefits of bigram frequency intent generation as datasets increase irregularity and confirm the success of the universal sentence encoder embeddings with K-Means clustering. Additionally, our examination of real-world data underlines the importance of fine-grained utterance tokenization and gives promise to the feasibility of research methods on unpublished data. Altogether, this thesis provides a comprehensive analysis covering the abilities of the the two-stage pipeline components to support open intent discovery for a variety of dataset characteristics, offering alternative solutions where beneficial for real-world applications. This gives insight to the optimal configuration to automatically generate a novel dialog training dataset from unstructured, unlabeled chat utterances. The code for this thesis can be found at https://github.com/jnwagner53/dialog-intent-generation.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright MIT
dc.rights.uri	http://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Open Intent Generation Through Unsupervised Semantic Clustering of Task-Oriented Dialog
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: wagner-jnwagner-meng-eecs-2022 ...
Size:: 32.68Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record