dc.contributor.advisor | Glass, James | |
dc.contributor.author | Wagner, Julia N. | |
dc.date.accessioned | 2022-08-29T16:19:44Z | |
dc.date.available | 2022-08-29T16:19:44Z | |
dc.date.issued | 2022-05 | |
dc.date.submitted | 2022-06-30T15:52:18.667Z | |
dc.identifier.uri | https://hdl.handle.net/1721.1/144901 | |
dc.description.abstract | The natural language processing field has seen task-oriented dialog systems emerge as a strong area of interest in research and industry over the past years. However, the limited existence of complex and sufficiently annotated training data still places a bottleneck on the development of more advanced, domain-agnostic chatbots. Novel domains require extensive time and manual effort from experts when creating intents for new datasets to support dialog systems. This thesis analyzes a two-staged unsupervised semantic clustering and intent generation approach with multiple dataset adaptive interchangeable methods. We examine various pre-trained embeddings, scoring objectives for the number of clusters, unsupervised clustering algorithms, intent generation techniques, and utterance tokenization schemes. We then run experiments with these combinations on three datasets: SNIPS, MultiWOZ, and real-world chat data. This is followed by quantitative metric and in-depth qualitative cluster-based evaluation. We show the benefits of bigram frequency intent generation as datasets increase irregularity and confirm the success of the universal sentence encoder embeddings with K-Means clustering. Additionally, our examination of real-world data underlines the importance of fine-grained utterance tokenization and gives promise to the feasibility of research methods on unpublished data. Altogether, this thesis provides a comprehensive analysis covering the abilities of the the two-stage pipeline components to support open intent discovery for a variety of dataset characteristics, offering alternative solutions where beneficial for real-world applications. This gives insight to the optimal configuration to automatically generate a novel dialog training dataset from unstructured, unlabeled chat utterances. The code for this thesis can be found at https://github.com/jnwagner53/dialog-intent-generation. | |
dc.publisher | Massachusetts Institute of Technology | |
dc.rights | In Copyright - Educational Use Permitted | |
dc.rights | Copyright MIT | |
dc.rights.uri | http://rightsstatements.org/page/InC-EDU/1.0/ | |
dc.title | Open Intent Generation Through Unsupervised Semantic Clustering of Task-Oriented Dialog | |
dc.type | Thesis | |
dc.description.degree | M.Eng. | |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
mit.thesis.degree | Master | |
thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |