Approximating interactive human evaluation with self-play for open-domain dialog systems

Ghandeharioun, A; Shen, JH; Jaques, N; Ferguson, C; Jones, N; Lapedriza, A; Picard, R

Notice

This is not the latest version of this item. The latest version can be found at:https://dspace.mit.edu/handle/1721.1/137062.2

Show simple item record

dc.contributor.author	Ghandeharioun, A
dc.contributor.author	Shen, JH
dc.contributor.author	Jaques, N
dc.contributor.author	Ferguson, C
dc.contributor.author	Jones, N
dc.contributor.author	Lapedriza, A
dc.contributor.author	Picard, R
dc.date.accessioned	2021-11-02T12:16:27Z
dc.date.available	2021-11-02T12:16:27Z
dc.date.submitted	2019
dc.identifier.uri	https://hdl.handle.net/1721.1/137062
dc.description.abstract	© 2019 Neural information processing systems foundation. All rights reserved. Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r >.7, p <.05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models, including several that make novel improvements to recent hierarchical dialog generation architectures through sentiment and semantic knowledge distillation on the utterance level. Finally, we open-source the interactive evaluation platform we built and the dataset we collected to allow researchers to efficiently deploy and evaluate dialog models.	en_US
dc.language.iso	en
dc.relation.isversionof	https://proceedings.neurips.cc/paper/2019/file/fc9812127bf09c7bd29ad6723c683fb5-Paper.pdf	en_US
dc.rights	Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.	en_US
dc.source	Neural Information Processing Systems (NIPS)	en_US
dc.title	Approximating interactive human evaluation with self-play for open-domain dialog systems	en_US
dc.type	Article	en_US
dc.identifier.citation	Ghandeharioun, A, Shen, JH, Jaques, N, Ferguson, C, Jones, N et al. "Approximating interactive human evaluation with self-play for open-domain dialog systems." Advances in Neural Information Processing Systems, 32.
dc.relation.journal	Advances in Neural Information Processing Systems	en_US
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2021-07-06T13:41:07Z
dspace.orderedauthors	Ghandeharioun, A; Shen, JH; Jaques, N; Ferguson, C; Jones, N; Lapedriza, A; Picard, R	en_US
dspace.date.submission	2021-07-06T13:41:10Z
mit.journal.volume	32	en_US
mit.license	PUBLISHER_POLICY
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: NeurIPS-2019-approximating-int ...
Size:: 1.084Mb
Format:: PDF
Description:: Published version

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record

Version	Item	Date	Summary
2	1721.1/137062.2	2022-07-18T14:01:35Z	Metadata changed: Verified or entered author name and department authority metadata.
1	1721.1/137062*	2021-11-02T12:16:27Z

*Selected version

DSpace@MIT

Notice

Approximating interactive human evaluation with self-play for open-domain dialog systems

Files in this item

This item appears in the following Collection(s)

Version History