How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Chia‐Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, Joëlle Pineau

Type: Article

Publication Date: 2016-01-01

Citations: 936

DOI: https://doi.org/10.18653/v1/d16-1230

Abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available.Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response.We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Locations

arXiv (Cornell University) - View - PDF
eScholarship@McGill (McGill) - View - PDF
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing - View - PDF

Similar Works

Action	Title	Year	Authors
+	How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation	2016	Chia‐Wei Liu Ryan Lowe Iulian Vlad Serban Michael D. Noseworthy Laurent Charlin Joëlle Pineau
+	Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation	2017	Shikhar Sharma Layla El Asri Hannes Schulz Jérémie Zumer
+	Evaluating Dialogue Generation Systems via Response Selection	2020	Shiki Sato Reina Akama Hiroki Ouchi Jun Suzuki Kentaro Inui
+	Evaluating Dialogue Generation Systems via Response Selection	2020	Shiki Sato Reina Akama Hiroki Ouchi Jun Suzuki Kentaro Inui
+ PDF Chat	Evaluating Dialogue Generation Systems via Response Selection	2020	Shiki Sato
+	Designing Precise and Robust Dialogue Response Evaluators	2020	Tianyu Zhao Divesh Lala Tatsuya Kawahara
+	Designing Precise and Robust Dialogue Response Evaluators	2020	Tianyu Zhao Divesh Lala Tatsuya Kawahara
+	Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses	2017	Ryan Lowe Michael D. Noseworthy Iulian Vlad Serban Nicolas Angelard-Gontier Yoshua Bengio Joëlle Pineau
+	Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses	2017	Ryan Lowe Michael D. Noseworthy Iulian Vlad Serban Nicolas Angelard-Gontier Yoshua Bengio Joëlle Pineau
+	Speaker Sensitive Response Evaluation Model	2020	JinYeong Bak Alice Oh
+	Speaker Sensitive Response Evaluation Model	2020	JinYeong Bak Alice Oh
+ PDF Chat	On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation	2024	John Mendonça Alon Lavie Isabel Trancoso
+	Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response	2023	Yongkang Liu Feng Shi Daling Wang Yifei Zhang Hinrich Schütze
+	Efficient Task-Oriented Dialogue Systems with Response Selection as an Auxiliary Task	2022	Radostin Cholakov Todor Kolev
+	On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems	2021	Ian Berlot-Attwell Frank Rudzicz
+ PDF Chat	Automatic Evaluation and Moderation of Open-domain Dialogue Systems	2021	Chen Zhang João Sadoc Luis Fernando D’Haro Rafael E. Banchs Alexander I. Rudnicky
+	Automatic Evaluation and Moderation of Open-domain Dialogue Systems	2021	Chen Zhang João Sedoc Luis Fernando D’Haro Rafael E. Banchs Alexander I. Rudnicky
+ PDF Chat	PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison	2024	ChaeHun Park Minseok Choi Dohyun Lee Jaegul Choo
+	Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems	2024	Yuma Tsuta Naoki Yoshinaga Shoetsu Sato Masashi Toyoda
+	How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics	2020	Prasanna Parthasarathi Joëlle Pineau Sarath Chandar

Works That Cite This (485)

Action	Title	Year	Authors
+	Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions	2020	Stephen Roller Y-Lan Boureau Jason Weston Antoine Bordes Emily Dinan Angela Fan David Gunning Da Young Ju Margaret Li Spencer Poff
+ PDF Chat	CoreGen: Contextualized Code Representation Learning for Commit Message Generation	2021	Lun Yiu Nie Cuiyun Gao Zhicong Zhong Wai Lam Yang Liu Zenglin Xu
+ PDF Chat	SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts	2022	Nhung Thi-Hong Nguyen Phuong Phan-Dieu Ha Luan Thanh Nguyen Kiet Van Nguyen Ngan Luu-Thuy Nguyen
+	TCNN: Triple Convolutional Neural Network Models for Retrieval-based Question Answering System in E-commerce	2020	Shuangyong Song Chao Wang
+ PDF Chat	A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues	2017	Iulian Vlad Serban Alessandro Sordoni Ryan Lowe Laurent Charlin Joëlle Pineau Aaron Courville Yoshua Bengio
+ PDF Chat	ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning	2019	Maarten Sap Ronan Le Bras Emily Allaway Chandra Bhagavatula Nicholas Lourie Hannah Rashkin Brendan Roof Noah A. Smith Yejin Choi
+	Retrieval Augmentation Reduces Hallucination in Conversation	2021	Kurt Shuster Spencer Poff Moya Chen Douwe Kiela Jason Weston
+ PDF Chat	Tell Me About Yourself	2020	Ziang Xiao Michelle X. Zhou Q. Vera Liao Gloria Mark Changyan Chi Wenxi Chen Huahai Yang
+ PDF Chat	Deep Reinforcement Learning for Chatbots Using Clustered Actions and Human-Likeness Rewards	2019	Heriberto Cuayáhuitl Donghyeon Lee Seonghan Ryu Sung‐Ja Choi In-Chul Hwang Jihie Kim
+	Understanding and Improving the Exemplar-based Generation for Open-domain Conversation	2022	Seungju Han Beomsu Kim Seokjun Seo Enkhbayar Erdenee Buru Chang

Works Cited by This (16)

Action	Title	Year	Authors
+ PDF Chat	A Neural Network Approach to Context-Sensitive Generation of Conversational Responses	2015	Alessandro Sordoni Michel Galley Michael Auli Chris Brockett Yangfeng Ji Margaret Mitchell Jian‐Yun Nie Jianfeng Gao Bill Dolan
+	A Neural Conversational Model	2015	Oriol Vinyals Quoc V. Le
+	Generating Sequences With Recurrent Neural Networks	2013	Alex Graves
+ PDF Chat	Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems	2015	Tsung-Hsien Wen Milica Gašić Nikola Mrkšić Pei-Hao Su David Vandyke Steve Young
+	A Diversity-Promoting Objective Function for Neural Conversation Models	2015	Jiwei Li Michel Galley Chris Brockett Jianfeng Gao Bill Dolan
+	PARADISE: A Framework for Evaluating Spoken Dialogue Agents	1997	Marilyn Walker Diane Litman Candace Kamm Alicia Abella
+	deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets	2015	Michel Galley Chris Brockett Alessandro Sordoni Yangfeng Ji Michael Auli Chris Quirk Margaret Mitchell Jianfeng Gao Bill Dolan
+	The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems	2015	Ryan Lowe Nissan Pow Iulian Vlad Serban Joëlle Pineau
+ PDF Chat	Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models	2016	Iulian Vlad Serban Alessandro Sordoni Yoshua Bengio Aaron Courville Joëlle Pineau
+ PDF Chat	A Diversity-Promoting Objective Function for Neural Conversation Models	2016	Jiwei Li Michel Galley Chris Brockett Jianfeng Gao Bill Dolan