How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Type: Article

Publication Date: 2016-01-01

Citations: 936

DOI: https://doi.org/10.18653/v1/d16-1230

Abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available.Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response.We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Locations

  • arXiv (Cornell University) - View - PDF
  • eScholarship@McGill (McGill) - View - PDF
  • Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing - View - PDF

Similar Works

Action Title Year Authors
+ How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation 2016 Chiaā€Wei Liu
Ryan Lowe
Iulian Vlad Serban
Michael D. Noseworthy
Laurent Charlin
Joƫlle Pineau
+ Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation 2017 Shikhar Sharma
Layla El Asri
Hannes Schulz
JƩrƩmie Zumer
+ Evaluating Dialogue Generation Systems via Response Selection 2020 Shiki Sato
Reina Akama
Hiroki Ouchi
Jun Suzuki
Kentaro Inui
+ Evaluating Dialogue Generation Systems via Response Selection 2020 Shiki Sato
Reina Akama
Hiroki Ouchi
Jun Suzuki
Kentaro Inui
+ PDF Chat Evaluating Dialogue Generation Systems via Response Selection 2020 Shiki Sato
+ Designing Precise and Robust Dialogue Response Evaluators 2020 Tianyu Zhao
Divesh Lala
Tatsuya Kawahara
+ Designing Precise and Robust Dialogue Response Evaluators 2020 Tianyu Zhao
Divesh Lala
Tatsuya Kawahara
+ Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses 2017 Ryan Lowe
Michael D. Noseworthy
Iulian Vlad Serban
Nicolas Angelard-Gontier
Yoshua Bengio
Joƫlle Pineau
+ Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses 2017 Ryan Lowe
Michael D. Noseworthy
Iulian Vlad Serban
Nicolas Angelard-Gontier
Yoshua Bengio
Joƫlle Pineau
+ Speaker Sensitive Response Evaluation Model 2020 JinYeong Bak
Alice Oh
+ Speaker Sensitive Response Evaluation Model 2020 JinYeong Bak
Alice Oh
+ PDF Chat On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation 2024 John MendonƧa
Alon Lavie
Isabel Trancoso
+ Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response 2023 Yongkang Liu
Feng Shi
Daling Wang
Yifei Zhang
Hinrich SchĆ¼tze
+ Efficient Task-Oriented Dialogue Systems with Response Selection as an Auxiliary Task 2022 Radostin Cholakov
Todor Kolev
+ On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems 2021 Ian Berlot-Attwell
Frank Rudzicz
+ PDF Chat Automatic Evaluation and Moderation of Open-domain Dialogue Systems 2021 Chen Zhang
JoĆ£o Sadoc
Luis Fernando Dā€™Haro
Rafael E. Banchs
Alexander I. Rudnicky
+ Automatic Evaluation and Moderation of Open-domain Dialogue Systems 2021 Chen Zhang
JoĆ£o Sedoc
Luis Fernando Dā€™Haro
Rafael E. Banchs
Alexander I. Rudnicky
+ PDF Chat PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison 2024 ChaeHun Park
Minseok Choi
Dohyun Lee
Jaegul Choo
+ Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems 2024 Yuma Tsuta
Naoki Yoshinaga
Shoetsu Sato
Masashi Toyoda
+ How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics 2020 Prasanna Parthasarathi
Joƫlle Pineau
Sarath Chandar

Works That Cite This (485)

Action Title Year Authors
+ Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions 2020 Stephen Roller
Y-Lan Boureau
Jason Weston
Antoine Bordes
Emily Dinan
Angela Fan
David Gunning
Da Young Ju
Margaret Li
Spencer Poff
+ PDF Chat CoreGen: Contextualized Code Representation Learning for Commit Message Generation 2021 Lun Yiu Nie
Cuiyun Gao
Zhicong Zhong
Wai Lam
Yang Liu
Zenglin Xu
+ PDF Chat SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts 2022 Nhung Thi-Hong Nguyen
Phuong Phan-Dieu Ha
Luan Thanh Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
+ TCNN: Triple Convolutional Neural Network Models for Retrieval-based Question Answering System in E-commerce 2020 Shuangyong Song
Chao Wang
+ PDF Chat A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues 2017 Iulian Vlad Serban
Alessandro Sordoni
Ryan Lowe
Laurent Charlin
Joƫlle Pineau
Aaron Courville
Yoshua Bengio
+ PDF Chat ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning 2019 Maarten Sap
Ronan Le Bras
Emily Allaway
Chandra Bhagavatula
Nicholas Lourie
Hannah Rashkin
Brendan Roof
Noah A. Smith
Yejin Choi
+ Retrieval Augmentation Reduces Hallucination in Conversation 2021 Kurt Shuster
Spencer Poff
Moya Chen
Douwe Kiela
Jason Weston
+ PDF Chat Tell Me About Yourself 2020 Ziang Xiao
Michelle X. Zhou
Q. Vera Liao
Gloria Mark
Changyan Chi
Wenxi Chen
Huahai Yang
+ PDF Chat Deep Reinforcement Learning for Chatbots Using Clustered Actions and Human-Likeness Rewards 2019 Heriberto CuayƔhuitl
Donghyeon Lee
Seonghan Ryu
Sungā€Ja Choi
In-Chul Hwang
Jihie Kim
+ Understanding and Improving the Exemplar-based Generation for Open-domain Conversation 2022 Seungju Han
Beomsu Kim
Seokjun Seo
Enkhbayar Erdenee
Buru Chang

Works Cited by This (16)

Action Title Year Authors
+ PDF Chat A Neural Network Approach to Context-Sensitive Generation of Conversational Responses 2015 Alessandro Sordoni
Michel Galley
Michael Auli
Chris Brockett
Yangfeng Ji
Margaret Mitchell
Jianā€Yun Nie
Jianfeng Gao
Bill Dolan
+ A Neural Conversational Model 2015 Oriol Vinyals
Quoc V. Le
+ Generating Sequences With Recurrent Neural Networks 2013 Alex Graves
+ PDF Chat Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems 2015 Tsung-Hsien Wen
Milica GaÅ”ić
Nikola MrkÅ”ić
Pei-Hao Su
David Vandyke
Steve Young
+ A Diversity-Promoting Objective Function for Neural Conversation Models 2015 Jiwei Li
Michel Galley
Chris Brockett
Jianfeng Gao
Bill Dolan
+ PARADISE: A Framework for Evaluating Spoken Dialogue Agents 1997 Marilyn Walker
Diane Litman
Candace Kamm
Alicia Abella
+ deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets 2015 Michel Galley
Chris Brockett
Alessandro Sordoni
Yangfeng Ji
Michael Auli
Chris Quirk
Margaret Mitchell
Jianfeng Gao
Bill Dolan
+ The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems 2015 Ryan Lowe
Nissan Pow
Iulian Vlad Serban
Joƫlle Pineau
+ PDF Chat Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models 2016 Iulian Vlad Serban
Alessandro Sordoni
Yoshua Bengio
Aaron Courville
Joƫlle Pineau
+ PDF Chat A Diversity-Promoting Objective Function for Neural Conversation Models 2016 Jiwei Li
Michel Galley
Chris Brockett
Jianfeng Gao
Bill Dolan