Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models

Type: Preprint

Publication Date: 2020-01-01

Citations: 17

DOI: https://doi.org/10.48550/arxiv.2005.14709

Locations

  • arXiv (Cornell University) - View
  • DataCite API - View

Similar Works

Action Title Year Authors
+ Language Model Behavior: A Comprehensive Survey 2023 Tyler A. Chang
Benjamin Bergen
+ Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards 2023 Chanjun Park
Hyeonseok Moon
Seolhwa Lee
Jaehyung Seo
Sugyeong Eo
Heuiseok Lim
+ PDF Chat Language Model Behavior: A Comprehensive Survey 2023 Tyler A. Chang
Benjamin Bergen
+ Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches 2019 Shane Storks
Qiaozi Gao
Joyce Chai
+ Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches 2019 Shane Storks
Qiaozi Gao
Joyce Chai
+ PDF Chat Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey 2024 Philipp Mondorf
Barbara Plank
+ Statistically Profiling Biases in Natural Language Reasoning Datasets and Models 2023 Shanshan Huang
Kenny Q. Zhu
+ PDF Chat MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models 2024 Jiachun Li
Pengfei Cao
Zhou Jin
Chen Yu-bo
Kang Liu
Jun Zhao
+ A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension 2022 Xanh Ho
Johannes Mario Meissner
Saku Sugawara
Akiko Aizawa
+ Are Emergent Abilities in Large Language Models just In-Context Learning? 2023 Sheng Lu
Irina Bigoulaeva
Rachneet Sachdeva
Harish Tayyar Madabushi
Iryna Gurevych
+ PDF Chat LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios 2024 Xiaodong Wu
Minhao Wang
Yichen Liu
Xiaoming Shi
Yan He
Xingyu Lu
Xiashi Zhu
Wei Zhang
+ PDF Chat What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models 2020 Allyson Ettinger
+ Efficient Methods for Natural Language Processing: A Survey 2022 Marcos Treviso
Tianchu Ji
Ji-Ung Lee
Betty van Aken
Qingqing Cao
Manuel R. Ciosici
Michael Hassid
Kenneth Heafield
Sara Hooker
Pedro H. Martins
+ PDF Chat Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models 2024 Lovish Madaan
David Esiobu
Pontus Stenetorp
Barbara Plank
Dieuwke Hupkes
+ PDF Chat OLMES: A Standard for Language Model Evaluations 2024 Yuling Gu
Oyvind Tafjord
Bailey Kuehl
Dany Haddad
Jesse Dodge
Hannaneh Hajishirzi
+ PDF Chat Causal Evaluation of Language Models 2024 Sirui Chen
Bo Peng
Meiqi Chen
Ruiqi Wang
Mengying Xu
Xingyu Zeng
Rui Zhao
Shengjie Zhao
Yu Qiao
Chaochao Lu
+ PDF Chat Efficient Methods for Natural Language Processing: A Survey 2023 Marcos Treviso
Ji-Ung Lee
Tianchu Ji
Betty van Aken
Qingqing Cao
Manuel R. Ciosici
Michael Hassid
Kenneth Heafield
Sara Hooker
Colin Raffel
+ PDF Chat Reliable, Reproducible, and Really Fast Leaderboards with Evalica 2024 Dmitry Ustalov
+ PDF Chat Bias in Large Language Models: Origin, Evaluation, and Mitigation 2024 Yufei Guo
Muzhe Guo
Juntao Su
Zhou Yang
Mengqiu Zhu
Hongfei Li
Mengyang Qiu
Shuo Shuo Liu
+ PDF Chat Instruction Finetuning for Leaderboard Generation from Empirical AI Research 2024 Salomon Kabongo
Jennifer D’Souza

Works That Cite This (9)

Action Title Year Authors
+ PDF Chat Data and its (dis)contents: A survey of dataset development and use in machine learning research 2021 Amandalynne Paullada
Inioluwa Deborah Raji
Emily M. Bender
Emily Denton
Alex Hanna
+ The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics 2021 Sebastian Gehrmann
Tosin Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Anuoluwapo Aremu
Antoine Bosselut
Khyathi Raghavi Chandu
Miruna Clinciu
Dipanjan Das
Kaustubh Dhole
+ The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks 2023 Nikil Selvam
Sunipa Dev
Daniel Khashabi
Tushar Khot
Kai-Wei Chang
+ PDF Chat Semantics Altering Modifications for Evaluating Comprehension in Machine Reading 2021 Viktor Schlegel
Goran Nenadić
Riza Batista-Navarro
+ PDF Chat QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension 2022 Anna Rogers
Matt Gardner
Isabelle Augenstein
+ The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics 2021 Sebastian Gehrmann
Tosin Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Anuoluwapo Aremu
Antoine Bosselut
Khyathi Raghavi Chandu
Miruna Clinciu
Dipanjan Das
Kaustubh Dhole
+ Discovering Highly Influential Shortcut Reasoning: An Automated Template-Free Approach 2023 Daichi Haraguchi
Kiyoaki Shirai
Naoya Inoue
Natthawut Kertkeidkachorn
+ Data and its (dis)contents: A survey of dataset development and use in machine learning research 2020 Amandalynne Paullada
Inioluwa Deborah Raji
Emily M. Bender
Emily Denton
Alex Hanna
+ QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension 2021 Anna Rogers
Matt Gardner
Isabelle Augenstein