Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Type: Preprint

Publication Date: 2024-03-06

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2403.03558

Abstract

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model's ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Large Language Models Are Unconscious of Unreasonability in Math Problems 2024 Jingyuan Ma
Damai Dai
Zhifang Sui
+ PDF Chat When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems 2024 Asir Saadat
Tasmia Binte Sogir
Md. Ahasanul Arefin Chowdhury
Saddam Al Aziz
+ NLPBench: Evaluating Large Language Models on Solving NLP Problems 2023 Linxin Song
Jieyu Zhang
Lechao Cheng
Pengyuan Zhou
Tianyi Zhou
Irene Li
+ PDF Chat SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models 2024 Dian Yu
Baolin Peng
Ye Tian
Linfeng Song
Haitao Mi
Yu Dong
+ An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP) 2023 Paulo Shakarian
Abhinav Koyyalamudi
Noel Ngu
Lakshmivihari Mareedu
+ AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models 2023 Fei Tang
Wanling Gao
Luzhou Peng
Jianfeng Zhan
+ PDF Chat Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange 2024 Ankit Satpute
Noah Gießing
André Greiner-Petter
Moritz Schubotz
Olaf Teschke
Akiko Aizawa
Béla Gipp
+ PDF Chat Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions 2024 Y. Thomas Hou
Yiqi Luo
Zhiwen Ruan
Hongru Wang
Weifeng Ge
Yun Chen
Guanhua Chen
+ PDF Chat ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection 2024 Yibo Yan
Shen Wang
Jiahao Huo
Hang Li
Boyan Li
Jiamin Su
Xiong Gao
Yifan Zhang
Tianlong Xu
Zhendong Chu
+ PDF Chat Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation 2024 Xi Kang
Zimu Wang
Xiao-Bo Jin
Weimin Wang
Kaizhu Huang
Qiufeng Wang
+ PDF Chat Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning 2024 Ruosen Li
Ziming Luo
Xinya Du
+ Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models 2023 Jinhao Duan
Hao Cheng
Shiqi Wang
Chenan Wang
Alex Zavalny
Renjing Xu
Bhavya Kailkhura
Kaidi Xu
+ Augmenting Math Word Problems via Iterative Question Composing 2024 Haoxiong Liu
Andrew Chi-Chih Yao
+ PDF Chat LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ 2024 Marc-Antoine Allard
Matin Ansaripour
Maria Yuffa
Paul Teiletche
+ Evaluating Open-Domain Question Answering in the Era of Large Language Models 2023 Ehsan Kamalloo
Nouha Dziri
Charles L. A. Clarke
Davood Rafiei
+ Evaluating Open-Domain Question Answering in the Era of Large Language Models 2023 Ehsan Kamalloo
Nouha Dziri
Charles L. A. Clarke
Davood Rafiei
+ PDF Chat Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation 2025 Sadegh Mahdavi
Muchen Li
Kaiwen Liu
Christos Thrampoulidis
Leonid Sigal
Renjie Liao
+ PDF Chat Uncertainty Estimation of Large Language Models in Medical Question Answering 2024 Jiaxin Wu
Yizhou Yu
Hong-Yu Zhou
+ PDF Chat Benchmarking Large Language Models via Random Variables 2025 Zijin Hong
H. Wu
Dongdong Su
Junnan Dong
Yilin Xiao
Yujing Zhang
Zhu Wang
Feiran Huang
Linyi Li
Hongxia Yang
+ PDF Chat Math Multiple Choice Question Generation via Human-Large Language Model Collaboration 2024 Jaewook Lee
Digory Smith
Simon Woodhead
Andrew Lan

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors