Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, Zhao Hui

Type: Preprint

Publication Date: 2024-03-06

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2403.03558

Abstract

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model's ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	Large Language Models Are Unconscious of Unreasonability in Math Problems	2024	Jingyuan Ma Damai Dai Zhifang Sui
+ PDF Chat	When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems	2024	Asir Saadat Tasmia Binte Sogir Md. Ahasanul Arefin Chowdhury Saddam Al Aziz
+	NLPBench: Evaluating Large Language Models on Solving NLP Problems	2023	Linxin Song Jieyu Zhang Lechao Cheng Pengyuan Zhou Tianyi Zhou Irene Li
+ PDF Chat	SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models	2024	Dian Yu Baolin Peng Ye Tian Linfeng Song Haitao Mi Yu Dong
+	An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)	2023	Paulo Shakarian Abhinav Koyyalamudi Noel Ngu Lakshmivihari Mareedu
+	AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models	2023	Fei Tang Wanling Gao Luzhou Peng Jianfeng Zhan
+ PDF Chat	Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange	2024	Ankit Satpute Noah Gießing André Greiner-Petter Moritz Schubotz Olaf Teschke Akiko Aizawa Béla Gipp
+ PDF Chat	Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions	2024	Y. Thomas Hou Yiqi Luo Zhiwen Ruan Hongru Wang Weifeng Ge Yun Chen Guanhua Chen
+ PDF Chat	ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection	2024	Yibo Yan Shen Wang Jiahao Huo Hang Li Boyan Li Jiamin Su Xiong Gao Yifan Zhang Tianlong Xu Zhendong Chu
+ PDF Chat	Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation	2024	Xi Kang Zimu Wang Xiao-Bo Jin Weimin Wang Kaizhu Huang Qiufeng Wang
+ PDF Chat	Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning	2024	Ruosen Li Ziming Luo Xinya Du
+	Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models	2023	Jinhao Duan Hao Cheng Shiqi Wang Chenan Wang Alex Zavalny Renjing Xu Bhavya Kailkhura Kaidi Xu
+	Augmenting Math Word Problems via Iterative Question Composing	2024	Haoxiong Liu Andrew Chi-Chih Yao
+ PDF Chat	LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ	2024	Marc-Antoine Allard Matin Ansaripour Maria Yuffa Paul Teiletche
+	Evaluating Open-Domain Question Answering in the Era of Large Language Models	2023	Ehsan Kamalloo Nouha Dziri Charles L. A. Clarke Davood Rafiei
+	Evaluating Open-Domain Question Answering in the Era of Large Language Models	2023	Ehsan Kamalloo Nouha Dziri Charles L. A. Clarke Davood Rafiei
+ PDF Chat	Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation	2025	Sadegh Mahdavi Muchen Li Kaiwen Liu Christos Thrampoulidis Leonid Sigal Renjie Liao
+ PDF Chat	Uncertainty Estimation of Large Language Models in Medical Question Answering	2024	Jiaxin Wu Yizhou Yu Hong-Yu Zhou
+ PDF Chat	Benchmarking Large Language Models via Random Variables	2025	Zijin Hong H. Wu Dongdong Su Junnan Dong Yilin Xiao Yujing Zhang Zhu Wang Feiran Huang Linyi Li Hongxia Yang
+ PDF Chat	Math Multiple Choice Question Generation via Human-Large Language Model Collaboration	2024	Jaewook Lee Digory Smith Simon Woodhead Andrew Lan

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors