Evaluating language models for mathematics through interactions

Type: Article

Publication Date: 2024-06-03

Citations: 8

DOI: https://doi.org/10.1073/pnas.2318124121

Abstract

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

Locations

  • Proceedings of the National Academy of Sciences - View - PDF
  • PubMed Central - View
  • arXiv (Cornell University) - View - PDF
  • Apollo (University of Cambridge) - View - PDF
  • PubMed - View

Similar Works

Action Title Year Authors
+ Evaluating Language Models for Mathematics through Interactions 2023 Katherine M. Collins
Albert Q. Jiang
Simon Frieder
Lionel Wong
Miri Zilka
Umang Bhatt
Thomas Lukasiewicz
Yuhuai Wu
Joshua B. Tenenbaum
William E. Hart
+ Mathematical Capabilities of ChatGPT 2023 Simon Frieder
Luca Pinchetti
Ryan‐Rhys Griffiths
Tommaso Salvatori
Thomas Lukasiewicz
Philipp Petersen
Alexis Chevalier
Julius Berner
+ Assessing the Impact of Prompting, Persona, and Chain of Thought Methods on ChatGPT's Arithmetic Capabilities 2023 Yuhao Chen
Chloe Wong
Hanwen Yang
Juan Aguenza
Sai Bhujangari
B. C. Vu
Xun Lei
Amisha Prasad
Manny Fluss
Eric Phuong
+ PDF Chat Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange 2024 Ankit Satpute
Noah Gießing
André Greiner-Petter
Moritz Schubotz
Olaf Teschke
Akiko Aizawa
BĂ©la Gipp
+ PDF Chat LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ 2024 Marc-Antoine Allard
Matin Ansaripour
Maria Yuffa
Paul Teiletche
+ TALM: Tool Augmented Language Models 2022 Aaron Parisi
Yao Zhao
Noah Fiedel
+ PDF Chat Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision 2024 Zhiheng Xi
Dingwen Yang
Jinyan Huang
Jiafu Tang
Guanyu Li
Yiwen Ding
Wei He
Boyang Hong
S. Do
Wenyu Zhan
+ Learning gain differences between ChatGPT and human tutor generated algebra hints 2023 Zachary A. Pardos
Shreya Bhandari
+ Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate 2023 Boshi Wang
Xiang Yue
Huan Sun
+ Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate 2023 Boshi Wang
Xiang Yue
Huan Sun
+ PDF Chat MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning 2024 Debrup Das
Debopriyo Banerjee
Somak Aditya
Ashish Kulkarni
+ PDF Chat TuringQ: Benchmarking AI Comprehension in Theory of Computation 2024 Pardis Sadat Zahraei
Ehsaneddin Asgari
+ PDF Chat Unreflected Acceptance – Investigating the Negative Consequences of ChatGPT-Assisted Problem Solving in Physics Education 2024 Lars Krupp
Steffen Steinert
Maximilian Kiefer-Emmanouilidis
Karina E. Avila
Paul Lukowicz
Jochen KĂŒhn
Stefan KĂŒchemann
Jakob Karolus
+ PDF Chat When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems 2024 Asir Saadat
Tasmia Binte Sogir
Md. Ahasanul Arefin Chowdhury
Saddam Al Aziz
+ PDF Chat Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators 2024 Prasoon Bajpai
N. C. Chatterjee
Subhabrata Dutta
Tanmoy Chakraborty
+ PDF Chat UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models 2025 Xin Xu
Jiaxin Zhang
Tianhao Chen
Zitong Chao
Jie Hu
Can Yang
+ PDF Chat Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models 2024 Yue Zhou
Yada Zhu
Diego Antognini
Yoon Kim
Yang Zhang
+ From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting 2024 Nuo Chen
Hongguang Li
Baoyuan Wang
Jia Li
+ PDF Chat BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search 2024 Linzhuang Sun
Hao Liang
Jingxuan Wei
Bihui Yu
Conghui He
Zenan Zhou
Wentao Zhang
+ PDF Chat Evaluating Mathematical Reasoning Beyond Accuracy 2024 Shijie Xia
Xuefeng Li
Yixin Liu
Tongshuang Wu
Pengfei Liu

Works Cited by This (16)

Action Title Year Authors
+ Explainable machine learning in deployment 2020 Umang Bhatt
Alice Xiang
Shubham Sharma
Adrian Weller
Ankur Taly
Yunhan Jia
Joydeep Ghosh
Ruchir Puri
José M. F. Moura
Peter Eckersley
+ Learning to Complement Humans 2020 Bryan Wilder
Eric Horvitz
Ece Kamar
+ Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty 2021 Umang Bhatt
Javier AntorĂĄn
Yunfeng Zhang
Q. Vera Liao
Prasanna Sattigeri
Riccardo Fogliato
Gabrielle Gauthier Melançon
Ranganath Krishnan
Jason Stanley
Omesh Tickoo
+ Acquisition of chess knowledge in AlphaZero 2022 Thomas McGrath
Andrei Kapishnikov
Nenad TomaĆĄev
Adam Pearce
Martin Wattenberg
Demis Hassabis
Been Kim
Ulrich Paquet
Vladimir Kramnik
+ PDF Chat Advancing mathematics by guiding human intuition with AI 2021 Alex Davies
Petar Veličković
Lars Buesing
Sam Blackwell
Daniel Zheng
Nenad TomaĆĄev
Richard Tanburn
Peter Battaglia
Charles Blundell
AndrĂĄs JuhĂĄsz
+ PDF Chat CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities 2022 Mina Lee
Percy Liang
Qian Yang
+ Locating and Editing Factual Associations in GPT 2022 Kevin Meng
David Bau
Alex Andonian
Yonatan Belinkov
+ Language Models are Few-Shot Learners 2020 T. B. Brown
Benjamin F. Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
+ Mathematical Capabilities of ChatGPT 2023 Simon Frieder
Luca Pinchetti
Ryan‐Rhys Griffiths
Tommaso Salvatori
Thomas Lukasiewicz
Philipp Petersen
Alexis Chevalier
Julius Berner
+ Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals 2023 Piotr Mirowski
Kory W. Mathewson
Jaylen Pittman
Richard Evans