Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Type: Preprint

Publication Date: 2024-11-25

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2411.16579

Abstract

Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor's self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \href{https://mathcritique.github.io/}{https://mathcritique.github.io/}.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat CriticBench: Benchmarking LLMs for Critique-Correct Reasoning 2024 Zicheng Lin
Zhibin Gou
Liang Tian
Ruilin Luo
Haowei Liu
Yujiu Yang
+ PDF Chat Small Language Models Need Strong Verifiers to Self-Correct Reasoning 2024 Yunxiang Zhang
Muhammad Khalifa
Lajanugen Logeswaran
Jaekyeom Kim
Moontae Lee
Honglak Lee
Lu Wang
+ PDF Chat Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning 2024 Zhihan Zhang
Zhenwen Liang
Wenhao Yu
Dian Yu
Mengzhao Jia
Dong Yu
Meng Jiang
+ PDF Chat RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques 2025 Zhimin Tang
Zhao Li
Zuo Xiao
Tian Ding
Ruoyu Sun
Benyou Wang
Dayiheng Liu
Fei Huang
Tianyu Liu
Bowen Yu
+ PDF Chat Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback 2025 Yen-Ting Lin
Di Jin
Tengyu Xu
Tianhao Wu
Sainbayar Sukhbaatar
Zhu Chen
Yun Feng He
Yun-Nung Chen
Jason Weston
Yuandong Tian
+ Critique Ability of Large Language Models 2023 Liangchen Luo
Lin Zi
Yinxiao Liu
Lei Shu
Yun Zhu
Jingbo Shang
Lei Meng
+ PDF Chat Optimizing Language Model's Reasoning Abilities with Weak Supervision 2024 Yongqi Tong
Sizhe Wang
Dawei Li
Y.H. Wang
Simeng Han
Lin Zi
Chengsong Huang
Jiaxin Huang
Jingbo Shang
+ Large Language Models Can Self-Improve 2022 Jiaxin Huang
Shixiang Gu
Le Hou
Yuexin Wu
Xuezhi Wang
Hongkun Yu
Jiawei Han
+ PDF Chat Large Language Models Can Self-Improve 2023 Jiaxin Huang
Shixiang Gu
Le Hou
Yuexin Wu
Xuezhi Wang
Hongkun Yu
Jiawei Han
+ PDF Chat Large Language Models Can Self-Correct with Minimal Effort 2024 Zhenyu Wu
Qingkai Zeng
Zhihan Zhang
Zhaoxuan Tan
Chao Shen
Meng Jiang
+ PDF Chat Evaluating Mathematical Reasoning Beyond Accuracy 2024 Shijie Xia
Xuefeng Li
Yixin Liu
Tongshuang Wu
Pengfei Liu
+ RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought 2023 Tianci Xue
Ziqi Wang
Zhenhailong Wang
Chi Han
Pengfei Yu
Heng Ji
+ PDF Chat Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review 2024 Zhuochun Li
Yuelyu Ji
Rui Meng
Daqing He
+ PDF Chat S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners 2024 Yuchen Yan
Jin Jiang
Yang Liu
Yixin Cao
Xin Xu
Mengdi zhang
Xunliang Cai
Jian Q. Shao
+ Learning From Mistakes Makes LLM Better Reasoner 2023 Shengnan An
Zexiong Ma
Zeqi Lin
Nanning Zheng
Jianā€“Guang Lou
Weizhu Chen
+ PDF Chat Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation 2024 Chengwei Dai
Kun Li
Wei Zhou
Songlin Hu
+ CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities 2024 Yujun Mao
Yoon Kim
Yilun Zhou
+ PDF Chat Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying 2024 Federico Castagna
Isabel Sassoon
Simon Parsons
+ PDF Chat LLMs Can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought 2024 Zhuoxuan Jiang
Haoyuan Peng
Shanshan Feng
Fan Li
Dongsheng Li
+ PDF Chat LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought 2024 Zhuoxuan Jiang
Haoyuan Peng
Shanshan Feng
Fan Li
Dongsheng Li

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors