The Disconnect Between AI Benchmarks and Math Research
Evaluating AI systems on their ability to be a mathematical copilot
Current AI systems boast impressive scores on mathematical benchmarks. Yet when confronted with the questions mathematicians actually ask in their daily research, these same systems often struggle, and don't even realize they are struggling.
As explored in "The Cultural Divide between Mathematics and AI", there's a significant gap between the AI and Math worlds. Thus I have been trying to build tools to help with research, especially around paper search and AI question-answering.
Special thanks to everyone who submitted questions and provided feedback on answers! With well over a thousand questions asked in just the past week, we are seeing some interesting patterns and I wanted to present some preliminary results.
For those new here, please sign up, join our chat, or reach out with any feedback.
The Unique Challenge of Mathematical Research Questions
Mathematical research presents some fundamental challenges making it a particularly stringent test of genuine intelligence rather than pattern recognition.
- The Abstraction Gap: In mathematics, the tools you need to solve a problem often look entirely different from the original problem, so semantic search is insufficient.
- A request to "build an optimal order router" requires understanding Multi-Armed Bandit algorithms.
- The breakthrough in "ranking web pages" came through eigenvalue analysis of adjacency graphs.
- Solving "high-dimensional sphere packing" requires mastery of Leech Lattices and Modular Functions.
- Finding when to "stop evaluating options" leads to the Secretary Problem.
- The Deep Historical Context: Mathematics is cumulative knowledge spanning centuries:
- Sometimes you have to track a thread back a hundred years (and different languages) before finding the right reference.
- Notation evolves dramatically between eras and subfields, requiring translation between mathematical "dialects".
- The cost of missing a key historical reference isn't merely inefficiency—it can represent years of redundant work or false directions.
- The Critical Nuance Factor: Mathematics is exquisitely sensitive to small variations. Consider these seemingly similar equations, each requiring entirely different solution approaches:
- The Benchmark Reality Gap: Current mathematical benchmarks fundamentally misrepresent research mathematics:
- Benchmarks typically test algorithmic solving of known problem types with established solution methods.
- Actual research questions demand identifying which area of mathematics is relevant in the first place.
- Research advances often come from seeing unexpected connections across disparate mathematical domains.
- The "known-unknown" nature of benchmark problems versus the "unknown-unknown" nature of research creates a fundamental evaluation mismatch.
Real-World Examples: Where AI Systems Struggle
Let's examine a few examples that illustrate the challenges AI systems face.
Question: Does there exists a divergent series which converges on every subset of N with arithmetic density 0?
This question came up in a Reddit thread, and a user quickly explained that there isn't, and gave an algorithm for iteratively constructing such a sequence.
Sky-T1 and O1 correctly said there isn't, while all other models claimed there is and gave a faulty construction.
This demonstrates how even relatively straightforward questions can reveal gaps in AI systems' reasoning.
Question: Can nowhere vanishing entire functions have a non-trivial linear dependency?
This was resolved in the negative by Borel in 1897 and there's a great explanation in Lang's Introduction to Complex Hyperbolic Space and in some books on Nevanlinna Theory such as by Rubel and by Kodaira.
Interestingly, none of the AI systems (including Sugaku) provided the fully correct answer, although Sky-T1 and GPT-4.5 mentioned relevant concepts like Wronskians and growth rates that appear in the proof. This highlights how even well-established mathematical results can be challenging for AI systems when they involve specialized areas of mathematics.
Question: What are the latest results on the Selberg Class Degree Conjecture? or Question: What is the Selberg Class Degree Conjecture and what is currently known about it?
This was part of my research area, and is the conjecture that a certain quantity (the degree) is always an integer.
Most AI systems provided partially correct but flawed answers, with common problems including:
- Incorrectly focusing only on classifying elements of natural degree while missing the critical question of whether non-integer degrees (such as 1<d<2) can exist
- Making false claims about the state of research (such as claiming degree 2 has been fully classified)
- Inventing non-existent papers or results
- Fabricating computational approaches or evidence
- Creating fictional connections to other mathematical problems
Sugaku's answer correctly summarized the known results, including the significant 2011 proof by Kaczorowski and Perelli that the degree cannot be strictly between 1 and 2, demonstrating a more accurate grasp of the research landscape.
What Mathematicians Actually Ask: Question Types
Using an LLM to analyze and classify the questions submitted to Sugaku, we've identified that mathematicians primarily seek help with searching for references and asking about specific applications to non-math areas.
Question Type | % |
---|---|
Find Relevant Resources | 24 |
Application to X | 14 |
Off-Topic / Not Math | 12 |
Ask about Specific Person | 10 |
Explain Concept/Definition | 9 |
Calculate/Compute | 8 |
Ask about Specific Paper | 7 |
Proof Assistance | 6 |
Problem Solving | 6 |
Research Suggestion/Direction | 2 |
Teaching Advice | 2 |
Website support | 1 |
The Evaluation Paradox: Human Judgment vs. AI Judgment
To evaluate system performance, we've gathered feedback through two methods: direct user votes and LLM-based evaluations. The results reveal a striking disconnect
Human Evaluation
When mathematicians vote on answer quality (+1 or -1), the rankings show Sugaku-MA1 in the lead, followed closely by DeepSeek R1 and O1.
Note: This is NOT a fair comparison since some of these questions were used to help train and calibrate Sugaku. I will update as more results come in.
model | votes |
---|---|
sugaku-ma1 | 0.48 |
deepseek-r1 | 0.44 |
o1 | 0.43 |
o3-mini | 0.38 |
gemini-2-pro | 0.36 |
o1-mini | 0.35 |
claude-37-sonnet | 0.26 |
gpt-4o | 0.22 |
gpt-4_5 | 0.19 |
sky-t1 | 0.14 |
gemini-2-flash | 0.08 |
claude-3-5-haiku | -0.19 |
claude-3-5-sonnet | -0.19 |
claude-3-opus | -0.26 |
LLM Evaluation
When we used AI judges to evaluate the same answers, we discovered that each AI system consistently preferred answers from its own brand of models, regardless of actual mathematical correctness. Most tellingly, in cases where all AI systems provided incorrect answers, LLM judges still expressed high confidence in the responses from systems similar to themselves, while human experts correctly identified all answers as flawed.
Aside from this, O1 ranks consistently high and Sugaku ranks consistently low.
Gemini as a Judge
Gemini like Gemini 2 Pro by far
backend__slug | accuracy_score | relevance_score | completeness_score | overall_score |
---|---|---|---|---|
gemini-2-pro | 9.1 | 9.4 | 9.1 | 9.1 |
gpt-4_5 | 8.9 | 9.4 | 8.3 | 8.6 |
o1 | 8.9 | 9.3 | 8.2 | 8.5 |
deepseek-r1 | 8.7 | 9.2 | 8.4 | 8.5 |
gemini-2-flash | 8.6 | 9.1 | 8.3 | 8.4 |
o1-mini | 8.5 | 9.0 | 8.1 | 8.3 |
o3-mini | 8.3 | 8.8 | 7.3 | 7.8 |
gpt-4o | 8.1 | 8.7 | 7.0 | 7.6 |
claude-3-5-sonnet | 8.0 | 8.6 | 7.0 | 7.4 |
sky-t1 | 7.8 | 8.4 | 6.9 | 7.2 |
claude-37-sonnet | 7.9 | 8.5 | 6.5 | 7.2 |
claude-3-opus | 7.6 | 8.2 | 6.6 | 7.0 |
sugaku-ma1 | 7.5 | 7.6 | 5.6 | 6.4 |
claude-3-5-haiku | 7.2 | 8.0 | 5.7 | 6.3 |
GPT as a Judge
GPT-4o likes GPT 4.5 following by GPT-4o and O1
backend__slug | accuracy_score | relevance_score | understandability_score | succinctness_score | overall |
---|---|---|---|---|---|
gpt-4_5 | 8.3 | 8.4 | 8.2 | 7.7 | 8.2 |
gpt-4o | 8.0 | 8.0 | 8.4 | 8.1 | 8.1 |
o1 | 8.2 | 8.3 | 8.2 | 7.7 | 8.1 |
gemini-2-pro | 8.5 | 8.6 | 8.0 | 7.1 | 8.0 |
o1-mini | 8.0 | 8.0 | 8.1 | 7.6 | 7.9 |
claude-37-sonnet | 7.6 | 7.8 | 8.2 | 8.0 | 7.9 |
claude-3-5-sonnet | 7.6 | 7.7 | 8.2 | 7.9 | 7.8 |
o3-mini | 7.7 | 7.8 | 8.1 | 7.7 | 7.8 |
gemini-2-flash | 7.9 | 8.1 | 7.9 | 7.2 | 7.8 |
claude-3-opus | 7.5 | 7.6 | 8.0 | 7.7 | 7.7 |
claude-3-5-haiku | 7.1 | 7.3 | 8.0 | 8.0 | 7.6 |
sky-t1 | 7.2 | 7.4 | 7.9 | 7.6 | 7.5 |
deepseek-r1 | 7.8 | 7.9 | 7.3 | 6.2 | 7.3 |
sugaku-ma1 | 6.6 | 6.6 | 7.0 | 6.6 | 6.7 |
Claude as a Judge
Claude likes O1 and O3-mini followed by Claude 3.7 Sonnet and 3.5 Sonnet
backend__slug | accuracy_score | relevance_score | understandability_score | succinctness_score | overall |
---|---|---|---|---|---|
o1 | 9.0 | 9.2 | 8.9 | 8.0 | 8.8 |
o3-mini | 8.3 | 8.6 | 8.8 | 8.3 | 8.5 |
claude-37-sonnet | 7.8 | 8.3 | 8.7 | 9.1 | 8.5 |
claude-3-5-sonnet | 7.7 | 8.3 | 8.7 | 9.1 | 8.4 |
gpt-4_5 | 8.3 | 8.9 | 8.4 | 7.9 | 8.4 |
gpt-4o | 7.7 | 8.2 | 8.7 | 8.7 | 8.3 |
gemini-2-pro | 8.8 | 9.1 | 8.8 | 6.3 | 8.2 |
deepseek-r1 | 8.3 | 8.8 | 8.5 | 7.4 | 8.2 |
o1-mini | 8.3 | 8.7 | 8.7 | 6.8 | 8.1 |
gemini-2-flash | 8.1 | 8.6 | 8.6 | 6.8 | 8.0 |
claude-3-5-haiku | 6.9 | 7.7 | 8.3 | 9.2 | 8.0 |
claude-3-opus | 7.1 | 7.7 | 8.4 | 8.7 | 8.0 |
sky-t1 | 7.0 | 7.6 | 8.1 | 7.7 | 7.6 |
sugaku-ma1 | 5.7 | 5.9 | 6.9 | 7.2 | 6.4 |
Sugaku's Approach: Building for Real Mathematical Understanding
The above challenges are especially poignant in mathematical research questions, but also show up in many other intellectual endeavors. I'm building Sugaku to directly address these challenges:
- Reference-aware architecture: Designed to efficiently locate and integrate relevant mathematical literature spanning centuries
- Concept mapping across notation systems: Recognizes equivalent concepts expressed in different notational conventions
- Fine-tuned on actual research questions: Trained on the types of questions mathematicians genuinely ask, not artificial benchmarks
- Expert-guided development: Incorporating direct feedback from research mathematicians across specialties
- Mathematical consistency verification: Built-in mechanisms to check logical consistency and proof validity, and acknowledge uncertainty.
You can go to sugaku.net to test this out yourself with your own questions, browse and investigate papers, you can sign up to get a custom newsfeed of papers, track your projects and collaborators, and you can generate new ideas.
Whether or not you currently use AI tools to help with your research I would love your feedback. You can join our chat, or reach out with any feedback.