The Disconnect Between AI Benchmarks and Math Research

Evaluating AI systems on their ability to be a mathematical copilot

Current AI systems boast impressive scores on mathematical benchmarks. Yet when confronted with the questions mathematicians actually ask in their daily research, these same systems often struggle, and don't even realize they are struggling.

As explored in "The Cultural Divide between Mathematics and AI", there's a significant gap between the AI and Math worlds. Thus I have been trying to build tools to help with research, especially around paper search and AI question-answering.

Special thanks to everyone who submitted questions and provided feedback on answers! With well over a thousand questions asked in just the past week, we are seeing some interesting patterns and I wanted to present some preliminary results.

For those new here, please sign up, join our chat, or reach out with any feedback.

The Unique Challenge of Mathematical Research Questions

Mathematical research presents some fundamental challenges making it a particularly stringent test of genuine intelligence rather than pattern recognition.

  1. The Abstraction Gap: In mathematics, the tools you need to solve a problem often look entirely different from the original problem, so semantic search is insufficient.
    1. A request to "build an optimal order router" requires understanding Multi-Armed Bandit algorithms.
    2. The breakthrough in "ranking web pages" came through eigenvalue analysis of adjacency graphs.
    3. Solving "high-dimensional sphere packing" requires mastery of Leech Lattices and Modular Functions.
    4. Finding when to "stop evaluating options" leads to the Secretary Problem.
  2. The Deep Historical Context: Mathematics is cumulative knowledge spanning centuries:
    1. Sometimes you have to track a thread back a hundred years (and different languages) before finding the right reference.
    2. Notation evolves dramatically between eras and subfields, requiring translation between mathematical "dialects".
    3. The cost of missing a key historical reference isn't merely inefficiency—it can represent years of redundant work or false directions.
  3. The Critical Nuance Factor: Mathematics is exquisitely sensitive to small variations. Consider these seemingly similar equations, each requiring entirely different solution approaches:
    1. x²-y²=n (straightforward factorization)
    2. x²+y²=n (requires Gaussian integer theory and Cornacchia's Algorithm)
    3. x²-2y²=n (requires Pell's equation techniques and continued fractions)
    4. x²-4y²=n (returns to straightforward factorization)
    5. x²-y³=n (involves Mordell curves and Catalan's conjecture)
  4. The Benchmark Reality Gap: Current mathematical benchmarks fundamentally misrepresent research mathematics:
    1. Benchmarks typically test algorithmic solving of known problem types with established solution methods.
    2. Actual research questions demand identifying which area of mathematics is relevant in the first place.
    3. Research advances often come from seeing unexpected connections across disparate mathematical domains.
    4. The "known-unknown" nature of benchmark problems versus the "unknown-unknown" nature of research creates a fundamental evaluation mismatch.

Real-World Examples: Where AI Systems Struggle

Let's examine a few examples that illustrate the challenges AI systems face.

Question: Does there exists a divergent series which converges on every subset of N with arithmetic density 0?

This question came up in a Reddit thread, and a user quickly explained that there isn't, and gave an algorithm for iteratively constructing such a sequence.

Sky-T1 and O1 correctly said there isn't, while all other models claimed there is and gave a faulty construction.

This demonstrates how even relatively straightforward questions can reveal gaps in AI systems' reasoning.

Question: Can nowhere vanishing entire functions have a non-trivial linear dependency?

This was resolved in the negative by Borel in 1897 and there's a great explanation in Lang's Introduction to Complex Hyperbolic Space and in some books on Nevanlinna Theory such as by Rubel and by Kodaira.

Interestingly, none of the AI systems (including Sugaku) provided the fully correct answer, although Sky-T1 and GPT-4.5 mentioned relevant concepts like Wronskians and growth rates that appear in the proof. This highlights how even well-established mathematical results can be challenging for AI systems when they involve specialized areas of mathematics.

Question: What are the latest results on the Selberg Class Degree Conjecture? or Question: What is the Selberg Class Degree Conjecture and what is currently known about it?

This was part of my research area, and is the conjecture that a certain quantity (the degree) is always an integer.

Most AI systems provided partially correct but flawed answers, with common problems including:

  • Incorrectly focusing only on classifying elements of natural degree while missing the critical question of whether non-integer degrees (such as 1<d<2) can exist
  • Making false claims about the state of research (such as claiming degree 2 has been fully classified)
  • Inventing non-existent papers or results
  • Fabricating computational approaches or evidence
  • Creating fictional connections to other mathematical problems

Sugaku's answer correctly summarized the known results, including the significant 2011 proof by Kaczorowski and Perelli that the degree cannot be strictly between 1 and 2, demonstrating a more accurate grasp of the research landscape.

What Mathematicians Actually Ask: Question Types

Using an LLM to analyze and classify the questions submitted to Sugaku, we've identified that mathematicians primarily seek help with searching for references and asking about specific applications to non-math areas.

Question Type %
Find Relevant Resources 24
Application to X 14
Off-Topic / Not Math 12
Ask about Specific Person 10
Explain Concept/Definition 9
Calculate/Compute 8
Ask about Specific Paper 7
Proof Assistance 6
Problem Solving 6
Research Suggestion/Direction 2
Teaching Advice 2
Website support 1

The Evaluation Paradox: Human Judgment vs. AI Judgment

To evaluate system performance, we've gathered feedback through two methods: direct user votes and LLM-based evaluations. The results reveal a striking disconnect

Human Evaluation

When mathematicians vote on answer quality (+1 or -1), the rankings show Sugaku-MA1 in the lead, followed closely by DeepSeek R1 and O1.

Note: This is NOT a fair comparison since some of these questions were used to help train and calibrate Sugaku. I will update as more results come in.

model votes
sugaku-ma1 0.48
deepseek-r1 0.44
o1 0.43
o3-mini 0.38
gemini-2-pro 0.36
o1-mini 0.35
claude-37-sonnet 0.26
gpt-4o 0.22
gpt-4_5 0.19
sky-t1 0.14
gemini-2-flash 0.08
claude-3-5-haiku -0.19
claude-3-5-sonnet -0.19
claude-3-opus -0.26

LLM Evaluation

When we used AI judges to evaluate the same answers, we discovered that each AI system consistently preferred answers from its own brand of models, regardless of actual mathematical correctness. Most tellingly, in cases where all AI systems provided incorrect answers, LLM judges still expressed high confidence in the responses from systems similar to themselves, while human experts correctly identified all answers as flawed.

Aside from this, O1 ranks consistently high and Sugaku ranks consistently low.

Gemini as a Judge

Gemini like Gemini 2 Pro by far

backend__slug accuracy_score relevance_score completeness_score overall_score
gemini-2-pro 9.1 9.4 9.1 9.1
gpt-4_5 8.9 9.4 8.3 8.6
o1 8.9 9.3 8.2 8.5
deepseek-r1 8.7 9.2 8.4 8.5
gemini-2-flash 8.6 9.1 8.3 8.4
o1-mini 8.5 9.0 8.1 8.3
o3-mini 8.3 8.8 7.3 7.8
gpt-4o 8.1 8.7 7.0 7.6
claude-3-5-sonnet 8.0 8.6 7.0 7.4
sky-t1 7.8 8.4 6.9 7.2
claude-37-sonnet 7.9 8.5 6.5 7.2
claude-3-opus 7.6 8.2 6.6 7.0
sugaku-ma1 7.5 7.6 5.6 6.4
claude-3-5-haiku 7.2 8.0 5.7 6.3

GPT as a Judge

GPT-4o likes GPT 4.5 following by GPT-4o and O1

backend__slug accuracy_score relevance_score understandability_score succinctness_score overall
gpt-4_5 8.3 8.4 8.2 7.7 8.2
gpt-4o 8.0 8.0 8.4 8.1 8.1
o1 8.2 8.3 8.2 7.7 8.1
gemini-2-pro 8.5 8.6 8.0 7.1 8.0
o1-mini 8.0 8.0 8.1 7.6 7.9
claude-37-sonnet 7.6 7.8 8.2 8.0 7.9
claude-3-5-sonnet 7.6 7.7 8.2 7.9 7.8
o3-mini 7.7 7.8 8.1 7.7 7.8
gemini-2-flash 7.9 8.1 7.9 7.2 7.8
claude-3-opus 7.5 7.6 8.0 7.7 7.7
claude-3-5-haiku 7.1 7.3 8.0 8.0 7.6
sky-t1 7.2 7.4 7.9 7.6 7.5
deepseek-r1 7.8 7.9 7.3 6.2 7.3
sugaku-ma1 6.6 6.6 7.0 6.6 6.7

Claude as a Judge

Claude likes O1 and O3-mini followed by Claude 3.7 Sonnet and 3.5 Sonnet

backend__slug accuracy_score relevance_score understandability_score succinctness_score overall
o1 9.0 9.2 8.9 8.0 8.8
o3-mini 8.3 8.6 8.8 8.3 8.5
claude-37-sonnet 7.8 8.3 8.7 9.1 8.5
claude-3-5-sonnet 7.7 8.3 8.7 9.1 8.4
gpt-4_5 8.3 8.9 8.4 7.9 8.4
gpt-4o 7.7 8.2 8.7 8.7 8.3
gemini-2-pro 8.8 9.1 8.8 6.3 8.2
deepseek-r1 8.3 8.8 8.5 7.4 8.2
o1-mini 8.3 8.7 8.7 6.8 8.1
gemini-2-flash 8.1 8.6 8.6 6.8 8.0
claude-3-5-haiku 6.9 7.7 8.3 9.2 8.0
claude-3-opus 7.1 7.7 8.4 8.7 8.0
sky-t1 7.0 7.6 8.1 7.7 7.6
sugaku-ma1 5.7 5.9 6.9 7.2 6.4

Sugaku's Approach: Building for Real Mathematical Understanding

The above challenges are especially poignant in mathematical research questions, but also show up in many other intellectual endeavors. I'm building Sugaku to directly address these challenges:

  1. Reference-aware architecture: Designed to efficiently locate and integrate relevant mathematical literature spanning centuries
  2. Concept mapping across notation systems: Recognizes equivalent concepts expressed in different notational conventions
  3. Fine-tuned on actual research questions: Trained on the types of questions mathematicians genuinely ask, not artificial benchmarks
  4. Expert-guided development: Incorporating direct feedback from research mathematicians across specialties
  5. Mathematical consistency verification: Built-in mechanisms to check logical consistency and proof validity, and acknowledge uncertainty.

You can go to sugaku.net to test this out yourself with your own questions, browse and investigate papers, you can sign up to get a custom newsfeed of papers, track your projects and collaborators, and you can generate new ideas.

Whether or not you currently use AI tools to help with your research I would love your feedback. You can join our chat, or reach out with any feedback.