Ask a Question

Prefer a chat interface with context about you and your work?

Measuring short-form factuality in large language models

Measuring short-form factuality in large language models

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there …