Benchmarks in AI: Measuring and Comparing Model Performance

The rapid development of large language models (LLMs) such as GPT-4o, Claude 3.5 or Gemini 2.0 has brought a central question into focus: How do you measure progress in artificial intelligence? This is where benchmarks come into play. They serve as a yardstick for comparing the performance of different models. But not all benchmarks are the same. Some test pure factual knowledge, others test logical thinking, creative problem solving or even approaches to “Artificial general intelligence” (AGI). In this article, we take a detailed look at the most important benchmarks, what sets them apart and what criticism they attract.

Epoch.ai

The Evolution of Benchmarks

At the beginning of AI development, simple tests such as word translation or syntactic analyses were sufficient to determine progress. Later came more complex benchmarks such as GLUE and SuperGLUE, which measure language comprehension and reasoning ability. But with today's models, which are already capable of generating extensive texts and answering complex questions, more sophisticated tests have become necessary.

As a rule, “pass@1” is used. “Pass@1” is a metric for evaluating the performance of generative AI models, particularly in code generation and question-answer systems. The value describes the probability that the first generated answer is correct. It is a top-1 accuracy metric and indicates how often a model directly provides the correct solution without considering multiple attempts.

Pass@1 is particularly crucial for code generation

In benchmarks such as HumanEval or MBPP, a high Pass@1 value is an indicator of reliable AI code generation. OpenAI, DeepMind and Meta often use Pass@1 to evaluate LLMs for software development.

Pass@1 can be misleading for language models

For general language models (e.g. GPT-4, Claude, Gemini), there is often not only one correct answer, but a “good” answer can still be considered wrong if it does not correspond exactly to the expected solution.

Important Benchmarks and Their Special Features

1 ARC-AGI (Abstraction and Reasoning Corpus for AGI)

Developed to test a kind of “human logic” in machines.
Based on Raven's matrix tests and other visual abstraction tasks.
Particularly important because it doesn't just test memorized knowledge, but requires pattern recognition and strategic thinking.
Current winner: OpenAI's o3

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a unique benchmark designed to measure the progress of AI systems towards general intelligence. Unlike many other tests, ARC-AGI focuses on evaluating a system's ability to efficiently learn new skills and respond appropriately to unfamiliar situations. This is achieved through tasks that require human cognitive processes such as object permanence, goal orientation, counting, and geometric intuition.

The developers of the ARC-AGI benchmark did not assume that LLM's would saturate the benchmark so quickly. Rather, they expected 12 more months before a model reached 85%. However, OpenAI has proven that the development speed is faster than we all assumed. The benchmark is therefore considered to have been met. The last stronghold of (typical) human performance falls. o3: 87.5% vs Human: 85%

2. GPQA (A Graduate-Level Google-Proof Q&A Benchmark)

Focuses on open-ended questions with broad general knowledge.
Allows little to no explicit clues, so a model must demonstrate real world knowledge and contextual understanding.
Excellent for testing how well an AI can combine facts from different domains.
A PhD achieves an average of 74% in its field of knowledge; on average, however, only around 34% when applied to all domains

Google-proof: “In this work, we are interested in questions where the ground truth is not available to non-experts using easily-found internet resources, since we require that questions be hard and Google-proof in order to be suitable for scalable oversight experiments.” (GPQA ArXiv)

The graduate-level Google-Proof Q&A Benchmark (GPQA) is a challenging dataset designed to evaluate the capabilities of LLMs and scalable monitoring methods. It consists of 448 multiple-choice questions in the fields of biology, physics, and chemistry, created by domain experts. These questions are designed to be challenging even for experts with or at the doctoral level in the respective subject areas; they achieve an accuracy rate of 74%. Remarkably, highly qualified non-experts with unlimited internet access and an average of over 30 minutes of research time per question only achieve an accuracy rate of 34%, which underlines the “Google certainty” of the benchmark.

The benchmark was saturated by OpenAI's o3-mini high. It achieved an average of 77%. Full o3, on the other hand, already achieved 87.7%.

“Experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web” (GPQA)

3. MMLU (Massive Multitask Language Understanding)

Contains questions from 57 different subject areas / disciplines: history, physics, medicine, law.
Particularly exciting because it tests a wide range of specialist knowledge and is therefore reminiscent of university examinations.
Criticized because it can sometimes be easily mastered with mere memorization.

Massive Multitask Language Understanding (MMLU) is a benchmark designed to evaluate the capabilities of language models across a wide range of topics. It comprises approximately 16,000 multiple-choice questions from 57 academic subjects, including mathematics, philosophy, law, and medicine. The goal is to test both the world knowledge and problem-solving abilities of models.

“While many existing benchmarks focus on specific tasks or understanding “common sense”, MMLU goes beyond this by confronting models with complex questions from a variety of disciplines” (ainewsdaily.com)

4. AIME (AI Mathematical Evaluation)

Developed for evaluating the mathematical abilities of LLMs.
Particularly important because mathematics requires abstract and rule-based logic.
Shows how well a model has real reasoning abilities, not just language patterns.
o3 scored 96.7% in AIME 2024 with only one incorrect answer and achieved 87.7% in GPQA Diamond, outperforming human experts.

AIME (American Invitational Mathematics Examination) is a challenging mathematics competition originally designed for talented high school students in the United States. In recent years, however, AIME has also established itself as an important benchmark for evaluating the mathematical abilities of AI models.

5. HELM (Holistic Evaluation of Language Models)

Takes into account not only performance, but also bias, robustness, and ethical issues.
Particularly innovative because it measures fairness and safety in addition to accuracy.
Provides a more comprehensive view of a model's strengths and weaknesses.

The Holistic Evaluation of Language Models (HELM) is a comprehensive benchmark developed by Stanford's Center for Research on Foundation Models (CRFM). The aim of HELM is to improve the transparency of language models by evaluating them across a variety of scenarios and metrics. This enables a deeper understanding of their capabilities, limitations, and potential risks.

HELM is showing the first signs of saturation in some areas (such as safety scores), but its modular architecture and continuous expansion mean that it is significantly more resistant to obsolescence than older benchmarks. The developers emphasize that the current version v1.0 is just the beginning of systematic security evaluations.

6. MATH Benchmark

Designed for complex mathematical tasks.
Includes questions at school and university level.
Evaluates the ability of a model to understand and correctly apply mathematical logic.
Leading models such as OpenAI o1 now achieve 94.8% accuracy on the MATH dataset. The new o3-mini high even achieves 97.9

The MATH Benchmark (Mathematics Aptitude Test of Humans) is a specific benchmark for evaluating the mathematical abilities of artificial intelligence, especially language models. It was developed to measure and compare the performance of AI systems in solving complex mathematical problems. Here are the most important aspects:

The MATH benchmark is significantly saturated for state-of-the-art models, which requires the development of more demanding test systems such as FrontierMATH.

7. FrontierMATH

Designed to test state-of-the-art AI systems in math.
Contains highly complex problems that go beyond standardized school math.
Shows whether a model is suitable for scientific and engineering applications.
Developed with the help of 60 scientists and field medalists worldwide to create extremely challenging math problems

The FrontierMath Benchmark is an important tool for evaluating the advanced mathematical abilities of AI systems. It was developed to test the limits of mathematical thinking and problem-solving abilities of AI.

The tasks included in the benchmark are designed in such a way that they take even experienced mathematicians several hours or even days to solve. Current AI systems, including advanced models such as GPT-4 and Gemini, successfully solve fewer than 2% of these tasks.

FrontierMath has not yet been saturated. Nevertheless, OpenAI surprised with an achievement of 25.2% when o3 high compute was enabled for the inference. This was something that was not expected by the developers of the benchmark until a year later.

8. Humanity’s Last Exam

A benchmark developed to measure whether AI can compete with humans in exams.
Contains real exam questions from different educational levels.
Exciting because it assesses the real performance of AI in standardized test environments.

What makes HLE special is its exceptional difficulty and broad range of topics. With 3,000 challenging questions contributed by nearly 1,000 subject-matter experts from over 500 institutions worldwide, the test covers over 100 academic disciplines, including math, humanities, and science.

Another outstanding feature of the HLE is its multimedia approach: about 10% of the questions require both text and image comprehension, while the remaining 90% are text-based. This diversity ensures that the models are tested in different contexts.

So far, OpenAI's new research agent “Deep Research” has achieved the best result of 26.6%. OpenAI writes:

❝

On Humanity’s Last Exam⁠(opens in a new window), a recently released evaluation that tests AI across a broad range of subjects on expert-level questions, the model powering deep research scores a new high at 26.6% accuracy. This test consists of over 3,000 multiple choice and short answer questions across more than 100 subjects from linguistics to rocket science, classics to ecology. Compared to OpenAI o1, the largest gains appeared in chemistry, humanities and social sciences, and mathematics. The model powering deep research showcased a human-like approach by effectively seeking out specialized information when necessary.

9. SWE-Benchmark (Software Engineering Benchmark)

Measures the ability of AI models to solve software engineering tasks.
Includes programming tasks, code analysis and debugging tests.
Important as AI is increasingly used in software development.
Here, too, o3 is by far the best LLM, currently achieving 71.7%.

The AI SWE benchmark, best known as SWE-bench, is a major tool for evaluating the ability of AI models to solve real-world software problems. Unlike traditional programming benchmarks, which are often based on isolated code snippets, SWE-bench provides AI systems with actual GitHub issues from popular open-source Python projects. The task is to analyze the entire code base and make appropriate changes to fix the described problems. The effectiveness of these solutions is verified by existing unit tests, which ensure that the proposed changes both solve the specific problem and maintain the integrity of the rest of the code.

SWE-bench Verified remains an effective progress indicator, as even leading models are still unable to solve >50% of the issues. The saturation threshold will only be reached when several models consistently achieve >90%.

10. SWE-Lancer (Can frontier LLMs earn $1 million from real-world freelance software engineering)

Evaluates the capabilities of advanced language models in the area of freelance software development
It includes over 1,400 real tasks from the Upwork platform, worth a total of $1 million.
These tasks range from simple bug fixes for $50 to complex feature implementations worth $32,000
Developed by OpenAI and currently the latest benchmark to measure agentic capabilities

OpenAI's new SWE Lancer benchmark is particularly noteworthy because it assesses the performance of AI models based on real freelance software development tasks. The benchmark includes over 1,400 tasks from the Upwork platform with a total value of $1 million. These tasks range from simple bug fixes to complex feature implementations, thus reflecting a broad spectrum of real-world software development challenges.

The models are evaluated using end-to-end tests that have been triple-verified by experienced software engineers. Management decisions are compared with the original decisions made by the hired engineering managers. The results show that current models are not yet able to successfully complete the majority of tasks.

Criticism of Benchmarks

Although benchmarks are essential for progress, they also have weaknesses:

Optimization for benchmarks: Models are trained in such a way that they perform particularly well on certain benchmarks, without this having any significance in real-world applications.
Lack of generalization ability: A model that performs excellently in MMLU may still have problems in open conversations.
Bias and distortions: Many benchmarks are based on Western concepts of intelligence and language, which leads to disadvantages for people from other cultural backgrounds.
Insufficient measurement of creativity: Most benchmarks assess factual knowledge and logic, but not creativity or innovative problem solving.
Potential leaks from training data: Many benchmarks consist of publicly available data sets or exam questions that may have already been incorporated into the training data of the models. This means that a model cannot show its true problem-solving ability, but can only replicate what it has already learned.

As a result, OpenAI's scientist Noam Brown recently suggested looking less at benchmarks in general and more at the price-performance ratio. A much more practical yardstick.

Conclusion

Benchmarks are the backbone of AI evaluation, but they are not an absolute measure of intelligence. They provide a guide, but not a conclusive verdict. Rather, a combination of different tests is needed to truly understand AI. The way forward would be more dynamic benchmarks that not only capture knowledge retrieval but also creative and adaptive thinking. The latest proposal from Noam Brown is based on cost and performance, which will probably be an increasingly important aspect in the future. This is because the further LLMs develop, the larger they become. Although the focus is shifting from pre-training to inference, cost remains a significant factor that also plays a significant role in distribution.

The race for the best AI will continue, and with it the search for better benchmarks. The question remains: when will there be a test that a model not only passes but also understands?

—

Ready for more content from Kim Isenberg? Subscribe to FF Daily for free!

Kim Isenberg

Kim studied sociology and law at a university in Germany and has been impressed by technology in general for many years. Since the breakthrough of OpenAI's ChatGPT, Kim has been trying to scientifically examine the influence of artificial intelligence on our society.

Follow Kim on X

Sources

AI News Daily: “Massive Multitask Language Understanding”

👾 Benchmarks in Artificial Intelligence: Measuring, Comparing, Understanding