Benchmarks

Rank the models — and the benchmarks themselves.

catalog updated 52m ago
The meta-insight: Most teams trust benchmark scores blindly — but benchmarks vary wildly in quality. The Benchmark² framework grades the benchmarks themselves on three axes: ranking consistency with peers (CBRC), ability to separate strong from weak models (Discriminability), and capability alignment (do stronger models actually win?). Across 15 popular benchmarks it finds many are weak signals, and that a curated 35% subset preserves evaluation fidelity. Benchmark²: Systematic Evaluation of LLM Benchmarks (arXiv 2601.03986) ↗
Benchmark Quality Scores from Benchmark²

Higher BQS = more trustworthy. A benchmark can be popular yet a poor signal (low discriminability or rank-inconsistent). Sort and judge before you trust a number.

15 / 15
BenchmarkDomainConsistencyDiscrim.AlignmentQuality (BQS)
AIME 2024
Mathematics5274850.79
OmniMath
Mathematics7679610.75
OlympiadBench
Mathematics7576610.73
ARC
General7911870.65
BBH
General7525660.60
IFEval
Knowledge7523630.58
DROP
General7120610.56
EQ-Bench
Knowledge7527530.56
AMC 22-24
Mathematics7036460.55
MATH-500
Mathematics7016620.55
IFBench
Knowledge7131510.55
SuperGPQA
Knowledge7934430.55
CommonsenseQA
General7517570.54
MMLU-Pro
Knowledge6540360.51
SIQA
General7317230.40
How to get the latest results

RECOMMENDED ARCHITECTURE for an always-current SMB model-picker (as of 2026-06-23): (1) PRIMARY automated feed — Artificial Analysis Data API (https://artificialanalysis.ai/api/v2, x-api-key header). Poll /data/llms/models and /language/models/free on a daily cron; this gives you the headline Intelligence Index + live blended price + output speed/latency (rolling 72h) for ~356 models, which is the core 'capability vs cost vs speed' table SMB buyers need. Upgrade to Pro/Commercial for sub-benchmark breakdowns and redistribution rights. (2) FRONTIER CAPABILITY feed (legally redistributable) — Epoch AI: `pip install epochai` or fetch CSVs from epoch.ai/benchmarks/use-this-data on a daily/weekly cron. CC-BY license means you can surface FrontierMath, GPQA Diamond, AIME, SWE-bench Verified and the Epoch Capabilities Index directly in your product with attribution. (3) HUMAN-PREFERENCE feed — pull lmarena-ai/leaderboard-dataset 'latest' split from Hugging Face via the `datasets` library on a weekly schedule (no official API exists). (4) CODING feeds — scrape swebench.com (or the Steel.dev mirror) for SWE-bench Verified/Pro, pull the Aider polyglot YAML from the Aider-AI/aider GitHub repo raw, and pull LiveBench from its GitHub/HF datasets. (5) AGENTIC/TOOL-USE — BFCL leaderboard (gorilla.cs.berkeley.edu) and Tau-bench/Terminal-Bench results (also surfaced in Artificial Analysis's agentic index). (6) OPEN-WEIGHT freshness — Hugging Face Trending API + community leaderboards (the Open LLM Leaderboard v1/v2 is RETIRED/archived as of 2026 — do not wire it as a live feed). PRACTICAL STACK: One daily cron hitting Artificial Analysis (API, authed) + Epoch (CSV/Python, CC-BY) covers ~90% of buyer-facing needs with proper APIs and clean licensing; layer weekly HF-dataset pulls (LMArena) and targeted scrapes (SWE-bench, Aider, LiveBench, BFCL) for the rest. KEY CAVEATS to encode in the product: MMLU and AIME 2025 are saturated (drop from frontier comparisons); MMLU-Pro near-saturated (~83-90% cluster); prefer GPQA Diamond, FrontierMath, HLE, SWE-bench Pro, Tau-bench and RULER/MRCR for top-end discrimination; show a composite headline score (AA Intelligence Index or Epoch ECI) with per-domain drill-down; treat vendor-self-reported numbers and crowd Elo as lower-trust signals to cross-check against self-run benchmarks; and note the 'effective context is ~60-70% of advertised window' reality from RULER/MRCR when displaying context-window specs.

Live leaderboards & their APIs
10 / 10
LeaderboardMeasuresAPICostHow to pull latest
Continuous — speed/price metrics refresh from a rolling 72h window; new models and eval results added within days of release.
Composite 'Intelligence Index' (v4.x, built from ~10 independent evals spanning reasoning, coding, math, agentic, knowledge) plus real-world cost (blended $/M tokens), output speed (tokens/sec) and latency (TTFT). The single best one-stop view balancing capability vs price vs speed — exactly the intelligence/speed/cost tradeoff SMB buyers care about.APIFree tier: ~100-1000 requests/day, public language models, headline indices and input/output token prices only. Pro tier: model-level detail, full blended pricing, percentiles. Commercial: provider data, time-series, raw measurements, redistribution rights (negotiated/paid).Poll /data/llms/models or /language/models on a daily cron with x-api-key; diff against last snapshot to detect new models and score changes. This is the recommended programmatic backbone for a model-picker product.
Continuously as votes accumulate; leaderboard snapshots published roughly weekly. The HF dataset has a 'latest' split refreshed on publish.
Human-preference Elo (Bradley-Terry) from crowdsourced blind pairwise battles — 'which answer do real users prefer.' Captures subjective quality / vibes that static benchmarks miss. Multiple arenas: Text, Vision, WebDev/Code, Search, plus a new Agent Arena (launched June 2026) measuring real agentic behavior (retries, steerability, downloads).scrapeFree (data is open via HF dataset; voting/site is free).Pull the lmarena-ai/leaderboard-dataset 'latest' split from Hugging Face on a schedule (e.g. via huggingface_hub or datasets in Python), or scrape the fboulnois CSV release. No auth needed.
Monthly question refresh; leaderboard updated as new models are run.
Contamination-limited objective benchmark across 6 categories: math, coding, reasoning, data analysis, instruction-following, language comprehension. Verifiable ground-truth answers (no LLM judge), good for trustworthy capability ranking.scrapeFree / open-source (MIT-style); you pay only your own API inference costs if you run it.Clone the repo or pull livebench HF datasets; or scrape the leaderboard. For latest scores without running it, scrape the site table monthly.
Leaderboard updated continuously as labs/agents submit; new variants (Pro, Multimodal, Multilingual) added periodically.
Real-world agentic software engineering: resolve real GitHub issues by generating patches that pass the repo's hidden tests (Docker-executed). Verified = 500 human-validated solvable issues. SWE-bench Pro = harder, contamination-resistant variant. The gold-standard coding-agent benchmark.scrapeFree / open-source. Running it incurs significant compute + API costs.Scrape swebench.com or the Steel.dev/llm-stats mirrors; or pull the HF dataset and run the harness. For a product, scraping a mirror that already aggregates submissions is most practical.
Updated as new models are benchmarked (community + maintainers); refreshed within days/weeks of major releases.
Practical code-editing skill: 225 hard Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Composite = correctness x adherence to the requested diff/edit format. Strong real-world signal for 'can this model reliably edit code in a tool.'scrapeFree / open-source.Pull the leaderboard data file from the aider GitHub repo (raw YAML) on a schedule — cleaner than scraping HTML.
Periodic batch releases per leaderboard version. NOTE: HELM entered maintenance mode on June 1, 2026 — slower/fewer new frontier-model additions going forward.
Holistic, multi-scenario academic evaluation. Sub-leaderboards: HELM Capabilities, HELM Safety, plus domain ones (MedHELM, etc.). Emphasizes transparency and reproducibility with full prompt-level logs.scrapeFree / open. Running HELM yourself costs your own inference spend.Use the crfm-helm package to fetch/parse published run results, or download the JSON result files referenced by the leaderboard pages. Treat as a periodic (not real-time) reference.
Continuously updated as new models are evaluated; new benchmark tiers added over time.
Curated, rigorously-run results for the hardest frontier benchmarks: FrontierMath (Tiers 1-3 and Tier 4), GPQA Diamond, MATH Level 5, Mock AIME, SWE-bench Verified, plus an Epoch Capabilities Index (ECI). Best source for hard-math/science frontier discrimination.APIFree — Creative Commons Attribution license (free to use, redistribute, reproduce with credit). This makes it uniquely friendly for embedding in a commercial product.Use the epochai Python client or fetch the CSVs on a schedule. CC-BY licensing means you can legally surface this data in your product with attribution.
No longer updated (retired). Archived snapshots remain available.
Historically: standardized open-weight model ranking (v2 used IFEval, BBH, MATH-Hard, GPQA, MuSR, MMLU-Pro). NOTE: officially RETIRED — frozen/archived; no longer updated.scrapeFree.Do NOT rely on this for current data. For open-weight freshness, use HF 'trending models' (huggingface.co/models?sort=trending) and topic-specific community leaderboards instead.
Updated on major model releases (roughly continuous/weekly).
Human-readable aggregator comparing GPT/Claude/Gemini/Llama/DeepSeek/Qwen/Kimi across reasoning, coding, math, multilingual, plus price and speed. Separate Open LLM and 'Best LLM for Coding' (SWE-bench, LiveCodeBench, Aider, BFCL) views. Deliberately uses non-saturated benchmarks.scrapeFree to view.Scrape the page periodically, or use it as a human-curated sanity check rather than an automated feed.
Frequent (often daily/weekly) as they track new releases.
Aggregators that consolidate 300+ models across many benchmarks (MMLU-Pro, GPQA, SWE-bench, AIME, LiveBench, Aider, BFCL, long-context) plus price/speed/context. Useful for one-stop scraping and cross-checking.scrapeFree to view.Scrape targeted benchmark pages as a fallback when a primary source lacks an API. Treat vendor-reported numbers with caution (self-reported, not independently run).
Key benchmarks by domain
16 / 16
BenchmarkDomainMeasuresSaturation
SWE-bench Verified
coding (agentic)Resolving real GitHub issues with patches that pass hidden tests (pass@1 resolved %).Approaching saturation at the very top — frontier models reported ~88-95% mid-2026 (e.g. Claude Opus 4.x ~88%, some newer Claude variants 95%+), with contamination/test-design caveats. Still discriminating in the mid-range; SWE-bench Pro is the harder successor for headroom.
LiveCodeBench
codingContamination-free competitive-programming / code-generation across time-windowed problems.Not saturated; time-windowing keeps it fresh. Good for ranking code generation when SWE-bench is too agentic/expensive.
Aider Polyglot
coding (editing)Multi-language code editing with strict edit-format compliance (composite of correctness x format adherence).Mid-high; top model ~0.88 (GPT-5 class) with broad spread below (~0.58 average). Still discriminates well, especially for tool-integration readiness.
AIME 2025 / 2026
mathAdvanced high-school olympiad math (competition problems), exact-answer scored.SATURATED at the frontier — GPT-5-class models reported ~100% in 2026. No longer discriminating for top models; use only for mid-tier or as a floor check.
FrontierMath (Tiers 1-3 and Tier 4)
math (frontier)Hundreds of original, expert-crafted research-level math problems across modern mathematics; Tier 4 is the hardest expansion set.NOT saturated and the best math discriminator. Rapid 2026 gains (e.g. Claude Fable 5 ~87% Tiers 1-3, ~88% Tier 4) but still the frontier yardstick. Run/hosted by Epoch AI.
GPQA Diamond
reasoning / scienceGoogle-proof graduate-level science MCQs (physics, chem, bio) requiring genuine reasoning.Approaching saturation at the very top but still produces meaningful ~15-point spreads in the ~60-90% band — widely cited as the most trusted reasoning discriminator in 2026.
MMLU-Pro
knowledgeHarder 10-option multitask knowledge across 14 subjects (successor to MMLU).NEAR-SATURATED — top models cluster ~83-90% (Gemini 3 Pro ~90%, Claude Opus 4.x ~89%) with little top-end discrimination. Useful as a knowledge floor, not a frontier separator.
Humanity's Last Exam (HLE)
reasoning (hardest)Extremely hard expert-level multi-domain questions designed to resist saturation.NOT saturated — designed as the hardest broad reasoning test; large headroom remains. Best single 'how smart at the limit' signal.
IFEval
instruction-followingVerifiable instruction-following (format/length/keyword constraints) with programmatic checking.Largely saturated for frontier models (high-90s); still useful for catching smaller/cheaper models that miss constraints. IFEval-FC extends it to function-calling format adherence.
BFCL v4 (Berkeley Function Calling Leaderboard)
agentic / tool-useAccuracy of function/tool calling — single, parallel, multi-turn, and (v4) holistic agentic evaluation via AST checking.Not saturated for the harder multi-turn/agentic categories; strong signal for tool-use reliability. Key for agent/RAG product decisions.
Tau-bench / Tau2-bench
agentic (tool-use, multi-turn)Realistic multi-turn agent tasks (retail/airline/telecom domains) requiring tool use under policies.NOT saturated — hard, realistic agentic tasks with clear top-model spread. Excellent for ranking agent reliability.
Terminal-Bench 2.0
agentic (computer/terminal use)Completing real tasks in a terminal/computer environment end-to-end.NOT saturated; meaningful spread. Good signal for autonomous computer-use agents.
RULER
long-contextNVIDIA synthetic suite: 13 tasks x 4 categories at 4K-128K tokens testing retrieval + reasoning over context.Not saturated for effective long context — reveals that effective capacity is typically only ~60-70% of advertised window. Key reality check on context-window marketing.
MRCR v2 (Multi-Round Coreference/Context Resolution)
long-contextMulti-round coreference + entity tracking under long context (e.g. 64K, 8-needle).Not saturated; strong discriminator for genuine long-context comprehension beyond simple needle retrieval.
NIAH-2 / Needle-in-a-Haystack (updated)
long-contextRetrieval of planted facts ('needles') across very long contexts.Basic single-needle is largely solved/saturated; multi-needle and reasoning variants still discriminate. Use updated multi-needle versions only.
MMLU-Pro / GPQA as composite inputs
knowledge + reasoningFrequently rolled into composite indices (Artificial Analysis Intelligence Index, Epoch Capabilities Index, HELM Capabilities).Composites mitigate single-benchmark saturation by blending non-saturated evals — the right approach for a buyer-facing single score.

A field guide to what's out there, grouped by capability. Match the benchmark family to what you're actually shipping.

Knowledge & Language Understanding
11
MMLU
General knowledge across 57 subjects (STEM → social science)
ARC
Grade-school science questions needing logical deduction
GLUE / SuperGLUE
Broad language-understanding task suites (SuperGLUE = harder)
Natural Questions
Real Google queries answered from Wikipedia
Reasoning Capabilities
7
GSM8K
8.5K grade-school math problems, multi-step solving
BIG-Bench Hard
Hardest BIG-Bench tasks requiring multi-step reasoning
AGIEval
Human standardized tests (GRE, GMAT, SAT, LSAT)
RACE
Exam reading-comprehension questions
Multi-Turn Conversations
2
MT-Bench
Multi-turn dialogue quality for chat assistants
QuAC
100K question-answer pairs in dialogue context
Grounding & Summarization
4
Grounding / abstractive summarization
Faithful condensation without hallucination
Content Moderation & Safety
4
TruthfulQA
Resistance to common false beliefs & biases
ToxiGen
Implicit hate-speech detection on minority-targeted text
HHH
Helpful / honest / harmless alignment
Coding Capabilities
3
HumanEval
Function-level code generation accuracy (pass@k)
CodeXGLUE
Multi-task code understanding & generation
LLM-Assisted Evaluation
4
LLM-as-judge (GPT-4 class)
Using strong LLMs to score outputs vs human preference
Sign in to continue

LLM Switchboard is private — sign in with Authly to access the control room.

Sign in with Authly
← Back to home