Benchmarks · LLM Switchboard

The meta-insight: Most teams trust benchmark scores blindly — but benchmarks vary wildly in quality. The Benchmark² framework grades the benchmarks themselves on three axes: ranking consistency with peers (CBRC), ability to separate strong from weak models (Discriminability), and capability alignment (do stronger models actually win?). Across 15 popular benchmarks it finds many are weak signals, and that a curated 35% subset preserves evaluation fidelity. Benchmark²: Systematic Evaluation of LLM Benchmarks (arXiv 2601.03986) ↗

Benchmark Quality Scores from Benchmark²

Higher BQS = more trustworthy. A benchmark can be popular yet a poor signal (low discriminability or rank-inconsistent). Sort and judge before you trust a number.

Benchmark	Domain	Consistency	Discrim.	Alignment	Quality (BQS)
AIME 2024	Mathematics	52	74	85	0.79
OmniMath	Mathematics	76	79	61	0.75
OlympiadBench	Mathematics	75	76	61	0.73
ARC	General	79	11	87	0.65
BBH	General	75	25	66	0.60
IFEval	Knowledge	75	23	63	0.58
DROP	General	71	20	61	0.56
EQ-Bench	Knowledge	75	27	53	0.56
AMC 22-24	Mathematics	70	36	46	0.55
MATH-500	Mathematics	70	16	62	0.55
IFBench	Knowledge	71	31	51	0.55
SuperGPQA	Knowledge	79	34	43	0.55
CommonsenseQA	General	75	17	57	0.54
MMLU-Pro	Knowledge	65	40	36	0.51
SIQA	General	73	17	23	0.40

How to get the latest results

RECOMMENDED ARCHITECTURE for an always-current SMB model-picker (as of 2026-06-23): (1) PRIMARY automated feed — Artificial Analysis Data API (https://artificialanalysis.ai/api/v2, x-api-key header). Poll /data/llms/models and /language/models/free on a daily cron; this gives you the headline Intelligence Index + live blended price + output speed/latency (rolling 72h) for ~356 models, which is the core 'capability vs cost vs speed' table SMB buyers need. Upgrade to Pro/Commercial for sub-benchmark breakdowns and redistribution rights. (2) FRONTIER CAPABILITY feed (legally redistributable) — Epoch AI: `pip install epochai` or fetch CSVs from epoch.ai/benchmarks/use-this-data on a daily/weekly cron. CC-BY license means you can surface FrontierMath, GPQA Diamond, AIME, SWE-bench Verified and the Epoch Capabilities Index directly in your product with attribution. (3) HUMAN-PREFERENCE feed — pull lmarena-ai/leaderboard-dataset 'latest' split from Hugging Face via the `datasets` library on a weekly schedule (no official API exists). (4) CODING feeds — scrape swebench.com (or the Steel.dev mirror) for SWE-bench Verified/Pro, pull the Aider polyglot YAML from the Aider-AI/aider GitHub repo raw, and pull LiveBench from its GitHub/HF datasets. (5) AGENTIC/TOOL-USE — BFCL leaderboard (gorilla.cs.berkeley.edu) and Tau-bench/Terminal-Bench results (also surfaced in Artificial Analysis's agentic index). (6) OPEN-WEIGHT freshness — Hugging Face Trending API + community leaderboards (the Open LLM Leaderboard v1/v2 is RETIRED/archived as of 2026 — do not wire it as a live feed). PRACTICAL STACK: One daily cron hitting Artificial Analysis (API, authed) + Epoch (CSV/Python, CC-BY) covers ~90% of buyer-facing needs with proper APIs and clean licensing; layer weekly HF-dataset pulls (LMArena) and targeted scrapes (SWE-bench, Aider, LiveBench, BFCL) for the rest. KEY CAVEATS to encode in the product: MMLU and AIME 2025 are saturated (drop from frontier comparisons); MMLU-Pro near-saturated (~83-90% cluster); prefer GPQA Diamond, FrontierMath, HLE, SWE-bench Pro, Tau-bench and RULER/MRCR for top-end discrimination; show a composite headline score (AA Intelligence Index or Epoch ECI) with per-domain drill-down; treat vendor-self-reported numbers and crowd Elo as lower-trust signals to cross-check against self-run benchmarks; and note the 'effective context is ~60-70% of advertised window' reality from RULER/MRCR when displaying context-window specs.

Artificial Analysis Data API (Free tier) · Free tier (rate-limited); paid Pro/Commercial for depth + redistribution.Epoch AI Benchmarking data (CSV + epochai Python client) · Free (Creative Commons Attribution).LMArena leaderboard dataset on Hugging Face · Free.Aider Polyglot leaderboard data file (GitHub raw) · Free / open-source.LiveBench (GitHub + HF datasets) · Free / open-source (you pay your own inference if self-running).SWE-bench official + Steel.dev / llm-stats mirrors · Free.Hugging Face Trending models + community leaderboards · Free.BFCL (Berkeley Function Calling Leaderboard) + Gorilla repo · Free / open-source.

Live leaderboards & their APIs

Leaderboard	Measures	API	Cost	How to pull latest
Artificial Analysis ↗ Continuous — speed/price metrics refresh from a rolling 72h window; new models and eval results added within days of release.	Composite 'Intelligence Index' (v4.x, built from ~10 independent evals spanning reasoning, coding, math, agentic, knowledge) plus real-world cost (blended $/M tokens), output speed (tokens/sec) and latency (TTFT). The single best one-stop view balancing capability vs price vs speed — exactly the intelligence/speed/cost tradeoff SMB buyers care about.	API	Free tier: ~100-1000 requests/day, public language models, headline indices and input/output token prices only. Pro tier: model-level detail, full blended pricing, percentiles. Commercial: provider data, time-series, raw measurements, redistribution rights (negotiated/paid).	Poll /data/llms/models or /language/models on a daily cron with x-api-key; diff against last snapshot to detect new models and score changes. This is the recommended programmatic backbone for a model-picker product.
LMArena (formerly Chatbot Arena / LMSYS) ↗ Continuously as votes accumulate; leaderboard snapshots published roughly weekly. The HF dataset has a 'latest' split refreshed on publish.	Human-preference Elo (Bradley-Terry) from crowdsourced blind pairwise battles — 'which answer do real users prefer.' Captures subjective quality / vibes that static benchmarks miss. Multiple arenas: Text, Vision, WebDev/Code, Search, plus a new Agent Arena (launched June 2026) measuring real agentic behavior (retries, steerability, downloads).	scrape	Free (data is open via HF dataset; voting/site is free).	Pull the lmarena-ai/leaderboard-dataset 'latest' split from Hugging Face on a schedule (e.g. via huggingface_hub or datasets in Python), or scrape the fboulnois CSV release. No auth needed.
LiveBench ↗ Monthly question refresh; leaderboard updated as new models are run.	Contamination-limited objective benchmark across 6 categories: math, coding, reasoning, data analysis, instruction-following, language comprehension. Verifiable ground-truth answers (no LLM judge), good for trustworthy capability ranking.	scrape	Free / open-source (MIT-style); you pay only your own API inference costs if you run it.	Clone the repo or pull livebench HF datasets; or scrape the leaderboard. For latest scores without running it, scrape the site table monthly.
SWE-bench / SWE-bench Verified (+ SWE-bench Pro) ↗ Leaderboard updated continuously as labs/agents submit; new variants (Pro, Multimodal, Multilingual) added periodically.	Real-world agentic software engineering: resolve real GitHub issues by generating patches that pass the repo's hidden tests (Docker-executed). Verified = 500 human-validated solvable issues. SWE-bench Pro = harder, contamination-resistant variant. The gold-standard coding-agent benchmark.	scrape	Free / open-source. Running it incurs significant compute + API costs.	Scrape swebench.com or the Steel.dev/llm-stats mirrors; or pull the HF dataset and run the harness. For a product, scraping a mirror that already aggregates submissions is most practical.
Aider Polyglot Leaderboard ↗ Updated as new models are benchmarked (community + maintainers); refreshed within days/weeks of major releases.	Practical code-editing skill: 225 hard Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Composite = correctness x adherence to the requested diff/edit format. Strong real-world signal for 'can this model reliably edit code in a tool.'	scrape	Free / open-source.	Pull the leaderboard data file from the aider GitHub repo (raw YAML) on a schedule — cleaner than scraping HTML.
Stanford HELM ↗ Periodic batch releases per leaderboard version. NOTE: HELM entered maintenance mode on June 1, 2026 — slower/fewer new frontier-model additions going forward.	Holistic, multi-scenario academic evaluation. Sub-leaderboards: HELM Capabilities, HELM Safety, plus domain ones (MedHELM, etc.). Emphasizes transparency and reproducibility with full prompt-level logs.	scrape	Free / open. Running HELM yourself costs your own inference spend.	Use the crfm-helm package to fetch/parse published run results, or download the JSON result files referenced by the leaderboard pages. Treat as a periodic (not real-time) reference.
Epoch AI Benchmarking Hub ↗ Continuously updated as new models are evaluated; new benchmark tiers added over time.	Curated, rigorously-run results for the hardest frontier benchmarks: FrontierMath (Tiers 1-3 and Tier 4), GPQA Diamond, MATH Level 5, Mock AIME, SWE-bench Verified, plus an Epoch Capabilities Index (ECI). Best source for hard-math/science frontier discrimination.	API	Free — Creative Commons Attribution license (free to use, redistribute, reproduce with credit). This makes it uniquely friendly for embedding in a commercial product.	Use the epochai Python client or fetch the CSVs on a schedule. CC-BY licensing means you can legally surface this data in your product with attribution.
Hugging Face Open LLM Leaderboard (v2) + community leaderboards ↗ No longer updated (retired). Archived snapshots remain available.	Historically: standardized open-weight model ranking (v2 used IFEval, BBH, MATH-Hard, GPQA, MuSR, MMLU-Pro). NOTE: officially RETIRED — frozen/archived; no longer updated.	scrape	Free.	Do NOT rely on this for current data. For open-weight freshness, use HF 'trending models' (huggingface.co/models?sort=trending) and topic-specific community leaderboards instead.
Vellum LLM Leaderboard ↗ Updated on major model releases (roughly continuous/weekly).	Human-readable aggregator comparing GPT/Claude/Gemini/Llama/DeepSeek/Qwen/Kimi across reasoning, coding, math, multilingual, plus price and speed. Separate Open LLM and 'Best LLM for Coding' (SWE-bench, LiveCodeBench, Aider, BFCL) views. Deliberately uses non-saturated benchmarks.	scrape	Free to view.	Scrape the page periodically, or use it as a human-curated sanity check rather than an automated feed.
llm-stats.com / BenchLM / LM Council / DemandSphere (secondary aggregators) ↗ Frequent (often daily/weekly) as they track new releases.	Aggregators that consolidate 300+ models across many benchmarks (MMLU-Pro, GPQA, SWE-bench, AIME, LiveBench, Aider, BFCL, long-context) plus price/speed/context. Useful for one-stop scraping and cross-checking.	scrape	Free to view.	Scrape targeted benchmark pages as a fallback when a primary source lacks an API. Treat vendor-reported numbers with caution (self-reported, not independently run).

Key benchmarks by domain

Benchmark	Domain	Measures	Saturation
SWE-bench Verified	coding (agentic)	Resolving real GitHub issues with patches that pass hidden tests (pass@1 resolved %).	Approaching saturation at the very top — frontier models reported ~88-95% mid-2026 (e.g. Claude Opus 4.x ~88%, some newer Claude variants 95%+), with contamination/test-design caveats. Still discriminating in the mid-range; SWE-bench Pro is the harder successor for headroom.
LiveCodeBench	coding	Contamination-free competitive-programming / code-generation across time-windowed problems.	Not saturated; time-windowing keeps it fresh. Good for ranking code generation when SWE-bench is too agentic/expensive.
Aider Polyglot	coding (editing)	Multi-language code editing with strict edit-format compliance (composite of correctness x format adherence).	Mid-high; top model ~0.88 (GPT-5 class) with broad spread below (~0.58 average). Still discriminates well, especially for tool-integration readiness.
AIME 2025 / 2026	math	Advanced high-school olympiad math (competition problems), exact-answer scored.	SATURATED at the frontier — GPT-5-class models reported ~100% in 2026. No longer discriminating for top models; use only for mid-tier or as a floor check.
FrontierMath (Tiers 1-3 and Tier 4)	math (frontier)	Hundreds of original, expert-crafted research-level math problems across modern mathematics; Tier 4 is the hardest expansion set.	NOT saturated and the best math discriminator. Rapid 2026 gains (e.g. Claude Fable 5 ~87% Tiers 1-3, ~88% Tier 4) but still the frontier yardstick. Run/hosted by Epoch AI.
GPQA Diamond	reasoning / science	Google-proof graduate-level science MCQs (physics, chem, bio) requiring genuine reasoning.	Approaching saturation at the very top but still produces meaningful ~15-point spreads in the ~60-90% band — widely cited as the most trusted reasoning discriminator in 2026.
MMLU-Pro	knowledge	Harder 10-option multitask knowledge across 14 subjects (successor to MMLU).	NEAR-SATURATED — top models cluster ~83-90% (Gemini 3 Pro ~90%, Claude Opus 4.x ~89%) with little top-end discrimination. Useful as a knowledge floor, not a frontier separator.
Humanity's Last Exam (HLE)	reasoning (hardest)	Extremely hard expert-level multi-domain questions designed to resist saturation.	NOT saturated — designed as the hardest broad reasoning test; large headroom remains. Best single 'how smart at the limit' signal.
IFEval	instruction-following	Verifiable instruction-following (format/length/keyword constraints) with programmatic checking.	Largely saturated for frontier models (high-90s); still useful for catching smaller/cheaper models that miss constraints. IFEval-FC extends it to function-calling format adherence.
BFCL v4 (Berkeley Function Calling Leaderboard)	agentic / tool-use	Accuracy of function/tool calling — single, parallel, multi-turn, and (v4) holistic agentic evaluation via AST checking.	Not saturated for the harder multi-turn/agentic categories; strong signal for tool-use reliability. Key for agent/RAG product decisions.
Tau-bench / Tau2-bench	agentic (tool-use, multi-turn)	Realistic multi-turn agent tasks (retail/airline/telecom domains) requiring tool use under policies.	NOT saturated — hard, realistic agentic tasks with clear top-model spread. Excellent for ranking agent reliability.
Terminal-Bench 2.0	agentic (computer/terminal use)	Completing real tasks in a terminal/computer environment end-to-end.	NOT saturated; meaningful spread. Good signal for autonomous computer-use agents.
RULER	long-context	NVIDIA synthetic suite: 13 tasks x 4 categories at 4K-128K tokens testing retrieval + reasoning over context.	Not saturated for effective long context — reveals that effective capacity is typically only ~60-70% of advertised window. Key reality check on context-window marketing.
MRCR v2 (Multi-Round Coreference/Context Resolution)	long-context	Multi-round coreference + entity tracking under long context (e.g. 64K, 8-needle).	Not saturated; strong discriminator for genuine long-context comprehension beyond simple needle retrieval.
NIAH-2 / Needle-in-a-Haystack (updated)	long-context	Retrieval of planted facts ('needles') across very long contexts.	Basic single-needle is largely solved/saturated; multi-needle and reasoning variants still discriminate. Use updated multi-needle versions only.
MMLU-Pro / GPQA as composite inputs	knowledge + reasoning	Frequently rolled into composite indices (Artificial Analysis Intelligence Index, Epoch Capabilities Index, HELM Capabilities).	Composites mitigate single-benchmark saturation by blending non-saturated evals — the right approach for a buyer-facing single score.

The benchmark map leobeeson/llm_benchmarks ↗

A field guide to what's out there, grouped by capability. Match the benchmark family to what you're actually shipping.

Knowledge & Language Understanding

MMLU

General knowledge across 57 subjects (STEM → social science)

ARC

Grade-school science questions needing logical deduction

GLUE / SuperGLUE

Broad language-understanding task suites (SuperGLUE = harder)

Natural Questions

Real Google queries answered from Wikipedia

Reasoning Capabilities

GSM8K

8.5K grade-school math problems, multi-step solving

BIG-Bench Hard

Hardest BIG-Bench tasks requiring multi-step reasoning

AGIEval

Human standardized tests (GRE, GMAT, SAT, LSAT)

RACE

Exam reading-comprehension questions

Multi-Turn Conversations

MT-Bench

Multi-turn dialogue quality for chat assistants

QuAC

100K question-answer pairs in dialogue context

Grounding & Summarization

Grounding / abstractive summarization

Faithful condensation without hallucination

Content Moderation & Safety

TruthfulQA

Resistance to common false beliefs & biases

ToxiGen

Implicit hate-speech detection on minority-targeted text

HHH

Helpful / honest / harmless alignment

Coding Capabilities

HumanEval

Function-level code generation accuracy (pass@k)

CodeXGLUE

Multi-task code understanding & generation

LLM-Assisted Evaluation

LLM-as-judge (GPT-4 class)

Using strong LLMs to score outputs vs human preference