Higher BQS = more trustworthy. A benchmark can be popular yet a poor signal (low discriminability or rank-inconsistent). Sort and judge before you trust a number.
| Benchmark | Domain | Consistency | Discrim. | Alignment | Quality (BQS) |
|---|---|---|---|---|---|
AIME 2024 | Mathematics | ||||
OmniMath | Mathematics | ||||
OlympiadBench | Mathematics | ||||
ARC | General | ||||
BBH | General | ||||
IFEval | Knowledge | ||||
DROP | General | ||||
EQ-Bench | Knowledge | ||||
AMC 22-24 | Mathematics | ||||
MATH-500 | Mathematics | ||||
IFBench | Knowledge | ||||
SuperGPQA | Knowledge | ||||
CommonsenseQA | General | ||||
MMLU-Pro | Knowledge | ||||
SIQA | General |
RECOMMENDED ARCHITECTURE for an always-current SMB model-picker (as of 2026-06-23): (1) PRIMARY automated feed — Artificial Analysis Data API (https://artificialanalysis.ai/api/v2, x-api-key header). Poll /data/llms/models and /language/models/free on a daily cron; this gives you the headline Intelligence Index + live blended price + output speed/latency (rolling 72h) for ~356 models, which is the core 'capability vs cost vs speed' table SMB buyers need. Upgrade to Pro/Commercial for sub-benchmark breakdowns and redistribution rights. (2) FRONTIER CAPABILITY feed (legally redistributable) — Epoch AI: `pip install epochai` or fetch CSVs from epoch.ai/benchmarks/use-this-data on a daily/weekly cron. CC-BY license means you can surface FrontierMath, GPQA Diamond, AIME, SWE-bench Verified and the Epoch Capabilities Index directly in your product with attribution. (3) HUMAN-PREFERENCE feed — pull lmarena-ai/leaderboard-dataset 'latest' split from Hugging Face via the `datasets` library on a weekly schedule (no official API exists). (4) CODING feeds — scrape swebench.com (or the Steel.dev mirror) for SWE-bench Verified/Pro, pull the Aider polyglot YAML from the Aider-AI/aider GitHub repo raw, and pull LiveBench from its GitHub/HF datasets. (5) AGENTIC/TOOL-USE — BFCL leaderboard (gorilla.cs.berkeley.edu) and Tau-bench/Terminal-Bench results (also surfaced in Artificial Analysis's agentic index). (6) OPEN-WEIGHT freshness — Hugging Face Trending API + community leaderboards (the Open LLM Leaderboard v1/v2 is RETIRED/archived as of 2026 — do not wire it as a live feed). PRACTICAL STACK: One daily cron hitting Artificial Analysis (API, authed) + Epoch (CSV/Python, CC-BY) covers ~90% of buyer-facing needs with proper APIs and clean licensing; layer weekly HF-dataset pulls (LMArena) and targeted scrapes (SWE-bench, Aider, LiveBench, BFCL) for the rest. KEY CAVEATS to encode in the product: MMLU and AIME 2025 are saturated (drop from frontier comparisons); MMLU-Pro near-saturated (~83-90% cluster); prefer GPQA Diamond, FrontierMath, HLE, SWE-bench Pro, Tau-bench and RULER/MRCR for top-end discrimination; show a composite headline score (AA Intelligence Index or Epoch ECI) with per-domain drill-down; treat vendor-self-reported numbers and crowd Elo as lower-trust signals to cross-check against self-run benchmarks; and note the 'effective context is ~60-70% of advertised window' reality from RULER/MRCR when displaying context-window specs.
| Leaderboard | Measures | API | Cost | How to pull latest |
|---|---|---|---|---|
Continuous — speed/price metrics refresh from a rolling 72h window; new models and eval results added within days of release. | Composite 'Intelligence Index' (v4.x, built from ~10 independent evals spanning reasoning, coding, math, agentic, knowledge) plus real-world cost (blended $/M tokens), output speed (tokens/sec) and latency (TTFT). The single best one-stop view balancing capability vs price vs speed — exactly the intelligence/speed/cost tradeoff SMB buyers care about. | API | Free tier: ~100-1000 requests/day, public language models, headline indices and input/output token prices only. Pro tier: model-level detail, full blended pricing, percentiles. Commercial: provider data, time-series, raw measurements, redistribution rights (negotiated/paid). | Poll /data/llms/models or /language/models on a daily cron with x-api-key; diff against last snapshot to detect new models and score changes. This is the recommended programmatic backbone for a model-picker product. |
Continuously as votes accumulate; leaderboard snapshots published roughly weekly. The HF dataset has a 'latest' split refreshed on publish. | Human-preference Elo (Bradley-Terry) from crowdsourced blind pairwise battles — 'which answer do real users prefer.' Captures subjective quality / vibes that static benchmarks miss. Multiple arenas: Text, Vision, WebDev/Code, Search, plus a new Agent Arena (launched June 2026) measuring real agentic behavior (retries, steerability, downloads). | scrape | Free (data is open via HF dataset; voting/site is free). | Pull the lmarena-ai/leaderboard-dataset 'latest' split from Hugging Face on a schedule (e.g. via huggingface_hub or datasets in Python), or scrape the fboulnois CSV release. No auth needed. |
Monthly question refresh; leaderboard updated as new models are run. | Contamination-limited objective benchmark across 6 categories: math, coding, reasoning, data analysis, instruction-following, language comprehension. Verifiable ground-truth answers (no LLM judge), good for trustworthy capability ranking. | scrape | Free / open-source (MIT-style); you pay only your own API inference costs if you run it. | Clone the repo or pull livebench HF datasets; or scrape the leaderboard. For latest scores without running it, scrape the site table monthly. |
Leaderboard updated continuously as labs/agents submit; new variants (Pro, Multimodal, Multilingual) added periodically. | Real-world agentic software engineering: resolve real GitHub issues by generating patches that pass the repo's hidden tests (Docker-executed). Verified = 500 human-validated solvable issues. SWE-bench Pro = harder, contamination-resistant variant. The gold-standard coding-agent benchmark. | scrape | Free / open-source. Running it incurs significant compute + API costs. | Scrape swebench.com or the Steel.dev/llm-stats mirrors; or pull the HF dataset and run the harness. For a product, scraping a mirror that already aggregates submissions is most practical. |
Updated as new models are benchmarked (community + maintainers); refreshed within days/weeks of major releases. | Practical code-editing skill: 225 hard Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Composite = correctness x adherence to the requested diff/edit format. Strong real-world signal for 'can this model reliably edit code in a tool.' | scrape | Free / open-source. | Pull the leaderboard data file from the aider GitHub repo (raw YAML) on a schedule — cleaner than scraping HTML. |
Periodic batch releases per leaderboard version. NOTE: HELM entered maintenance mode on June 1, 2026 — slower/fewer new frontier-model additions going forward. | Holistic, multi-scenario academic evaluation. Sub-leaderboards: HELM Capabilities, HELM Safety, plus domain ones (MedHELM, etc.). Emphasizes transparency and reproducibility with full prompt-level logs. | scrape | Free / open. Running HELM yourself costs your own inference spend. | Use the crfm-helm package to fetch/parse published run results, or download the JSON result files referenced by the leaderboard pages. Treat as a periodic (not real-time) reference. |
Continuously updated as new models are evaluated; new benchmark tiers added over time. | Curated, rigorously-run results for the hardest frontier benchmarks: FrontierMath (Tiers 1-3 and Tier 4), GPQA Diamond, MATH Level 5, Mock AIME, SWE-bench Verified, plus an Epoch Capabilities Index (ECI). Best source for hard-math/science frontier discrimination. | API | Free — Creative Commons Attribution license (free to use, redistribute, reproduce with credit). This makes it uniquely friendly for embedding in a commercial product. | Use the epochai Python client or fetch the CSVs on a schedule. CC-BY licensing means you can legally surface this data in your product with attribution. |
No longer updated (retired). Archived snapshots remain available. | Historically: standardized open-weight model ranking (v2 used IFEval, BBH, MATH-Hard, GPQA, MuSR, MMLU-Pro). NOTE: officially RETIRED — frozen/archived; no longer updated. | scrape | Free. | Do NOT rely on this for current data. For open-weight freshness, use HF 'trending models' (huggingface.co/models?sort=trending) and topic-specific community leaderboards instead. |
Updated on major model releases (roughly continuous/weekly). | Human-readable aggregator comparing GPT/Claude/Gemini/Llama/DeepSeek/Qwen/Kimi across reasoning, coding, math, multilingual, plus price and speed. Separate Open LLM and 'Best LLM for Coding' (SWE-bench, LiveCodeBench, Aider, BFCL) views. Deliberately uses non-saturated benchmarks. | scrape | Free to view. | Scrape the page periodically, or use it as a human-curated sanity check rather than an automated feed. |
Frequent (often daily/weekly) as they track new releases. | Aggregators that consolidate 300+ models across many benchmarks (MMLU-Pro, GPQA, SWE-bench, AIME, LiveBench, Aider, BFCL, long-context) plus price/speed/context. Useful for one-stop scraping and cross-checking. | scrape | Free to view. | Scrape targeted benchmark pages as a fallback when a primary source lacks an API. Treat vendor-reported numbers with caution (self-reported, not independently run). |
| Benchmark | Domain | Measures | Saturation |
|---|---|---|---|
SWE-bench Verified | coding (agentic) | Resolving real GitHub issues with patches that pass hidden tests (pass@1 resolved %). | Approaching saturation at the very top — frontier models reported ~88-95% mid-2026 (e.g. Claude Opus 4.x ~88%, some newer Claude variants 95%+), with contamination/test-design caveats. Still discriminating in the mid-range; SWE-bench Pro is the harder successor for headroom. |
LiveCodeBench | coding | Contamination-free competitive-programming / code-generation across time-windowed problems. | Not saturated; time-windowing keeps it fresh. Good for ranking code generation when SWE-bench is too agentic/expensive. |
Aider Polyglot | coding (editing) | Multi-language code editing with strict edit-format compliance (composite of correctness x format adherence). | Mid-high; top model ~0.88 (GPT-5 class) with broad spread below (~0.58 average). Still discriminates well, especially for tool-integration readiness. |
AIME 2025 / 2026 | math | Advanced high-school olympiad math (competition problems), exact-answer scored. | SATURATED at the frontier — GPT-5-class models reported ~100% in 2026. No longer discriminating for top models; use only for mid-tier or as a floor check. |
FrontierMath (Tiers 1-3 and Tier 4) | math (frontier) | Hundreds of original, expert-crafted research-level math problems across modern mathematics; Tier 4 is the hardest expansion set. | NOT saturated and the best math discriminator. Rapid 2026 gains (e.g. Claude Fable 5 ~87% Tiers 1-3, ~88% Tier 4) but still the frontier yardstick. Run/hosted by Epoch AI. |
GPQA Diamond | reasoning / science | Google-proof graduate-level science MCQs (physics, chem, bio) requiring genuine reasoning. | Approaching saturation at the very top but still produces meaningful ~15-point spreads in the ~60-90% band — widely cited as the most trusted reasoning discriminator in 2026. |
MMLU-Pro | knowledge | Harder 10-option multitask knowledge across 14 subjects (successor to MMLU). | NEAR-SATURATED — top models cluster ~83-90% (Gemini 3 Pro ~90%, Claude Opus 4.x ~89%) with little top-end discrimination. Useful as a knowledge floor, not a frontier separator. |
Humanity's Last Exam (HLE) | reasoning (hardest) | Extremely hard expert-level multi-domain questions designed to resist saturation. | NOT saturated — designed as the hardest broad reasoning test; large headroom remains. Best single 'how smart at the limit' signal. |
IFEval | instruction-following | Verifiable instruction-following (format/length/keyword constraints) with programmatic checking. | Largely saturated for frontier models (high-90s); still useful for catching smaller/cheaper models that miss constraints. IFEval-FC extends it to function-calling format adherence. |
BFCL v4 (Berkeley Function Calling Leaderboard) | agentic / tool-use | Accuracy of function/tool calling — single, parallel, multi-turn, and (v4) holistic agentic evaluation via AST checking. | Not saturated for the harder multi-turn/agentic categories; strong signal for tool-use reliability. Key for agent/RAG product decisions. |
Tau-bench / Tau2-bench | agentic (tool-use, multi-turn) | Realistic multi-turn agent tasks (retail/airline/telecom domains) requiring tool use under policies. | NOT saturated — hard, realistic agentic tasks with clear top-model spread. Excellent for ranking agent reliability. |
Terminal-Bench 2.0 | agentic (computer/terminal use) | Completing real tasks in a terminal/computer environment end-to-end. | NOT saturated; meaningful spread. Good signal for autonomous computer-use agents. |
RULER | long-context | NVIDIA synthetic suite: 13 tasks x 4 categories at 4K-128K tokens testing retrieval + reasoning over context. | Not saturated for effective long context — reveals that effective capacity is typically only ~60-70% of advertised window. Key reality check on context-window marketing. |
MRCR v2 (Multi-Round Coreference/Context Resolution) | long-context | Multi-round coreference + entity tracking under long context (e.g. 64K, 8-needle). | Not saturated; strong discriminator for genuine long-context comprehension beyond simple needle retrieval. |
NIAH-2 / Needle-in-a-Haystack (updated) | long-context | Retrieval of planted facts ('needles') across very long contexts. | Basic single-needle is largely solved/saturated; multi-needle and reasoning variants still discriminate. Use updated multi-needle versions only. |
MMLU-Pro / GPQA as composite inputs | knowledge + reasoning | Frequently rolled into composite indices (Artificial Analysis Intelligence Index, Epoch Capabilities Index, HELM Capabilities). | Composites mitigate single-benchmark saturation by blending non-saturated evals — the right approach for a buyer-facing single score. |
A field guide to what's out there, grouped by capability. Match the benchmark family to what you're actually shipping.