QuantSightBench - Prediction Interval Benchmark

Leaderboard

Results under the agentic setting at high reasoning effort on 1,000 questions. Sort by Coverage (closer to 90% target is better) or Mean Log IS (lower is better).

#	Model	Provider	Coverage ▲	Mean Log IS

Metrics Glossary

Metric	Description	Ideal
`Coverage`	Fraction of questions where the ground truth falls within the predicted interval. A well-calibrated model producing 90% intervals should achieve empirical coverage close to 90%. The leaderboard is sorted by coverage by default.	0.90 (= 90%)
`Mean Log IS (MLIS)`	Mean of the Winkler interval score applied to log-transformed values. The Winkler score is a proper scoring rule that jointly penalizes interval width and miscoverage — narrow intervals that miss the target are penalized proportionally to the miss distance. The log transform normalizes across scales, preventing large-magnitude questions from dominating the aggregate. A model with low MLIS produces intervals that are both sharp (narrow) and well-calibrated.	Lower is better

Data Details

QuantSightBench builds on the OpenForecast pipeline and filters to 1,000 quantitative forecasting questions whose resolution depends on uncertain future outcomes — excluding items answerable via straightforward calculation or factual recall.

1,000 Forecasting questions

~320K News articles in corpus

8 Domains covered

512 Tokens per chunk

Domain Distribution

The 1,000 questions span 8 domains. Labels were assigned by GPT-4.1-mini based on question title and background.

Domain	Count	%
Business, Finance & Technology	258	25.8
Politics & Geopolitics	185	18.5
Infrastructure & Transport	122	12.2
Culture & Entertainment	113	11.3
Crime & Legal	101	10.1
Sports	96	9.6
Science, Health & Environment	70	7.0
Education	55	5.5
Total	1,000	100.0

Leakage Prevention

Numerical forecasting benefits from a natural temporal separation between training data and evaluation targets. Background articles are drawn from January–August 2025, while evaluated questions have resolution dates between September 2025 and January 2026. All evaluated models have a documented knowledge cutoff prior to September 2025.

Retrieval Corpus

The ~320,000-article corpus spans diverse international sources including Forbes, CNN, Hindustan Times, Deutsche Welle, and Irish Times. Articles are de-duplicated, chunked into 512-token segments, and embedded with OpenAI's text-embedding-3-large. At evaluation time the most relevant chunks are retrieved and provided to the model as context.

Example Questions

Representative questions from the benchmark.

Question	Background	Answer
What will be the annual silver supply deficit (in million ounces) in 2025?	The silver market faces structural deficits when demand exceeds supply, impacting prices and industrial availability.	117.6
What is the annual climate finance commitment agreed upon by developed nations for developing countries by 2035?	Developed nations have pledged financial support to help developing countries transition to clean energy and adapt to climate change.	300
How many fatalities were confirmed in the Air India Flight AI 171 crash by October 2025?	Air India Flight AI 171 crashed shortly after takeoff from Ahmedabad, leading to a major aviation disaster investigation.	260

Key Findings

Results under the agentic setting reveal systematic patterns in how LLMs handle numerical uncertainty quantification.

Systematic Overconfidence

No model reaches the 90% target. Gemini 3.1 Pro leads at 79.1% coverage; DeepSeek v3.2 trails at 61.5%. The consistent shortfall indicates intervals that are too narrow for the stated confidence level.

Coverage Degrades with Scale

Coverage exceeds 80% for values in the 1–10 range but falls below 65% once magnitudes reach 100K+. Mean Log IS spikes at the high end, reaching 23.7 for Gemini 3 Pro — scale awareness is a core bottleneck.

Iteration Signals Difficulty

Aggregate coverage decreases monotonically with retrieval iterations, from ~86% at one iteration to ~65% at five. Rather than hurting forecasts, more iterations flag harder questions the agent cannot confidently resolve.

Reasoning Effort Helps Unevenly

Higher effort improves both coverage and sharpness overall. Opus 4.5 gains the most (65.4% → 72.6% coverage; MLIS 7.45 → 6.74), while already-performant GPT-5.1 sees little benefit.

Prompt Setting Matters

Agentic retrieval beats background-context, which beats zero-shot. Gains are largest for open-weight models like GLM-4.7 and DeepSeek v3.2, narrowing the gap with frontier systems.

Confidence Specification is Essential

Removing the 90% confidence instruction drops GPT-5.4 from 75.3% → 68.2% coverage (MLIS 7.32 → 11.44) and Opus 4.6 from 73.6% → 67.0%. Explicit targets are required for calibrated intervals.

Coverage vs. Mean Log IS

Each point is a model under the agentic setting. The ideal model sits in the top-left (high coverage, low score). GPT-5.1 and Gemini 3.1 Pro come closest.

Coverage by Model

Empirical coverage of 90% prediction intervals. The dashed line marks the 90% target. No model reaches it.

Deep Dive Analysis

Detailed analysis across reasoning effort levels, prompt settings, and calibration behavior.

Reasoning Effort Effect

Coverage and mean log IS across low, medium, and high reasoning effort under the background-context setting. Opus 4.5 benefits the most (coverage 65.4% → 72.6%), Sonnet 4.5 improves more modestly, and GPT-5.1 shows minimal gains — consistent with the paper's finding that extended reasoning helps when models are not yet saturated. Gemini 3 Pro was only tested at low and high.

Coverage (%)

Mean Log IS (lower = better)

Prompt Settings Comparison

Coverage and mean log IS across three prompt configurations: zero-shot (no context), background-context (with relevant grounding), and agentic (retrieval over a fixed news corpus). Background context generally improves over zero-shot; agentic retrieval improves further, with the most pronounced gains for open-weight models such as GLM-4.7 and DeepSeek v3.2. Frontier models show smaller relative improvements from agentic retrieval.

Coverage (%)

Mean Log IS (lower = better)

Calibration Across Confidence Levels

Target vs. actual coverage at 80%, 90%, and 95% under the background-context setting. The dashed diagonal is perfect calibration. All three models fall below it, and the gap widens at higher targets — models widen intervals in response to higher confidence requests but not enough to reach the target. MLIS improves for GPT-5.1 and Opus 4.5 as the widening offsets miscoverage penalties; Gemini 3 Pro shows the opposite, suggesting less efficient widening.

Target vs. Actual Coverage

Mean Log IS by Target Coverage

Effect of Confidence Level Specification

Removing the 90% confidence instruction from the agentic prompt forces models to choose interval width implicitly. Both coverage and sharpness degrade substantially, indicating that explicit confidence targets serve as an important conditioning signal.

Confidence Instruction	GPT-5.4 Coverage	GPT-5.4 MLIS	Opus 4.6 Coverage	Opus 4.6 MLIS
None (inherent)	68.24%	11.44	66.99%	10.71
90% specified	75.33%	7.32	73.60%	7.00

Methodology

1

Dataset

1,000 resolved numerical forecasting questions built on the OpenForecast pipeline, spanning 8 domains (business & finance, politics, infrastructure, culture, crime, sports, science & health, education). Background articles are drawn from Jan–Aug 2025; resolution dates fall between Sep 2025 and Jan 2026 to prevent leakage.

2

Task

Each model receives a forecasting question and must produce a prediction interval at a specified confidence level — a lower and upper bound containing the true value with stated probability. Default evaluations use the 90% level.

3

Scoring

We use the Winkler Interval Score (Gneiting & Raftery, 2007), a proper scoring rule that jointly penalizes interval width and miscoverage. Because numerical quantities span many scales, we apply it to log-transformed values, yielding the Mean Log Interval Score (MLIS).

4

Configurations

Models are evaluated across three prompt settings (zero-shot, background-context, agentic retrieval over a 320K news-article corpus), three reasoning-effort levels (low/medium/high), and three target confidence levels (80%, 90%, 95%). The main leaderboard reports the agentic setting at high effort.

About

QuantSightBench is a benchmark for evaluating the numerical forecasting capabilities of large language models. Unlike benchmarks that evaluate forecasting through binary or multiple-choice outcomes, QuantSightBench assesses whether models can produce well-calibrated prediction intervals for uncertain quantitative outcomes — a format that is both closer to how people naturally reason about uncertainty and a demanding test of scale awareness and calibration.

The benchmark uses real-world numerical forecasting questions with known resolutions, drawn from the OpenForecast pipeline. Background articles are restricted to Jan–Aug 2025 and resolution dates fall between Sep 2025 and Jan 2026 to prevent data leakage.

Team

Jeremy Qin^1,2,3 and Maksym Andriushchenko^1,2,3

¹ELLIS Institute Tübingen ²Max Planck Institute for Intelligent Systems ³Tübingen AI Center

GitHub Repository Paper