What does it mean to be well-calibrated?

A well-calibrated forecaster produces probability estimates that match observed reality over many predictions. If you assign 80% probability to a set of events, approximately 80% of those events should actually occur. Perfect calibration means every probability bucket matches its actual frequency. No individual prediction can be calibrated -- calibration only applies to a track record across many forecasts.

What is a Brier score?

The Brier score measures the mean squared error of probability forecasts against binary outcomes. It is calculated as the average of (forecast - outcome) squared across all predictions, where outcome is 1 if the event occurred and 0 if it did not. Scores range from 0 (perfect) to 1 (worst possible). A score below 0.25 beats random guessing.

Are prediction markets well-calibrated?

High-liquidity prediction markets demonstrate strong calibration. Academic studies of the Iowa Electronic Markets and analysis of Polymarket and Kalshi data show that prices on liquid markets closely track actual outcome frequencies. Thin markets with few participants and low volume are noticeably less calibrated, with prices reflecting noise rather than genuine probability estimates.

What Is Calibration in Forecasting? A Data-Driven Explainer

Q: What are superforecasters?

Superforecasters are individuals identified through Philip Tetlock's Good Judgment Project who consistently outperform both chance and expert consensus on geopolitical and policy forecasting. They share traits like intellectual humility, quantitative thinking, and a willingness to update beliefs with new evidence. Their track record shows that structured probabilistic thinking produces measurably better forecasts than traditional expert judgment.

A forecaster is calibrated when their predicted probabilities match observed frequencies. If a forecaster says “70% likely” across one hundred different predictions, roughly seventy of those events should actually occur. Calibration is the most rigorous standard for evaluating whether probability estimates are meaningful, and it separates useful forecasts from confident guesses.

How Is Calibration Measured?

Calibration is assessed by grouping predictions into probability buckets and comparing the predicted frequency against the actual outcome frequency. A forecaster who assigns 90% probability to fifty events is well-calibrated if approximately forty-five of those events occur.

The standard visualization is a calibration curve (also called a reliability diagram). The x-axis shows predicted probability, the y-axis shows observed frequency of outcomes, and a perfectly calibrated forecaster falls along the diagonal:

Predicted Probability	Ideal Outcome Rate	Overconfident Example	Underconfident Example
10%	10%	5% (too few happen)	18% (too many happen)
30%	30%	15%	42%
50%	50%	35%	60%
70%	70%	55%	78%
90%	90%	75%	95%

Overconfidence means extreme probabilities are assigned too frequently — the forecaster says 90% but the event only happens 75% of the time. Underconfidence means probabilities cluster toward the center — the forecaster says 60% when the true rate is 80%. Most human forecasters exhibit overconfidence, assigning extreme probabilities more often than their accuracy justifies.

Calibration requires volume. A single prediction cannot be evaluated for calibration. You need hundreds or thousands of forecasts across different probability levels to produce a meaningful calibration curve. This is one reason why platform-level data is more informative than any individual forecaster’s track record.

What Is the Brier Score and Why Does It Matter?

The Brier score is the most widely used metric for evaluating probabilistic forecasts. The formula is straightforward:

Brier Score = (1/N) x SUM(forecast - outcome)^2

Where forecast is the predicted probability (0 to 1) and outcome is 1 if the event occurred and 0 if it did not. The sum runs across all N predictions.

Key properties of the Brier score:

Range: 0 to 1. Lower is better.
Perfect score: 0 (every prediction was 100% for events that happened and 0% for events that did not).
Worst score: 1 (every prediction was completely wrong).
Random baseline: 0.25 (assigning 50% to everything). Any Brier score below 0.25 beats coin-flipping.
Climatological baseline: Using the historical base rate for every prediction. Beating this baseline means the forecaster adds value beyond simple frequency knowledge.

The Brier score can be decomposed into three components: calibration (do probabilities match frequencies?), resolution (do predictions vary meaningfully, or does the forecaster just say 50% for everything?), and uncertainty (how inherently unpredictable are the events?). A forecaster with good calibration and good resolution — meaning they assign diverse, accurate probabilities — will have a low Brier score.

For a broader look at how forecasting accuracy plays out across platforms, see our prediction market accuracy analysis.

How Calibrated Are Prediction Markets?

Prediction markets aggregate the beliefs of many participants into a single price, and the resulting calibration is generally strong — particularly on liquid markets.

Academic research on the Iowa Electronic Markets (operating since 1988) found that market prices closely tracked actual election outcomes, outperforming major polls in the majority of head-to-head comparisons. More recent data from Polymarket and Kalshi confirms the pattern: high-volume markets produce well-calibrated prices.

Our dataset shows a clear relationship between liquidity and calibration quality. Markets with daily volume exceeding $10,000 exhibit significantly tighter calibration curves than thin markets. A contract trading at $0.70 on a market with $500,000 in volume reflects genuine information aggregation. The same price on a market with $200 in total volume may be driven by one or two participants and carries far less predictive weight.

The Odds Reference dashboard tracks prices across multiple platforms. When the same event trades on Polymarket, Kalshi, and community platforms like Metaculus, cross-platform price convergence is itself a calibration signal — agreement across independent participant pools strengthens confidence in the implied probability.

Community forecasting platforms offer an interesting calibration comparison. Metaculus, which uses no real money, has demonstrated calibration comparable to financial prediction markets on overlapping question sets. This challenges the assumption that financial incentives are strictly necessary for good calibration, though the topic remains actively debated in the research literature.

What Are Superforecasters and Why Do They Matter?

The term “superforecaster” comes from Philip Tetlock’s research, particularly the Good Judgment Project, a multi-year study funded by IARPA (Intelligence Advanced Research Projects Activity). The project identified individuals who consistently outperformed both chance and professional intelligence analysts on geopolitical forecasting questions.

Superforecasters share several measurable traits:

Granularity. They assign precise probabilities (67% rather than “likely”) and adjust them frequently as new information arrives.
Intellectual humility. They treat their own estimates as hypotheses to be updated, not positions to be defended.
Base-rate awareness. They start with the historical frequency of similar events and adjust from there, rather than reasoning from narratives alone.
Active updating. They monitor evidence continuously and revise forecasts in small increments.

Tetlock’s research demonstrated that structured probabilistic thinking — the kind embodied by calibration-focused forecasting — produces measurably better results than traditional expert judgment, which tends toward overconfidence and narrative-driven reasoning.

Prediction markets incorporate superforecaster-like behavior structurally: participants with better information or better models make more money, and their trades move prices proportionally to their capital and conviction. The market price, in theory, reflects the beliefs of the best-informed participants weighted by their willingness to back those beliefs with money.

How Can You Improve Your Own Calibration?

Calibration is a skill that improves with deliberate practice. Several evidence-based approaches work:

Track your predictions. Record every probability estimate you make and the eventual outcome. Over time, group your predictions into buckets and check whether your 70% predictions happen about 70% of the time. Without tracking, self-assessment is unreliable.

Use reference classes. Before estimating the probability of a specific event, ask: “How often do events like this happen historically?” Starting from the base rate and adjusting based on specific evidence consistently produces better-calibrated estimates than starting from intuition.

Seek disconfirming evidence. Actively look for reasons your estimate might be wrong. Overconfidence typically stems from anchoring on confirming evidence and underweighting contradictory signals.

Practice on calibration trainers. Several online tools present trivia questions and ask you to assign confidence levels, then score your calibration in real time. These build the habit of mapping internal confidence to accurate probability estimates.

The prediction market glossary defines key terms used throughout calibration and forecasting analysis.

Key Takeaways

Calibration means your probability estimates match reality: 70% predictions should come true approximately 70% of the time
The Brier score (mean squared error of forecasts) is the standard metric — scores below 0.25 beat random guessing, and decomposition reveals calibration versus resolution
High-liquidity prediction markets demonstrate strong calibration; thin markets are noticeably less reliable
Superforecasters, identified through Tetlock’s research, outperform experts by using precise probabilities, base rates, and frequent updating
Track your own predictions systematically to build calibration as a measurable, improvable skill