Prediction Markets · research

We Built a Weather Forecasting Model to Trade Kalshi. Here's What Happened.

Last Updated: March 25, 2026

We Built a Weather Forecasting Model to Trade Kalshi. Here’s What Happened.

Last Updated: March 26, 2026

The thesis was simple: NWS (National Weather Service) forecasts are free, they’re accurate, and Kalshi’s weather contracts are priced by a crowd that might not be doing the math correctly. We built a model to convert NWS MOS ensemble forecasts into bracket probabilities, tested three strategies across 1,506 weather events, and discovered that the crowd was doing the math just fine.

Key Findings:

What Was the Opportunity?

Kalshi offers daily weather contracts on high temperatures in major US cities. A typical event has six temperature brackets — for example, “NYC High Temperature: Under 50F / 50-55F / 55-60F / 60-65F / 65-70F / Over 70F.” Each bracket trades as a binary contract paying $1 if the actual high falls within that range, $0 otherwise.

The data advantage thesis seemed compelling. NWS MOS (Model Output Statistics) forecasts are publicly available, updated multiple times daily, and have well-documented accuracy characteristics. Day-0 forecasts for NYC high temperature show a mean absolute error of 2.12 degrees Fahrenheit and a bias of -0.24 degrees F (the forecast runs slightly cold). Our dataset included 3,429 day-0 GFS forecast observations for NYC alone.

Across all forecast horizons, the accuracy picture was clear:

ForecastBiasMAERMSESamples
Day-0 GFS-0.24 F2.12 F2.85 F3,429
Day-1 GFS-0.64 F2.42 F3.27 F6,854
Day-2 GFS-0.66 F2.65 F3.57 F6,850
Day-0 NAM-1.14 F2.49 F3.44 F729

The NAM model runs colder than GFS (bias of -1.14 F vs -0.24 F), which we factored into our ensemble. The key observation: a forecast error of 2 degrees F means the correct bracket is almost always within one bracket of the forecast center. If brackets are 5 degrees wide, the model should identify the correct bracket roughly a third of the time — double the 16.7% random baseline.

We collected 1,506 NYC high temperature events with matched NWS forecasts and Kalshi contract outcomes. This became our testing ground.

How Large Was the Full Dataset?

The 1,506 events were the core of our analysis, but the underlying dataset was substantially richer. We assembled 48,978 daily weather observations from NOAA’s historical records, covering temperature, precipitation, and wind data across all target cities. Each observation was matched against 46,248 CLI (Climate Data Online) reports from NOAA’s official station records, achieving a 99.9% match rate against the GHCN (Global Historical Climatology Network) reference dataset.

CLI reports are the ground-truth temperature records published by local NWS forecast offices after each calendar day. They contain verified high and low temperatures, precipitation totals, and departure-from-normal statistics. The 99.9% GHCN match rate means our verified actuals agreed with the independent GHCN database in all but 46 of 46,248 cases — those 46 discrepancies were rounding differences of 1 degree F or less. This match rate matters because any error in the “actual” temperature would contaminate every model accuracy and strategy calculation downstream.

On the market side, we tracked 40,032 settled Kalshi weather contracts across all events and brackets. Of those, 37,674 contracts had nonzero volume — meaning at least one trade occurred. The remaining 2,358 contracts (5.9%) were listed but never traded, almost all of them deep tail brackets with implied probabilities below 3%. These zero-volume contracts tell their own story: the market doesn’t bother pricing outcomes that both sides agree are near-impossible.

The combination of 48,978 weather observations, 46,248 verified CLI reports, and 37,674 traded contracts gave us one of the most comprehensive datasets assembled for prediction market weather analysis.

How Did We Build the Model?

The model converts a point forecast plus an uncertainty estimate into a probability distribution across Kalshi’s temperature brackets. We used a four-step process that combines multiple NWS forecast models into a single calibrated probability grid.

Step 1: Forecast Ensemble. We combined GFS and NAM forecasts where available, weighting by inverse RMSE. The ensemble forecast was the weighted average of available model outputs for each event.

The weighting formula is straightforward. For two models with RMSE values R_GFS and R_NAM, the weight for GFS is:

weight_GFS = (1 / R_GFS) / ((1 / R_GFS) + (1 / R_NAM))

With day-0 RMSE of 2.85 F for GFS and 3.44 F for NAM, this produces:

weight_GFS = (1 / 2.85) / ((1 / 2.85) + (1 / 3.44)) = 0.3509 / (0.3509 + 0.2907) = 0.547

weight_NAM = 0.2907 / 0.6416 = 0.453

GFS gets 54.7% of the weight, NAM gets 45.3%. This reflects GFS’s lower forecast error — the model that has been more accurate historically gets more influence on the ensemble prediction. For events where only GFS was available (the majority, given 3,429 GFS observations versus 729 NAM), the ensemble reduced to the GFS forecast alone.

When both models were available, the ensemble point forecast was simply: T_ensemble = 0.547 * T_GFS + 0.453 * T_NAM. If GFS predicted 72 F and NAM predicted 70 F, the ensemble forecast was 71.1 F.

Step 2: Uncertainty Estimation. Each forecast model has a known sigma (standard deviation of errors). For day-0 GFS, sigma = 2.84 F. For day-0 NAM, sigma = 3.24 F.

The ensemble sigma combines both models’ uncertainty. For inverse-RMSE weighting, the combined sigma is:

sigma_ensemble = 1 / sqrt((w_GFS / sigma_GFS)^2 + (w_NAM / sigma_NAM)^2)

With both models available, this yields sigma_ensemble = 2.14 F — tighter than either individual model. When only GFS is available, sigma_ensemble defaults to 2.84 F. The uncertainty reduction from combining models is genuine but modest: a 24.6% reduction in standard deviation when both are present.

We used these sigma values to build a normal distribution centered on the ensemble forecast. The assumption of normality is well-supported for temperature forecast errors over our 48,978-observation dataset: a Shapiro-Wilk test on day-0 GFS errors yielded p = 0.34, failing to reject normality.

Step 3: Bracket Probability Grid. For each bracket (e.g., 55-60 F), we computed the area under the normal curve between the bracket boundaries. A 6-bracket event produces 6 probabilities that sum to 1.0.

The computation uses the standard normal CDF (denoted Phi). For a bracket with lower bound L and upper bound U, given forecast center mu and sigma:

P(bracket) = Phi((U - mu) / sigma) - Phi((L - mu) / sigma)

A worked example makes this concrete. Suppose our ensemble forecasts 72 F with sigma = 2.84 F, and the brackets are [Under 60, 60-65, 65-70, 70-75, 75-80, Over 80]:

  • P(Under 60) = Phi((60 - 72) / 2.84) = Phi(-4.23) = 0.001 (0.1%)
  • P(60-65) = Phi((65 - 72) / 2.84) - Phi((60 - 72) / 2.84) = Phi(-2.46) - Phi(-4.23) = 0.007 - 0.001 = 0.6%
  • P(65-70) = Phi((70 - 72) / 2.84) - Phi((65 - 72) / 2.84) = Phi(-0.70) - Phi(-2.46) = 0.242 - 0.007 = 23.5%
  • P(70-75) = Phi((75 - 72) / 2.84) - Phi((70 - 72) / 2.84) = Phi(1.06) - Phi(-0.70) = 0.855 - 0.242 = 61.3%
  • P(75-80) = Phi((80 - 72) / 2.84) - Phi((75 - 72) / 2.84) = Phi(2.82) - Phi(1.06) = 0.998 - 0.855 = 14.3%
  • P(Over 80) = 1 - Phi((80 - 72) / 2.84) = 1 - 0.998 = 0.2%

The six probabilities sum to 100%. The model assigns 61.3% to the 70-75 bracket, 23.5% to 65-70, and 14.3% to 75-80. The three tail brackets share under 1% combined.

An important edge case arises when the forecast center falls exactly on a bracket boundary. If the forecast is 70.0 F, the two adjacent brackets (65-70 and 70-75) each get roughly equal probability — 46.1% and 46.1% in this case, with the remaining 7.8% spread across the other four brackets. The model handles this gracefully because the normal CDF is continuous, but it creates a practical challenge: two brackets with near-identical probability means the model has low conviction, and any strategy that depends on a clear top-1 pick performs poorly. We found that 8.7% of our 1,506 events had the forecast within 0.5 F of a bracket boundary, and the top-1 hit rate on those 131 events was just 26.4% — well below the 33.0% average.

Step 4: Calibration Check. We binned our model probabilities into 10 deciles and compared predicted vs. realized hit rates.

Calibration Results

The model was reasonably well-calibrated but showed characteristic overconfidence at higher probability levels:

Model ProbabilityEventsRealized Hit RateGap
0-10%4,8906.2%-1.3 pp
10-20%2,14415.8%+0.2 pp
20-30%1,20625.1%+0.9 pp
30-40%70433.8%+0.5 pp
40-50%32838.1%-5.7 pp
50%+26444.3%-10.2 pp

The overconfidence pattern is striking and consistent. Below 40% model probability, calibration is tight — realized hit rates track predictions within 1.3 percentage points. Above 40%, the model systematically overstates the probability of the predicted bracket hitting. When our model said 50%+, the bracket only hit 44.3% of the time — a 10.2 percentage point gap.

This same overconfidence pattern appeared in our crypto model research, where Black-Scholes probabilities above 50% also overstated realized settlement rates. In that study, the 50-60% probability bucket showed +0.74 percentage points of overconfidence for BTC and +1.68 for ETH. The weather overconfidence is more severe (10.2 pp vs 0.7-1.7 pp), likely because temperature brackets impose a discrete outcome structure that the continuous normal distribution handles poorly at high confidence levels. The pattern suggests overconfidence at the 50%+ zone is a structural property of these Gaussian pricing models across domains, not a domain-specific artifact.

For the 264 events where our model predicted 50%+ probability, the market’s implied probability averaged 48.2 cents — closer to the realized 44.3% than our model’s 53.7% average prediction. The crowd was already partially correcting for the overconfidence our model exhibited.

Overall Accuracy

Across 1,506 events:

  • Top-1 bracket hit rate: 33.0% (2x random chance of 16.7%)
  • Top-2 bracket hit rate: 61.55% (the correct bracket was in our top 2 predictions nearly two-thirds of the time)
  • Average winner probability: 25.62%
  • Average Brier score: 0.8169

Seasonal performance was surprisingly consistent: top-1 hit rates ranged from 31.7% in winter (DJF) to 34.0% in spring (MAM). The model didn’t have a major blind spot in any season.

What Three Strategies Did We Test?

We tested three distinct approaches to converting model edge into trading profits.

Strategy 1: Tail Selling

Thesis: Sell (go NO) on the least likely brackets. If our model says bracket 5 and 6 have a combined probability under 8%, sell YES on those brackets and collect premium.

Results:

  • Brackets ranked 5th or 6th in probability lost 96.6% of the time
  • Sub-5% probability contracts settled YES only 5.37% of the time
  • Rank 5 breakeven price: 7 cents
  • Rank 6 breakeven price: 3 cents
MetricTaker FeesMaker Fees
Win rate96.6%96.6%
ROI0.62%6.22%
Avg premium collected~5 cents~5 cents
Avg loss when wrong~95 cents~95 cents

Verdict: The strategy was correct 96.6% of the time — but that’s roughly what you’d expect from selling contracts priced at 3-7 cents. The market already priced these outcomes correctly. At taker fees (the realistic scenario for most traders), ROI was 0.62%. You’d need to trade thousands of contracts to make minimum wage. At maker fees, the 6.22% ROI looks better but requires consistent fill rates on the NO side, which is already crowded.

Strategy 2: Conditional Filtering

Thesis: Don’t trade every event. Filter for conditions where the model has a structural advantage — specific seasons, high-confidence forecasts, or particular weather patterns.

We tested dozens of filter combinations. The best we found: spring events (MAM) where model confidence exceeded 45% for the top bracket.

Results:

  • 93 qualifying events out of 1,506 total
  • Top-1 hit rate: 45.16% (vs. 33.0% unfiltered — a +12.9 percentage point improvement)

Verdict: Almost certainly overfit. We tested many filter combinations and reported the best one. With 93 events, the confidence interval on a 45% hit rate is wide. We’d need 500+ events under the same conditions to confirm this filter actually works. The out-of-sample track record is untested. This is the kind of result that looks great in a backtest and falls apart in live trading.

Filters That Looked Promising but Failed

The conditional filtering exercise produced several results that illustrate why small-sample backtests are dangerous.

Wind speed above 15 mph: We hypothesized that high-wind events would increase forecast uncertainty, creating wider mispricings. The filter identified 67 qualifying events with winds exceeding 15 mph at the forecast station. The top-1 hit rate was 40.3% — a 7.3 percentage point improvement over baseline. But 67 events is a tiny sample. The 95% confidence interval on a 40.3% rate with n=67 spans from 28.6% to 52.0%. The true hit rate could easily be at or below the 33% baseline. We found zero statistical significance (p = 0.22).

Temperature delta from yesterday exceeding 10 F: Large day-over-day temperature swings might indicate forecast model disagreement, potentially creating mispricings. We found 112 events where the actual high differed from the previous day’s high by more than 10 F. The top-1 hit rate was 38.4% — seemingly promising. But these events also had a higher average Brier score (0.871 vs 0.817 baseline), meaning the model was less calibrated overall despite the higher hit rate. The “improvement” came from the model getting lucky on a small set of volatile weather days, not from a structural advantage.

Humidity above 80% combined with summer months: This filter yielded just 41 events with a 43.9% hit rate. The sample is so small that 2-3 additional misses would have dropped the rate below 35%. We rejected it despite the headline number.

The pattern across all filter experiments was consistent: small samples produced inflated hit rates that couldn’t survive statistical scrutiny. Every filter that beat 40% had fewer than 120 qualifying events. The filters with 300+ events never exceeded 36%. This is the textbook signature of overfitting — you’re finding noise patterns in the data, not signal.

Strategy 3: Top-2 Straddle

Thesis: Buy the top 2 most likely brackets simultaneously. One of them hits 61.55% of the time. If the combined purchase price is less than 61.55 cents, we have an edge.

Results:

  • Breakeven cost: 61.55 cents (the exact hit rate)
  • Average market cost for top-2 combined: 76.5 cents
  • Gap: -14.95 cents (the market was pricing the top-2 combo at 76.5 cents, well above our 61.55% hit rate)

Verdict: Dead on arrival. The market prices the top-2 brackets above the realized combination rate. The crowd effectively aggregates the same NWS data we’re using and adds a risk premium on top. Buying the two most likely outcomes is a losing proposition at market prices.

Strategy Comparison Summary

StrategyHit RateBreakevenMarket PriceP&L per EventVerdict
Tail selling (taker)96.6%3.4%3-7c+0.03cBreakeven
Tail selling (maker)96.6%3.4%3-7c+0.31cMarginal
Conditional filter45.2%45.2%~35cUnknownLikely overfit
Top-2 straddle61.6%61.6c76.5c-14.9cUnprofitable

How Efficient Are Weather Markets Really?

To answer this definitively, we analyzed 804,248 hypothetical trade-out scenarios across all 1,506 events. The full convergence analysis is detailed in our companion study on weather price convergence, which covers the mechanics of how prices drift toward terminal values.

Price Convergence Patterns

The convergence data tells a clear story. For contracts that ultimately settled YES (winners):

  • Entry price (median): 43.8 cents
  • Peak price before settlement: 77 cents
  • Convergence path: Smooth upward drift over final 4-6 hours

For contracts that settled NO (losers):

  • Entry price (median): 15.2 cents
  • Trough before settlement: 4 cents
  • Convergence path: Gradual decline, accelerating in final 2 hours

Winners converge from 43.8 cents to 77 cents. Losers converge from 15.2 cents to 4 cents. The market correctly identifies the likely outcome hours before settlement, and prices adjust smoothly toward the terminal value. The 23-cent gap between the peak winner price (77c) and full settlement ($1) represents irreducible weather uncertainty — even moments before the official high is recorded, the market prices in the possibility of a late-day temperature shift.

The convergence asymmetry is notable: losers fall faster than winners rise. Losers drop 11.2 cents (from 15.2c to 4c) while winners gain 33.2 cents (from 43.8c to 77c). In percentage terms, losers lose 73.7% of their value while winners gain only 75.8%. The market is quicker to rule out wrong brackets than to confirm the right one, which makes intuitive sense — a forecast 5 degrees away from a bracket boundary provides strong evidence against that bracket, but a forecast at the bracket center still faces the 2.84 F sigma of forecast error.

The NWS Update Lag

When NWS releases a new forecast, how quickly does Kalshi respond?

Our data shows 73% of contracts move more than 1 cent within one hour of an NWS update. The lag window is 10-30 minutes — there’s a brief period where the forecast has changed but prices haven’t fully adjusted.

But the margin is razor-thin. The typical post-update price movement is 1-3 cents. Kalshi’s taker fee on a mid-priced contract is 1.5-1.75 cents per side. Round-trip, you’re paying 3-3.5 cents in fees to capture a 1-3 cent move. The math doesn’t work.

Even with maker fees (roughly 0.4-0.5 cents per side), the window is narrow enough that consistent execution is unrealistic. By the time you identify the update, adjust your model, and place orders, the market has already moved.

Volume and Liquidity Patterns

Weather contract liquidity follows predictable patterns:

  • Peak trading occurs 2-4 hours before settlement, when the forecast is most certain but still tradeable
  • Depth concentrates in the 2-3 most likely brackets; tail brackets are illiquid
  • Weekday events see 2-3x more volume than weekend events
  • Of our 40,032 total settled contracts, 37,674 (94.1%) had at least one trade — but the remaining 2,358 zero-volume contracts were almost entirely rank-5 and rank-6 brackets with implied probabilities below 3%

The liquidity concentration creates a practical constraint for any systematic strategy. Tail-selling (Strategy 1) targets exactly the brackets where liquidity is thinnest. The 96.6% win rate is real, but getting filled consistently on NO orders for rank-6 brackets priced at 3 cents requires patience and maker infrastructure that most retail participants lack.

What Does This Tell Us About Prediction Markets?

Weather markets are the poster child for prediction market efficiency. The conditions that produce efficient markets all apply:

  1. Public expert forecasts exist. NWS data is free, high-quality, and updated frequently. The crowd doesn’t need to be smarter than weather scientists — it just needs to aggregate their forecasts.

  2. Events resolve quickly. Daily temperature contracts settle within 24 hours. There’s no long-term uncertainty or narrative-driven drift. The feedback loop between prediction and outcome is tight.

  3. Low stakes reduce noise. Weather contracts attract fewer irrational participants than political or sports markets. There’s no team loyalty or political identity driving biased bets.

  4. The outcome space is narrow. Six temperature brackets are easy for a crowd to price correctly, compared to hundreds of political candidates or sporting outcomes.

The implication for other prediction markets: domains with publicly available expert forecasts and quick resolution cycles will likely be efficient. If you’re building a trading model, the edge won’t come from better data aggregation (the crowd already does this well). It has to come from better calibration, faster reaction to private information, or structural advantages like maker-fee rebates.

Our analysis of Kalshi’s crypto binary options found a parallel result: across 72.1 million trades, market makers earned +1.12% while takers lost -1.12%. The structural advantage belongs to liquidity providers, not signal-based traders, regardless of the underlying asset class. And our flagship crypto backtest confirmed the pattern — even a +1.2 cent per-signal edge across 877,606 contracts couldn’t overcome execution costs for a reliable profit.

For current weather and other prediction market prices across all platforms, check our live dashboard. For details on how Kalshi specifically handles weather contracts, see our Kalshi platform profile.

Frequently Asked Questions

Can you make money trading Kalshi weather contracts?

Our analysis of 1,506 weather events and three distinct strategies found that none produced consistent profits after Kalshi’s fees. The best approach — tail-selling — was correct 96.6% of the time but generated only 0.62% ROI at taker fees, which translates to roughly 3 cents of profit per event. At that rate, you’d need 10,000+ contracts per year to earn $300. The 37,674 contracts with nonzero volume in our dataset suggest liquidity exists, but the margins are thinner than any reasonable time investment justifies.

How accurate are NWS weather forecasts for trading?

NWS GFS day-0 forecasts have a mean absolute error of 2.12 degrees F across 3,429 observations, and our ensemble model achieved a 33% bracket hit rate — 2x the 16.7% random baseline. The model’s accuracy is genuine, but Kalshi markets already incorporate this public forecast data within 10-30 minutes of release. Our dataset of 46,248 CLI reports verified against GHCN with 99.9% match rate confirms that the forecast-to-actual pipeline is clean. The model works; the market has already done the same work.

What is the best strategy for Kalshi weather markets?

Tail-selling (shorting outcomes ranked 5th or 6th in probability) had the highest win rate at 96.6%, but margins were eaten by fees. Conditional filtering improved the top-1 hit rate to 45.2% on 93 events, but every filter that exceeded 40% had fewer than 120 qualifying events — a strong indicator of overfitting. The top-2 straddle was unprofitable at market prices, with the market charging 76.5 cents for a combination that hit only 61.6% of the time. No strategy we tested produced reliable after-fee profits.

How fast do Kalshi weather markets react to NWS forecast updates?

73% of weather contracts show more than 1 cent of price movement within 1 hour of an NWS forecast update. The primary reaction window is 10-30 minutes. But the convergence margin of 1-3 cents is smaller than the round-trip trading cost of 3-3.5 cents at taker fees. Our 804,248 trade-out scenarios show that by 6 hours before settlement, winners are already at 50 cents and losers at 13 cents — the market has largely decided the outcome before most traders even start looking.

Are prediction markets good at forecasting weather?

Yes. Across 1,506 events and 37,674 traded contracts, Kalshi weather markets aggregate public NWS data, private weather models, and local knowledge into prices that closely track realized outcomes. The favorite-longshot bias exists but is smaller than in crypto markets or sports markets — nobody has an emotional attachment to a temperature bracket. Weather markets demonstrate that prediction markets work best when expert public data is available, resolution is fast, and participants are profit-motivated rather than identity-driven.

Key Takeaways

  • Our NWS-based ensemble model (GFS 54.7% weight, NAM 45.3%, inverse-RMSE weighting) achieved 33% top-1 bracket accuracy (2x random) and 61.6% top-2 accuracy across 1,506 events, backed by 48,978 daily observations and 46,248 verified CLI reports
  • Three strategies tested: tail-selling (96.6% win rate, breakeven after fees), conditional filtering (45% on 93 events, likely overfit), top-2 straddle (unprofitable at market prices of 76.5c vs 61.6% hit rate)
  • Overconfidence at 50%+ probability (44.3% realized vs 53.7% predicted) mirrors the same pattern found in our crypto pricing model backtest — a structural property of Gaussian models, not domain-specific
  • Weather markets respond to NWS updates within 10-30 minutes, but the 1-3 cent convergence margin is smaller than round-trip trading costs of 3-3.5 cents
  • 804,248 trade-out scenarios confirm smooth price convergence: winners drift 43.8c to 77c, losers drift 15.2c to 4c
  • Weather prediction markets work exactly as efficient market theory predicts — the crowd aggregates public expert forecasts faster and more accurately than any individual model

For the full academic treatment of these findings, see our research paper on weather prediction market efficiency.

Frequently Asked Questions

Can you make money trading Kalshi weather?
We tested three strategies across 1,506 events and none produced meaningful returns after fees. Tail-selling returned 0.62% ROI at taker fees. Conditional filtering hit 45% on only 93 events (likely overfit). The top-2 straddle was unprofitable at market prices.
How accurate are NWS weather forecasts?
NWS day-0 GFS forecasts for NYC show a mean absolute error of 2.12°F and a bias of -0.24°F. Our model converted these into bracket probabilities achieving 33% top-1 accuracy (2x random) and 61.6% top-2 accuracy across 1,506 events.
What is the favorite-longshot bias in weather betting?
Weather markets show a mild favorite-longshot bias: sub-10% contracts settle YES about 0.4 percentage points more often than prices imply. This is roughly one-tenth the 2-4pp bias in crypto markets, reflecting lower emotional stakes in weather outcomes.
How do Kalshi weather contracts settle?
Kalshi weather contracts settle based on the official daily high temperature recorded at the designated weather station for each city. Brackets are typically 5°F wide, with six brackets per event.