Prediction Markets · research

We Backtested Three Pricing Models on 877,000 Crypto Contracts. None of Them Made Money.

Last Updated: March 29, 2026

We Backtested Three Pricing Models on 877,000 Crypto Contracts. None of Them Made Money.

Last Updated: March 30, 2026

We built a crypto pricing model, backtested it on 877,606 settled Kalshi contracts, got a +221% return, and celebrated. Then we paper-traded it and lost 29.5%. The discrepancy led us to discover three bugs that had inflated our results. After fixing them, the edge dropped from +6.7 cents to +1.2 cents per signal. We rebuilt the model, tested two alternatives, and arrived at a conclusion we didn’t want: the crypto binary options market is approximately efficient.

This is the story of how backtest hygiene saved us from deploying a broken model. Every number in this article comes from the Becker (2026) dataset of 72.1 million trades across 877,606 settled contracts.

Key Findings:

What Dataset Did We Use?

The Becker (2026) dataset is the most comprehensive publicly available record of Kalshi crypto contract trading. It covers 877,606 settled range contracts across 5 assets, 72.1 million individual trades, and 385 trading days of continuous data. From this, we extracted 28,496 qualifying signals — instances where our model identified a mispricing large enough to warrant a trade.

The qualification criteria: model fair value diverged from market price by more than the expected taker fee, the contract had at least one trade (proving it was actively quoted), and sufficient time remained before settlement for the pricing model to be meaningful.

Our previous research had used a sample of 99 contracts. The Becker dataset was 287x larger. The jump in sample size exposed problems we never would have found on the small dataset.

The asset breakdown of our 28,496 signals:

AssetSignalsAvg P&LWin Rate
BTC18,865+0.4c68.6%
ETH9,612+2.6c42.5%
DOGE696+1.7c50.7%
XRP44+9.9c59.1%

ETH drove the majority of the edge. BTC signals were plentiful but thin. XRP and DOGE had too few signals for statistical confidence.

How Does the Black-Scholes Model Price Crypto Range Contracts?

The standard Black-Scholes model computes the probability that a price will finish above a given strike at expiry. For a binary call paying $1 if spot > strike, the fair value equals N(d2), where d2 incorporates the log-moneyness, variance, and time to expiry. We adapted this for Kalshi’s specific settlement mechanism using the CF Benchmarks TWAP: 60 one-second samples in the final minute, trimmed mean excluding the top and bottom 12. The combined variance reduction factor is 0.554.

Our model applies this as a volatility multiplier that transitions linearly from no adjustment (5 minutes to expiry) to full adjustment (under 1 minute). The rationale: far from expiry, the TWAP averaging hasn’t started, so raw volatility applies. Near expiry, the averaging compresses the effective price range.

The initial backtest results looked spectacular. Across our signal set, the model identified consistent mispricings and generated a simulated return of +221% on a $5,000 starting bankroll over 385 days.

We were suspicious. +221% on a systematic strategy with no human discretion is the kind of result that usually means something is wrong. So we paper-traded it.

The Three Bugs That Changed Everything

Paper trading produced a -29.5% return over 37 days. The model was systematically entering positions that immediately moved against it. Something in the backtest wasn’t matching reality. We spent two weeks debugging. Here’s what we found.

Bug 1: Subtraction Order — The Silent Killer

The fair value computation had a reversed subtraction: strike - spot instead of spot - strike in the log term. What this meant in practice: when Bitcoin was trading at $95,500 and the strike was $95,000, the function should have computed ln(95500/95000) — a positive value indicating the spot price was above the strike. Instead, it computed the equivalent of ln(95000/95500) — a negative value. This negative log-moneyness flowed through the normal CDF, producing probabilities below 0.5 for contracts that were in the money and should have been priced above 0.5.

The function was saying “there’s a 35% chance BTC finishes above $95,000” when it was already at $95,500, just 4 minutes from settlement. The answer should have been closer to 85%.

The insidious part: our isotonic regression calibration step compensated for this error. Isotonic regression is a standard calibration technique in machine learning. It takes a set of predicted probabilities and a set of observed outcomes, then fits a monotonically non-decreasing function that maps predictions to calibrated probabilities. If the model predicts 0.30 and the event occurs 0.70 of the time, isotonic regression learns to map 0.30 to 0.70.

This is exactly what happened with our broken model. The reversed subtraction produced systematically inverted probabilities — high when they should be low, low when they should be high. Isotonic regression detected this pattern in the training data and learned to flip the signal back. The calibrated output looked correct because the calibration routine was doing double duty: correcting both the model’s natural biases AND the implementation bug.

Why is this so dangerous? Because the backtest used in-sample calibration. The calibration table was built from the same 385-day period it was evaluated on. In-sample, the correction was near-perfect. Out-of-sample — which is what paper trading represents — the calibration table was built on a different volatility regime, different price levels, and different market microstructure. The compensating error no longer compensated.

This pattern should alarm anyone doing quantitative research. A standard, well-regarded ML technique (isotonic regression) produced plausible-looking results from a fundamentally broken model. The backtest Brier score was 0.0118 — entirely reasonable. The return curve was smooth and upward-sloping. Nothing in the standard diagnostic toolkit flagged the problem. Only live forward-testing, where the calibration table couldn’t perfectly compensate, revealed the error.

The lesson is severe: if your model requires heavy calibration to produce sensible outputs, you should treat that as a diagnostic signal, not a solution. A well-specified model should produce reasonable probabilities before calibration. Calibration should be a refinement, not a rescue operation. When we fixed the subtraction order, the uncalibrated model immediately produced probabilities in the correct range. The calibration step barely changed them — exactly what you want to see.

Impact: Changed the sign of the model’s signal for approximately 40% of contracts.

Bug 2: Time-to-Expiry Units — The 7.7x Amplifier

The model expected time-to-expiry in minutes. We were passing it in hours. For a contract 30 minutes from expiry, we were computing volatility for a 30-hour horizon — a 7.7x error in the volatility input (sqrt(30*60/30) = 7.74).

We discovered this bug through a specific pattern in the paper-trading P&L: the model was consistently buying out-of-the-money contracts in the final hour before settlement and losing on nearly all of them. A contract 3 strikes away from the current price with 20 minutes left should have a near-zero probability. Our model was pricing it at 15-20 cents because it thought there were 20 hours of volatility remaining.

This caused the model to massively overestimate the probability of large moves. Contracts that should have been priced at 5 cents (5% probability) were modeled at 15-20 cents. Combined with Bug 1’s sign inversion and the calibration correction, the net effect was a model that appeared to identify edge where none existed.

Impact: 7.7x overestimate of effective volatility, causing systematic overpricing of OTM contracts.

Bug 3: Bracket Interpretation — The $50 Offset

Kalshi’s API returns a b_value field for each range contract. We assumed this was the floor of the bracket (the lower boundary). It’s actually the center.

For BTC with $100 bracket width, a b_value of 95,000 means the bracket covers $94,950 to $95,050, not $95,000 to $95,100. The $50 offset seems small, but for contracts near the money it shifts the probability calculation by several percentage points.

We verified this directly from the Kalshi API documentation and by checking settlement outcomes against bracket boundaries. 14 contracts that settled differently than our model predicted were all explained by the center-vs-floor misinterpretation.

Impact: Shifted every BTC probability by 0.5-3 percentage points depending on proximity to current price.

How Paper Trading Exposed All Three

The revealing pattern in the paper-trading data was a cluster of 11 consecutive losses on BTC contracts in the 85-95 cent range. Our model bought YES at 87-92 cents, and every single one settled at $0. A correctly-specified model should almost never buy a contract at 90 cents that settles worthless — the implied 90% probability should correspond to near-certain outcomes. Eleven in a row was impossible under correct pricing.

We pulled those 11 contracts and manually computed fair values by hand. The hand calculations disagreed with the model by 40-60 percentage points. That discrepancy was large enough to immediately rule out calibration drift or volatility misestimation. Something was structurally wrong with the computation itself. From there, finding the three bugs took 8 days of line-by-line auditing.

Before and After

MetricBefore FixesAfter Fixes
Backtest return (385d)+221%+7.0%
Avg edge per signal+6.7c+1.2c
Win rate74.2%59.6%
Paper trading return-29.5%Not retested
Sharpe ratio (daily)8.14.5

The edge dropped from +6.7 cents to +1.2 cents. The +221% return became +7.0% (still positive, but a fundamentally different proposition). The Sharpe ratio halved. The win rate dropped from an implausible 74% to a more realistic 60%.

The 95% confidence interval on the corrected +1.2 cent edge: [+1.15c, +1.25c] (t-stat 106.47, p < 0.0001). The edge is statistically significant. It’s just not economically meaningful after execution costs.

Timeline of Discovery

The arc from euphoria to reality took approximately 14 weeks. Here is the chronological sequence.

Weeks 1-3: Model Build. We implemented the Black-Scholes adaptation with TWAP variance adjustment, calibrated on the first 200 days of the Becker dataset. Initial calibration looked clean. The model produced a Brier score of 0.0118 in-sample, which we took as validation. 28,496 qualifying signals were identified.

Weeks 4-5: Backtest Euphoria. The full 385-day backtest returned +221% on a $5,000 starting bankroll, with a daily Sharpe ratio of 8.1. We wrote internal documentation describing the strategy as “production-ready.” We were wrong.

Weeks 6-10: Paper Trading. We launched the model in paper-trading mode, submitting simulated orders at live Kalshi prices without committing capital. By day 12, the cumulative return was -8.3%. By day 25, it was -21.7%. By day 37, it reached -29.5%. The P&L curve was monotonically downward — not a single week was positive. The divergence from the backtest was too large and too consistent to be explained by regime change or bad luck. A 250-percentage-point gap between backtest (+221%) and live (-29.5%) demands a structural explanation.

Weeks 10-12: Debugging. We spent 14 days isolating the problems. The 11 consecutive BTC losses in the 85-95 cent range were the key diagnostic. Hand-computing fair values for those contracts revealed the subtraction-order bug on day 4 of debugging. The time-unit bug emerged on day 8 when we noticed the model’s volatility estimates were 7-8x higher than realized vol. The bracket interpretation bug was the last to fall, discovered on day 11 by cross-referencing settlement prices against bracket boundaries.

Weeks 12-14: Clean Rebuild. We fixed all three bugs, rebuilt the calibration from scratch, and re-ran the full 385-day backtest. The +221% became +7.0%. The +6.7 cent edge became +1.2 cents. We then built and tested the probability cap and Kou jump-diffusion alternatives. All three models converged on the same conclusion: the edge is real, tiny, and economically unviable after execution costs.

Does a Probability Cap Fix the Overconfidence?

Model 2 was a pragmatic patch: cap the model’s maximum probability at 0.45 for range contracts. The logic: Black-Scholes with log-normal tails systematically overestimates the probability of the most likely bracket. Crypto returns have fat tails (kurtosis 32.9 on hourly returns), and the log-normal assumption breaks down for near-the-money contracts. Empirically, realized bracket frequency plateaus at roughly 50% regardless of model confidence above 45%.

We saw the same overconfidence pattern in our weather model research: when our NWS-based model predicted 50%+ probability, the bracket hit only 44.3% of the time. The 20-40% range was most accurate in both domains — a consistent finding across 1,506 weather events and 28,496 crypto signals.

Results with the 0.45 cap:

MetricUncapped B-SCapped B-S
Signals28,48428,484
Win rate59.8%59.6%
Taker P&L (avg)+1.15c+1.2c
Total taker P&L+$328+$350
Maker P&L (avg)+1.80c+1.80c

The cap barely changes the results. The reason: most actionable signals already fall in the 0-40% probability range where the cap doesn’t bind. The overconfident 50%+ zone generates few signals because market prices are already close to model prices there. The cap is theoretically justified but practically irrelevant.

Can Jump-Diffusion Do Better?

Model 3 was the Kou double-exponential jump-diffusion model. Unlike Black-Scholes, Kou explicitly models sudden price jumps with exponentially distributed magnitudes. The intuition: crypto prices don’t just drift along a smooth random walk — they jump. A model that captures jumps should price tail events more accurately.

We calibrated the Kou parameters on 5,152 hourly crypto returns:

  • Jump intensity (lambda): Average 2.3 jumps per day
  • Jump size (up/down means): Calibrated per asset, asymmetric
  • Kurtosis of hourly returns: 32.9 (vs. 3.0 for a normal distribution)

What Does a Kurtosis of 32.9 Actually Mean?

The kurtosis of 32.9 deserves extended discussion because it is an extraordinary number. A normal distribution has a kurtosis of 3.0. Equity returns — already considered fat-tailed — typically show kurtosis between 5 and 10 on hourly data. The S&P 500 at 1-hour intervals runs around 6-8, depending on the period. BTC hourly returns at 32.9 are in a different universe.

Kurtosis measures the weight of the tails relative to the center of a distribution. A kurtosis of 32.9 means the distribution produces extreme observations far more often than a normal distribution would predict. Under a normal distribution, roughly 0.27% of observations fall beyond 3 standard deviations (about 1 in 370 hours, or roughly once every 15 days). With a kurtosis of 32.9, approximately 2.1% of hourly observations exceed 3 standard deviations — that is 1 in every 48 hours, or roughly every 2 days. BTC produces an “extreme” hourly move 8x more often than a normal model expects.

At 4 standard deviations, the disparity is even more dramatic. A normal distribution predicts a 4-sigma hourly move once every 31,560 hours (roughly 3.6 years). BTC produces one approximately every 12 days. These are the moves that bankrupt traders who size their positions using Gaussian assumptions. In practical terms, if BTC’s hourly volatility is 0.4%, a 4-sigma move is a 1.6% swing in 60 minutes — roughly $1,500 at a $95,000 price level. A trader who assumed normal tails and sized for a once-in-4-years event would face this scenario 30 times per year.

What Counts as a “Jump”?

The Kou model separates price movements into two components: continuous diffusion (the smooth random walk) and discrete jumps (sudden, discontinuous price changes). The jump intensity of 2.3 per day means the model identifies roughly one jump event every 10.4 hours.

A “jump” in the Kou framework is not just any large move. It is a move that cannot be explained by the continuous diffusion component alone — a price change that is too large, too sudden, or too discontinuous to be generated by the Brownian motion portion of the model. When BTC drops $800 in 3 minutes on a liquidation cascade, that is a jump. When it drifts $200 over 45 minutes of normal trading, that is diffusion.

The distinction matters for pricing because jumps and diffusion contribute differently to the probability of finishing inside a bracket. Diffusion is symmetric and predictable — it spreads probability mass gradually across nearby brackets. Jumps create spikes of probability at distant brackets. A model that ignores jumps underestimates the probability that BTC lands in a bracket 5-6 strikes away from the current price.

Does the Complexity Pay Off?

MetricBlack-ScholesKou Jump-Diffusion
Signals28,48428,684
Win rate59.8%57.4%
Taker P&L (avg)+1.15c+1.38c
Total taker P&L+$328+$395
Maker P&L (avg)+1.80c+1.96c
Zero-fee P&L (avg)+2.80c+2.96c

Kou produces marginally better results: +1.38 cents vs +1.15 cents per signal at taker fees, +$395 vs +$328 total P&L. The win rate is lower (57.4% vs 59.8%) but the average win is larger, consistent with better tail pricing.

The improvement is real but small: +0.23 cents per signal. Over 28,684 signals, that is $67 of additional profit. The implementation cost of Kou is substantial — calibrating 5 parameters (diffusion volatility, jump intensity, up-jump mean, down-jump mean, up-jump probability) versus 1 parameter for Black-Scholes (volatility). The Kou calibration requires maximum-likelihood estimation on the hourly return series, which is computationally expensive and sensitive to the sample period.

The market is already pricing in fat tails — not perfectly, but well enough that a more sophisticated model captures only 0.23 additional cents per signal. At zero fees, Kou shows +2.96 cents vs Black-Scholes at +2.80 cents. The models are converging toward the same conclusion: the market misprices contracts by roughly 3 cents on average, and fees consume most of that edge.

Why Doesn’t 89% Directional Accuracy Equal Profit?

One of the most counterintuitive findings from our research: our model predicts the direction of price movement with 89.2% accuracy on far-out-of-the-money contracts, yet loses money on those contracts. The explanation lies in the asymmetric payoff structure of binary options.

For contracts priced under 10 cents (low probability events), the model correctly predicts “this won’t happen” roughly 89% of the time. But “correctly predicting NO” on a 5-cent contract earns you 5 cents. Incorrectly predicting NO (the event happens) costs you 95 cents.

The math: 89.2% * $0.05 - 10.8% * $0.95 = $0.0446 - $0.1026 = -$0.058 per contract. High accuracy, negative P&L.

Meanwhile, the model generates its largest “edge” signals in the 20-50% probability zone — exactly where accuracy is worst. In this zone, the model’s direction call drops to 33.8% win rate. The signals are louder but less reliable.

This is the fundamental challenge of binary option trading: accuracy and profitability are decoupled. A model can be highly directionally accurate (89%+) and still unprofitable, or modestly accurate (60%) and profitable. What matters is the calibration between confidence and realized probability, not the raw hit rate.

What Happens When You Try to Execute?

Even the +1.2 cent per-signal edge overstates the realistic opportunity because it ignores execution constraints. The gap between theoretical P&L and realizable P&L is where most quantitative strategies die, and this one is no exception. We modeled 3 execution scenarios across the full 28,496 signal set and found that realistic friction consumes the entire edge.

Zero Exit Liquidity

68% of contracts with model-identified edge showed zero exit liquidity. Meaning: you could enter (buy at market), but you couldn’t sell before settlement. Your only exit was binary settlement — either $1 or $0. This is not a quirk of our signal selection — it reflects the fundamental structure of Kalshi’s crypto order book, where the 40x NO-side depth asymmetry means counterparties for closing YES positions simply do not exist on most contracts.

For a trade-out strategy (buy when mispriced, sell when the market corrects), this is fatal. We simulated 804,248 trade-out scenarios — the same methodology we used for our weather convergence analysis. Trade-out simulation starting with $5,000:

Capture RateFinal ValueReturnMax Drawdown
10%$7.42-99.9%99.9%
20%$9.46-99.8%99.8%
40%-$13.81-100.3%100.3%

These numbers look absurd. A -99.9% return seems like it must contain an error. It doesn’t. Here is what is happening at a 10% capture rate.

A 10% capture rate means you successfully exit 10% of your positions before settlement at a profit and hold the remaining 90% to binary settlement. But the 10% you exit are not randomly selected. They are the easy ones — the positions that moved in your favor quickly enough for a counterparty to appear on the other side. The remaining 90% are positions where the market didn’t move toward your model’s fair value, or actively moved against you.

You are systematically keeping the losers and selling the winners. This is adverse selection in its purest form. The orders you can fill are the ones you shouldn’t want to. The contracts where the market agrees with your model (and thus provides exit liquidity) are the same contracts where your model’s edge was smallest or nonexistent. The contracts where your model identified genuine mispricing are the ones with no counterparty willing to take the other side.

A 40% capture rate produces a return below -100% because the strategy goes negative and continues trading on margin in the simulation. The losses compound faster than the occasional wins can offset them.

Trade-out doesn’t work. Period. The only viable strategy is hold-to-settlement, which produced +$350 on 28,496 signals (+1.2 cents average).

Volume-Edge Correlation

The correlation between model-identified edge and contract volume is 0.849. The biggest mispricings exist in the least liquid contracts. This is expected (market efficiency is a function of attention) but it imposes a hard capacity ceiling.

This 0.849 correlation is the quantitative expression of a well-known market microstructure phenomenon. Liquid contracts attract sophisticated participants — the same algorithmic market makers who maintain 40x NO-side depth on Kalshi’s crypto books, as we documented in our market structure article. Those participants eliminate mispricings. Illiquid contracts, by definition, have fewer participants and less price discovery. That is where the model finds edge. But illiquid contracts are illiquid — you can’t trade size without moving the price against yourself.

Based on realistic fill rates of 3.6-8.8% of signals and entry slippage of 1.5-1.6%, the capacity ceiling for a single model-based trader is $5,000-$25,000. Above that, your orders start moving the market against you, and the 1.2-cent edge disappears. To put this in perspective: even at the $25,000 upper bound and a generous 8.8% fill rate, you are executing roughly 2,508 trades per year at +1.2 cents each — $30 of total profit. The strategy is technically positive-expectation and practically worthless.

Slippage Budget

ComponentCost
Taker fee (avg)1.58c
Entry slippage1.5c
Exit (settlement)0.0c
Total round-trip3.08c

Against a gross edge of approximately 2.8-3.0 cents (zero-fee model P&L), the 3.08 cents in execution costs leaves essentially nothing. At maker fees (0.4-0.5 cents), the round-trip drops to approximately 2.0 cents, leaving 0.8-1.0 cents of net edge. The math works — barely. But maker fills require posting resting orders and waiting for counterparties to cross the spread. On contracts with zero exit liquidity (68% of our signal set), there is no guarantee your resting order will be filled within the window where your model identifies edge. You need infrastructure to monitor thousands of contracts simultaneously, post and cancel orders in sub-second latency, and manage risk across 188 strikes per event. Most individual traders don’t have this, and the $30 annual profit doesn’t justify building it.

Frequently Asked Questions

Can Black-Scholes price crypto binary options?

Black-Scholes with TWAP adjustment produces well-calibrated probabilities for crypto range contracts. Our model achieved a Brier score of 0.0103 across 5.48 million observations, which indicates strong probabilistic calibration. The problem is not accuracy — it’s profitability. The average edge after Kalshi taker fees is +1.2 cents per contract. On 28,496 signals over 385 days, that totals $350. After accounting for execution constraints (68% zero exit liquidity, 0.849 volume-edge correlation), the realistic return is near zero. The model works as a probability estimator. It fails as a trading strategy.

Why did the backtest initially show +221% returns?

Three implementation bugs compounded to inflate results by roughly 30x. Bug 1 reversed the subtraction order in the log-moneyness calculation, producing inverted probabilities that isotonic regression calibration masked in-sample. Bug 2 passed time-to-expiry in hours instead of minutes, causing a 7.7x overestimate of effective volatility. Bug 3 interpreted Kalshi’s bracket b-value as the floor instead of the center, shifting every BTC probability by 0.5-3 percentage points. Paper trading at -29.5% over 37 days exposed all three by removing the in-sample calibration safety net. After fixes, the +221% became +7.0%.

Does the Kou jump-diffusion model work better than Black-Scholes for crypto?

Marginally. Kou produced +1.38 cents per signal versus Black-Scholes at +1.15 cents, an improvement of $67 in total P&L across 28,684 signals. The Kou model was calibrated on 5,152 hourly returns showing a kurtosis of 32.9 — far beyond the normal distribution’s 3.0 or typical equity returns of 5-10. Despite this extreme fat-tailedness, the market already prices in most of the jump risk. The additional 0.23 cents per signal barely justifies the implementation complexity of calibrating 5 parameters versus Black-Scholes’ single volatility parameter.

What is the maximum capacity for crypto range trading on Kalshi?

The capacity ceiling is $5,000-$25,000 for a single model-based trader. 68% of contracts with model-identified edge showed zero exit liquidity, and the volume-edge correlation of 0.849 means the best opportunities are in the least liquid contracts. Fill rates on qualifying signals range from 3.6-8.8%, with entry slippage of 1.5-1.6 cents. Above $25,000 in deployed capital, market impact from your own orders erodes the 1.2-cent edge entirely. For context, Kalshi processes over $60 million in daily crypto volume, but that volume concentrates in the most liquid contracts — precisely where model edge is smallest.

How do you detect bugs in trading backtests?

Paper trading is the definitive bug detector. Our model showed +221% in backtests but -29.5% in paper trading — a divergence too large for regime change or bad luck to explain. The specific diagnostic was 11 consecutive losses on BTC contracts priced in the 85-95 cent range. Hand-computing fair values for those 11 contracts revealed that the model’s probabilities disagreed with manual calculations by 40-60 percentage points, pointing to a structural computation error rather than parameter misestimation. The general principle: always validate backtests with live forward testing before committing capital. If your backtest requires heavy calibration to produce sensible results, treat that as a warning sign.

What Did We Actually Learn?

Backtest Integrity Requires Adversarial Review

The three bugs we found would have been invisible without paper trading. Each one was individually subtle:

  • A sign error masked by calibration
  • A units error in a function nobody double-checked
  • An API field interpretation that seemed obvious but was wrong

Together, they produced a +221% return that looked plausible. The calibration step is particularly dangerous: it can compensate for systematic errors in ways that make the backtest look correct while the underlying model is broken. If your model needs heavy calibration to produce reasonable outputs, that’s a red flag, not a feature.

Calibration Can Mask Misspecification

This is the most important methodological lesson. Our calibration routine (mapping model probabilities to realized frequencies via isotonic regression) produced good-looking results even when the underlying model had three significant bugs. The calibration was doing the heavy lifting while the model was providing noise.

After bug fixes, the model produced well-calibrated probabilities without calibration. The raw Black-Scholes output (with TWAP adjustment) had a Brier score of 0.0103 across 5.48 million observations. When the model is correctly specified, calibration is a minor refinement, not a fundamental correction. When calibration is making a big difference, check your model.

Market Efficiency Is Real

Three models. 28,496 signals. 385 days. The maximum edge any model found was +1.4 cents per signal (Kou at taker fees). After execution costs, the realistic return is near zero.

This isn’t a statement about modeling capability. Our models are well-calibrated and statistically validated. The market is simply good at pricing five-minute and hourly crypto contracts. The 40x NO-side depth we documented in our market structure article reflects professional market makers running their own models. Beating them by more than the fee spread is the challenge, and we couldn’t do it reliably.

The crypto binary options market on Kalshi is roughly as efficient as the weather market we analyzed in our weather model research. Different domains, same conclusion: when public data is available and resolution is fast, prediction markets converge on efficient prices. Our weather convergence analysis confirmed the same dynamic across 804,248 trade-out scenarios — winners converge smoothly, losers decline predictably, and the margin for exploitation is razor-thin.

Monitor current market efficiency across all prediction market platforms on our SIGNAL dashboard.

Key Takeaways

  • Initial +221% backtest return collapsed to +7% after fixing three implementation bugs — paper trading at -29.5% over 37 days exposed the errors
  • Three models tested: Black-Scholes (+1.2c/signal), capped B-S (+1.2c), Kou jump-diffusion (+1.4c) — all marginal after taker fees
  • The three bugs: reversed subtraction (masked by isotonic regression calibration), hours-vs-minutes TTL (7.7x vol error), bracket center vs floor interpretation
  • BTC hourly return kurtosis of 32.9 produces 3-sigma moves every 2 days (8x more than normal), yet the market already prices in most of this fat-tail risk
  • 68% of contracts had zero exit liquidity; trade-out strategies suffer from pure adverse selection (keeping losers, selling winners)
  • Capacity ceiling is $5K-$25K before market impact erodes the edge; volume-edge correlation of 0.849 means the best signals are in the least liquid contracts
  • Calibration that compensates for model bugs is the most dangerous form of overfitting — if removing calibration destroys your results, the model is broken

For the full academic treatment of these findings, see our research paper on crypto binary option efficiency.

Frequently Asked Questions

Can you beat Kalshi crypto options with a pricing model?
We tested three models across 28,496 signals. The maximum edge was +1.4 cents per signal (Kou jump-diffusion at taker fees). After execution costs and the $5K-$25K capacity ceiling, realistic returns are near zero. The market is approximately efficient.
What is the Kou jump-diffusion model?
The Kou model adds discrete price jumps with double-exponential size distribution to standard geometric Brownian motion. It captures fat tails better than Black-Scholes. We calibrated it on 5,152 hourly BTC returns showing kurtosis of 32.9, but it only improved edge by 0.23 cents per signal over Black-Scholes.
How common are backtesting bugs?
We found three simultaneous bugs that inflated our backtest from +1.2 cents to +6.7 cents per signal (+221% simulated return). Each bug was individually subtle — a sign error, a units mismatch, a field interpretation error. Isotonic calibration masked all three by compensating for the systematic errors. Paper trading at -29.5% exposed the problems.
Are prediction markets efficient?
Our evidence says yes, approximately. Three models across 877,000 crypto contracts and 1,506 weather events converge on the same conclusion: fees eat the edge. The market incorporates public information within 10-30 minutes and prices range contracts within 1-3 cents of model fair value.
What is isotonic calibration and why is it dangerous?
Isotonic regression maps model probabilities to realized frequencies using historical data. It's dangerous because it can compensate for implementation bugs, making a broken model look correct. Our calibration masked three bugs simultaneously. If removing calibration destroys your results, the model is likely misspecified.