Beta Calibration vs Isotonic: Why TrapStats Switched

This is a technical post. If you're here for greyhound picks you can safely skip it; if you're here because you want to know how the sausage is actually made (and why we changed the sausage recipe twice), read on.

All performance figures below are simulated/paper results; past performance does not guarantee future results. This is research and education, not betting advice.

The problem: a narrow-input calibration arm

The Denis predictor produces a raw probability from a LightGBM ensemble. Before that probability becomes an EV input, it's passed through a calibrator — a learned function that maps raw scores to actual win probabilities on a held-out set.

The first version of TrapStats' Denis used an isotonic regression calibrator, fit on the full validation set (~7500 rows). Isotonic is a non-parametric, monotonic function — it's the textbook choice and works well on most problems.

On Denis, it produced a disaster.

What happened the first time

The Denis selection arm only ever selects the top pick per race, and top-pick raw probabilities are concentrated in a narrow range — roughly 0.30 to 0.50. The isotonic calibrator, fit on the full validation distribution (which spans 0.0 to 1.0), learned a fine-grained mapping in the populated buckets and then applied out_of_bounds="clip" for inputs outside its training range.

Result: long-shot dogs the model rated at 0.05 raw got clipped up to the constant minimum of the calibrator's effective range — around 0.18 to 0.20 — at prices of 30 to 50 to 1. EV exploded:

ev_win = 0.20 × 50 − 1 = +9.00

The system saw +900% EV on a 50/1 long shot and faithfully recorded the selection. Multiply by 100+ similar candidates over 75 minutes and we'd produced a phantom-EV machine. We rolled it back the same evening.

(See memory: feedback_denis_calibration for the longer post-mortem.)

Why isotonic broke on this input

Isotonic regression fits a monotonic step function on the training distribution. Its outputs are well-behaved inside that distribution. But:

Outside the training range, it has no signal — and out_of_bounds="clip" is a sensible default that becomes catastrophic when paired with high-payoff long-shot selections.
On narrow ranges, isotonic collapses to fewer effective bins. If 7000 of 7500 training points sit in [0.20, 0.55] and you're selecting picks in [0.30, 0.50], the calibrator's output range in your operating zone is even narrower — almost a constant. Multiplying a constant by price gives EV that grows linearly with price. That's the long-shot bias mechanism.

Why Beta calibration handles it

A Beta calibrator is parametric: it fits a single 3-parameter logistic on log-probabilities:

P_calibrated(x) = sigmoid(a · log(x) + b · log(1 − x) + c)

Three numbers (a, b, c) trained on the same data. Properties:

Smooth and monotonic everywhere. No "out of bounds" behavior — the function is defined on (0, 1) without surprises.
Doesn't overfit narrow ranges. With only three free parameters, it can't collapse to a constant in a narrow bucket the way isotonic does — there's nowhere for it to overfit to.
Theoretically grounded. Beta calibration assumes the raw scores are samples from a transformed Beta distribution; on tree ensembles the assumption is roughly correct.

We swapped isotonic for Beta in ml/models/calibration.py:BetaCalibrator, fit on the same val set, deployed it, and the phantom-EV problem disappeared.

A second failure, a deeper lesson

The second Denis disaster was not a calibrator-shape problem — it was a fit-distribution problem. Even with isotonic replaced, we discovered later that the calibrator was being fit on the full val population (~7500 entries, full prob distribution) while being applied at the per-race top-pick level (a selection-conditioned slice with prob_win concentrated in 0.30–0.50).

The calibrator added +0.04 to +0.08 in exactly that band. Live Brier jumped to 0.217 vs val 0.153. Win rate in the 30%+ bucket was 27%; predicted 41%. Same phantom-EV problem, different root cause.

The fix this time wasn't another calibrator change — it was the shadow calibrator infrastructure: a second calibrator fit only on the selection arm's live outcomes, logged side-by-side with the production calibrator but never used to make selections. We watch it. If the live-arm calibrator's Brier improves on production's, and realized ROI lines up with predicted, then we cut over.

(See project_denis_calibration_explainability_2026_05_29 memory for the full overhaul detail.)

Lessons that generalised

Three things we now treat as absolute rules:

Beta calibration for narrow-input ranges, never isotonic.
Fit the calibrator on the same conditional distribution it will be applied to. Selection arms need selection-arm calibrators.
Shadow before cut. Never deploy a calibration change to the live selection arm without ≥7 days of shadow data showing improvement and realized-vs-predicted ROI within ~5pp.

These rules were paid for in simulated units. They live in feedback_denis_calibration in the project memory, and every new calibration proposal has to clear them.

What you see today

If you visit a Denis pick on /denis and expand the "Why this pick" panel, you'll see three rows in the Shadow block:

The production prob (current calibrator output).
The shadow prob (live-arm Beta calibrator output).
The segment-prior prob (per-track+distance+grade base rate).

All three are observe-only diagnostic numbers. The selection uses the production prob — but you can see what the alternatives would say, side by side, and decide for yourself whether the model is on consensus or on a limb.

That's the whole point: probability calibration is hard, the failure modes are sharp, and the honest move is to expose every layer of the reasoning rather than hide it behind a single number.