Why Raw Sigmoid Scores Beat Isotonic Calibration in Crypto ML Pipelines

Most ML tutorials will tell you to calibrate your classifier's probabilities before putting them in production. "Calibrated probabilities are more meaningful," they say. "A 0.75 confidence should mean 75% win rate." Makes sense. Sklearn has CalibratedClassifierCV ready to go. You wrap your model, apply isotonic regression, and your probabilities are now trustworthy.

Do not do this for crypto trading models. It will break your signal.

This post explains exactly why, with data from our V2 ensemble running live on real orders since February 18, 2026. We will cover the counterintuitive truth about AUC, why isotonic regression eats your tail signal, and the one exception where raw sigmoid scores can also mislead you.

The AUC 0.50 Trap

Here is a line from our V2 walk-forward evaluation on a recent retrain:

Combined OOS AUC BUY: 0.535
Combined OOS AUC SELL: 0.498

By every textbook metric, that SELL model is useless. 0.498 is worse than a coin flip. Delete the model, retry features, start over.

Except the SELL signal was profitable in paper trading at a 60% win rate. How?

Because AUC measures ordering across the entire score distribution, and trading only uses the tails. An AUC of 0.50 means the model cannot rank a randomly chosen positive above a randomly chosen negative. But in production we do not ask "rank these two." We ask "of the 5% of samples scoring above 0.80, what fraction are positive?"

Those are different questions. A model can have trash bulk ordering and excellent tail behavior simultaneously.

Here is what our sigmoid score distribution looked like on that 0.498-AUC model:

sigmoid_bin   count    win_rate
0.50-0.60     8421     48.2%
0.60-0.70     2104     51.6%
0.70-0.80      612     56.8%
0.80-0.90      189     61.4%
0.90-1.00       47     70.2%

The model is correctly identifying high-confidence wins. It is just wrong about everything else, and the bulk of "everything else" drags the AUC toward 0.50.

What Isotonic Calibration Actually Does

Isotonic regression fits a monotonic step function to your predictions so that the output becomes a calibrated probability. It looks like this:

from sklearn.isotonic import IsotonicRegression

iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(raw_sigmoid_scores, labels)
calibrated = iso.predict(raw_sigmoid_scores)

The problem is that isotonic regression is trained on a calibration set. On that set, it observes something like:

Scores in [0.80, 0.90] won 58% of the time
Scores in [0.90, 1.00] won 62% of the time

It then compresses scores from 0.80-1.00 into 0.58-0.62. Mathematically, your probabilities are now "calibrated." Practically, you have just flattened the tail that contained all your trading signal.

When you go to production and apply your threshold of 0.80, you find that nothing scores above 0.80 anymore. You have to drop your threshold to 0.58 to get any signal. And at 0.58 the raw score contained many false positives that you have now mixed into your high-confidence bucket.

The Production Fix: Raw Sigmoid from Logits

Tree-based gradient boosters output raw log-odds (logits), not probabilities. Most libraries will helpfully sigmoid them for you when you call predict_proba. Do not let them. Pull the raw logits and sigmoid them yourself:

import numpy as np

def raw_sigmoid(model, X):
    raw_logits = model.predict(X, raw_score=True)
    return 1.0 / (1.0 + np.exp(-raw_logits))

In SQL (our production signal query looks like this):

SELECT *
FROM bot_order_signals bos
WHERE 1.0 / (1.0 + exp(-bos.score)) >= 0.80
  AND bos.side = 'BUY'
ORDER BY bos.score DESC
LIMIT 10;

We store the raw logit. We compute sigmoid in the query at read time. We never trust the isotonic score, which is stored alongside purely for diagnostic comparison.

The Double-Sigmoid Bug

One specific gotcha we hit in February: our original V2 code called predict_proba() and then applied a sigmoid on top of it. The result was a double-sigmoid, which compressed all scores toward 0.50. High-confidence predictions that should have been 0.85 came out as 0.63.

The fix was to use raw classifier scores (logits) and apply sigmoid exactly once:

raw_logit = model.predict(X, output_margin=True)
sigmoid = 1.0 / (1.0 + np.exp(-raw_logit))

Check your pipeline for this. If your production scores look "too close to 0.5," double-sigmoid is a likely culprit.

The Exception: Raw SELL Signal Can Invert

Here is the thing we did not expect. On two consecutive retrains in late March 2026, our raw SELL sigmoid scores inverted at high thresholds. Here is what we saw:

SELL raw sigmoid bin    win_rate (OOS eval)
>= 0.70                 42%
>= 0.80                 37%
>= 0.90                 30%

The higher the model's confidence, the worse its accuracy. The model was anti-predicting.

But in paper trading on live data, the same model at the same threshold showed 55-60% win rates. The inversion was period-specific, not a fundamental property of the model.

Our conclusion: trust paper trading over eval set OOS when they disagree, especially for minority-side predictions. Eval sets are small. Paper trading sets are much larger and capture live market conditions. If eval says "inverted" but paper trading says "55% WR with 200+ trades," trust the paper trading number.

Practical Checklist

When building a crypto ML pipeline:

Compute sigmoid from raw logits, exactly once. Do not call predict_proba and then sigmoid again.
Bin your scores and look at per-bin win rates. If AUC is 0.50 but the top bin wins 65%, you still have a useful model.
Do not isotonic-calibrate before gating. Threshold on raw sigmoid. Use isotonic only if you need calibrated probabilities for downstream math (position sizing, Kelly criterion, etc.), and even then apply it post-hoc.
Track eval OOS and paper trading separately. They can disagree, and paper trading usually wins.
Store raw logits in your database. Compute sigmoid at read time. This lets you experiment with different thresholds without re-running predictions.

At APIndicators, every prediction we serve to API clients uses this raw-sigmoid-on-logits approach. Our production threshold is 0.80 for V2 and 0.90 for V3 paper trading. See /docs for the prediction endpoint schema or /pricing for plan details.