Most retail crypto prediction systems run a single model, tune it aggressively on historical data, and push it to production. Then they watch in confusion as the backtest's 68% win rate collapses to 51% in live trading. We have been down that road. V1 of APIndicators was a single LightGBM classifier. It shipped in late 2025 and it worked until it did not.
V2 is different. It is a three-model ensemble of LightGBM, XGBoost, and CatBoost, each trained independently on walk-forward folds, weighted by out-of-sample AUC, and gated behind a raw sigmoid threshold of 0.80 plus a positive expected value filter. V2 went live with real orders on February 18, 2026. As of today we have 973 real trades: BUY side 59.0% win rate, SELL side 50.3%, overall average +0.319% per BUY trade.
This post walks through the full architecture, the reasoning behind each decision, and the production numbers we are seeing.
Why Three Models, Not One
Every gradient boosting library has systematic biases:
- LightGBM grows leaf-wise. It converges fast and tends to produce sharp decision boundaries. Great at capturing local patterns, occasionally overfits minority regions.
- XGBoost grows level-wise with heavy regularization. Smoother predictions, more conservative tail behavior.
- CatBoost uses ordered boosting and oblivious trees. Handles categorical interactions natively and produces well-calibrated probabilities by default.
Trained on identical features and labels, these three will disagree on roughly 15-20% of high-confidence predictions. Those disagreements are information. When all three agree that a candle is likely to hit +2% before -2%, you trade. When they split, you stay out.
The requirement for this to work: the models must make different errors. Three copies of LightGBM with different seeds do not qualify. Three architecturally distinct libraries do.
Walk-Forward Training: Why Not a Single Train/Test Split
Crypto markets are non-stationary. The data-generating process in March is not the same as November. A single train/test split trains on old regimes and tests on adjacent data, which overstates performance.
We use four walk-forward folds. Each fold trains on an expanding window and tests on the next 5-day block. This simulates how the model would have behaved in production, retrained on a rolling basis.
from sklearn.model_selection import TimeSeriesSplit
def walk_forward_folds(df, n_splits=4, test_days=5):
df = df.sort_values("timestamp").reset_index(drop=True)
fold_size = int(len(df) * (test_days / df["timestamp"].dt.date.nunique()))
folds = []
for i in range(n_splits):
test_start = len(df) - fold_size * (n_splits - i)
test_end = test_start + fold_size
train_idx = df.index[:test_start]
test_idx = df.index[test_start:test_end]
folds.append((train_idx, test_idx))
return folds
Each fold produces an AUC on its out-of-sample test set. We average those four AUCs to get the combined OOS AUC per model. This is the number we trust — not the calibration-set AUC, which is optimistic.
Training All Three Models on the Same Fold
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score
def train_fold(X_train, y_train, X_test, y_test):
models = {
"lgbm": LGBMClassifier(n_estimators=800, learning_rate=0.03, max_depth=7),
"xgb": XGBClassifier(n_estimators=800, learning_rate=0.03, max_depth=6),
"catboost": CatBoostClassifier(iterations=800, learning_rate=0.03, depth=6, verbose=0),
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
raw = model.predict(X_test)
auc = roc_auc_score(y_test, raw)
results[name] = {"model": model, "auc": auc, "scores": raw}
return results
One training run produces 12 models (3 algorithms x 4 folds) and 12 AUCs. We then collapse to 3 final models by training each algorithm on 100% of the data, using the fold AUCs only for weighting.
Combining Scores: OOS-AUC Weighted Ensemble
The naive approach is simple averaging. We tried it. It underperforms a weighted average because not all three models contribute equally in every market regime.
Our production weighting uses the combined OOS AUC of each model:
import numpy as np
def compute_ensemble_weights(oos_aucs: dict) -> dict:
aucs = np.array(list(oos_aucs.values()))
if np.any(np.isnan(aucs)) or aucs.sum() == 0:
n = len(oos_aucs)
return {name: 1.0 / n for name in oos_aucs}
shifted = np.maximum(aucs - 0.5, 0.001)
weights = shifted / shifted.sum()
return dict(zip(oos_aucs.keys(), weights))
def ensemble_predict(models_and_weights, X):
combined = np.zeros(len(X))
for name, (model, weight) in models_and_weights.items():
raw_score = model.predict(X)
sigmoid = 1.0 / (1.0 + np.exp(-raw_score))
combined += sigmoid * weight
return combined
Two details matter here:
- Shift by 0.5 before weighting. AUC of 0.51 is almost nothing. AUC of 0.65 is strong. The gap between them should dominate the weighting, not be diluted by the baseline 0.5.
- NaN fallback to 1/N weights. Tiny test sets occasionally produce NaN AUCs. If any weight is NaN, fall back to equal weighting rather than letting NaN poison the whole ensemble.
The Production Gate: Sigmoid 0.80 + EV Filter
After the ensemble produces a combined sigmoid score, we apply two gates:
def production_signal(ensemble_score, expected_value, side):
threshold = 0.80
if side == "BUY":
return ensemble_score >= threshold
else:
return ensemble_score >= threshold and expected_value > 0
The EV > 0 filter for SELL trades was added on February 27, 2026 after two weeks of paper trading showed that SELL signals with negative expected value systematically underperformed. BUY trades ignore the EV filter because the BUY regressor in V2 has a mean EV of -1.27% (it systematically underestimates BUY outcomes), so filtering by EV > 0 would kill the BUY side entirely.
What V2 Actually Looks Like in Production
Live since February 18, 2026, maximum 10 simultaneous open orders, 10-minute signal refresh, 1-hour trading interval. Current numbers:
- Total trades: 973
- BUY win rate: 59.0% (avg +0.319% per trade)
- SELL win rate: 50.3% (flat)
- Overall: 54.6%
The SELL side is the hard problem. Our walk-forward evaluations show SELL signal inverting at high raw sigmoid thresholds in some folds — the model starts anti-predicting. We suspect this is regime-specific, and our March retrains show SELL slowly recovering. For now, V2 carries both sides because removing SELL would cut order volume in half, and the EV filter keeps the worst SELLs out.
Key Takeaways
- Three different gradient boosting libraries, weighted by walk-forward OOS AUC, outperform any single model.
- Always use raw sigmoid scores, never isotonic calibration (we have a whole post on this).
- Build a NaN fallback into your ensemble weighting. Walk-forward folds with tiny test sets will break naive weighting code.
- Track real-money metrics separately from paper trading metrics. They diverge in ways you do not expect.
- Add filters like EV > 0 only after you have production data to justify them. Adding filters up front adds complexity without evidence.
You can hit the live V2 ensemble via the APIndicators prediction endpoint — 50+ indicators, 470+ pairs, ML predictions updated every 10 minutes. See /pricing for plans or /docs for the full API reference.