Chapter 5 — ML & ensemble integration¶
Guided time: 6–8 hours
Prerequisites: Chapter 4 — Engine
Next: Chapter 6 — Quantum & sentiment
1. Purpose¶
Machine learning here is not a Kaggle leaderboard exercise—it is a probability supplier for a trading policy. This chapter explains the contract between ml/ensemble_model.py and engine/signal_generator.py, how missing models behave, and how to safely plug in trained classifiers.
Disclaimer: Teaching integration does not guarantee profitable trading. Stationarity, fees, slippage, and regime change dominate live results.
2. Objectives¶
- State the
predict_probabatch contract (input shape, output shape, class order). - Configure
XGB_MODEL_PATH/ENSEMBLE_MODEL_PATH(or repo default paths) to load joblib models. - Explain
SafeModelinml/model_loader.pyand why it exists. - Compare
ensemble_modelpath withxgb_model.py(alternative wiring; may needload_xgbcompletion in your fork).
3. Contract: what SignalGenerator expects¶
SignalGenerator.generate calls (conceptually):
P = ensemble_predict_proba([feature_row])
Where:
- Input is a list of rows; each row is a list of floats (one bar → one row typical).
- Output is a list of rows; each row has three probabilities aligned to [SELL, BUY, HOLD] after normalization inside the generator’s logic.
If the list is empty or malformed, the generator falls back to HOLD with zero confidence—safe default.
4. Ensemble class (lazy singleton pattern)¶
ml/ensemble_model.py defines:
Ensemble— loads a model from disk viaload_model(path)when a path is provided.predict_proba(X)— for each feature row:- If a model is loaded, call
model.predict_proba([feats])and map two-class or three-class outputs into normalized triples. - If no model, emit [1/3, 1/3, 1/3] per row so the pipeline still runs (smoke tests rely on this).
Pedagogy point: distinguish “no model” (explicit neutral distribution) from “model predicts uncertainty” (learned probabilities).
5. SafeModel wrapper¶
ml/model_loader.py wraps raw sklearn-like estimators:
- If
predict_probaexists, use it. - If not, synthesize probabilities from
predictoutputs (including scalar mappings).
This reduces crashes when experimenting with heterogeneous model types—at the cost of semantic ambiguity if you feed the wrong estimator. Review synthesized paths before live trading.
6. Bringing your own model¶
Training (outside this repo)¶
Typical pattern:
- Build a labeled dataset aligned with your feature definition in
signal_generator.py(or replace features deliberately). - Train a classifier with
predict_proba(two-class is common). joblib.dump(model, "models/xgb.pkl")(path illustrative).
Wiring¶
Set env vars or place files where ensemble_model._default_model_path() searches:
XGB_MODEL_PATHENSEMBLE_MODEL_PATHmodels/xgb.pklmodels/ensemble.pkl
Restart the process after changes—module-level singletons cache the ensemble.
7. Evaluation mindset (lightweight)¶
Before live deployment, at minimum:
- Calibration — raw probabilities may be miscalibrated; consider Platt scaling or isotonic regression if you rely on thresholds.
- Leakage — ensure labels do not peek into the future relative to features.
- Fees — policy thresholds should be interpreted net of costs if you simulate P&L.
This course does not implement a full backtester; add a chapter if you productize one.
8. Labs¶
Lab 5.1 — Neutral path (30 min)¶
Run smoke without any model file. Confirm triple [~0.33, ~0.33, ~0.33] appears in logs.
Lab 5.2 — Stub model (120+ min)¶
Train a trivial sklearn model on random data only to exercise plumbing:
- Same feature dimension as
_to_feature_vector. - Save via joblib.
- Point env path to it.
- Re-run smoke; observe whether BUY/SELL ever appears (may still be rare).
Lab 5.3 — Contract test (45 min)¶
Write a unit test (in tests/ or personal folder) that mocks predict_proba to return an extreme BUY vector and asserts SignalGenerator returns BUY with high confidence.
9. Exercises¶
- Why is batching
predict_proba([row])preferred over per-row import side effects? - What breaks if class order were [BUY, SELL, HOLD] but code assumes [SELL, BUY, HOLD]?
- Locate
_to_feature_vectorand list the numeric features it constructs.
10. Notebook¶
notebooks/01_feature_vector_and_proba.ipynb:
- Reimplement
_to_feature_vectorin a cell. - Plot histograms of dummy probabilities under neutral vs synthetic BUY-skewed model outputs.
11. Summary¶
The ML layer’s job is to emit probabilities the policy can threshold. The policy’s job is to stay safe when ML is absent or wrong. Chapter 6 adds optional quantum portfolio shaping and external sentiment feeds on top of this foundation.