Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
410 lines
16 KiB
Markdown
Executable File
410 lines
16 KiB
Markdown
Executable File
# DOLPHIN-NAUTILUS — E2E Master Validation Plan
|
||
# "From Champion Backtest to Production Fidelity"
|
||
|
||
**Authored**: 2026-03-07
|
||
**Authority**: Post-MIG7 production readiness gate. No live capital until this plan completes green.
|
||
**Principle**: Every phase ends in a written, dated, signed-off result. No skipping forward on "probably fine."
|
||
**Numeric fidelity target**: Trade-by-trade log identity to full float64 precision where deterministic.
|
||
Stochastic components (OB live data, ExF timing jitter) are isolated and accounted for explicitly.
|
||
|
||
---
|
||
|
||
## Prerequisites — Before Any Phase Begins
|
||
|
||
```bash
|
||
# All daemons stopped. Clean state.
|
||
# Docker stack healthy:
|
||
docker ps # hazelcast:5701, hazelcast-mc:8080, prefect:4200 all Up
|
||
|
||
# Activate venv — ALL commands below assume this:
|
||
source "/c/Users/Lenovo/Documents/- Siloqy/Scripts/activate"
|
||
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict"
|
||
```
|
||
|
||
---
|
||
|
||
## PHASE 0 — Blue/Green Audit
|
||
|
||
**Goal**: Confirm blue and green configs are identical where they should be, and differ
|
||
only where intentionally different (direction, IMap names, log dirs).
|
||
|
||
### AUDIT-1: Config structural diff
|
||
|
||
```bash
|
||
python -c "
|
||
import yaml
|
||
blue = yaml.safe_load(open('prod/configs/blue.yml'))
|
||
green = yaml.safe_load(open('prod/configs/green.yml'))
|
||
|
||
EXPECTED_DIFFS = {'strategy_name', 'direction'}
|
||
HZ_DIFFS = {'imap_state', 'imap_pnl'}
|
||
LOG_DIFFS = {'log_dir'}
|
||
|
||
def flatten(d, prefix=''):
|
||
out = {}
|
||
for k, v in d.items():
|
||
key = f'{prefix}.{k}' if prefix else k
|
||
if isinstance(v, dict):
|
||
out.update(flatten(v, key))
|
||
else:
|
||
out[key] = v
|
||
return out
|
||
|
||
fb, fg = flatten(blue), flatten(green)
|
||
all_keys = set(fb) | set(fg)
|
||
diffs = {k: (fb.get(k), fg.get(k)) for k in all_keys if fb.get(k) != fg.get(k)}
|
||
|
||
print('=== Config diffs (blue vs green) ===')
|
||
for k, (b, g) in sorted(diffs.items()):
|
||
expected = any(x in k for x in EXPECTED_DIFFS | HZ_DIFFS | LOG_DIFFS)
|
||
tag = '[OK]' if expected else '[*** UNEXPECTED ***]'
|
||
print(f' {tag} {k}: blue={b!r} green={g!r}')
|
||
"
|
||
```
|
||
|
||
**Pass**: Only `strategy_name`, `direction`, `hazelcast.imap_state`, `hazelcast.imap_pnl`,
|
||
`paper_trade.log_dir` differ. Any other diff = fix before proceeding.
|
||
|
||
### AUDIT-2: Engine param identity
|
||
|
||
Both configs must have identical engine section except where intentional.
|
||
Specifically verify `fixed_tp_pct=0.0095`, `abs_max_leverage=6.0`, `fraction=0.20`,
|
||
`max_hold_bars=120`, `vel_div_threshold=-0.02`. These are the champion params —
|
||
any deviation from blue in green's engine section is a bug.
|
||
|
||
### AUDIT-3: Code path symmetry
|
||
|
||
Verify `paper_trade_flow.py` routes `direction_val=1` for green and `direction_val=-1`
|
||
for blue. Verify `dolphin_actor.py` does the same. Verify both write to their respective
|
||
IMap (`DOLPHIN_PNL_BLUE` vs `DOLPHIN_PNL_GREEN`).
|
||
|
||
**AUDIT GATE**: All 3 checks green → sign off with date. Then proceed to REGRESSION.
|
||
|
||
---
|
||
|
||
## PHASE 1 — Full Regression
|
||
|
||
**Goal**: Clean slate. Every existing test passes. No regressions from MIG7 work.
|
||
|
||
```bash
|
||
python -m pytest ci/ -v --tb=short 2>&1 | tee run_logs/regression_$(date +%Y%m%d_%H%M%S).log
|
||
```
|
||
|
||
**Expected**: 14/14 tests green (test_13×6 + test_14×3 + test_15×1 + test_16×4).
|
||
**Also run** the original 5 CI layers:
|
||
|
||
```bash
|
||
bash ci/run_ci.sh 2>&1 | tee run_logs/ci_full_$(date +%Y%m%d_%H%M%S).log
|
||
```
|
||
|
||
Fix any failures before proceeding. Zero tolerance.
|
||
|
||
---
|
||
|
||
## PHASE 2 — ALGOx Series: Pre/Post MIG Numeric Parity
|
||
|
||
**Goal**: Prove the production NDAlphaEngine produces numerically identical results to
|
||
the pre-MIG champion backtest. Trade by trade. Bar by bar. Float by float.
|
||
|
||
**The guarantee**: NDAlphaEngine uses `seed=42` → deterministic numba PRNG. Given
|
||
identical input data in identical order, output must be bit-for-bit identical for all
|
||
non-stochastic paths (OB=MockOBProvider, ExF=static, no live HZ).
|
||
|
||
### ALGO-1: Capture Pre-MIG Reference
|
||
|
||
Run the original champion test to produce the definitive reference log:
|
||
|
||
```bash
|
||
python nautilus_dolphin/test_pf_dynamic_beta_validate.py \
|
||
2>&1 | tee run_logs/PREMIG_REFERENCE_$(date +%Y%m%d_%H%M%S).log
|
||
```
|
||
|
||
This produces:
|
||
- `run_logs/trades_YYYYMMDD_HHMMSS.csv` — trade-by-trade: asset, direction, entry_bar,
|
||
exit_bar, entry_price, exit_price, pnl_pct, pnl_absolute, leverage, exit_reason
|
||
- `run_logs/daily_YYYYMMDD_HHMMSS.csv` — per-date: capital, pnl, trades, boost, beta, mc_status
|
||
- `run_logs/summary_YYYYMMDD_HHMMSS.json` — aggregate: ROI, PF, DD, Sharpe, WR, Trades
|
||
|
||
**Expected aggregate** (champion, frozen):
|
||
ROI=+44.89%, PF=1.123, DD=14.95%, Sharpe=2.50, WR=49.3%, Trades=2128
|
||
|
||
If the pre-MIG test no longer produces this, stop. Something has regressed in the engine.
|
||
Restore from backup before proceeding.
|
||
|
||
**Label these files**: `PREMIG_REFERENCE_*` — do not overwrite.
|
||
|
||
### ALGO-2: Post-MIG Engine Parity (Batch Mode, No HZ)
|
||
|
||
Create `ci/test_algo2_postmig_parity.py`:
|
||
|
||
This test runs the SAME 55-day dataset (Dec31–Feb25, vbt_cache_klines parquets)
|
||
through `NDAlphaEngine` via the production `paper_trade_flow.py` code path, but with:
|
||
- HZ disabled (no client connection — use `--no-hz` flag or mock HZ)
|
||
- MockOBProvider (same as pre-MIG, static 62% fill, -0.09 imbalance bias)
|
||
- ExF disabled (no live fetch — use static zero vector as pre-MIG did)
|
||
- `seed=42`, all params from `blue.yml`
|
||
|
||
Then compare output trade CSV against `PREMIG_REFERENCE_trades_*.csv`:
|
||
|
||
```python
|
||
# Comparison logic — every trade must match:
|
||
for i, (pre, post) in enumerate(zip(pre_trades, post_trades)):
|
||
assert pre['asset'] == post['asset'], f"Trade {i}: asset mismatch"
|
||
assert pre['direction'] == post['direction'], f"Trade {i}: direction mismatch"
|
||
assert pre['entry_bar'] == post['entry_bar'], f"Trade {i}: entry_bar mismatch"
|
||
assert pre['exit_bar'] == post['exit_bar'], f"Trade {i}: exit_bar mismatch"
|
||
assert abs(pre['entry_price'] - post['entry_price']) < 1e-9, f"Trade {i}: entry_price mismatch"
|
||
assert abs(pre['pnl_pct'] - post['pnl_pct']) < 1e-9, f"Trade {i}: pnl_pct mismatch"
|
||
assert abs(pre['leverage'] - post['leverage']) < 1e-9, f"Trade {i}: leverage mismatch"
|
||
assert pre['exit_reason'] == post['exit_reason'], f"Trade {i}: exit_reason mismatch"
|
||
assert len(pre_trades) == len(post_trades), f"Trade count mismatch: {len(pre_trades)} vs {len(post_trades)}"
|
||
```
|
||
|
||
**Pass**: All 2128 trades match to 1e-9 precision. Zero divergence.
|
||
|
||
**If divergence found**: Binary search the 55-day window to find the first diverging trade.
|
||
Read that date's bar-level state log to identify the cause. Fix before proceeding.
|
||
|
||
### ALGO-3: Sub-Day ACB Path Parity
|
||
|
||
Run the same 55-day dataset WITH ACB listener active but no boost changes arriving
|
||
(no `acb_processor_service` running → `_pending_acb` stays None throughout).
|
||
Output must be identical to ALGO-2. This confirms the ACB listener path is truly
|
||
inert when no boost events arrive.
|
||
|
||
```python
|
||
assert result == algo2_result # exact dict comparison
|
||
```
|
||
|
||
### ALGO-4: Full Stack Parity (HZ+Prefect Active, MockOB, Static ExF)
|
||
|
||
Start HZ. Start Prefect. Run paper_trade_flow.py for the 55-day window in replay mode
|
||
(historical parquets, not live data). MockOBProvider. ExF from static file (not live fetch).
|
||
|
||
Output must match ALGO-2 exactly. This confirms HZ state persistence, posture reads,
|
||
and IMap writes do NOT alter the algo computation path.
|
||
|
||
**This is the critical gate**: if HZ introduces any non-determinism into the engine,
|
||
it shows up here.
|
||
|
||
### ALGO-5: Bar-Level State Log Comparison
|
||
|
||
Instrument `esf_alpha_orchestrator.py` to optionally emit a per-bar state log:
|
||
|
||
```
|
||
bar_idx | vel_div | vol_regime_ok | position_open | regime_size_mult | boost | beta | action
|
||
```
|
||
|
||
Run pre-MIG reference and post-MIG batch on the same date. Compare bar-by-bar.
|
||
Every numeric field must match to float64 precision.
|
||
|
||
**This is the flint-512 resolution check.** If ALGO-2 passes but this fails on a
|
||
specific field, that field has a divergence the aggregate metrics hid.
|
||
|
||
**ALGO GATE**: ALGO-2 through ALGO-5 all green → algo is certified production-identical.
|
||
Document with date, trade count, first/last trade ID, aggregate metrics.
|
||
|
||
---
|
||
|
||
## PHASE 3 — PREFLIGHTx Series: Systemic Reliability
|
||
|
||
**Goal**: Find everything that can go wrong before it goes wrong with real capital.
|
||
No network/infra simulation — pure systemic/concurrency/logic bugs.
|
||
|
||
### PREFLIGHT-1: Concurrent ACB + Execution Race Stress
|
||
|
||
Spawn 50 threads simultaneously calling `engine.update_acb_boost()` with random values
|
||
while the main thread runs `process_day()`. Verify:
|
||
- No crash, no deadlock
|
||
- Final `position` state is consistent (not half-closed, not double-closed)
|
||
- `_pending_acb` mechanism absorbs all concurrent writes safely
|
||
|
||
```python
|
||
# Run 1000 iterations. Any assertion failure = race condition confirmed.
|
||
for _ in range(1000):
|
||
engine = NDAlphaEngine(seed=42, ...)
|
||
# ... inject position ...
|
||
with ThreadPoolExecutor(max_workers=50) as ex:
|
||
futures = [ex.submit(engine.update_acb_boost, random(), random()) for _ in range(50)]
|
||
engine.process_day(...) # concurrent
|
||
assert engine.position is None or engine.position.asset in valid_assets
|
||
```
|
||
|
||
### PREFLIGHT-2: Daemon Restart Mid-Day
|
||
|
||
While paper_trade_flow.py is mid-execution (historical replay, fast clock):
|
||
1. Kill `acb_processor_service` → verify engine falls back to last known boost, does not crash
|
||
2. Kill HZ → verify `paper_trade_flow` falls back to JSONL ledger, does not crash, resumes
|
||
3. Kill and restart `system_watchdog_service` → verify posture snaps back to APEX after restart
|
||
4. Kill and restart HZ → verify client reconnects, IMap state survives (HZ persistence)
|
||
|
||
Each kill/restart is a separate PREFLIGHT-2.N sub-test with a pass/fail log entry.
|
||
|
||
### PREFLIGHT-3: `_processed_dates` Set Growth
|
||
|
||
Run a simulated 795-day replay through `DolphinActor.on_bar()` (mocked bars, no real HZ).
|
||
Verify `_processed_dates` does not grow unboundedly. It should be cleared on `on_stop()`
|
||
and not accumulate across sessions.
|
||
|
||
If it grows to 795 entries and is never cleared: add `self._processed_dates.clear()` to
|
||
`on_stop()` and document as a found bug.
|
||
|
||
### PREFLIGHT-4: Capital Ledger Consistency Under HZ Failure
|
||
|
||
Run 10 days of paper trading. On day 5, simulate HZ write failure (mock `imap.put` to throw).
|
||
Verify:
|
||
- JSONL fallback ledger was written on days 1-4
|
||
- Day 6 resumes from JSONL ledger with correct capital
|
||
- No capital double-counting or reset to 25k
|
||
|
||
### PREFLIGHT-5: Posture Hysteresis Under Rapid Oscillation
|
||
|
||
Write a test that rapidly alternates `DOLPHIN_SAFETY` between APEX and HIBERNATE 100 times
|
||
per second while `paper_trade_flow.py` reads it. Verify:
|
||
- No partial posture state (half APEX half HIBERNATE)
|
||
- No trade entered and immediately force-exited due to posture flip
|
||
- Hysteresis thresholds in `survival_stack.py` absorb the noise
|
||
|
||
### PREFLIGHT-6: Survival Stack Rm Boundary Conditions
|
||
|
||
Feed the survival stack exact boundary inputs (Cat1=0.0, Cat2=0.0, Cat3=1.0, Cat4=0.0, Cat5=0.0)
|
||
and verify Rm multiplier matches the analytic formula exactly. Then feed all-zero (APEX expected)
|
||
and all-one (HIBERNATE expected). Verify posture transitions at exact threshold values.
|
||
|
||
### PREFLIGHT-7: Memory Leak Over Extended Replay
|
||
|
||
Run a 795-bar (1 day, full bar count) simulation 1000 times in a loop. Sample RSS before
|
||
and after. Growth > 50 MB = memory leak. Candidate sites: `_price_histories` trim logic,
|
||
`trade_history` list accumulation, HZ map handle cache in `ShardedFeatureStore`.
|
||
|
||
### PREFLIGHT-8: Seeded RNG Determinism Under Reset
|
||
|
||
Call `engine.reset()` and re-run the same date. Verify output is bit-for-bit identical
|
||
to the first run. The numba PRNG must re-seed correctly on reset.
|
||
|
||
**PREFLIGHT GATE**: All 8 series pass with zero failures across all iterations.
|
||
Document each with date, iteration count, pass/fail, any bugs found and fixed.
|
||
|
||
---
|
||
|
||
## PHASE 4 — VBT Integration Verification
|
||
|
||
**Goal**: Confirm `dolphin_vbt_real.py` (the original VBT vectorized backtest) remains
|
||
fully operational under the production environment and produces identical results to
|
||
its own historical champion run.
|
||
|
||
### VBT-1: VBT Standalone Parity
|
||
|
||
```bash
|
||
python nautilus_dolphin/dolphin_vbt_real.py --mode backtest --dates 55day \
|
||
2>&1 | tee run_logs/VBT_STANDALONE_$(date +%Y%m%d_%H%M%S).log
|
||
```
|
||
|
||
Compare aggregate metrics against the known VBT champion. VBT and NDAlphaEngine should
|
||
agree within float accumulation tolerance (not bit-perfect — different execution paths —
|
||
but metrics within 0.5% of each other).
|
||
|
||
### VBT-2: VBT Under Prefect Scheduling
|
||
|
||
Wrap a VBT backtest run as a Prefect flow (or verify it can be triggered from a flow).
|
||
Confirm it reads from `vbt_cache_klines` parquets correctly and writes results to
|
||
`DOLPHIN_STATE_BLUE` IMap.
|
||
|
||
### VBT-3: Parquet Cache Freshness
|
||
|
||
Verify `vbt_cache_klines/` has contiguous parquets from 2024-01-01 to yesterday.
|
||
Any gap = data pipeline issue to fix before live trading.
|
||
|
||
```python
|
||
from pathlib import Path
|
||
import pandas as pd
|
||
dates = sorted([f.stem for f in Path('vbt_cache_klines').glob('20*.parquet')])
|
||
expected = pd.date_range('2024-01-01', pd.Timestamp.utcnow().date(), freq='D').strftime('%Y-%m-%d').tolist()
|
||
missing = set(expected) - set(dates)
|
||
print(f"Missing dates: {sorted(missing)}")
|
||
```
|
||
|
||
**VBT GATE**: VBT standalone matches champion metrics, Prefect integration runs,
|
||
parquet cache contiguous.
|
||
|
||
---
|
||
|
||
## PHASE 5 — Final E2E Paper Trade (The Climax)
|
||
|
||
**Goal**: One complete live paper trading day under full production stack.
|
||
Everything real except capital.
|
||
|
||
### Setup
|
||
|
||
1. Start all daemons:
|
||
```bash
|
||
python prod/acb_processor_service.py &
|
||
python prod/system_watchdog_service.py &
|
||
python external_factors/ob_stream_service.py &
|
||
```
|
||
2. Confirm Prefect `mc_forewarner_flow` scheduled and healthy
|
||
3. Confirm HZ MC console shows all IMaps healthy (port 8080)
|
||
4. Confirm `DOLPHIN_SAFETY` = `{"posture": "APEX", ...}`
|
||
|
||
### Instrumentation
|
||
|
||
Before running, enable bar-level state logging in `paper_trade_flow.py`:
|
||
- Every bar: `bar_idx, vel_div, vol_regime_ok, posture, boost, beta, position_open, action`
|
||
- Every trade entry: full entry record (identical schema to pre-MIG reference)
|
||
- Every trade exit: full exit record + exit reason
|
||
- End of day: capital, pnl, trades, mc_status, acb_boost, exf_snapshot
|
||
|
||
Output files:
|
||
```
|
||
paper_logs/blue/E2E_FINAL_YYYYMMDD_bars.csv # bar-level state
|
||
paper_logs/blue/E2E_FINAL_YYYYMMDD_trades.csv # trade-by-trade
|
||
paper_logs/blue/E2E_FINAL_YYYYMMDD_summary.json # daily aggregate
|
||
```
|
||
|
||
### The Run
|
||
|
||
```bash
|
||
python prod/paper_trade_flow.py --config prod/configs/blue.yml \
|
||
--date $(date +%Y-%m-%d) \
|
||
--instrument-full \
|
||
2>&1 | tee run_logs/E2E_FINAL_$(date +%Y%m%d_%H%M%S).log
|
||
```
|
||
|
||
### Post-Run Comparison
|
||
|
||
Compare `E2E_FINAL_*_trades.csv` against the nearest-date pre-MIG trade log:
|
||
- Exit reasons distribution should match historical norms (86% MAX_HOLD, ~10% FIXED_TP, ~4% STOP_LOSS)
|
||
- WR should be in the 55-65% historical range for this market regime
|
||
- Per-trade leverage values should be in the 1x-6x range
|
||
- No `SUBDAY_ACB_NORMALIZATION` exits unless boost genuinely dropped intraday
|
||
|
||
**Pass criteria**: No crashes. Trades produced. All metrics within historical distribution.
|
||
Bar-level state log shows correct posture enforcement, boost injection, and capital accumulation.
|
||
|
||
---
|
||
|
||
## Sign-Off Checklist
|
||
|
||
```
|
||
[ ] AUDIT: blue/green config diff — only expected diffs found
|
||
[ ] REGRESSION: 14/14 CI tests green
|
||
[ ] ALGO-1: Pre-MIG reference captured, ROI=+44.89%, Trades=2128
|
||
[ ] ALGO-2: Post-MIG batch parity, all 2128 trades match to 1e-9
|
||
[ ] ALGO-3: ACB inert path identical to ALGO-2
|
||
[ ] ALGO-4: Full HZ+Prefect stack identical to ALGO-2
|
||
[ ] ALGO-5: Bar-level state log identical field by field
|
||
[ ] PREFLIGHT-1 through -8: all passed, bugs found+fixed documented
|
||
[ ] VBT-1: VBT champion metrics reproduced
|
||
[ ] VBT-2: VBT Prefect integration runs
|
||
[ ] VBT-3: Parquet cache contiguous
|
||
[ ] E2E FINAL: Live paper day completed, trades produced, metrics within historical range
|
||
|
||
Only after all boxes checked: consider 30-day continuous paper trading.
|
||
Only after 30-day paper validation: consider live capital.
|
||
```
|
||
|
||
---
|
||
|
||
*The algo has been built carefully. This plan exists to prove it.
|
||
Trust the process. Fix what breaks. Ship what holds.* 🐬
|