Files
DOLPHIN/prod/docs/E2E_MASTER_PLAN.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

410 lines
16 KiB
Markdown
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DOLPHIN-NAUTILUS — E2E Master Validation Plan
# "From Champion Backtest to Production Fidelity"
**Authored**: 2026-03-07
**Authority**: Post-MIG7 production readiness gate. No live capital until this plan completes green.
**Principle**: Every phase ends in a written, dated, signed-off result. No skipping forward on "probably fine."
**Numeric fidelity target**: Trade-by-trade log identity to full float64 precision where deterministic.
Stochastic components (OB live data, ExF timing jitter) are isolated and accounted for explicitly.
---
## Prerequisites — Before Any Phase Begins
```bash
# All daemons stopped. Clean state.
# Docker stack healthy:
docker ps # hazelcast:5701, hazelcast-mc:8080, prefect:4200 all Up
# Activate venv — ALL commands below assume this:
source "/c/Users/Lenovo/Documents/- Siloqy/Scripts/activate"
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict"
```
---
## PHASE 0 — Blue/Green Audit
**Goal**: Confirm blue and green configs are identical where they should be, and differ
only where intentionally different (direction, IMap names, log dirs).
### AUDIT-1: Config structural diff
```bash
python -c "
import yaml
blue = yaml.safe_load(open('prod/configs/blue.yml'))
green = yaml.safe_load(open('prod/configs/green.yml'))
EXPECTED_DIFFS = {'strategy_name', 'direction'}
HZ_DIFFS = {'imap_state', 'imap_pnl'}
LOG_DIFFS = {'log_dir'}
def flatten(d, prefix=''):
out = {}
for k, v in d.items():
key = f'{prefix}.{k}' if prefix else k
if isinstance(v, dict):
out.update(flatten(v, key))
else:
out[key] = v
return out
fb, fg = flatten(blue), flatten(green)
all_keys = set(fb) | set(fg)
diffs = {k: (fb.get(k), fg.get(k)) for k in all_keys if fb.get(k) != fg.get(k)}
print('=== Config diffs (blue vs green) ===')
for k, (b, g) in sorted(diffs.items()):
expected = any(x in k for x in EXPECTED_DIFFS | HZ_DIFFS | LOG_DIFFS)
tag = '[OK]' if expected else '[*** UNEXPECTED ***]'
print(f' {tag} {k}: blue={b!r} green={g!r}')
"
```
**Pass**: Only `strategy_name`, `direction`, `hazelcast.imap_state`, `hazelcast.imap_pnl`,
`paper_trade.log_dir` differ. Any other diff = fix before proceeding.
### AUDIT-2: Engine param identity
Both configs must have identical engine section except where intentional.
Specifically verify `fixed_tp_pct=0.0095`, `abs_max_leverage=6.0`, `fraction=0.20`,
`max_hold_bars=120`, `vel_div_threshold=-0.02`. These are the champion params —
any deviation from blue in green's engine section is a bug.
### AUDIT-3: Code path symmetry
Verify `paper_trade_flow.py` routes `direction_val=1` for green and `direction_val=-1`
for blue. Verify `dolphin_actor.py` does the same. Verify both write to their respective
IMap (`DOLPHIN_PNL_BLUE` vs `DOLPHIN_PNL_GREEN`).
**AUDIT GATE**: All 3 checks green → sign off with date. Then proceed to REGRESSION.
---
## PHASE 1 — Full Regression
**Goal**: Clean slate. Every existing test passes. No regressions from MIG7 work.
```bash
python -m pytest ci/ -v --tb=short 2>&1 | tee run_logs/regression_$(date +%Y%m%d_%H%M%S).log
```
**Expected**: 14/14 tests green (test_13×6 + test_14×3 + test_15×1 + test_16×4).
**Also run** the original 5 CI layers:
```bash
bash ci/run_ci.sh 2>&1 | tee run_logs/ci_full_$(date +%Y%m%d_%H%M%S).log
```
Fix any failures before proceeding. Zero tolerance.
---
## PHASE 2 — ALGOx Series: Pre/Post MIG Numeric Parity
**Goal**: Prove the production NDAlphaEngine produces numerically identical results to
the pre-MIG champion backtest. Trade by trade. Bar by bar. Float by float.
**The guarantee**: NDAlphaEngine uses `seed=42` → deterministic numba PRNG. Given
identical input data in identical order, output must be bit-for-bit identical for all
non-stochastic paths (OB=MockOBProvider, ExF=static, no live HZ).
### ALGO-1: Capture Pre-MIG Reference
Run the original champion test to produce the definitive reference log:
```bash
python nautilus_dolphin/test_pf_dynamic_beta_validate.py \
2>&1 | tee run_logs/PREMIG_REFERENCE_$(date +%Y%m%d_%H%M%S).log
```
This produces:
- `run_logs/trades_YYYYMMDD_HHMMSS.csv` — trade-by-trade: asset, direction, entry_bar,
exit_bar, entry_price, exit_price, pnl_pct, pnl_absolute, leverage, exit_reason
- `run_logs/daily_YYYYMMDD_HHMMSS.csv` — per-date: capital, pnl, trades, boost, beta, mc_status
- `run_logs/summary_YYYYMMDD_HHMMSS.json` — aggregate: ROI, PF, DD, Sharpe, WR, Trades
**Expected aggregate** (champion, frozen):
ROI=+44.89%, PF=1.123, DD=14.95%, Sharpe=2.50, WR=49.3%, Trades=2128
If the pre-MIG test no longer produces this, stop. Something has regressed in the engine.
Restore from backup before proceeding.
**Label these files**: `PREMIG_REFERENCE_*` — do not overwrite.
### ALGO-2: Post-MIG Engine Parity (Batch Mode, No HZ)
Create `ci/test_algo2_postmig_parity.py`:
This test runs the SAME 55-day dataset (Dec31Feb25, vbt_cache_klines parquets)
through `NDAlphaEngine` via the production `paper_trade_flow.py` code path, but with:
- HZ disabled (no client connection — use `--no-hz` flag or mock HZ)
- MockOBProvider (same as pre-MIG, static 62% fill, -0.09 imbalance bias)
- ExF disabled (no live fetch — use static zero vector as pre-MIG did)
- `seed=42`, all params from `blue.yml`
Then compare output trade CSV against `PREMIG_REFERENCE_trades_*.csv`:
```python
# Comparison logic — every trade must match:
for i, (pre, post) in enumerate(zip(pre_trades, post_trades)):
assert pre['asset'] == post['asset'], f"Trade {i}: asset mismatch"
assert pre['direction'] == post['direction'], f"Trade {i}: direction mismatch"
assert pre['entry_bar'] == post['entry_bar'], f"Trade {i}: entry_bar mismatch"
assert pre['exit_bar'] == post['exit_bar'], f"Trade {i}: exit_bar mismatch"
assert abs(pre['entry_price'] - post['entry_price']) < 1e-9, f"Trade {i}: entry_price mismatch"
assert abs(pre['pnl_pct'] - post['pnl_pct']) < 1e-9, f"Trade {i}: pnl_pct mismatch"
assert abs(pre['leverage'] - post['leverage']) < 1e-9, f"Trade {i}: leverage mismatch"
assert pre['exit_reason'] == post['exit_reason'], f"Trade {i}: exit_reason mismatch"
assert len(pre_trades) == len(post_trades), f"Trade count mismatch: {len(pre_trades)} vs {len(post_trades)}"
```
**Pass**: All 2128 trades match to 1e-9 precision. Zero divergence.
**If divergence found**: Binary search the 55-day window to find the first diverging trade.
Read that date's bar-level state log to identify the cause. Fix before proceeding.
### ALGO-3: Sub-Day ACB Path Parity
Run the same 55-day dataset WITH ACB listener active but no boost changes arriving
(no `acb_processor_service` running → `_pending_acb` stays None throughout).
Output must be identical to ALGO-2. This confirms the ACB listener path is truly
inert when no boost events arrive.
```python
assert result == algo2_result # exact dict comparison
```
### ALGO-4: Full Stack Parity (HZ+Prefect Active, MockOB, Static ExF)
Start HZ. Start Prefect. Run paper_trade_flow.py for the 55-day window in replay mode
(historical parquets, not live data). MockOBProvider. ExF from static file (not live fetch).
Output must match ALGO-2 exactly. This confirms HZ state persistence, posture reads,
and IMap writes do NOT alter the algo computation path.
**This is the critical gate**: if HZ introduces any non-determinism into the engine,
it shows up here.
### ALGO-5: Bar-Level State Log Comparison
Instrument `esf_alpha_orchestrator.py` to optionally emit a per-bar state log:
```
bar_idx | vel_div | vol_regime_ok | position_open | regime_size_mult | boost | beta | action
```
Run pre-MIG reference and post-MIG batch on the same date. Compare bar-by-bar.
Every numeric field must match to float64 precision.
**This is the flint-512 resolution check.** If ALGO-2 passes but this fails on a
specific field, that field has a divergence the aggregate metrics hid.
**ALGO GATE**: ALGO-2 through ALGO-5 all green → algo is certified production-identical.
Document with date, trade count, first/last trade ID, aggregate metrics.
---
## PHASE 3 — PREFLIGHTx Series: Systemic Reliability
**Goal**: Find everything that can go wrong before it goes wrong with real capital.
No network/infra simulation — pure systemic/concurrency/logic bugs.
### PREFLIGHT-1: Concurrent ACB + Execution Race Stress
Spawn 50 threads simultaneously calling `engine.update_acb_boost()` with random values
while the main thread runs `process_day()`. Verify:
- No crash, no deadlock
- Final `position` state is consistent (not half-closed, not double-closed)
- `_pending_acb` mechanism absorbs all concurrent writes safely
```python
# Run 1000 iterations. Any assertion failure = race condition confirmed.
for _ in range(1000):
engine = NDAlphaEngine(seed=42, ...)
# ... inject position ...
with ThreadPoolExecutor(max_workers=50) as ex:
futures = [ex.submit(engine.update_acb_boost, random(), random()) for _ in range(50)]
engine.process_day(...) # concurrent
assert engine.position is None or engine.position.asset in valid_assets
```
### PREFLIGHT-2: Daemon Restart Mid-Day
While paper_trade_flow.py is mid-execution (historical replay, fast clock):
1. Kill `acb_processor_service` → verify engine falls back to last known boost, does not crash
2. Kill HZ → verify `paper_trade_flow` falls back to JSONL ledger, does not crash, resumes
3. Kill and restart `system_watchdog_service` → verify posture snaps back to APEX after restart
4. Kill and restart HZ → verify client reconnects, IMap state survives (HZ persistence)
Each kill/restart is a separate PREFLIGHT-2.N sub-test with a pass/fail log entry.
### PREFLIGHT-3: `_processed_dates` Set Growth
Run a simulated 795-day replay through `DolphinActor.on_bar()` (mocked bars, no real HZ).
Verify `_processed_dates` does not grow unboundedly. It should be cleared on `on_stop()`
and not accumulate across sessions.
If it grows to 795 entries and is never cleared: add `self._processed_dates.clear()` to
`on_stop()` and document as a found bug.
### PREFLIGHT-4: Capital Ledger Consistency Under HZ Failure
Run 10 days of paper trading. On day 5, simulate HZ write failure (mock `imap.put` to throw).
Verify:
- JSONL fallback ledger was written on days 1-4
- Day 6 resumes from JSONL ledger with correct capital
- No capital double-counting or reset to 25k
### PREFLIGHT-5: Posture Hysteresis Under Rapid Oscillation
Write a test that rapidly alternates `DOLPHIN_SAFETY` between APEX and HIBERNATE 100 times
per second while `paper_trade_flow.py` reads it. Verify:
- No partial posture state (half APEX half HIBERNATE)
- No trade entered and immediately force-exited due to posture flip
- Hysteresis thresholds in `survival_stack.py` absorb the noise
### PREFLIGHT-6: Survival Stack Rm Boundary Conditions
Feed the survival stack exact boundary inputs (Cat1=0.0, Cat2=0.0, Cat3=1.0, Cat4=0.0, Cat5=0.0)
and verify Rm multiplier matches the analytic formula exactly. Then feed all-zero (APEX expected)
and all-one (HIBERNATE expected). Verify posture transitions at exact threshold values.
### PREFLIGHT-7: Memory Leak Over Extended Replay
Run a 795-bar (1 day, full bar count) simulation 1000 times in a loop. Sample RSS before
and after. Growth > 50 MB = memory leak. Candidate sites: `_price_histories` trim logic,
`trade_history` list accumulation, HZ map handle cache in `ShardedFeatureStore`.
### PREFLIGHT-8: Seeded RNG Determinism Under Reset
Call `engine.reset()` and re-run the same date. Verify output is bit-for-bit identical
to the first run. The numba PRNG must re-seed correctly on reset.
**PREFLIGHT GATE**: All 8 series pass with zero failures across all iterations.
Document each with date, iteration count, pass/fail, any bugs found and fixed.
---
## PHASE 4 — VBT Integration Verification
**Goal**: Confirm `dolphin_vbt_real.py` (the original VBT vectorized backtest) remains
fully operational under the production environment and produces identical results to
its own historical champion run.
### VBT-1: VBT Standalone Parity
```bash
python nautilus_dolphin/dolphin_vbt_real.py --mode backtest --dates 55day \
2>&1 | tee run_logs/VBT_STANDALONE_$(date +%Y%m%d_%H%M%S).log
```
Compare aggregate metrics against the known VBT champion. VBT and NDAlphaEngine should
agree within float accumulation tolerance (not bit-perfect — different execution paths —
but metrics within 0.5% of each other).
### VBT-2: VBT Under Prefect Scheduling
Wrap a VBT backtest run as a Prefect flow (or verify it can be triggered from a flow).
Confirm it reads from `vbt_cache_klines` parquets correctly and writes results to
`DOLPHIN_STATE_BLUE` IMap.
### VBT-3: Parquet Cache Freshness
Verify `vbt_cache_klines/` has contiguous parquets from 2024-01-01 to yesterday.
Any gap = data pipeline issue to fix before live trading.
```python
from pathlib import Path
import pandas as pd
dates = sorted([f.stem for f in Path('vbt_cache_klines').glob('20*.parquet')])
expected = pd.date_range('2024-01-01', pd.Timestamp.utcnow().date(), freq='D').strftime('%Y-%m-%d').tolist()
missing = set(expected) - set(dates)
print(f"Missing dates: {sorted(missing)}")
```
**VBT GATE**: VBT standalone matches champion metrics, Prefect integration runs,
parquet cache contiguous.
---
## PHASE 5 — Final E2E Paper Trade (The Climax)
**Goal**: One complete live paper trading day under full production stack.
Everything real except capital.
### Setup
1. Start all daemons:
```bash
python prod/acb_processor_service.py &
python prod/system_watchdog_service.py &
python external_factors/ob_stream_service.py &
```
2. Confirm Prefect `mc_forewarner_flow` scheduled and healthy
3. Confirm HZ MC console shows all IMaps healthy (port 8080)
4. Confirm `DOLPHIN_SAFETY` = `{"posture": "APEX", ...}`
### Instrumentation
Before running, enable bar-level state logging in `paper_trade_flow.py`:
- Every bar: `bar_idx, vel_div, vol_regime_ok, posture, boost, beta, position_open, action`
- Every trade entry: full entry record (identical schema to pre-MIG reference)
- Every trade exit: full exit record + exit reason
- End of day: capital, pnl, trades, mc_status, acb_boost, exf_snapshot
Output files:
```
paper_logs/blue/E2E_FINAL_YYYYMMDD_bars.csv # bar-level state
paper_logs/blue/E2E_FINAL_YYYYMMDD_trades.csv # trade-by-trade
paper_logs/blue/E2E_FINAL_YYYYMMDD_summary.json # daily aggregate
```
### The Run
```bash
python prod/paper_trade_flow.py --config prod/configs/blue.yml \
--date $(date +%Y-%m-%d) \
--instrument-full \
2>&1 | tee run_logs/E2E_FINAL_$(date +%Y%m%d_%H%M%S).log
```
### Post-Run Comparison
Compare `E2E_FINAL_*_trades.csv` against the nearest-date pre-MIG trade log:
- Exit reasons distribution should match historical norms (86% MAX_HOLD, ~10% FIXED_TP, ~4% STOP_LOSS)
- WR should be in the 55-65% historical range for this market regime
- Per-trade leverage values should be in the 1x-6x range
- No `SUBDAY_ACB_NORMALIZATION` exits unless boost genuinely dropped intraday
**Pass criteria**: No crashes. Trades produced. All metrics within historical distribution.
Bar-level state log shows correct posture enforcement, boost injection, and capital accumulation.
---
## Sign-Off Checklist
```
[ ] AUDIT: blue/green config diff — only expected diffs found
[ ] REGRESSION: 14/14 CI tests green
[ ] ALGO-1: Pre-MIG reference captured, ROI=+44.89%, Trades=2128
[ ] ALGO-2: Post-MIG batch parity, all 2128 trades match to 1e-9
[ ] ALGO-3: ACB inert path identical to ALGO-2
[ ] ALGO-4: Full HZ+Prefect stack identical to ALGO-2
[ ] ALGO-5: Bar-level state log identical field by field
[ ] PREFLIGHT-1 through -8: all passed, bugs found+fixed documented
[ ] VBT-1: VBT champion metrics reproduced
[ ] VBT-2: VBT Prefect integration runs
[ ] VBT-3: Parquet cache contiguous
[ ] E2E FINAL: Live paper day completed, trades produced, metrics within historical range
Only after all boxes checked: consider 30-day continuous paper trading.
Only after 30-day paper validation: consider live capital.
```
---
*The algo has been built carefully. This plan exists to prove it.
Trust the process. Fix what breaks. Ship what holds.* 🐬