Files
DOLPHIN/prod/docs/E2E_MASTER_PLAN.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

16 KiB
Executable File
Raw Permalink Blame History

DOLPHIN-NAUTILUS — E2E Master Validation Plan

"From Champion Backtest to Production Fidelity"

Authored: 2026-03-07 Authority: Post-MIG7 production readiness gate. No live capital until this plan completes green. Principle: Every phase ends in a written, dated, signed-off result. No skipping forward on "probably fine." Numeric fidelity target: Trade-by-trade log identity to full float64 precision where deterministic. Stochastic components (OB live data, ExF timing jitter) are isolated and accounted for explicitly.


Prerequisites — Before Any Phase Begins

# All daemons stopped. Clean state.
# Docker stack healthy:
docker ps  # hazelcast:5701, hazelcast-mc:8080, prefect:4200 all Up

# Activate venv — ALL commands below assume this:
source "/c/Users/Lenovo/Documents/- Siloqy/Scripts/activate"
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict"

PHASE 0 — Blue/Green Audit

Goal: Confirm blue and green configs are identical where they should be, and differ only where intentionally different (direction, IMap names, log dirs).

AUDIT-1: Config structural diff

python -c "
import yaml
blue = yaml.safe_load(open('prod/configs/blue.yml'))
green = yaml.safe_load(open('prod/configs/green.yml'))

EXPECTED_DIFFS = {'strategy_name', 'direction'}
HZ_DIFFS = {'imap_state', 'imap_pnl'}
LOG_DIFFS = {'log_dir'}

def flatten(d, prefix=''):
    out = {}
    for k, v in d.items():
        key = f'{prefix}.{k}' if prefix else k
        if isinstance(v, dict):
            out.update(flatten(v, key))
        else:
            out[key] = v
    return out

fb, fg = flatten(blue), flatten(green)
all_keys = set(fb) | set(fg)
diffs = {k: (fb.get(k), fg.get(k)) for k in all_keys if fb.get(k) != fg.get(k)}

print('=== Config diffs (blue vs green) ===')
for k, (b, g) in sorted(diffs.items()):
    expected = any(x in k for x in EXPECTED_DIFFS | HZ_DIFFS | LOG_DIFFS)
    tag = '[OK]' if expected else '[*** UNEXPECTED ***]'
    print(f'  {tag} {k}: blue={b!r}  green={g!r}')
"

Pass: Only strategy_name, direction, hazelcast.imap_state, hazelcast.imap_pnl, paper_trade.log_dir differ. Any other diff = fix before proceeding.

AUDIT-2: Engine param identity

Both configs must have identical engine section except where intentional. Specifically verify fixed_tp_pct=0.0095, abs_max_leverage=6.0, fraction=0.20, max_hold_bars=120, vel_div_threshold=-0.02. These are the champion params — any deviation from blue in green's engine section is a bug.

AUDIT-3: Code path symmetry

Verify paper_trade_flow.py routes direction_val=1 for green and direction_val=-1 for blue. Verify dolphin_actor.py does the same. Verify both write to their respective IMap (DOLPHIN_PNL_BLUE vs DOLPHIN_PNL_GREEN).

AUDIT GATE: All 3 checks green → sign off with date. Then proceed to REGRESSION.


PHASE 1 — Full Regression

Goal: Clean slate. Every existing test passes. No regressions from MIG7 work.

python -m pytest ci/ -v --tb=short 2>&1 | tee run_logs/regression_$(date +%Y%m%d_%H%M%S).log

Expected: 14/14 tests green (test_13×6 + test_14×3 + test_15×1 + test_16×4). Also run the original 5 CI layers:

bash ci/run_ci.sh 2>&1 | tee run_logs/ci_full_$(date +%Y%m%d_%H%M%S).log

Fix any failures before proceeding. Zero tolerance.


PHASE 2 — ALGOx Series: Pre/Post MIG Numeric Parity

Goal: Prove the production NDAlphaEngine produces numerically identical results to the pre-MIG champion backtest. Trade by trade. Bar by bar. Float by float.

The guarantee: NDAlphaEngine uses seed=42 → deterministic numba PRNG. Given identical input data in identical order, output must be bit-for-bit identical for all non-stochastic paths (OB=MockOBProvider, ExF=static, no live HZ).

ALGO-1: Capture Pre-MIG Reference

Run the original champion test to produce the definitive reference log:

python nautilus_dolphin/test_pf_dynamic_beta_validate.py \
    2>&1 | tee run_logs/PREMIG_REFERENCE_$(date +%Y%m%d_%H%M%S).log

This produces:

  • run_logs/trades_YYYYMMDD_HHMMSS.csv — trade-by-trade: asset, direction, entry_bar, exit_bar, entry_price, exit_price, pnl_pct, pnl_absolute, leverage, exit_reason
  • run_logs/daily_YYYYMMDD_HHMMSS.csv — per-date: capital, pnl, trades, boost, beta, mc_status
  • run_logs/summary_YYYYMMDD_HHMMSS.json — aggregate: ROI, PF, DD, Sharpe, WR, Trades

Expected aggregate (champion, frozen): ROI=+44.89%, PF=1.123, DD=14.95%, Sharpe=2.50, WR=49.3%, Trades=2128

If the pre-MIG test no longer produces this, stop. Something has regressed in the engine. Restore from backup before proceeding.

Label these files: PREMIG_REFERENCE_* — do not overwrite.

ALGO-2: Post-MIG Engine Parity (Batch Mode, No HZ)

Create ci/test_algo2_postmig_parity.py:

This test runs the SAME 55-day dataset (Dec31Feb25, vbt_cache_klines parquets) through NDAlphaEngine via the production paper_trade_flow.py code path, but with:

  • HZ disabled (no client connection — use --no-hz flag or mock HZ)
  • MockOBProvider (same as pre-MIG, static 62% fill, -0.09 imbalance bias)
  • ExF disabled (no live fetch — use static zero vector as pre-MIG did)
  • seed=42, all params from blue.yml

Then compare output trade CSV against PREMIG_REFERENCE_trades_*.csv:

# Comparison logic — every trade must match:
for i, (pre, post) in enumerate(zip(pre_trades, post_trades)):
    assert pre['asset']        == post['asset'],        f"Trade {i}: asset mismatch"
    assert pre['direction']    == post['direction'],     f"Trade {i}: direction mismatch"
    assert pre['entry_bar']    == post['entry_bar'],     f"Trade {i}: entry_bar mismatch"
    assert pre['exit_bar']     == post['exit_bar'],      f"Trade {i}: exit_bar mismatch"
    assert abs(pre['entry_price'] - post['entry_price']) < 1e-9, f"Trade {i}: entry_price mismatch"
    assert abs(pre['pnl_pct']  - post['pnl_pct'])  < 1e-9, f"Trade {i}: pnl_pct mismatch"
    assert abs(pre['leverage'] - post['leverage']) < 1e-9, f"Trade {i}: leverage mismatch"
    assert pre['exit_reason']  == post['exit_reason'],   f"Trade {i}: exit_reason mismatch"
assert len(pre_trades) == len(post_trades), f"Trade count mismatch: {len(pre_trades)} vs {len(post_trades)}"

Pass: All 2128 trades match to 1e-9 precision. Zero divergence.

If divergence found: Binary search the 55-day window to find the first diverging trade. Read that date's bar-level state log to identify the cause. Fix before proceeding.

ALGO-3: Sub-Day ACB Path Parity

Run the same 55-day dataset WITH ACB listener active but no boost changes arriving (no acb_processor_service running → _pending_acb stays None throughout). Output must be identical to ALGO-2. This confirms the ACB listener path is truly inert when no boost events arrive.

assert result == algo2_result  # exact dict comparison

ALGO-4: Full Stack Parity (HZ+Prefect Active, MockOB, Static ExF)

Start HZ. Start Prefect. Run paper_trade_flow.py for the 55-day window in replay mode (historical parquets, not live data). MockOBProvider. ExF from static file (not live fetch).

Output must match ALGO-2 exactly. This confirms HZ state persistence, posture reads, and IMap writes do NOT alter the algo computation path.

This is the critical gate: if HZ introduces any non-determinism into the engine, it shows up here.

ALGO-5: Bar-Level State Log Comparison

Instrument esf_alpha_orchestrator.py to optionally emit a per-bar state log:

bar_idx | vel_div | vol_regime_ok | position_open | regime_size_mult | boost | beta | action

Run pre-MIG reference and post-MIG batch on the same date. Compare bar-by-bar. Every numeric field must match to float64 precision.

This is the flint-512 resolution check. If ALGO-2 passes but this fails on a specific field, that field has a divergence the aggregate metrics hid.

ALGO GATE: ALGO-2 through ALGO-5 all green → algo is certified production-identical. Document with date, trade count, first/last trade ID, aggregate metrics.


PHASE 3 — PREFLIGHTx Series: Systemic Reliability

Goal: Find everything that can go wrong before it goes wrong with real capital. No network/infra simulation — pure systemic/concurrency/logic bugs.

PREFLIGHT-1: Concurrent ACB + Execution Race Stress

Spawn 50 threads simultaneously calling engine.update_acb_boost() with random values while the main thread runs process_day(). Verify:

  • No crash, no deadlock
  • Final position state is consistent (not half-closed, not double-closed)
  • _pending_acb mechanism absorbs all concurrent writes safely
# Run 1000 iterations. Any assertion failure = race condition confirmed.
for _ in range(1000):
    engine = NDAlphaEngine(seed=42, ...)
    # ... inject position ...
    with ThreadPoolExecutor(max_workers=50) as ex:
        futures = [ex.submit(engine.update_acb_boost, random(), random()) for _ in range(50)]
        engine.process_day(...)  # concurrent
    assert engine.position is None or engine.position.asset in valid_assets

PREFLIGHT-2: Daemon Restart Mid-Day

While paper_trade_flow.py is mid-execution (historical replay, fast clock):

  1. Kill acb_processor_service → verify engine falls back to last known boost, does not crash
  2. Kill HZ → verify paper_trade_flow falls back to JSONL ledger, does not crash, resumes
  3. Kill and restart system_watchdog_service → verify posture snaps back to APEX after restart
  4. Kill and restart HZ → verify client reconnects, IMap state survives (HZ persistence)

Each kill/restart is a separate PREFLIGHT-2.N sub-test with a pass/fail log entry.

PREFLIGHT-3: _processed_dates Set Growth

Run a simulated 795-day replay through DolphinActor.on_bar() (mocked bars, no real HZ). Verify _processed_dates does not grow unboundedly. It should be cleared on on_stop() and not accumulate across sessions.

If it grows to 795 entries and is never cleared: add self._processed_dates.clear() to on_stop() and document as a found bug.

PREFLIGHT-4: Capital Ledger Consistency Under HZ Failure

Run 10 days of paper trading. On day 5, simulate HZ write failure (mock imap.put to throw). Verify:

  • JSONL fallback ledger was written on days 1-4
  • Day 6 resumes from JSONL ledger with correct capital
  • No capital double-counting or reset to 25k

PREFLIGHT-5: Posture Hysteresis Under Rapid Oscillation

Write a test that rapidly alternates DOLPHIN_SAFETY between APEX and HIBERNATE 100 times per second while paper_trade_flow.py reads it. Verify:

  • No partial posture state (half APEX half HIBERNATE)
  • No trade entered and immediately force-exited due to posture flip
  • Hysteresis thresholds in survival_stack.py absorb the noise

PREFLIGHT-6: Survival Stack Rm Boundary Conditions

Feed the survival stack exact boundary inputs (Cat1=0.0, Cat2=0.0, Cat3=1.0, Cat4=0.0, Cat5=0.0) and verify Rm multiplier matches the analytic formula exactly. Then feed all-zero (APEX expected) and all-one (HIBERNATE expected). Verify posture transitions at exact threshold values.

PREFLIGHT-7: Memory Leak Over Extended Replay

Run a 795-bar (1 day, full bar count) simulation 1000 times in a loop. Sample RSS before and after. Growth > 50 MB = memory leak. Candidate sites: _price_histories trim logic, trade_history list accumulation, HZ map handle cache in ShardedFeatureStore.

PREFLIGHT-8: Seeded RNG Determinism Under Reset

Call engine.reset() and re-run the same date. Verify output is bit-for-bit identical to the first run. The numba PRNG must re-seed correctly on reset.

PREFLIGHT GATE: All 8 series pass with zero failures across all iterations. Document each with date, iteration count, pass/fail, any bugs found and fixed.


PHASE 4 — VBT Integration Verification

Goal: Confirm dolphin_vbt_real.py (the original VBT vectorized backtest) remains fully operational under the production environment and produces identical results to its own historical champion run.

VBT-1: VBT Standalone Parity

python nautilus_dolphin/dolphin_vbt_real.py --mode backtest --dates 55day \
    2>&1 | tee run_logs/VBT_STANDALONE_$(date +%Y%m%d_%H%M%S).log

Compare aggregate metrics against the known VBT champion. VBT and NDAlphaEngine should agree within float accumulation tolerance (not bit-perfect — different execution paths — but metrics within 0.5% of each other).

VBT-2: VBT Under Prefect Scheduling

Wrap a VBT backtest run as a Prefect flow (or verify it can be triggered from a flow). Confirm it reads from vbt_cache_klines parquets correctly and writes results to DOLPHIN_STATE_BLUE IMap.

VBT-3: Parquet Cache Freshness

Verify vbt_cache_klines/ has contiguous parquets from 2024-01-01 to yesterday. Any gap = data pipeline issue to fix before live trading.

from pathlib import Path
import pandas as pd
dates = sorted([f.stem for f in Path('vbt_cache_klines').glob('20*.parquet')])
expected = pd.date_range('2024-01-01', pd.Timestamp.utcnow().date(), freq='D').strftime('%Y-%m-%d').tolist()
missing = set(expected) - set(dates)
print(f"Missing dates: {sorted(missing)}")

VBT GATE: VBT standalone matches champion metrics, Prefect integration runs, parquet cache contiguous.


PHASE 5 — Final E2E Paper Trade (The Climax)

Goal: One complete live paper trading day under full production stack. Everything real except capital.

Setup

  1. Start all daemons:
    python prod/acb_processor_service.py &
    python prod/system_watchdog_service.py &
    python external_factors/ob_stream_service.py &
    
  2. Confirm Prefect mc_forewarner_flow scheduled and healthy
  3. Confirm HZ MC console shows all IMaps healthy (port 8080)
  4. Confirm DOLPHIN_SAFETY = {"posture": "APEX", ...}

Instrumentation

Before running, enable bar-level state logging in paper_trade_flow.py:

  • Every bar: bar_idx, vel_div, vol_regime_ok, posture, boost, beta, position_open, action
  • Every trade entry: full entry record (identical schema to pre-MIG reference)
  • Every trade exit: full exit record + exit reason
  • End of day: capital, pnl, trades, mc_status, acb_boost, exf_snapshot

Output files:

paper_logs/blue/E2E_FINAL_YYYYMMDD_bars.csv      # bar-level state
paper_logs/blue/E2E_FINAL_YYYYMMDD_trades.csv    # trade-by-trade
paper_logs/blue/E2E_FINAL_YYYYMMDD_summary.json  # daily aggregate

The Run

python prod/paper_trade_flow.py --config prod/configs/blue.yml \
    --date $(date +%Y-%m-%d) \
    --instrument-full \
    2>&1 | tee run_logs/E2E_FINAL_$(date +%Y%m%d_%H%M%S).log

Post-Run Comparison

Compare E2E_FINAL_*_trades.csv against the nearest-date pre-MIG trade log:

  • Exit reasons distribution should match historical norms (86% MAX_HOLD, ~10% FIXED_TP, ~4% STOP_LOSS)
  • WR should be in the 55-65% historical range for this market regime
  • Per-trade leverage values should be in the 1x-6x range
  • No SUBDAY_ACB_NORMALIZATION exits unless boost genuinely dropped intraday

Pass criteria: No crashes. Trades produced. All metrics within historical distribution. Bar-level state log shows correct posture enforcement, boost injection, and capital accumulation.


Sign-Off Checklist

[ ] AUDIT: blue/green config diff — only expected diffs found
[ ] REGRESSION: 14/14 CI tests green
[ ] ALGO-1: Pre-MIG reference captured, ROI=+44.89%, Trades=2128
[ ] ALGO-2: Post-MIG batch parity, all 2128 trades match to 1e-9
[ ] ALGO-3: ACB inert path identical to ALGO-2
[ ] ALGO-4: Full HZ+Prefect stack identical to ALGO-2
[ ] ALGO-5: Bar-level state log identical field by field
[ ] PREFLIGHT-1 through -8: all passed, bugs found+fixed documented
[ ] VBT-1: VBT champion metrics reproduced
[ ] VBT-2: VBT Prefect integration runs
[ ] VBT-3: Parquet cache contiguous
[ ] E2E FINAL: Live paper day completed, trades produced, metrics within historical range

Only after all boxes checked: consider 30-day continuous paper trading.
Only after 30-day paper validation: consider live capital.

The algo has been built carefully. This plan exists to prove it. Trust the process. Fix what breaks. Ship what holds. 🐬