Files
DOLPHIN/prod/docs/PRODUCTION_BRINGUP_MASTER_PLAN.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

53 KiB
Executable File
Raw Blame History

DOLPHIN-NAUTILUS — Production Bringup Master Plan

"From Batch Paper Trading to Hyper-Reactive Memory/Compute Layer Live Algo"

Authored: 2026-03-06 Authority: Synthesizes NAUTILUS-DOLPHIN Prod System Spec (17 pages), LAYER_BRINGUP_PLAN.md, BRINGUP_GUIDE.md, and full champion research state. Champion baseline (supersedes spec targets): ROI=+44.89%, PF=1.123, DD=14.95%, Sharpe=2.50, WR=49.3% (55-day, abs_max_lev=6.0). Spec note: PDF spec targets (ROI>35%, Sharpe>2.0) are PRE-latest research. Current champion is already superior. Those floors hold as CI regression gates only.

Principle: The system must be FUNCTIONAL at every phase boundary. Never leave a partially broken state. Each phase ends with a green CI gate.

Deferred (later MIG steps): Linux RT kernel, DPDK kernel bypass, TLA+ formal spec, Rocq/Coq proofs. These are asymptotic perfection items, not blockers for live trading.


DAL Reliability Mapping (DO-178C adaptation)

DAL Component Failure consequence Required gate
A Kill-switch, capital ledger Total loss, uncontrolled exposure Hardware + software interlock
B MC-Forewarner, ACB v6 Excessive drawdown (>20%) CI regression + integration test
C Alpha signal (vel_div, IRP) Missed trades or false signals Unit + smoke test
D EsoF, DVOL environmental Suboptimal sizing Integration test (optional)
E Backfill, dashboards Observability loss only Best-effort

Architecture Target (End State — Post MIG7)

[ARB512 Scanner] ──► eigenvalues/YYYY-MM-DD/*.json
                              │
                    [Prefect SITARA — orchestration layer]
                    ├── ExF fetcher flow (macro data, daily)
                    ├── EsoF calculator flow (daily)
                    ├── MC-Forewarner flow (4-hourly)
                    └── Watchdog flow (10s heartbeat)
                              │
                    [Hazelcast IMDG — hot feature store]
                    ├── DOLPHIN_FEATURES IMap (per-asset, Near Cache)
                    ├── DOLPHIN_STATE_BLUE/GREEN IMap (capital, drawdown)
                    ├── DOLPHIN_SAFETY AtomicReference (posture, kill switch)
                    └── ACB EntryProcessor (atomic boost update)
                              │
                    [Nautilus-Trader — execution core (Rust)]
                    ├── NautilusActor ←→ NDAlphaEngine
                    ├── AsyncDataEngine (bar subscription)
                    └── Binance Futures adapter (live orders)
                              │
                    [Survival Stack — 5 categories × 4 postures]
                    Cat1:Invariants → Cat2:Structural → Cat3:Micro
                    → Cat4:Environmental → Cat5:CapitalStress
                    → Rm multiplier → APEX/STALKER/TURTLE/HIBERNATE

MIG0 — Current State Verification (Baseline Gate)

Goal: Confirm the existing batch paper trading system is fully operational and CI-clean before any migration work begins. Never build on a broken foundation.

Current state:

  • Docker stack: Hazelcast 5.3 (port 5701), HZ-MC (port 8080), Prefect Server (port 4200)
  • Prefect worker running on dolphin pool, deployment dolphin-paper-blue scheduled daily 00:05 UTC
  • paper_trade_flow.py loads JSON scan files, computes vel_div, runs NDAlphaEngine SHORT-only
  • Capital NOT persisted (restarts at 25k each day — KNOWN LIMITATION)
  • OB = MockOBProvider (static 62% fill, -0.09 imbalance bias)
  • No graceful degradation, no posture management
  • CI: 5 layers, 24/24 tests passing

MIG0 Verification Steps

Step MIG0.1 — CI green

cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict"
source "/c/Users/Lenovo/Documents/- Siloqy/Scripts/activate"
bash ci/run_ci.sh

PASS criteria:

  • All 24 tests pass (layers 1-5)
  • Exit code 0
  • Layer 3 regression: PF >= 1.08, WR >= 42%, trades >= 5 on 10-day VBT window

Step MIG0.2 — Infrastructure health

docker compose -f prod/docker-compose.yml ps
# ASSERT: hazelcast, hz-mc, prefect-server all "running" (not "restarting")

curl -s http://localhost:4200/api/health | python -c "import sys,json; d=json.load(sys.stdin); assert d['status']=='ok', d"
# ASSERT: Prefect API healthy

python -c "import hazelcast; c=hazelcast.HazelcastClient(); c.shutdown(); print('HZ OK')"
# ASSERT: prints "HZ OK" with no exception

Step MIG0.3 — Manual paper run

source "/c/Users/Lenovo/Documents/- Siloqy/Scripts/activate"
PREFECT_API_URL=http://localhost:4200/api \
python prod/paper_trade_flow.py --date $(date +%Y-%m-%d) --config prod/configs/blue.yml
# ASSERT: prints "vel_div range=[...]", prints "total_trades=N" where N > 0
# ASSERT: HZ IMap DOLPHIN_PNL_BLUE contains today's entry

FAIL criteria for MIG0: Any CI failure, any container in restart loop, zero trades on valid scan date.

MIG0 GATE: CI 24/24 + all 3 infra checks green. Only then proceed to MIG1.


MIG1 — Prefect SITARA: All Subsystems as Flows + State Persistence

Goal: Separate the "slow-thinking" (macro, orchestration) from the "fast-doing" (engine, execution). All support subsystems run as independent Prefect flows with retry logic. Capital persists across daily runs.

Spec reference: Sec IV (Prefect SITARA), "slow-thinking / fast-doing separation."

Why now: State persistence eliminates the #1 known limitation (restarts at 25k daily). Subsystem flows give observability + retry without coupling to the trading flow.

MIG1.1 — Capital State Persistence (DAL-A)

What to build: At flow start, restore capital from HZ. At flow end, write capital + drawdown + session summary back to HZ. If HZ unavailable, fall back to local JSON ledger.

File to modify: prod/paper_trade_flow.py

Implementation pattern (add to flow body):

# ---- Restore capital ----
STATE_KEY = f"state_{strategy_name}_{date_str}"
try:
    raw = imap_state.get(STATE_KEY) or imap_state.get('latest') or '{}'
    state = json.loads(raw)
    if state.get('strategy') == strategy_name and state.get('capital', 0) > 0:
        engine.capital = float(state['capital'])
        engine.initial_capital = float(state['capital'])
        logger.info(f"[STATE] Restored capital={engine.capital:.2f} from HZ")
except Exception as e:
    logger.warning(f"[STATE] HZ restore failed: {e} — using config capital")

# ---- Persist capital at end ----
try:
    new_state = {
        'strategy': strategy_name, 'capital': engine.capital,
        'date': date_str, 'pnl': day_result['pnl'], 'trades': day_result['trades'],
        'peak_capital': max(engine.capital, state.get('peak_capital', engine.capital)),
        'drawdown': 1.0 - engine.capital / max(engine.capital, state.get('peak_capital', engine.capital)),
    }
    imap_state.put('latest', json.dumps(new_state))
    imap_state.put(STATE_KEY, json.dumps(new_state))
except Exception as e:
    logger.error(f"[STATE] HZ persist failed: {e}")
    # Fallback: write to local JSON ledger
    ledger_path = Path(LOG_DIR) / f"state_ledger_{strategy_name}.jsonl"
    with open(ledger_path, 'a') as f:
        f.write(json.dumps(new_state) + '\n')

Test assertions (add to ci/test_06_state_persistence.py):

def test_hz_state_roundtrip():
    """Capital persists to HZ and is readable back."""
    import hazelcast, json
    c = hazelcast.HazelcastClient()
    m = c.get_map('DOLPHIN_STATE_BLUE').blocking()
    test_state = {'strategy': 'blue', 'capital': 27500.0, 'date': '2026-01-15', 'trades': 42}
    m.put('test_roundtrip', json.dumps(test_state))
    read_back = json.loads(m.get('test_roundtrip'))
    assert read_back['capital'] == 27500.0
    assert read_back['trades'] == 42
    m.remove('test_roundtrip')
    c.shutdown()

def test_capital_restoration_on_flow_start():
    """If HZ has prior state, engine.capital is set correctly."""
    # This tests the restore logic in isolation (mock HZ IMap)
    from unittest.mock import MagicMock
    import json
    stored = {'strategy': 'blue', 'capital': 28000.0}
    imap = MagicMock()
    imap.get = MagicMock(return_value=json.dumps(stored))
    # ... instantiate engine, run restore logic, assert engine.capital == 28000.0
    # (see ci/test_06_state_persistence.py for full implementation)

PASS criteria: Capital from day N is used as starting capital for day N+1. If HZ unavailable, local ledger file written. No crash if ledger file missing.

MIG1.2 — ExF Fetcher Flow

What to build: A standalone Prefect flow exf_fetcher_flow.py that fetches all 14 ExF indicators (FRED, Deribit, F&G, etc.) and writes results to HZ IMap DOLPHIN_FEATURES under key exf_latest.

File to create: prod/exf_fetcher_flow.py

Key design points:

  • Runs daily at 23:00 UTC (before paper trade at 00:05 UTC next day)
  • Uses existing external_factors/ modules
  • Writes {indicator_name: value, 'timestamp': iso_str, 'date': YYYY-MM-DD} to HZ
  • If fetch fails for any indicator: log warning, write None for that indicator, do NOT crash
  • Separate task per indicator family (FRED, Deribit, F&G) for retry isolation
@flow(name="exf-fetcher")
def exf_fetcher_flow(date_str: str = None):
    date_str = date_str or datetime.now(timezone.utc).strftime('%Y-%m-%d')
    results = {}
    results.update(fetch_fred_indicators(date_str))       # task
    results.update(fetch_deribit_funding(date_str))       # task
    results.update(fetch_fear_and_greed(date_str))        # task
    write_exf_to_hz(date_str, results)                    # task
    return results

Test assertions (add to ci/test_07_exf_flow.py):

def test_exf_flow_runs_without_crash():
    """ExF flow completes even if some APIs fail (returns partial results)."""
    result = exf_fetcher_flow(date_str='2026-01-15')
    assert isinstance(result, dict)
    # Core FRED indicators that were working:
    # claims, us10y, ycurve, stables, m2, hashrate, usdc, vol24
    # At least half should be present (some APIs may be down)
    non_none = sum(1 for v in result.values() if v is not None)
    assert non_none >= 4, f"Too many ExF indicators failed: {result}"

def test_exf_hz_write():
    """ExF results are readable from HZ after flow runs."""
    import hazelcast, json
    c = hazelcast.HazelcastClient()
    m = c.get_map('DOLPHIN_FEATURES').blocking()
    val = m.get('exf_latest')
    if val is None:
        pytest.skip("ExF flow has not run yet")
    data = json.loads(val)
    assert 'timestamp' in data
    assert 'date' in data
    c.shutdown()

PASS criteria: Flow completes (exit 0) even with partial API failures. Results written to HZ. paper_trade_flow.py reads ExF from HZ (not from disk NPZ fallback) on next run.

MIG1.3 — MC-Forewarner as Prefect Flow

What to build: Wrap mc_forewarning_service.py daemon as a Prefect flow that runs every 4 hours and writes its state to HZ IMap DOLPHIN_FEATURES key mc_forewarner_latest.

File to create: prod/mc_forewarner_flow.py

Key design:

  • Schedule: Cron("0 */4 * * *") (every 4 hours)
  • Runs DolphinForewarner with current champion params
  • Writes {'status': 'GREEN'|'ORANGE'|'RED', 'catastrophic_prob': float, 'envelope_score': float, 'timestamp': iso} to HZ
  • paper_trade_flow.py reads MC state from HZ (already does this via staleness check)
  • Add staleness gate: if MC timestamp > 6 hours old, treat as ORANGE (structural degradation Cat 2)

Test assertions (add to ci/test_08_mc_flow.py):

def test_mc_forewarner_flow_runs():
    """MC-Forewarner flow produces a valid status."""
    result = mc_forewarner_flow()
    assert result['status'] in ('GREEN', 'ORANGE', 'RED')
    assert 0.0 <= result['catastrophic_prob'] <= 1.0
    assert 'timestamp' in result

def test_mc_staleness_gate():
    """MC state older than 6 hours is treated as ORANGE, not GREEN."""
    from datetime import timedelta
    stale_ts = (datetime.now(timezone.utc) - timedelta(hours=7)).isoformat()
    stale_state = {'status': 'GREEN', 'timestamp': stale_ts}
    effective = get_effective_mc_status(stale_state)
    assert effective == 'ORANGE', "Stale MC should degrade to ORANGE"

PASS criteria: MC flow runs on schedule, writes to HZ, paper_trade_flow.py correctly reads MC status. Staleness detected and status degraded to ORANGE after 6h.

MIG1.4 — Watchdog Flow

What to build: A Prefect flow watchdog_flow.py that runs every 10 minutes (not 10 seconds — Windows Prefect scheduling granularity), checks all system components, and writes DOLPHIN_SYSTEM_HEALTH to HZ.

Checks performed:

  • HZ cluster quorum (>= 1 node alive)
  • Prefect worker responsive
  • Scan data freshness (latest scan date <= 2 days ago)
  • Paper log freshness (last JSONL entry <= 2 days old)
  • Docker containers running
@flow(name="watchdog")
def watchdog_flow():
    health = {
        'hz': check_hz_quorum(),       # task
        'prefect': check_prefect_api(), # task
        'scans': check_scan_freshness(), # task
        'logs': check_log_freshness(),   # task
        'timestamp': datetime.now(timezone.utc).isoformat(),
    }
    overall = 'GREEN' if all(v == 'OK' for v in health.values()
                             if isinstance(v, str) and v != health['timestamp']) else 'DEGRADED'
    health['overall'] = overall
    write_hz_health(health)  # task
    if overall == 'DEGRADED':
        logger.warning(f"[WATCHDOG] System degraded: {health}")
    return health

Test assertions (ci/test_09_watchdog.py):

def test_watchdog_detects_all_ok():
    result = watchdog_flow()
    assert result['overall'] in ('GREEN', 'DEGRADED')
    assert 'timestamp' in result
    # At minimum, HZ and Prefect should be OK in test environment
    assert result['hz'] == 'OK'
    assert result['prefect'] == 'OK'

def test_watchdog_writes_to_hz():
    import hazelcast, json
    watchdog_flow()
    c = hazelcast.HazelcastClient()
    m = c.get_map('DOLPHIN_SYSTEM_HEALTH').blocking()
    h = json.loads(m.get('latest'))
    assert h['overall'] in ('GREEN', 'DEGRADED')
    c.shutdown()

PASS criteria: Watchdog runs on schedule, writes health to HZ, operator can see system status at HZ-MC UI without reading logs.

MIG1 GATE

All of the following must pass before MIG2:

bash ci/run_ci.sh  # original 24 tests
pytest ci/test_06_state_persistence.py ci/test_07_exf_flow.py ci/test_08_mc_flow.py ci/test_09_watchdog.py -v

PASS criteria: 24 + 8 (new) = 32 tests green. Capital from prior day visible in HZ after manual paper run. MC-Forewarner status readable from HZ. Watchdog health GREEN.


MIG2 — Hazelcast IMDG: Feature Store + Live OB + Entry Processors

Goal: Replace file-based feature passing with a sub-millisecond in-memory feature store. Enable atomic ACB state updates. Replace MockOBProvider with live Binance WebSocket OB data.

Spec reference: Sec III (Hazelcast IMDG — DOLPHIN_FEATURES, Near Cache, Jet, Entry Processors).

Architecture: "Engine Room" — hot feature state that the trading engine reads without network overhead via Near Cache. The engine reads features, not files.

MIG2.1 — DOLPHIN_FEATURES IMap + Near Cache

What to build: Schema for the HZ feature store and Near Cache configuration.

IMap key schema:

DOLPHIN_FEATURES:
  "exf_latest"          → JSON dict: {indicator_name: value, timestamp, date}
  "mc_forewarner_latest" → JSON dict: {status, catastrophic_prob, envelope_score, timestamp}
  "acb_state"           → JSON dict: {boost, beta, w750_threshold, p60, last_date}
  "vol_regime"          → JSON dict: {vol_p60: float, current_vol: float, regime_ok: bool, timestamp}
  "asset_{SYMBOL}_ob"   → JSON dict: {imbalance, fill_prob, depth_quality, agreement, timestamp}
  "scan_latest"         → JSON dict: {date, vel_div_mean, vel_div_min, asset_count, timestamp}

Near Cache configuration (add to HZ client init in paper_trade_flow.py):

client = hazelcast.HazelcastClient(
    cluster_members=["localhost:5701"],
    near_caches={
        "DOLPHIN_FEATURES": {
            "invalidate_on_change": True,
            "time_to_live_seconds": 300,   # 5 min TTL
            "max_idle_seconds": 60,
            "eviction_policy": "LRU",
            "max_size": 5000,
        }
    }
)

Test assertions (ci/test_10_hz_feature_store.py):

def test_near_cache_read_latency():
    """Near Cache reads complete in <1ms after first warm read."""
    import time, hazelcast, json
    c = hazelcast.HazelcastClient(near_caches={"DOLPHIN_FEATURES": {...}})
    m = c.get_map('DOLPHIN_FEATURES').blocking()
    m.put('test_nc', json.dumps({'val': 42}))
    m.get('test_nc')  # warm the cache
    t0 = time.perf_counter()
    for _ in range(100):
        m.get('test_nc')
    elapsed_per_call = (time.perf_counter() - t0) / 100
    assert elapsed_per_call < 0.001, f"Near Cache too slow: {elapsed_per_call*1000:.2f}ms"
    c.shutdown()

def test_feature_store_schema():
    """All required keys are writable and readable in correct schema."""
    import hazelcast, json
    c = hazelcast.HazelcastClient()
    m = c.get_map('DOLPHIN_FEATURES').blocking()
    for key in ['exf_latest', 'mc_forewarner_latest', 'acb_state', 'vol_regime']:
        m.put(f'test_{key}', json.dumps({'test': True, 'timestamp': '2026-01-01T00:00:00Z'}))
        val = m.get(f'test_{key}')
        assert val is not None
        m.remove(f'test_{key}')
    c.shutdown()

MIG2.2 — ACB Entry Processor (Atomic State Update)

What to build: An Entry Processor that updates the ACB boost atomically in HZ without a full read-modify-write round trip. Critical for sub-day ACB updates when new scan bars arrive.

# prod/hz_entry_processors.py
import hazelcast

class ACBBoostUpdateProcessor(hazelcast.serialization.api.IdentifiedDataSerializable):
    """Atomically update ACB boost + beta in DOLPHIN_FEATURES without read-write round trip."""
    FACTORY_ID = 1
    CLASS_ID = 1

    def __init__(self, new_boost=None, new_beta=None, date_str=None):
        self.new_boost = new_boost
        self.new_beta = new_beta
        self.date_str = date_str

    def process(self, entry):
        import json
        current = json.loads(entry.value or '{}')
        if self.new_boost is not None:
            current['boost'] = self.new_boost
        if self.new_beta is not None:
            current['beta'] = self.new_beta
        current['last_updated'] = self.date_str
        entry.set_value(json.dumps(current))

    def write_data(self, object_data_output):
        object_data_output.write_float(self.new_boost or 0.0)
        object_data_output.write_float(self.new_beta or 0.0)
        object_data_output.write_utf(self.date_str or '')

    def read_data(self, object_data_input):
        self.new_boost = object_data_input.read_float()
        self.new_beta = object_data_input.read_float()
        self.date_str = object_data_input.read_utf()

    def get_factory_id(self): return self.FACTORY_ID
    def get_class_id(self): return self.CLASS_ID

Test assertions:

def test_acb_entry_processor_atomic():
    """Entry processor updates ACB state without race condition."""
    import hazelcast, json
    c = hazelcast.HazelcastClient()
    m = c.get_map('DOLPHIN_FEATURES').blocking()
    m.put('acb_state', json.dumps({'boost': 1.0, 'beta': 0.5}))
    processor = ACBBoostUpdateProcessor(new_boost=1.35, new_beta=0.7, date_str='2026-01-15')
    m.execute_on_key('acb_state', processor)
    result = json.loads(m.get('acb_state'))
    assert result['boost'] == 1.35
    assert result['beta'] == 0.7
    c.shutdown()

MIG2.3 — Live OB: Replace MockOBProvider

What to build: Wire ob_stream_service.py WebSocket feed into paper_trade_flow.py to replace MockOBProvider. OB features written to HZ per-asset under asset_{SYMBOL}_ob.

Implementation:

  1. ob_stream_service.py already verified live on Binance Futures WebSocket
  2. Start OB service as a background thread or separate Prefect flow at run start
  3. OB service writes per-asset OB snapshot to HZ every 5 seconds
  4. run_engine_day task reads OB from HZ Near Cache instead of MockOBProvider
  5. Graceful fallback: if asset OB data missing or stale (>30s), use neutral values (imbalance=0, fill_prob=0.5)

OB data schema in HZ:

ob_snapshot = {
    'imbalance': float,       # bid_vol - ask_vol / (bid_vol + ask_vol), range [-1, 1]
    'fill_prob': float,       # maker fill probability, range [0, 1]
    'depth_quality': float,   # normalized depth, range [0, 1]
    'agreement': float,       # OB trend agreement, range [-1, 1]
    'timestamp': iso_str,     # when this snapshot was taken
    'stale': bool,            # True if >30s since last update
}

Test assertions (ci/test_11_live_ob.py):

def test_ob_stream_connects():
    """OB stream service connects to Binance Futures WebSocket without error."""
    # Start OB service for BTCUSDT only, run for 10 seconds, check HZ for data
    import threading, time, hazelcast, json
    from ob_stream_service import OBStreamService
    c = hazelcast.HazelcastClient()
    m = c.get_map('DOLPHIN_FEATURES').blocking()
    svc = OBStreamService(symbols=['BTCUSDT'], hz_map=m)
    t = threading.Thread(target=svc.start, daemon=True)
    t.start()
    time.sleep(10)
    svc.stop()
    val = m.get('asset_BTCUSDT_ob')
    assert val is not None, "OB data not written to HZ after 10s"
    data = json.loads(val)
    assert -1.0 <= data['imbalance'] <= 1.0
    assert 0.0 <= data['fill_prob'] <= 1.0
    c.shutdown()

def test_ob_stale_fallback():
    """Engine uses neutral OB values when OB data is stale."""
    # Inject stale OB snapshot, verify engine uses fallback (imbalance=0, fill_prob=0.5)
    ...
    assert ob_features.imbalance == 0.0
    assert ob_features.fill_prob == 0.5

PASS criteria: |imbalance| < 0.3 on typical market conditions (confirmed in spec). Live OB replaces Mock. Expected result: 10-15% reduction in daily P&L variance (per 55-day OB validation: σ² reduced 15.35%).

MIG2.4 — paper_trade_flow.py reads all features from HZ

What to build: Refactor paper_trade_flow.py so run_engine_day reads ExF, MC state, OB, vol regime all from DOLPHIN_FEATURES HZ IMap (via Near Cache) instead of computing or loading from disk.

@task(persist_result=False)
def run_engine_day(date_str, scan_df, pt_cfg, strategy_name):
    client = hazelcast.HazelcastClient(near_caches={"DOLPHIN_FEATURES": {...}})
    features = client.get_map('DOLPHIN_FEATURES').blocking()

    # Read from HZ instead of computing inline
    mc_raw = features.get('mc_forewarner_latest')
    mc_status = json.loads(mc_raw)['status'] if mc_raw else 'GREEN'
    mc_status = get_effective_mc_status(mc_status, mc_raw)  # staleness check

    vol_raw = features.get('vol_regime')
    vol_ok = json.loads(vol_raw)['regime_ok'] if vol_raw else True

    # Pass OB provider backed by HZ
    ob_provider = HZOBProvider(features, staleness_threshold_sec=30)
    engine.set_ob_provider(ob_provider)
    ...

MIG2 GATE

bash ci/run_ci.sh
pytest ci/test_10_hz_feature_store.py ci/test_11_live_ob.py -v

PASS criteria: 32 + 4 (new) = 36 tests green. Live OB data flowing to HZ. Engine reads all features from HZ. No MockOBProvider in paper_trade_flow.py. Capital persisted day-over-day (verify manually over 3 consecutive days).


MIG3 — Survival Stack: Graceful Degradation (Control Theory)

Goal: Replace binary "up/down" thinking with a continuous, multiplicative risk controller. The system degrades gracefully under component failure rather than stopping or operating at full risk.

Spec reference: Sec VI (Control Theory Survival Stack), 5 categories, 4 postures, hysteresis.

Design: All 5 category multipliers multiply together to produce a final Rm (risk multiplier). Rm then maps to one of 4 operational postures. Hysteresis prevents rapid posture oscillation.

MIG3.1 — 5-Category Risk Multiplier (Rm)

File to create: nautilus_dolphin/nautilus_dolphin/nautilus/survival_stack.py

Rm = Cat1 × Cat2 × Cat3 × Cat4 × Cat5

Cat1 — Invariants (binary kill, <10ms response):
  Input: HZ quorum status + Nautilus heartbeat
  Rule: if HZ_nodes < 1 OR heartbeat_age > 30s → Rm1 = 0.0 (HIBERNATE)
  Else: Rm1 = 1.0

Cat2 — Structural (MC-Forewarner staleness + status):
  Input: MC status (GREEN/ORANGE/RED) + timestamp age
  GREEN, fresh → Rm2 = 1.0
  ORANGE, fresh → Rm2 = 0.5
  RED, fresh → Rm2 = 0.1 (exits only)
  Any status, stale (>6h) → Rm2 = exp(-staleness_hours / 3.0) ← exponential decay
  Rule: Rm2 = base_rm2 × exp(-max(0, staleness_hours - 6) / 3.0)

Cat3 — Microstructure (OB jitter/depth):
  Input: OB depth_quality + fill_prob + imbalance stability
  OB healthy (depth_quality > 0.7, fill_prob > 0.5) → Rm3 = 1.0
  OB jittery (fill_prob < 0.3 or depth_quality < 0.3) → Rm3 = 0.3 (passive quoting only)
  OB stale (>30s) → Rm3 = 0.5
  Rule: Rm3 = clip(0.3 + 0.7 * min(depth_quality, fill_prob), 0.3, 1.0)

Cat4 — Environmental (DVOL spike):
  Input: DVOL (Deribit BTC implied vol 30-day)
  Baseline DVOL (no spike) → Rm4 = 1.0
  DVOL spike detected (>2σ above 30-day mean) → Rm4 drops to 0.3 immediately (fast attack)
  Recovery: Rm4 recovers to 1.0 over 60 minutes (slow recovery, exponential)
  Rule: impulse-decay — Rm4 = 0.3 + (1.0 - 0.3) * (1 - exp(-t_since_spike / 60))

Cat5 — Capital Stress (sigmoid on drawdown):
  Input: current_drawdown = 1 - capital / peak_capital
  Rule: Rm5 = 1 / (1 + exp(20 * (drawdown - 0.12)))
  Effect: Rm5 ≈ 1.0 at DD<5%, ≈ 0.5 at DD=12%, ≈ 0.1 at DD=20%
  No cliff — continuous degradation as DD increases

Final: Rm = Rm1 × Rm2 × Rm3 × Rm4 × Rm5

MIG3.2 — 4 Operational Postures + Hysteresis

Rm → Posture mapping (with hysteresis deadband):
  Rm >= 0.85 → APEX    (abs_max_lev=6x, aggressive, full signal)
  Rm >= 0.40 → STALKER (abs_max_lev=2x, limit orders only)
  Rm >= 0.10 → TURTLE  (passive only, existing positions exit, no new entries)
  Rm <  0.10 → HIBERNATE (all-stop: close all positions, no new signals)

Hysteresis (Schmitt trigger):
  To DOWNGRADE (e.g., APEX → STALKER): threshold crossed + sustained for 2 consecutive checks
  To UPGRADE (e.g., STALKER → APEX): threshold exceeded + sustained for 5 consecutive checks
  Purpose: prevent rapid posture oscillation on noisy Rm boundary

Rm written to HZ DOLPHIN_SAFETY AtomicReference:
  {'posture': 'APEX'|'STALKER'|'TURTLE'|'HIBERNATE', 'Rm': float, 'timestamp': iso,
   'breakdown': {'Cat1': float, 'Cat2': float, 'Cat3': float, 'Cat4': float, 'Cat5': float}}

MIG3.3 — Integration into paper_trade_flow.py

run_engine_day reads posture from HZ before any engine action:

safety_ref = client.get_cp_subsystem().get_atomic_reference('DOLPHIN_SAFETY').blocking()
safety_state = json.loads(safety_ref.get() or '{}')
posture = safety_state.get('posture', 'APEX')
Rm = safety_state.get('Rm', 1.0)

if posture == 'HIBERNATE':
    logger.critical("[POSTURE] HIBERNATE — no trades today")
    return {'pnl': 0.0, 'trades': 0, 'posture': 'HIBERNATE'}

# Apply Rm to abs_max_leverage
effective_max_lev = pt_cfg['abs_max_leverage'] * Rm
engine.abs_max_leverage = max(1.0, effective_max_lev)

if posture == 'STALKER':
    engine.abs_max_leverage = min(engine.abs_max_leverage, 2.0)
elif posture == 'TURTLE':
    # No new entries — only manage existing positions
    engine.accept_new_entries = False

Test assertions (ci/test_12_survival_stack.py):

def test_rm_calculation_all_green():
    """All-green conditions → Rm = 1.0, posture = APEX."""
    ss = SurvivalStack(...)
    Rm, breakdown = ss.compute_rm(
        hz_nodes=1, heartbeat_age_s=1.0,
        mc_status='GREEN', mc_staleness_hours=0.5,
        ob_depth_quality=0.9, ob_fill_prob=0.8, ob_stale=False,
        dvol_spike=False, t_since_spike_min=999,
        drawdown=0.03,
    )
    assert Rm >= 0.95, f"Expected ~1.0, got {Rm}"
    assert breakdown['Cat1'] == 1.0
    assert breakdown['Cat5'] >= 0.95

def test_rm_hz_down_triggers_hibernate():
    """HZ quorum=0 → Cat1=0 → Rm=0 → HIBERNATE."""
    ss = SurvivalStack(...)
    Rm, _ = ss.compute_rm(hz_nodes=0, ...)
    assert Rm == 0.0
    assert ss.get_posture(Rm) == 'HIBERNATE'

def test_rm_drawdown_sigmoid():
    """Drawdown 12% → Rm5 ≈ 0.5."""
    ss = SurvivalStack(...)
    Rm5 = ss._cat5_capital_stress(drawdown=0.12)
    assert 0.4 <= Rm5 <= 0.6, f"Sigmoid expected ~0.5 at DD=12%, got {Rm5}"

def test_rm_dvol_spike_impulse_decay():
    """DVOL spike → Cat4=0.3. After 60min → Cat4≈1.0."""
    ss = SurvivalStack(...)
    assert ss._cat4_dvol(dvol_spike=True, t_since_spike_min=0) == pytest.approx(0.3, abs=0.05)
    assert ss._cat4_dvol(dvol_spike=True, t_since_spike_min=60) >= 0.9

def test_hysteresis_prevents_oscillation():
    """Rm oscillating at boundary does not cause rapid posture flips."""
    ss = SurvivalStack(hysteresis_down=2, hysteresis_up=5)
    postures = []
    for Rm in [0.84, 0.86, 0.84, 0.86, 0.84]:  # oscillating around APEX/STALKER boundary
        postures.append(ss.update_posture(Rm))
    # Should NOT oscillate — hysteresis holds the prior posture
    assert len(set(postures)) == 1, f"Hysteresis failed — postures: {postures}"

def test_posture_written_to_hz():
    """Posture and Rm are written to HZ DOLPHIN_SAFETY AtomicReference."""
    import hazelcast, json
    ss = SurvivalStack(...)
    Rm, _ = ss.compute_rm(...)
    ss.write_to_hz(Rm)
    c = hazelcast.HazelcastClient()
    ref = c.get_cp_subsystem().get_atomic_reference('DOLPHIN_SAFETY').blocking()
    state = json.loads(ref.get())
    assert state['posture'] in ('APEX', 'STALKER', 'TURTLE', 'HIBERNATE')
    assert 0.0 <= state['Rm'] <= 1.0
    c.shutdown()

PASS criteria: 36 + 6 = 42 tests green. Survival stack integrates into paper_trade_flow.py. Manual test: kill Hazelcast container → HIBERNATE triggers → restart HZ → system recovers to APEX within 2 check cycles.

MIG3 GATE

bash ci/run_ci.sh
pytest ci/test_12_survival_stack.py -v

Also verify manually:

  • Simulate MC-Forewarner returning RED → STALKER posture, max_lev=2x
  • Simulate drawdown 15% in ledger → Rm5 ≈ 0.35, posture degrades
  • System recovers gracefully when conditions improve (hysteresis up threshold met)

MIG4 — Nautilus-Trader Integration: Rust Execution Core

Goal: Replace the Python paper trading loop with Nautilus-Trader as the execution engine. NDAlphaEngine becomes a Nautilus Actor. Binance Futures orders routed through Nautilus adapter. This achieves true event-driven, sub-millisecond execution.

Spec reference: Sec V (Nautilus-Trader — Actor model, AsyncDataEngine, Rust networking, zero-copy Arrow).

Why Nautilus: Rust core, zero-copy Arrow data transport, proper Actor isolation, production-grade risk management. The Python engine (paper_trade_flow.py) was always a stepping stone.

MIG4.1 — NautilusActor Wrapper

Prereq: pip install nautilus_trader>=1.224 in Siloqy venv.

File to create: nautilus_dolphin/nautilus_dolphin/nautilus/nautilus_actor.py

Key design:

  • NautilusActor wraps NDAlphaEngine
  • Subscribes to bar data (5-second OHLCV bars for all 50 assets)
  • On each bar: updates eigenvalue features from HZ Near Cache
  • On each scan completion (5-minute window): calls engine.process_bar()
  • Orders submitted via Nautilus OrderFactory → Binance Futures adapter
  • Actor reads posture from HZ DOLPHIN_SAFETY before each order submission
from nautilus_trader.trading.actor import Actor
from nautilus_trader.model.data import Bar, BarType
from nautilus_trader.model.orders import MarketOrder, LimitOrder
from nautilus_trader.common.clock import LiveClock
from nautilus_trader.core.message import Event

class DolphinActor(Actor):
    def __init__(self, engine: NDAlphaEngine, hz_features_map, config):
        super().__init__(config)
        self.engine = engine
        self.hz = hz_features_map
        self._bar_buffer = {}  # symbol → list of bars

    def on_start(self):
        # Subscribe to 5s bars for all assets
        for symbol in self.engine.asset_columns:
            bar_type = BarType.from_str(f"{symbol}.BINANCE-5-SECOND-LAST-EXTERNAL")
            self.subscribe_bars(bar_type)

    def on_bar(self, bar: Bar):
        symbol = bar.bar_type.instrument_id.symbol.value
        self._bar_buffer.setdefault(symbol, []).append(bar)
        if self._should_process(bar):
            self._run_engine_on_bar_batch()

    def _run_engine_on_bar_batch(self):
        posture_raw = self.cache.get('DOLPHIN_SAFETY')
        posture = json.loads(posture_raw)['posture'] if posture_raw else 'APEX'
        if posture == 'HIBERNATE':
            return
        Rm = json.loads(posture_raw).get('Rm', 1.0) if posture_raw else 1.0
        signals = self.engine.process_bar_batch(self._bar_buffer, Rm=Rm)
        for signal in signals:
            self._submit_order(signal, posture)

    def _submit_order(self, signal, posture):
        if posture == 'TURTLE':
            return  # No new entries in TURTLE
        order_type = LimitOrder if posture == 'STALKER' else MarketOrder
        order = self.order_factory.create(
            instrument_id=signal.instrument_id,
            order_side=signal.side,
            quantity=signal.quantity,
            price=signal.limit_price if posture == 'STALKER' else None,
            order_type=order_type,
        )
        self.submit_order(order)

MIG4.2 — Docker: Add Nautilus Container

File to modify: prod/docker-compose.yml

Add Nautilus-Trader container (or run as sidecar process):

services:
  dolphin-actor:
    image: nautechsystems/nautilus_trader:latest
    volumes:
      - ../nautilus_dolphin:/app/nautilus_dolphin:ro
      - ../vbt_cache:/app/vbt_cache:ro
    environment:
      - HZ_CLUSTER=hazelcast:5701
      - BINANCE_API_KEY=${BINANCE_API_KEY}
      - BINANCE_API_SECRET=${BINANCE_API_SECRET}
      - TRADING_MODE=paper  # paper = no real orders
    depends_on:
      - hazelcast
    restart: unless-stopped

For paper trading: use Nautilus Backtest Engine or SimulatedExchange (no real orders). For live: swap to BinanceFuturesDataClient + BinanceFuturesExecutionClient.

MIG4.3 — Zero-copy Arrow: HZ → Nautilus

What to build: Eigenvalue scan DataFrames passed from Prefect scanner flow → HZ → Nautilus Actor using Apache Arrow IPC (zero-copy).

# Scanner writes Arrow record batch to HZ
import pyarrow as pa
import hazelcast

schema = pa.schema([
    ('symbol', pa.string()),
    ('vel_div', pa.float64()),
    ('lambda_max_w50', pa.float64()),
    ('lambda_max_w150', pa.float64()),
    ('instability', pa.float64()),
    ('timestamp', pa.int64()),
])

def write_scan_to_hz(df: pd.DataFrame, hz_map):
    table = pa.Table.from_pandas(df, schema=schema)
    sink = pa.BufferOutputStream()
    writer = pa.ipc.new_file(sink, table.schema)
    writer.write_table(table)
    writer.close()
    arrow_bytes = sink.getvalue().to_pybytes()
    hz_map.put('scan_arrow_latest', arrow_bytes)

# Nautilus Actor reads Arrow from HZ
def read_scan_from_hz(hz_map) -> pd.DataFrame:
    raw = hz_map.get('scan_arrow_latest')
    if raw is None:
        return None
    reader = pa.ipc.open_file(pa.py_buffer(raw))
    return reader.read_all().to_pandas()

Test assertions (ci/test_13_nautilus_integration.py):

def test_dolphin_actor_initializes():
    """DolphinActor can be constructed with NDAlphaEngine and HZ map."""
    from nautilus_dolphin.nautilus.nautilus_actor import DolphinActor
    engine = build_test_engine()
    actor = DolphinActor(engine=engine, hz_features_map=MockHZMap(), config={})
    assert actor is not None
    assert actor.engine is engine

def test_arrow_hz_roundtrip():
    """Scan DataFrame → Arrow IPC → HZ → Arrow IPC → DataFrame is lossless."""
    import pandas as pd, numpy as np
    df = pd.DataFrame({
        'symbol': ['BTCUSDT', 'ETHUSDT'],
        'vel_div': [-0.03, -0.01],
        'lambda_max_w50': [1.2, 0.9],
        'lambda_max_w150': [1.5, 1.0],
    })
    hz = MockHZMap()
    write_scan_to_hz(df, hz)
    df2 = read_scan_from_hz(hz)
    pd.testing.assert_frame_equal(df, df2)

def test_actor_respects_hibernate_posture():
    """DolphinActor does not submit orders when posture=HIBERNATE."""
    actor = DolphinActor(...)
    actor._posture_override = 'HIBERNATE'
    signals = actor._run_engine_on_bar_batch()
    assert signals == [] or signals is None

def test_nautilus_paper_run_no_crash():
    """NautilusTrader BacktestEngine with DolphinActor runs 1 day without crash."""
    from nautilus_trader.backtest.engine import BacktestEngine
    engine = BacktestEngine(config=BacktestEngineConfig(trader_id="DOLPHIN-001"))
    actor = DolphinActor(...)
    engine.add_actor(actor)
    engine.run(start=pd.Timestamp('2026-01-15'), end=pd.Timestamp('2026-01-16'))
    # ASSERT: runs without exception

PASS criteria: 42 + 4 = 46 tests green. DolphinActor processes one backtest day without crash. Arrow IPC roundtrip lossless. HIBERNATE posture prevents order submission.

MIG4 GATE

Manual integration test:

# Start Nautilus actor in paper mode for one day
python -m nautilus_dolphin.nautilus.run_papertrade --date 2026-01-15 --posture APEX
# ASSERT: trades > 0 logged, no crashes, capital > 0 at end
# ASSERT: orders visible in Nautilus portfolio summary

Full CI gate:

bash ci/run_ci.sh
pytest ci/test_13_nautilus_integration.py -v

MIG5 — LONG System Activation: Green Deployment

Goal: Activate bidirectional trading (SHORT + LONG) on the green deployment. Requires LONG validation result from b79rt78uv to confirm PF > 1.05 on 795-day klines.

Spec reference: LAYER_BRINGUP_PLAN.md Layer 7, green.yml config.

Prerequisites:

  • b79rt78uv result: LONG PF > 1.05 on 795-day klines, WR > 42%
  • Regime detector built: identifies when LONG conditions are active
  • Capital arbiter: assigns SHORT_weight + LONG_weight per day (sum = 1.0)

MIG5.1 — Validate LONG Result

When b79rt78uv completes, verify:

# Expected assertions from test_pf_klines_2y_long.py:
assert long_pf > 1.05        # Minimum viable LONG
assert long_wr > 0.40        # 40% win rate minimum
assert long_roi > 0.0        # Net positive over 795 days
assert long_max_dd < 0.30    # Drawdown bounded
assert long_trades > 100     # Sufficient sample size

If LONG fails (PF < 1.05): green.yml stays SHORT-only. Do not activate LONG. Research continues.

MIG5.2 — Regime Arbiter

What to build: capital_arbiter.py — determines SHORT_weight vs LONG_weight each day based on regime state.

class CapitalArbiter:
    def get_weights(self, date_str, features) -> dict:
        """
        Returns {'short': float, 'long': float} summing to 1.0.
        Based on: vel_div direction, BTC trend, ExF signals.
        """
        vel_div_mean = features.get('vel_div_mean', 0.0)
        btc_7bar_return = features.get('btc_7bar_return', 0.0)

        if vel_div_mean < -0.02 and btc_7bar_return < 0:
            # Strong structural breakdown — favor SHORT
            return {'short': 0.7, 'long': 0.3}
        elif vel_div_mean > 0.02 and btc_7bar_return > 0:
            # Strong recovery — favor LONG
            return {'short': 0.3, 'long': 0.7}
        else:
            # Neutral — equal weight
            return {'short': 0.5, 'long': 0.5}

MIG5.3 — green.yml and green deployment

Update prod/configs/green.yml:

direction: bidirectional  # was: short_only
long_vel_div_threshold: 0.02
long_extreme_threshold: 0.05
capital_arbiter: equal_weight  # or: regime_weighted

Register green deployment in Prefect:

PREFECT_API_URL=http://localhost:4200/api \
python -c "
from prod.paper_trade_flow import dolphin_paper_trade
dolphin_paper_trade.to_deployment(
    name='dolphin-paper-green',
    cron='10 0 * * *',  # 00:10 UTC (5 min after blue)
    parameters={'config': 'prod/configs/green.yml'},
).apply()
"

Test assertions (ci/test_14_long_system.py):

def test_long_system_requires_validation():
    """green.yml direction=bidirectional is only set after LONG PF > 1.05."""
    import yaml
    with open('prod/configs/green.yml') as f:
        cfg = yaml.safe_load(f)
    if cfg.get('direction') == 'bidirectional':
        # If bidirectional is set, LONG validation must have passed
        assert cfg.get('long_vel_div_threshold', 0) > 0, "LONG threshold not set"
        assert cfg.get('long_extreme_threshold', 0) > 0, "LONG extreme threshold not set"

def test_capital_arbiter_weights_sum_to_one():
    arb = CapitalArbiter()
    for scenario in [
        {'vel_div_mean': -0.05, 'btc_7bar_return': -0.01},
        {'vel_div_mean': +0.05, 'btc_7bar_return': +0.01},
        {'vel_div_mean': 0.0, 'btc_7bar_return': 0.0},
    ]:
        w = arb.get_weights('2026-01-15', scenario)
        assert abs(w['short'] + w['long'] - 1.0) < 1e-6
        assert w['short'] > 0 and w['long'] > 0

def test_green_engine_fires_long_trades():
    """Green deployment engine fires LONG trades on LONG signal days."""
    # Use a scan date where vel_div > 0.02 (LONG signal)
    # ASSERT: engine produces trades with direction=+1
    ...

PASS criteria: 46 + 3 = 49 tests green. Green deployment running alongside blue. Capital arbiter weights summing to 1.0. Both SHORT and LONG trades logged.


MIG6 — Hazelcast Jet: Reactive ACB Stream Processing

Goal: Replace batch ACB preload (once daily) with reactive sub-day ACB that updates on each new scan bar. HZ Jet pipeline processes eigenvalue stream, updates ACB state atomically via Entry Processor. Sub-day ACB enables adverse-turn exits within the trading day.

Spec reference: Sec III (Hazelcast Jet stream processing), Phase MIG6.

Impact: Per 55-day research, sub-day ACBv6 has +3-4% ROI potential. Currently not implemented in ND engine path.

MIG6.1 — Jet Pipeline Design

[ARB512 Scanner writes JSON]
    → [File watcher (Prefect sensor flow)]
    → [Publishes scan to HZ Jet Topic "dolphin.scan.bars"]
    → [Jet pipeline: eigenvalue processor]
    → [Computes vel_div, update volatility, update ACB boost]
    → [ACBBoostUpdateProcessor (Entry Processor) → DOLPHIN_FEATURES "acb_state"]
    → [Nautilus Actor reads updated ACB state via Near Cache]

MIG6.2 — Scan File Watcher Prefect Flow

File to create: prod/scan_watcher_flow.py

@flow(name="scan-watcher")
def scan_watcher_flow():
    """Watches eigenvalues dir for new scan files. Publishes to HZ Jet topic."""
    import watchdog.events, watchdog.observers
    last_seen = set()
    while True:
        current = set(glob.glob(f"{SCANS_DIR}/*/*.json"))
        new_files = current - last_seen
        for f in sorted(new_files):
            publish_scan_to_jet(f)  # task
        last_seen = current
        time.sleep(5)

MIG6.3 — Sub-day ACB Adverse-Turn Exits

When ACB boost drops significantly (>0.2x reduction) within a day → signal potential regime adverse turn. Engine checks for open SHORT positions and triggers early exit (subject to OB).

def on_acb_state_update(old_acb_state, new_acb_state, engine):
    """Called by Jet processor when ACB state updates."""
    boost_drop = old_acb_state['boost'] - new_acb_state['boost']
    if boost_drop > 0.2 and engine.has_open_positions():
        # Adverse turn signal: boost dropped significantly
        ob_quality = get_ob_quality()
        if ob_quality > 0.5:
            engine.request_orderly_exit()  # maker fill preferred
        else:
            engine.request_duress_exit()   # bypass OB wait, market order

Test assertions (ci/test_15_jet_pipeline.py):

def test_jet_topic_publish():
    """Scan file published to HZ Jet topic is received by subscriber."""
    import hazelcast, time
    c = hazelcast.HazelcastClient()
    topic = c.get_topic('dolphin.scan.bars').blocking()
    received = []
    topic.add_message_listener(lambda msg: received.append(msg.message_object))
    topic.publish({'vel_div': -0.03, 'timestamp': time.time()})
    time.sleep(0.1)
    assert len(received) == 1
    assert received[0]['vel_div'] == -0.03
    c.shutdown()

def test_acb_entry_processor_subday():
    """ACB Entry Processor updates boost atomically from Jet pipeline."""
    # Simulate mid-day ACB update: boost drops from 1.3 to 0.9
    processor = ACBBoostUpdateProcessor(new_boost=0.9, date_str='2026-01-15')
    hz_map.execute_on_key('acb_state', processor)
    updated = json.loads(hz_map.get('acb_state'))
    assert updated['boost'] == 0.9

def test_adverse_turn_triggers_exit():
    """Boost drop >0.2x with open positions triggers exit request."""
    engine = build_test_engine_with_open_position()
    old_state = {'boost': 1.3, 'beta': 0.7}
    new_state = {'boost': 1.0, 'beta': 0.5}
    on_acb_state_update(old_state, new_state, engine)
    assert engine.exit_requested, "Adverse turn should trigger exit"

PASS criteria: 49 + 3 = 52 tests green. Sub-day ACB updating on new scan files. Adverse-turn exit fires on simulated boost drop. Jet pipeline end-to-end test with mock scanner.

MIG6 GATE

bash ci/run_ci.sh
pytest ci/test_15_jet_pipeline.py -v
# ASSERT: 52 tests green

Operational check: Start scanner, watch HZ-MC topic dashboard, verify scan events appear in dolphin.scan.bars topic within 10s of each new JSON file.


MIG7 — Multi-Asset Scaling: 50 → 400 Assets

Goal: Scale from 50 to 400 assets while maintaining performance. Current memory footprint limits scaling. Distribute feature store across sharded HZ IMap. Multi-market capability.

Spec reference: Phase MIG7, MEMORY.md ("PROVEN better: higher returns + signal fidelity in tests. Blocked by RAM — optimize memory footprint FIRST, then scale").

Prerequisite (HARD): RAM optimization before scaling. Profile current 50-asset memory footprint first.

MIG7.1 — Memory Footprint Analysis

# Profile current memory usage
python -c "
import tracemalloc, sys
tracemalloc.start()
# ... run engine on 50 assets for 1 day ...
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics('lineno')
for s in stats[:20]:
    print(s)
"
# ASSERT: identify top memory consumers
# TARGET: < 4GB for 50 assets (< 32GB for 400 assets)

Known memory hotspots (probable):

  • _price_histories: rolling price buffer per asset × bar count
  • VBT parquet cache: 55 days × 50 assets × ~5k bars each
  • ACB: p60 threshold storage (per day, per asset)

MIG7.2 — Sharded IMap for 400-Asset Feature Store

# Shard by asset group (10 shards × 40 assets each)
def get_shard_map_name(symbol: str) -> str:
    shard = hash(symbol) % 10
    return f"DOLPHIN_FEATURES_SHARD_{shard:02d}"

# Each shard has its own Near Cache
near_cache_config = {f"DOLPHIN_FEATURES_SHARD_{i:02d}": {...} for i in range(10)}

MIG7.3 — Distributed Worker Pool

HZ IMDG + Prefect external workers on multiple machines (or Docker replicas):

  • Worker 1: assets 0-99 (BTCUSDT group)
  • Worker 2: assets 100-199
  • Worker 3: assets 200-299
  • Worker 4: assets 300-399

Capital arbiter aggregates signals from all workers before order submission.

Test assertions (ci/test_16_scaling.py):

def test_memory_footprint_50_assets():
    """50-asset engine uses < 4GB RAM."""
    import tracemalloc
    tracemalloc.start()
    run_engine_50_assets_1_day()
    _, peak = tracemalloc.get_traced_memory()
    assert peak < 4 * 1024**3, f"Memory too high: {peak/1024**3:.1f}GB"

def test_sharded_imap_read_write():
    """Feature store sharding: all 400 symbols writable and readable."""
    c = hazelcast.HazelcastClient()
    for i, symbol in enumerate(all_400_symbols):
        map_name = get_shard_map_name(symbol)
        m = c.get_map(map_name).blocking()
        m.put(f"vel_div_{symbol}", -0.03)
        assert m.get(f"vel_div_{symbol}") == -0.03
    c.shutdown()

def test_400_asset_engine_no_crash():
    """Engine processes 1 day with 400 assets without crash or OOM."""
    engine = build_400_asset_engine()
    result = engine.process_day('2026-01-15', df_400_assets, ...)
    assert result['trades'] > 0
    assert result['capital'] > 0

PASS criteria: 52 + 3 = 55 tests green. 400-asset engine processes one day. Memory < 32GB (if available). Sharded IMap round-trip working.


CI Test Suite — Cumulative Summary

MIG Phase New Tests Cumulative Total Key Assertion
MIG0 (baseline) 24 24 CI gate green, infra healthy
MIG1 (SITARA flows) 8 32 Capital persists, MC/ExF flows running
MIG2 (HZ feature store) 4 36 Near Cache <1ms, live OB flowing
MIG3 (survival stack) 6 42 Rm correct, postures fire, hysteresis holds
MIG4 (Nautilus) 4 46 Actor initializes, HIBERNATE blocks orders
MIG5 (LONG system) 3 49 LONG PF>1.05, arbiter weights sum=1
MIG6 (Jet reactive) 3 52 Jet topic live, Entry Processor atomic, adverse-turn fires
MIG7 (scaling) 3 55 Memory <4GB/50-asset, shard read-write, 400-asset no crash

Full CI gate at each phase boundary:

bash ci/run_ci.sh  # original 24 always must pass
pytest ci/ -v --ignore=ci/test_03_regression.py  # fast suite
pytest ci/test_03_regression.py  # regression (slower, run before prod push only)

Regression Floors (Phase Gate Minima)

These floors apply at EVERY phase gate. If a phase change causes any floor to be breached, STOP and investigate before proceeding.

Metric Floor Champion (current best) Notes
PF (10-day VBT) >= 1.08 1.123 55-day window: 1.123
WR (10-day VBT) >= 42% 49.3% Champion WR
ROI (10-day) >= -5% +44.89% (55d) Any 10-day window >= -5%
Trades (10-day) >= 5 ~380 (55d avg 7/day) Not a dead system
Max DD (55d) < 20% 14.95% Don't exceed DD spec target
Sharpe (55d) > 1.5 2.50 Don't regress below spec target

Open Items (Research Queue, Not Blocking MIG1-3)

These are noted here so they don't fall through the cracks, but they MUST NOT block forward migration:

  1. TP sweep: Apply 95bps to test_pf_dynamic_beta_validate.py ENGINE_KWARGS (still uses 0.0099). Low-risk, 10-min change. Do before next benchmark run.

  2. VOL gate EWMA: 5-bar EWMA before p60 gate (smooths noisy vol_ok). Minor improvement, not a blocker.

  3. Sub-day ACB adverse-turn exits (full implementation): Architecture documented in MEMORY.md Dynamic Exit Manager section. Prototype search in legacy standalone engine tests before building.

  4. Regime fragility sensing (Feb06-08 problem): HD Disentangled VAE on eigenvalue data + ExF conditioning. Long-term research. Does not block MIG1-4.

  5. MC-Forewarner live wiring verification: Mechanical exit/reduce execution on RED/ORANGE (currently only affects sizing, not execution). Must verify real-money path before live trading.

  6. 1m calibration sweep (b1ahez7tq): max_hold × abs_max_lev grid. When complete, update blue.yml if improvement found.

  7. EsoF multi-year backfill: Needed for N>6 tail events. N=6 currently insufficient for production. Backfiller script exists but needs multi-year klines data.


Operational Runbook — Standing Procedure

Daily check (takes 2 min)

# 1. Check Prefect UI for last run result
open http://localhost:4200  # check DOLPHIN-PAPER-BLUE last run status

# 2. Check HZ for today's P&L
python -c "
import hazelcast, json
c = hazelcast.HazelcastClient()
m = c.get_map('DOLPHIN_PNL_BLUE').blocking()
keys = sorted(m.key_set())
if keys:
    print(json.loads(m.get(keys[-1])))
c.shutdown()
"

# 3. Check survival stack posture
python -c "
import hazelcast, json
c = hazelcast.HazelcastClient()
ref = c.get_cp_subsystem().get_atomic_reference('DOLPHIN_SAFETY').blocking()
print(json.loads(ref.get() or '{}'))
c.shutdown()
"

Before any push to prod/blue or prod/green

bash ci/run_ci.sh --fast  # <60s, blocks push if fails (pre-push hook does this automatically)

Recovery from HIBERNATE posture

# 1. Diagnose: which Cat is failing?
python -c "from survival_stack import SurvivalStack; print(SurvivalStack().diagnose())"

# 2. Fix the underlying issue (restart HZ if Cat1, wait for MC if Cat2, etc.)

# 3. Survival stack auto-recovers after 5 consecutive checks above threshold
# Or manual override (EMERGENCY ONLY):
python -c "
import hazelcast, json
c = hazelcast.HazelcastClient()
ref = c.get_cp_subsystem().get_atomic_reference('DOLPHIN_SAFETY').blocking()
ref.set(json.dumps({'posture': 'APEX', 'Rm': 1.0, 'override': True}))
c.shutdown()
print('Manual override set to APEX')
"

Quick-Reference Phase Summary

Phase Deliverable Duration est. Functional system?
MIG0 CI 24/24 green, infra verified Done YES (batch paper trading)
MIG1 State persistence + subsystem flows 2-3 sessions YES + capital compounds
MIG2 HZ feature store + live OB 3-4 sessions YES + real OB signal
MIG3 Survival stack + postures 2-3 sessions YES + graceful degradation
MIG4 Nautilus-Trader execution 4-6 sessions YES + Rust core
MIG5 LONG system (GREEN deployment) 1-2 sessions YES + bidirectional
MIG6 HZ Jet reactive ACB 3-4 sessions YES + sub-day ACB
MIG7 400-asset scaling 4-6 sessions YES + full scale

The system is always functional. Every phase boundary = working system + passing CI. No dark periods.