# DOLPHIN Architectural Changes Specification ## Session: 2026-03-25 - Multi-Speed Event-Driven Architecture **Version**: 1.0 **Author**: Kimi Code CLI Agent **Status**: DEPLOYED (Production) **Related Files**: `SYSTEM_BIBLE.md` (updated to v3.1) --- ## Executive Summary This specification documents the architectural transformation of the DOLPHIN trading system from a **batch-oriented, single-worker Prefect architecture** to a **multi-speed, event-driven, multi-worker architecture** with proper resource isolation and self-healing capabilities. ### Key Changes 1. **Multi-Pool Prefect Architecture** - Separate work pools per frequency layer 2. **Event-Driven Nautilus Trader** - Hz listener for millisecond-latency trading 3. **Enhanced MHS v2** - Full 5-sensor monitoring with per-subsystem health tracking 4. **Systemd-Based Service Management** - Resource-constrained, auto-restarting services 5. **Concurrency Safety** - Prevents process explosion (root cause of 2026-03-24 outage) --- ## 1. Architecture Overview ### 1.1 Previous Architecture (Pre-Change) ``` ┌─────────────────────────────────────────┐ │ Single Work Pool: "dolphin" │ │ Single Prefect Worker (unlimited) │ │ │ │ Problems: │ │ - No concurrency limits │ │ - 60+ prefect.engine zombies │ │ - Resource exhaustion │ │ - Kernel deadlock on SMB hang │ └─────────────────────────────────────────┘ ``` ### 1.2 New Architecture (Post-Change) ``` ┌─────────────────────────────────────────────────────────────────────┐ │ LAYER 1: ULTRA-LOW LATENCY (<1ms) │ │ ├─ Work Pool: dolphin-obf (planned) │ │ ├─ Service: dolphin-nautilus-trader.service │ │ └─ Pattern: Hz Entry Listener → Immediate Signal → Trade │ ├─────────────────────────────────────────────────────────────────────┤ │ LAYER 2: FAST POLLING (1s-10s) │ │ ├─ Work Pool: dolphin-scan │ │ ├─ Service: Scan Bridge (direct/systemd hybrid) │ │ └─ Pattern: File watcher → Hz push │ ├─────────────────────────────────────────────────────────────────────┤ │ LAYER 3: SCHEDULED INDICATORS (varied) │ │ ├─ Work Pool: dolphin-extf-indicators (planned) │ │ ├─ Services: Per-indicator flows (funding, DVOL, F&G, etc.) │ │ └─ Pattern: Individual schedules → Hz push │ ├─────────────────────────────────────────────────────────────────────┤ │ LAYER 4: HEALTH MONITORING (~5s) │ │ ├─ Service: meta_health_daemon.service │ │ └─ Pattern: 5-sensor monitoring → Recovery actions │ ├─────────────────────────────────────────────────────────────────────┤ │ LAYER 5: DAILY BATCH │ │ ├─ Work Pool: dolphin (existing) │ │ └─ Pattern: Scheduled backtests, paper trades │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## 2. Detailed Component Specifications ### 2.1 Scan Bridge Service **Purpose**: Watch Arrow scan files from DolphinNG6, push to Hazelcast **File**: `prod/scan_bridge_prefect_flow.py` **Deployment**: ```bash prefect deploy scan_bridge_prefect_flow.py:scan_bridge_flow \ --name scan-bridge --pool dolphin ``` **Key Configuration**: - Concurrency limit: 1 (per deployment) - Work pool concurrency: 1 - Poll interval: 5s when idle - File mtime-based detection (handles NG6 restarts) **Hz Output**: - Map: `DOLPHIN_FEATURES` - Key: `latest_eigen_scan` - Fields: scan_number, assets, asset_prices, timestamp, bridge_ts, bridge_source **Current Status**: Running directly (PID 158929) due to Prefect worker issues --- ### 2.2 Nautilus Event-Driven Trader **Purpose**: Event-driven paper/live trading with millisecond latency **File**: `prod/nautilus_event_trader.py` **Service**: `/etc/systemd/system/dolphin-nautilus-trader.service` **Architecture Pattern**: ```python # Hz Entry Listener (not polling!) features_map.add_entry_listener( key='latest_eigen_scan', updated_func=on_scan_update, # Called on every new scan added_func=on_scan_update ) # Signal computation in callback def on_scan_update(event): scan = json.loads(event.value) signal = compute_signal(scan, ob_data, extf_data) if signal.valid: execute_trade(signal) ``` **Resource Limits** (systemd): ```ini MemoryMax=2G CPUQuota=200% TasksMax=50 ``` **Hz Integration**: - Input: `DOLPHIN_FEATURES["latest_eigen_scan"]` - Input: `DOLPHIN_FEATURES["ob_features_latest"]` (planned) - Input: `DOLPHIN_FEATURES["exf_latest"]` (planned) - Output: `DOLPHIN_PNL_BLUE[YYYY-MM-DD]` - Output: `DOLPHIN_STATE_BLUE["latest_nautilus"]` **Current Status**: Active (PID 159402), waiting for NG6 scans --- ### 2.3 Meta Health Service v2 (MHS) **Purpose**: Comprehensive system health monitoring with automated recovery **File**: `prod/meta_health_daemon_v2.py` **Service**: `/etc/systemd/system/meta_health_daemon.service` #### 2.3.1 Five-Sensor Model | Sensor | Name | Description | Thresholds | |--------|------|-------------|------------| | M1 | Process Integrity | Critical processes running | 0.0=missing, 1.0=all present | | M2 | Heartbeat Freshness | Hz heartbeat recency | >60s=0.0, >30s=0.5, <30s=1.0 | | M3 | Data Freshness | Hz data source timestamps | >120s=dead, >30s=stale, <30s=fresh | | M4 | Control Plane | Port connectivity | Hz+Prefect ports | | M5 | Data Coherence | Data integrity & posture validity | Valid ranges, enums | #### 2.3.2 Monitored Subsystems ```python HZ_DATA_SOURCES = { "scan": ("DOLPHIN_FEATURES", "latest_eigen_scan", "bridge_ts"), "obf": ("DOLPHIN_FEATURES", "ob_features_latest", "_pushed_at"), "extf": ("DOLPHIN_FEATURES", "exf_latest", "_pushed_at"), "esof": ("DOLPHIN_FEATURES", "esof_latest", "_pushed_at"), "safety": ("DOLPHIN_SAFETY", "latest", "ts"), "state": ("DOLPHIN_STATE_BLUE", "latest_nautilus", "updated_at"), } ``` #### 2.3.3 Rm_meta Calculation ```python rm_meta = M1 * M2 * M3 * M4 * M5 Status Mapping: - rm_meta > 0.8: GREEN - rm_meta > 0.5: DEGRADED - rm_meta > 0.2: CRITICAL - rm_meta <= 0.2: DEAD → Recovery actions triggered ``` #### 2.3.4 Recovery Actions When `status == "DEAD"`: 1. Check M4 (Control Plane) → Restart Hz/Prefect if needed 2. Check M1 (Processes) → Restart missing services 3. Trigger deployment runs for Prefect-managed flows **Current Status**: Active (PID 160052), monitoring (currently DEAD due to NG6 down) --- ### 2.4 Systemd Service Specifications #### 2.4.1 dolphin-nautilus-trader.service ```ini [Unit] Description=DOLPHIN Nautilus Event-Driven Trader After=network.target hazelcast.service [Service] Type=simple User=root WorkingDirectory=/mnt/dolphinng5_predict/prod Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin" Environment="PYTHONPATH=/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin" ExecStart=/home/dolphin/siloqy_env/bin/python3 nautilus_event_trader.py Restart=always RestartSec=5 StartLimitInterval=60s StartLimitBurst=3 # Resource Limits (Critical!) MemoryMax=2G CPUQuota=200% TasksMax=50 StandardOutput=append:/tmp/nautilus_trader.log StandardError=append:/tmp/nautilus_trader.log [Install] WantedBy=multi-user.target ``` #### 2.4.2 meta_health_daemon.service ```ini [Unit] Description=Meta Health Daemon - Watchdog of Watchdogs After=network.target hazelcast.service [Service] Type=simple User=root WorkingDirectory=/mnt/dolphinng5_predict/prod Environment="PREFECT_API_URL=http://localhost:4200/api" ExecStart=/home/dolphin/siloqy_env/bin/python meta_health_daemon_v2.py Restart=always RestartSec=5 StandardOutput=append:/mnt/dolphinng5_predict/run_logs/meta_health.log ``` #### 2.4.3 dolphin-prefect-worker.service ```ini [Unit] Description=DOLPHIN Prefect Worker After=network.target hazelcast.service [Service] Type=simple User=root WorkingDirectory=/mnt/dolphinng5_predict/prod Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin" Environment="PREFECT_API_URL=http://localhost:4200/api" ExecStart=/home/dolphin/siloqy_env/bin/prefect worker start --pool dolphin Restart=always RestartSec=10 StandardOutput=append:/tmp/prefect_worker.log ``` --- ## 3. Data Flow Specifications ### 3.1 Scan-to-Trade Latency Path ``` ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────┐ │ DolphinNG6 │────▶│ Arrow File │────▶│ Scan Bridge │────▶│ Hz │ │ (Windows) │ │ (SMB mount) │ │ (5s poll) │ │ DOLPHIN_ │ │ │ │ │ │ │ │ FEATURES │ └─────────────┘ └──────────────┘ └───────────────┘ └─────┬──────┘ │ ┌──────────────────────────────────────┘ │ Entry Listener (event-driven) ▼ ┌───────────────┐ ┌──────────────┐ │ Nautilus │────▶│ Trade Exec │ │ Event Trader │ │ (Paper/Live) │ │ (<1ms latency)│ │ │ └───────────────┘ └──────────────┘ ``` **Target Latency**: < 10ms from NG6 scan to trade execution ### 3.2 Hz Data Schema Updates #### DOLPHIN_FEATURES["latest_eigen_scan"] ```json { "scan_number": 8634, "timestamp": "2026-03-25T10:30:00Z", "assets": ["BTCUSDT", "ETHUSDT", ...], "asset_prices": {...}, "eigenvalues": [...], "bridge_ts": "2026-03-25T10:30:01.123456+00:00", "bridge_source": "scan_bridge_prefect" } ``` #### DOLPHIN_META_HEALTH["latest"] (NEW) ```json { "rm_meta": 0.0, "status": "DEAD", "m1_proc": 0.0, "m2_heartbeat": 0.0, "m3_data_freshness": 0.0, "m4_control_plane": 1.0, "m5_coherence": 0.0, "subsystem_health": { "processes": {...}, "data_sources": {...} }, "timestamp": "2026-03-25T10:30:00+00:00" } ``` --- ## 4. Safety Mechanisms ### 4.1 Concurrency Controls | Level | Mechanism | Value | Purpose | |-------|-----------|-------|---------| | Work Pool | `concurrency_limit` | 1 | Only 1 flow run per pool | | Deployment | `prefect concurrency-limit` | 1 | Tag-based limit | | Systemd | `TasksMax` | 50 | Max processes per service | | Systemd | `MemoryMax` | 2G | OOM protection | ### 4.2 Recovery Procedures **Scenario 1: Process Death** ``` M1 drops → systemd Restart=always → process restarts → M1 recovers ``` **Scenario 2: Data Staleness** ``` M3 drops (no NG6 data) → Status=DEGRADED → Wait for NG6 restart (No automatic action - data source is external) ``` **Scenario 3: Control Plane Failure** ``` M4 drops → MHS triggers → systemctl restart hazelcast ``` **Scenario 4: System Deadlock (2026-03-24 Incident)** ``` Prefect worker spawns 60+ processes → Resource exhaustion → Kernel deadlock Fix: Concurrency limits + systemd TasksMax prevent spawn loop ``` --- ## 5. Known Issues & Limitations ### 5.1 Prefect Worker Issue **Symptom**: Flow runs stuck in "Late" state, worker not picking up **Workaround**: Services run directly via systemd (Scan Bridge, Nautilus Trader) **Root Cause**: Unknown - possibly pool paused state or worker polling issue **Future Fix**: Investigate Prefect work pool `status` field, may need to recreate pool ### 5.2 NG6 Dependency **Current State**: All services DEAD (expected) due to no scan data **Recovery**: Automatic when NG6 restarts - scan bridge will detect files, Hz will update, Nautilus will trade ### 5.3 OBF/ExtF/EsoF Not Running **Status**: Services defined but not yet started **Action Required**: Start individually after NG6 recovery --- ## 6. Operational Commands ### 6.1 Service Management ```bash # Check all services systemctl status dolphin-* meta_health_daemon # View logs journalctl -u dolphin-nautilus-trader -f tail -f /tmp/nautilus_trader.log tail -f /mnt/dolphinng5_predict/run_logs/meta_health.log # Restart services systemctl restart dolphin-nautilus-trader systemctl restart meta_health_daemon ``` ### 6.2 Health Checks ```bash # Check Hz data cd /mnt/dolphinng5_predict/prod source /home/dolphin/siloqy_env/bin/activate python3 -c " import hazelcast, json client = hazelcast.HazelcastClient(cluster_name='dolphin') features = client.get_map('DOLPHIN_FEATURES').blocking() scan = json.loads(features.get('latest_eigen_scan') or '{}') print(f'Scan #{scan.get(\"scan_number\", \"N/A\")}') client.shutdown() " # Check MHS status cat /mnt/dolphinng5_predict/run_logs/meta_health.json ``` ### 6.3 Prefect Operations ```bash export PREFECT_API_URL="http://localhost:4200/api" # Check work pools prefect work-pool ls # Check deployments prefect deployment ls # Manual trigger (for testing) prefect deployment run scan-bridge-flow/scan-bridge ``` --- ## 7. File Locations | Component | File Path | |-----------|-----------| | Scan Bridge Flow | `/mnt/dolphinng5_predict/prod/scan_bridge_prefect_flow.py` | | Nautilus Trader | `/mnt/dolphinng5_predict/prod/nautilus_event_trader.py` | | MHS v2 | `/mnt/dolphinng5_predict/prod/meta_health_daemon_v2.py` | | Prefect Worker Service | `/etc/systemd/system/dolphin-prefect-worker.service` | | Nautilus Trader Service | `/etc/systemd/system/dolphin-nautilus-trader.service` | | MHS Service | `/etc/systemd/system/meta_health_daemon.service` | | Health Logs | `/mnt/dolphinng5_predict/run_logs/meta_health.log` | | Trader Logs | `/tmp/nautilus_trader.log` | | Prefect Worker Logs | `/tmp/prefect_worker.log` | | Hz Data | `DOLPHIN_FEATURES`, `DOLPHIN_SAFETY`, `DOLPHIN_META_HEALTH` | --- ## 8. Appendix: Version History | Version | Date | Changes | |---------|------|---------| | 1.0 | 2026-03-25 | Initial multi-speed architecture spec | --- ## 9. Sign-Off **Implementation**: Complete **Testing**: In Progress (waiting for NG6 restart) **Documentation**: Complete **Next Review**: Post-NG6-recovery validation **Agent**: Kimi Code CLI **Session**: 2026-03-25 **Status**: PRODUCTION DEPLOYED