Files
DOLPHIN/prod/AGENT_READ_ARCHITECTURAL_CHANGES_SPEC.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

16 KiB
Executable File

DOLPHIN Architectural Changes Specification

Session: 2026-03-25 - Multi-Speed Event-Driven Architecture

Version: 1.0
Author: Kimi Code CLI Agent
Status: DEPLOYED (Production)
Related Files: SYSTEM_BIBLE.md (updated to v3.1)


Executive Summary

This specification documents the architectural transformation of the DOLPHIN trading system from a batch-oriented, single-worker Prefect architecture to a multi-speed, event-driven, multi-worker architecture with proper resource isolation and self-healing capabilities.

Key Changes

  1. Multi-Pool Prefect Architecture - Separate work pools per frequency layer
  2. Event-Driven Nautilus Trader - Hz listener for millisecond-latency trading
  3. Enhanced MHS v2 - Full 5-sensor monitoring with per-subsystem health tracking
  4. Systemd-Based Service Management - Resource-constrained, auto-restarting services
  5. Concurrency Safety - Prevents process explosion (root cause of 2026-03-24 outage)

1. Architecture Overview

1.1 Previous Architecture (Pre-Change)

┌─────────────────────────────────────────┐
│  Single Work Pool: "dolphin"            │
│  Single Prefect Worker (unlimited)      │
│                                         │
│  Problems:                              │
│  - No concurrency limits                │
│  - 60+ prefect.engine zombies           │
│  - Resource exhaustion                  │
│  - Kernel deadlock on SMB hang          │
└─────────────────────────────────────────┘

1.2 New Architecture (Post-Change)

┌─────────────────────────────────────────────────────────────────────┐
│ LAYER 1: ULTRA-LOW LATENCY (<1ms)                                   │
│ ├─ Work Pool: dolphin-obf (planned)                                │
│ ├─ Service: dolphin-nautilus-trader.service                        │
│ └─ Pattern: Hz Entry Listener → Immediate Signal → Trade           │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 2: FAST POLLING (1s-10s)                                      │
│ ├─ Work Pool: dolphin-scan                                         │
│ ├─ Service: Scan Bridge (direct/systemd hybrid)                    │
│ └─ Pattern: File watcher → Hz push                                 │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 3: SCHEDULED INDICATORS (varied)                              │
│ ├─ Work Pool: dolphin-extf-indicators (planned)                    │
│ ├─ Services: Per-indicator flows (funding, DVOL, F&G, etc.)        │
│ └─ Pattern: Individual schedules → Hz push                         │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 4: HEALTH MONITORING (~5s)                                    │
│ ├─ Service: meta_health_daemon.service                             │
│ └─ Pattern: 5-sensor monitoring → Recovery actions                 │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 5: DAILY BATCH                                                │
│ ├─ Work Pool: dolphin (existing)                                   │
│ └─ Pattern: Scheduled backtests, paper trades                      │
└─────────────────────────────────────────────────────────────────────┘

2. Detailed Component Specifications

2.1 Scan Bridge Service

Purpose: Watch Arrow scan files from DolphinNG6, push to Hazelcast

File: prod/scan_bridge_prefect_flow.py

Deployment:

prefect deploy scan_bridge_prefect_flow.py:scan_bridge_flow \
  --name scan-bridge --pool dolphin

Key Configuration:

  • Concurrency limit: 1 (per deployment)
  • Work pool concurrency: 1
  • Poll interval: 5s when idle
  • File mtime-based detection (handles NG6 restarts)

Hz Output:

  • Map: DOLPHIN_FEATURES
  • Key: latest_eigen_scan
  • Fields: scan_number, assets, asset_prices, timestamp, bridge_ts, bridge_source

Current Status: Running directly (PID 158929) due to Prefect worker issues


2.2 Nautilus Event-Driven Trader

Purpose: Event-driven paper/live trading with millisecond latency

File: prod/nautilus_event_trader.py

Service: /etc/systemd/system/dolphin-nautilus-trader.service

Architecture Pattern:

# Hz Entry Listener (not polling!)
features_map.add_entry_listener(
    key='latest_eigen_scan',
    updated_func=on_scan_update,  # Called on every new scan
    added_func=on_scan_update
)

# Signal computation in callback
def on_scan_update(event):
    scan = json.loads(event.value)
    signal = compute_signal(scan, ob_data, extf_data)
    if signal.valid:
        execute_trade(signal)

Resource Limits (systemd):

MemoryMax=2G
CPUQuota=200%
TasksMax=50

Hz Integration:

  • Input: DOLPHIN_FEATURES["latest_eigen_scan"]
  • Input: DOLPHIN_FEATURES["ob_features_latest"] (planned)
  • Input: DOLPHIN_FEATURES["exf_latest"] (planned)
  • Output: DOLPHIN_PNL_BLUE[YYYY-MM-DD]
  • Output: DOLPHIN_STATE_BLUE["latest_nautilus"]

Current Status: Active (PID 159402), waiting for NG6 scans


2.3 Meta Health Service v2 (MHS)

Purpose: Comprehensive system health monitoring with automated recovery

File: prod/meta_health_daemon_v2.py

Service: /etc/systemd/system/meta_health_daemon.service

2.3.1 Five-Sensor Model

Sensor Name Description Thresholds
M1 Process Integrity Critical processes running 0.0=missing, 1.0=all present
M2 Heartbeat Freshness Hz heartbeat recency >60s=0.0, >30s=0.5, <30s=1.0
M3 Data Freshness Hz data source timestamps >120s=dead, >30s=stale, <30s=fresh
M4 Control Plane Port connectivity Hz+Prefect ports
M5 Data Coherence Data integrity & posture validity Valid ranges, enums

2.3.2 Monitored Subsystems

HZ_DATA_SOURCES = {
    "scan": ("DOLPHIN_FEATURES", "latest_eigen_scan", "bridge_ts"),
    "obf": ("DOLPHIN_FEATURES", "ob_features_latest", "_pushed_at"),
    "extf": ("DOLPHIN_FEATURES", "exf_latest", "_pushed_at"),
    "esof": ("DOLPHIN_FEATURES", "esof_latest", "_pushed_at"),
    "safety": ("DOLPHIN_SAFETY", "latest", "ts"),
    "state": ("DOLPHIN_STATE_BLUE", "latest_nautilus", "updated_at"),
}

2.3.3 Rm_meta Calculation

rm_meta = M1 * M2 * M3 * M4 * M5

Status Mapping:
- rm_meta > 0.8: GREEN
- rm_meta > 0.5: DEGRADED
- rm_meta > 0.2: CRITICAL
- rm_meta <= 0.2: DEAD  Recovery actions triggered

2.3.4 Recovery Actions

When status == "DEAD":

  1. Check M4 (Control Plane) → Restart Hz/Prefect if needed
  2. Check M1 (Processes) → Restart missing services
  3. Trigger deployment runs for Prefect-managed flows

Current Status: Active (PID 160052), monitoring (currently DEAD due to NG6 down)


2.4 Systemd Service Specifications

2.4.1 dolphin-nautilus-trader.service

[Unit]
Description=DOLPHIN Nautilus Event-Driven Trader
After=network.target hazelcast.service

[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PYTHONPATH=/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin"

ExecStart=/home/dolphin/siloqy_env/bin/python3 nautilus_event_trader.py

Restart=always
RestartSec=5
StartLimitInterval=60s
StartLimitBurst=3

# Resource Limits (Critical!)
MemoryMax=2G
CPUQuota=200%
TasksMax=50

StandardOutput=append:/tmp/nautilus_trader.log
StandardError=append:/tmp/nautilus_trader.log

[Install]
WantedBy=multi-user.target

2.4.2 meta_health_daemon.service

[Unit]
Description=Meta Health Daemon - Watchdog of Watchdogs
After=network.target hazelcast.service

[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PREFECT_API_URL=http://localhost:4200/api"
ExecStart=/home/dolphin/siloqy_env/bin/python meta_health_daemon_v2.py
Restart=always
RestartSec=5
StandardOutput=append:/mnt/dolphinng5_predict/run_logs/meta_health.log

2.4.3 dolphin-prefect-worker.service

[Unit]
Description=DOLPHIN Prefect Worker
After=network.target hazelcast.service

[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PREFECT_API_URL=http://localhost:4200/api"
ExecStart=/home/dolphin/siloqy_env/bin/prefect worker start --pool dolphin
Restart=always
RestartSec=10
StandardOutput=append:/tmp/prefect_worker.log

3. Data Flow Specifications

3.1 Scan-to-Trade Latency Path

┌─────────────┐     ┌──────────────┐     ┌───────────────┐     ┌────────────┐
│ DolphinNG6  │────▶│ Arrow File   │────▶│ Scan Bridge   │────▶│ Hz         │
│ (Windows)   │     │ (SMB mount)  │     │ (5s poll)     │     │ DOLPHIN_   │
│             │     │              │     │               │     │ FEATURES   │
└─────────────┘     └──────────────┘     └───────────────┘     └─────┬──────┘
                                                                     │
                              ┌──────────────────────────────────────┘
                              │ Entry Listener (event-driven)
                              ▼
                       ┌───────────────┐     ┌──────────────┐
                       │ Nautilus      │────▶│ Trade Exec   │
                       │ Event Trader  │     │ (Paper/Live) │
                       │ (<1ms latency)│     │              │
                       └───────────────┘     └──────────────┘

Target Latency: < 10ms from NG6 scan to trade execution

3.2 Hz Data Schema Updates

DOLPHIN_FEATURES["latest_eigen_scan"]

{
  "scan_number": 8634,
  "timestamp": "2026-03-25T10:30:00Z",
  "assets": ["BTCUSDT", "ETHUSDT", ...],
  "asset_prices": {...},
  "eigenvalues": [...],
  "bridge_ts": "2026-03-25T10:30:01.123456+00:00",
  "bridge_source": "scan_bridge_prefect"
}

DOLPHIN_META_HEALTH["latest"] (NEW)

{
  "rm_meta": 0.0,
  "status": "DEAD",
  "m1_proc": 0.0,
  "m2_heartbeat": 0.0,
  "m3_data_freshness": 0.0,
  "m4_control_plane": 1.0,
  "m5_coherence": 0.0,
  "subsystem_health": {
    "processes": {...},
    "data_sources": {...}
  },
  "timestamp": "2026-03-25T10:30:00+00:00"
}

4. Safety Mechanisms

4.1 Concurrency Controls

Level Mechanism Value Purpose
Work Pool concurrency_limit 1 Only 1 flow run per pool
Deployment prefect concurrency-limit 1 Tag-based limit
Systemd TasksMax 50 Max processes per service
Systemd MemoryMax 2G OOM protection

4.2 Recovery Procedures

Scenario 1: Process Death

M1 drops → systemd Restart=always → process restarts → M1 recovers

Scenario 2: Data Staleness

M3 drops (no NG6 data) → Status=DEGRADED → Wait for NG6 restart
(No automatic action - data source is external)

Scenario 3: Control Plane Failure

M4 drops → MHS triggers → systemctl restart hazelcast

Scenario 4: System Deadlock (2026-03-24 Incident)

Prefect worker spawns 60+ processes → Resource exhaustion → Kernel deadlock
Fix: Concurrency limits + systemd TasksMax prevent spawn loop

5. Known Issues & Limitations

5.1 Prefect Worker Issue

Symptom: Flow runs stuck in "Late" state, worker not picking up

Workaround: Services run directly via systemd (Scan Bridge, Nautilus Trader)

Root Cause: Unknown - possibly pool paused state or worker polling issue

Future Fix: Investigate Prefect work pool status field, may need to recreate pool

5.2 NG6 Dependency

Current State: All services DEAD (expected) due to no scan data

Recovery: Automatic when NG6 restarts - scan bridge will detect files, Hz will update, Nautilus will trade

5.3 OBF/ExtF/EsoF Not Running

Status: Services defined but not yet started

Action Required: Start individually after NG6 recovery


6. Operational Commands

6.1 Service Management

# Check all services
systemctl status dolphin-* meta_health_daemon

# View logs
journalctl -u dolphin-nautilus-trader -f
tail -f /tmp/nautilus_trader.log
tail -f /mnt/dolphinng5_predict/run_logs/meta_health.log

# Restart services
systemctl restart dolphin-nautilus-trader
systemctl restart meta_health_daemon

6.2 Health Checks

# Check Hz data
cd /mnt/dolphinng5_predict/prod
source /home/dolphin/siloqy_env/bin/activate
python3 -c "
import hazelcast, json
client = hazelcast.HazelcastClient(cluster_name='dolphin')
features = client.get_map('DOLPHIN_FEATURES').blocking()
scan = json.loads(features.get('latest_eigen_scan') or '{}')
print(f'Scan #{scan.get(\"scan_number\", \"N/A\")}')
client.shutdown()
"

# Check MHS status
cat /mnt/dolphinng5_predict/run_logs/meta_health.json

6.3 Prefect Operations

export PREFECT_API_URL="http://localhost:4200/api"

# Check work pools
prefect work-pool ls

# Check deployments
prefect deployment ls

# Manual trigger (for testing)
prefect deployment run scan-bridge-flow/scan-bridge

7. File Locations

Component File Path
Scan Bridge Flow /mnt/dolphinng5_predict/prod/scan_bridge_prefect_flow.py
Nautilus Trader /mnt/dolphinng5_predict/prod/nautilus_event_trader.py
MHS v2 /mnt/dolphinng5_predict/prod/meta_health_daemon_v2.py
Prefect Worker Service /etc/systemd/system/dolphin-prefect-worker.service
Nautilus Trader Service /etc/systemd/system/dolphin-nautilus-trader.service
MHS Service /etc/systemd/system/meta_health_daemon.service
Health Logs /mnt/dolphinng5_predict/run_logs/meta_health.log
Trader Logs /tmp/nautilus_trader.log
Prefect Worker Logs /tmp/prefect_worker.log
Hz Data DOLPHIN_FEATURES, DOLPHIN_SAFETY, DOLPHIN_META_HEALTH

8. Appendix: Version History

Version Date Changes
1.0 2026-03-25 Initial multi-speed architecture spec

9. Sign-Off

Implementation: Complete
Testing: In Progress (waiting for NG6 restart)
Documentation: Complete
Next Review: Post-NG6-recovery validation

Agent: Kimi Code CLI
Session: 2026-03-25
Status: PRODUCTION DEPLOYED