Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
16 KiB
Executable File
DOLPHIN Architectural Changes Specification
Session: 2026-03-25 - Multi-Speed Event-Driven Architecture
Version: 1.0
Author: Kimi Code CLI Agent
Status: DEPLOYED (Production)
Related Files: SYSTEM_BIBLE.md (updated to v3.1)
Executive Summary
This specification documents the architectural transformation of the DOLPHIN trading system from a batch-oriented, single-worker Prefect architecture to a multi-speed, event-driven, multi-worker architecture with proper resource isolation and self-healing capabilities.
Key Changes
- Multi-Pool Prefect Architecture - Separate work pools per frequency layer
- Event-Driven Nautilus Trader - Hz listener for millisecond-latency trading
- Enhanced MHS v2 - Full 5-sensor monitoring with per-subsystem health tracking
- Systemd-Based Service Management - Resource-constrained, auto-restarting services
- Concurrency Safety - Prevents process explosion (root cause of 2026-03-24 outage)
1. Architecture Overview
1.1 Previous Architecture (Pre-Change)
┌─────────────────────────────────────────┐
│ Single Work Pool: "dolphin" │
│ Single Prefect Worker (unlimited) │
│ │
│ Problems: │
│ - No concurrency limits │
│ - 60+ prefect.engine zombies │
│ - Resource exhaustion │
│ - Kernel deadlock on SMB hang │
└─────────────────────────────────────────┘
1.2 New Architecture (Post-Change)
┌─────────────────────────────────────────────────────────────────────┐
│ LAYER 1: ULTRA-LOW LATENCY (<1ms) │
│ ├─ Work Pool: dolphin-obf (planned) │
│ ├─ Service: dolphin-nautilus-trader.service │
│ └─ Pattern: Hz Entry Listener → Immediate Signal → Trade │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 2: FAST POLLING (1s-10s) │
│ ├─ Work Pool: dolphin-scan │
│ ├─ Service: Scan Bridge (direct/systemd hybrid) │
│ └─ Pattern: File watcher → Hz push │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 3: SCHEDULED INDICATORS (varied) │
│ ├─ Work Pool: dolphin-extf-indicators (planned) │
│ ├─ Services: Per-indicator flows (funding, DVOL, F&G, etc.) │
│ └─ Pattern: Individual schedules → Hz push │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 4: HEALTH MONITORING (~5s) │
│ ├─ Service: meta_health_daemon.service │
│ └─ Pattern: 5-sensor monitoring → Recovery actions │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 5: DAILY BATCH │
│ ├─ Work Pool: dolphin (existing) │
│ └─ Pattern: Scheduled backtests, paper trades │
└─────────────────────────────────────────────────────────────────────┘
2. Detailed Component Specifications
2.1 Scan Bridge Service
Purpose: Watch Arrow scan files from DolphinNG6, push to Hazelcast
File: prod/scan_bridge_prefect_flow.py
Deployment:
prefect deploy scan_bridge_prefect_flow.py:scan_bridge_flow \
--name scan-bridge --pool dolphin
Key Configuration:
- Concurrency limit: 1 (per deployment)
- Work pool concurrency: 1
- Poll interval: 5s when idle
- File mtime-based detection (handles NG6 restarts)
Hz Output:
- Map:
DOLPHIN_FEATURES - Key:
latest_eigen_scan - Fields: scan_number, assets, asset_prices, timestamp, bridge_ts, bridge_source
Current Status: Running directly (PID 158929) due to Prefect worker issues
2.2 Nautilus Event-Driven Trader
Purpose: Event-driven paper/live trading with millisecond latency
File: prod/nautilus_event_trader.py
Service: /etc/systemd/system/dolphin-nautilus-trader.service
Architecture Pattern:
# Hz Entry Listener (not polling!)
features_map.add_entry_listener(
key='latest_eigen_scan',
updated_func=on_scan_update, # Called on every new scan
added_func=on_scan_update
)
# Signal computation in callback
def on_scan_update(event):
scan = json.loads(event.value)
signal = compute_signal(scan, ob_data, extf_data)
if signal.valid:
execute_trade(signal)
Resource Limits (systemd):
MemoryMax=2G
CPUQuota=200%
TasksMax=50
Hz Integration:
- Input:
DOLPHIN_FEATURES["latest_eigen_scan"] - Input:
DOLPHIN_FEATURES["ob_features_latest"](planned) - Input:
DOLPHIN_FEATURES["exf_latest"](planned) - Output:
DOLPHIN_PNL_BLUE[YYYY-MM-DD] - Output:
DOLPHIN_STATE_BLUE["latest_nautilus"]
Current Status: Active (PID 159402), waiting for NG6 scans
2.3 Meta Health Service v2 (MHS)
Purpose: Comprehensive system health monitoring with automated recovery
File: prod/meta_health_daemon_v2.py
Service: /etc/systemd/system/meta_health_daemon.service
2.3.1 Five-Sensor Model
| Sensor | Name | Description | Thresholds |
|---|---|---|---|
| M1 | Process Integrity | Critical processes running | 0.0=missing, 1.0=all present |
| M2 | Heartbeat Freshness | Hz heartbeat recency | >60s=0.0, >30s=0.5, <30s=1.0 |
| M3 | Data Freshness | Hz data source timestamps | >120s=dead, >30s=stale, <30s=fresh |
| M4 | Control Plane | Port connectivity | Hz+Prefect ports |
| M5 | Data Coherence | Data integrity & posture validity | Valid ranges, enums |
2.3.2 Monitored Subsystems
HZ_DATA_SOURCES = {
"scan": ("DOLPHIN_FEATURES", "latest_eigen_scan", "bridge_ts"),
"obf": ("DOLPHIN_FEATURES", "ob_features_latest", "_pushed_at"),
"extf": ("DOLPHIN_FEATURES", "exf_latest", "_pushed_at"),
"esof": ("DOLPHIN_FEATURES", "esof_latest", "_pushed_at"),
"safety": ("DOLPHIN_SAFETY", "latest", "ts"),
"state": ("DOLPHIN_STATE_BLUE", "latest_nautilus", "updated_at"),
}
2.3.3 Rm_meta Calculation
rm_meta = M1 * M2 * M3 * M4 * M5
Status Mapping:
- rm_meta > 0.8: GREEN
- rm_meta > 0.5: DEGRADED
- rm_meta > 0.2: CRITICAL
- rm_meta <= 0.2: DEAD → Recovery actions triggered
2.3.4 Recovery Actions
When status == "DEAD":
- Check M4 (Control Plane) → Restart Hz/Prefect if needed
- Check M1 (Processes) → Restart missing services
- Trigger deployment runs for Prefect-managed flows
Current Status: Active (PID 160052), monitoring (currently DEAD due to NG6 down)
2.4 Systemd Service Specifications
2.4.1 dolphin-nautilus-trader.service
[Unit]
Description=DOLPHIN Nautilus Event-Driven Trader
After=network.target hazelcast.service
[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PYTHONPATH=/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin"
ExecStart=/home/dolphin/siloqy_env/bin/python3 nautilus_event_trader.py
Restart=always
RestartSec=5
StartLimitInterval=60s
StartLimitBurst=3
# Resource Limits (Critical!)
MemoryMax=2G
CPUQuota=200%
TasksMax=50
StandardOutput=append:/tmp/nautilus_trader.log
StandardError=append:/tmp/nautilus_trader.log
[Install]
WantedBy=multi-user.target
2.4.2 meta_health_daemon.service
[Unit]
Description=Meta Health Daemon - Watchdog of Watchdogs
After=network.target hazelcast.service
[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PREFECT_API_URL=http://localhost:4200/api"
ExecStart=/home/dolphin/siloqy_env/bin/python meta_health_daemon_v2.py
Restart=always
RestartSec=5
StandardOutput=append:/mnt/dolphinng5_predict/run_logs/meta_health.log
2.4.3 dolphin-prefect-worker.service
[Unit]
Description=DOLPHIN Prefect Worker
After=network.target hazelcast.service
[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PREFECT_API_URL=http://localhost:4200/api"
ExecStart=/home/dolphin/siloqy_env/bin/prefect worker start --pool dolphin
Restart=always
RestartSec=10
StandardOutput=append:/tmp/prefect_worker.log
3. Data Flow Specifications
3.1 Scan-to-Trade Latency Path
┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────┐
│ DolphinNG6 │────▶│ Arrow File │────▶│ Scan Bridge │────▶│ Hz │
│ (Windows) │ │ (SMB mount) │ │ (5s poll) │ │ DOLPHIN_ │
│ │ │ │ │ │ │ FEATURES │
└─────────────┘ └──────────────┘ └───────────────┘ └─────┬──────┘
│
┌──────────────────────────────────────┘
│ Entry Listener (event-driven)
▼
┌───────────────┐ ┌──────────────┐
│ Nautilus │────▶│ Trade Exec │
│ Event Trader │ │ (Paper/Live) │
│ (<1ms latency)│ │ │
└───────────────┘ └──────────────┘
Target Latency: < 10ms from NG6 scan to trade execution
3.2 Hz Data Schema Updates
DOLPHIN_FEATURES["latest_eigen_scan"]
{
"scan_number": 8634,
"timestamp": "2026-03-25T10:30:00Z",
"assets": ["BTCUSDT", "ETHUSDT", ...],
"asset_prices": {...},
"eigenvalues": [...],
"bridge_ts": "2026-03-25T10:30:01.123456+00:00",
"bridge_source": "scan_bridge_prefect"
}
DOLPHIN_META_HEALTH["latest"] (NEW)
{
"rm_meta": 0.0,
"status": "DEAD",
"m1_proc": 0.0,
"m2_heartbeat": 0.0,
"m3_data_freshness": 0.0,
"m4_control_plane": 1.0,
"m5_coherence": 0.0,
"subsystem_health": {
"processes": {...},
"data_sources": {...}
},
"timestamp": "2026-03-25T10:30:00+00:00"
}
4. Safety Mechanisms
4.1 Concurrency Controls
| Level | Mechanism | Value | Purpose |
|---|---|---|---|
| Work Pool | concurrency_limit |
1 | Only 1 flow run per pool |
| Deployment | prefect concurrency-limit |
1 | Tag-based limit |
| Systemd | TasksMax |
50 | Max processes per service |
| Systemd | MemoryMax |
2G | OOM protection |
4.2 Recovery Procedures
Scenario 1: Process Death
M1 drops → systemd Restart=always → process restarts → M1 recovers
Scenario 2: Data Staleness
M3 drops (no NG6 data) → Status=DEGRADED → Wait for NG6 restart
(No automatic action - data source is external)
Scenario 3: Control Plane Failure
M4 drops → MHS triggers → systemctl restart hazelcast
Scenario 4: System Deadlock (2026-03-24 Incident)
Prefect worker spawns 60+ processes → Resource exhaustion → Kernel deadlock
Fix: Concurrency limits + systemd TasksMax prevent spawn loop
5. Known Issues & Limitations
5.1 Prefect Worker Issue
Symptom: Flow runs stuck in "Late" state, worker not picking up
Workaround: Services run directly via systemd (Scan Bridge, Nautilus Trader)
Root Cause: Unknown - possibly pool paused state or worker polling issue
Future Fix: Investigate Prefect work pool status field, may need to recreate pool
5.2 NG6 Dependency
Current State: All services DEAD (expected) due to no scan data
Recovery: Automatic when NG6 restarts - scan bridge will detect files, Hz will update, Nautilus will trade
5.3 OBF/ExtF/EsoF Not Running
Status: Services defined but not yet started
Action Required: Start individually after NG6 recovery
6. Operational Commands
6.1 Service Management
# Check all services
systemctl status dolphin-* meta_health_daemon
# View logs
journalctl -u dolphin-nautilus-trader -f
tail -f /tmp/nautilus_trader.log
tail -f /mnt/dolphinng5_predict/run_logs/meta_health.log
# Restart services
systemctl restart dolphin-nautilus-trader
systemctl restart meta_health_daemon
6.2 Health Checks
# Check Hz data
cd /mnt/dolphinng5_predict/prod
source /home/dolphin/siloqy_env/bin/activate
python3 -c "
import hazelcast, json
client = hazelcast.HazelcastClient(cluster_name='dolphin')
features = client.get_map('DOLPHIN_FEATURES').blocking()
scan = json.loads(features.get('latest_eigen_scan') or '{}')
print(f'Scan #{scan.get(\"scan_number\", \"N/A\")}')
client.shutdown()
"
# Check MHS status
cat /mnt/dolphinng5_predict/run_logs/meta_health.json
6.3 Prefect Operations
export PREFECT_API_URL="http://localhost:4200/api"
# Check work pools
prefect work-pool ls
# Check deployments
prefect deployment ls
# Manual trigger (for testing)
prefect deployment run scan-bridge-flow/scan-bridge
7. File Locations
| Component | File Path |
|---|---|
| Scan Bridge Flow | /mnt/dolphinng5_predict/prod/scan_bridge_prefect_flow.py |
| Nautilus Trader | /mnt/dolphinng5_predict/prod/nautilus_event_trader.py |
| MHS v2 | /mnt/dolphinng5_predict/prod/meta_health_daemon_v2.py |
| Prefect Worker Service | /etc/systemd/system/dolphin-prefect-worker.service |
| Nautilus Trader Service | /etc/systemd/system/dolphin-nautilus-trader.service |
| MHS Service | /etc/systemd/system/meta_health_daemon.service |
| Health Logs | /mnt/dolphinng5_predict/run_logs/meta_health.log |
| Trader Logs | /tmp/nautilus_trader.log |
| Prefect Worker Logs | /tmp/prefect_worker.log |
| Hz Data | DOLPHIN_FEATURES, DOLPHIN_SAFETY, DOLPHIN_META_HEALTH |
8. Appendix: Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-03-25 | Initial multi-speed architecture spec |
9. Sign-Off
Implementation: Complete
Testing: In Progress (waiting for NG6 restart)
Documentation: Complete
Next Review: Post-NG6-recovery validation
Agent: Kimi Code CLI
Session: 2026-03-25
Status: PRODUCTION DEPLOYED