Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
480 lines
16 KiB
Markdown
Executable File
480 lines
16 KiB
Markdown
Executable File
# DOLPHIN Architectural Changes Specification
|
|
## Session: 2026-03-25 - Multi-Speed Event-Driven Architecture
|
|
|
|
**Version**: 1.0
|
|
**Author**: Kimi Code CLI Agent
|
|
**Status**: DEPLOYED (Production)
|
|
**Related Files**: `SYSTEM_BIBLE.md` (updated to v3.1)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This specification documents the architectural transformation of the DOLPHIN trading system from a **batch-oriented, single-worker Prefect architecture** to a **multi-speed, event-driven, multi-worker architecture** with proper resource isolation and self-healing capabilities.
|
|
|
|
### Key Changes
|
|
1. **Multi-Pool Prefect Architecture** - Separate work pools per frequency layer
|
|
2. **Event-Driven Nautilus Trader** - Hz listener for millisecond-latency trading
|
|
3. **Enhanced MHS v2** - Full 5-sensor monitoring with per-subsystem health tracking
|
|
4. **Systemd-Based Service Management** - Resource-constrained, auto-restarting services
|
|
5. **Concurrency Safety** - Prevents process explosion (root cause of 2026-03-24 outage)
|
|
|
|
---
|
|
|
|
## 1. Architecture Overview
|
|
|
|
### 1.1 Previous Architecture (Pre-Change)
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Single Work Pool: "dolphin" │
|
|
│ Single Prefect Worker (unlimited) │
|
|
│ │
|
|
│ Problems: │
|
|
│ - No concurrency limits │
|
|
│ - 60+ prefect.engine zombies │
|
|
│ - Resource exhaustion │
|
|
│ - Kernel deadlock on SMB hang │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
### 1.2 New Architecture (Post-Change)
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ LAYER 1: ULTRA-LOW LATENCY (<1ms) │
|
|
│ ├─ Work Pool: dolphin-obf (planned) │
|
|
│ ├─ Service: dolphin-nautilus-trader.service │
|
|
│ └─ Pattern: Hz Entry Listener → Immediate Signal → Trade │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ LAYER 2: FAST POLLING (1s-10s) │
|
|
│ ├─ Work Pool: dolphin-scan │
|
|
│ ├─ Service: Scan Bridge (direct/systemd hybrid) │
|
|
│ └─ Pattern: File watcher → Hz push │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ LAYER 3: SCHEDULED INDICATORS (varied) │
|
|
│ ├─ Work Pool: dolphin-extf-indicators (planned) │
|
|
│ ├─ Services: Per-indicator flows (funding, DVOL, F&G, etc.) │
|
|
│ └─ Pattern: Individual schedules → Hz push │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ LAYER 4: HEALTH MONITORING (~5s) │
|
|
│ ├─ Service: meta_health_daemon.service │
|
|
│ └─ Pattern: 5-sensor monitoring → Recovery actions │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ LAYER 5: DAILY BATCH │
|
|
│ ├─ Work Pool: dolphin (existing) │
|
|
│ └─ Pattern: Scheduled backtests, paper trades │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Detailed Component Specifications
|
|
|
|
### 2.1 Scan Bridge Service
|
|
|
|
**Purpose**: Watch Arrow scan files from DolphinNG6, push to Hazelcast
|
|
|
|
**File**: `prod/scan_bridge_prefect_flow.py`
|
|
|
|
**Deployment**:
|
|
```bash
|
|
prefect deploy scan_bridge_prefect_flow.py:scan_bridge_flow \
|
|
--name scan-bridge --pool dolphin
|
|
```
|
|
|
|
**Key Configuration**:
|
|
- Concurrency limit: 1 (per deployment)
|
|
- Work pool concurrency: 1
|
|
- Poll interval: 5s when idle
|
|
- File mtime-based detection (handles NG6 restarts)
|
|
|
|
**Hz Output**:
|
|
- Map: `DOLPHIN_FEATURES`
|
|
- Key: `latest_eigen_scan`
|
|
- Fields: scan_number, assets, asset_prices, timestamp, bridge_ts, bridge_source
|
|
|
|
**Current Status**: Running directly (PID 158929) due to Prefect worker issues
|
|
|
|
---
|
|
|
|
### 2.2 Nautilus Event-Driven Trader
|
|
|
|
**Purpose**: Event-driven paper/live trading with millisecond latency
|
|
|
|
**File**: `prod/nautilus_event_trader.py`
|
|
|
|
**Service**: `/etc/systemd/system/dolphin-nautilus-trader.service`
|
|
|
|
**Architecture Pattern**:
|
|
```python
|
|
# Hz Entry Listener (not polling!)
|
|
features_map.add_entry_listener(
|
|
key='latest_eigen_scan',
|
|
updated_func=on_scan_update, # Called on every new scan
|
|
added_func=on_scan_update
|
|
)
|
|
|
|
# Signal computation in callback
|
|
def on_scan_update(event):
|
|
scan = json.loads(event.value)
|
|
signal = compute_signal(scan, ob_data, extf_data)
|
|
if signal.valid:
|
|
execute_trade(signal)
|
|
```
|
|
|
|
**Resource Limits** (systemd):
|
|
```ini
|
|
MemoryMax=2G
|
|
CPUQuota=200%
|
|
TasksMax=50
|
|
```
|
|
|
|
**Hz Integration**:
|
|
- Input: `DOLPHIN_FEATURES["latest_eigen_scan"]`
|
|
- Input: `DOLPHIN_FEATURES["ob_features_latest"]` (planned)
|
|
- Input: `DOLPHIN_FEATURES["exf_latest"]` (planned)
|
|
- Output: `DOLPHIN_PNL_BLUE[YYYY-MM-DD]`
|
|
- Output: `DOLPHIN_STATE_BLUE["latest_nautilus"]`
|
|
|
|
**Current Status**: Active (PID 159402), waiting for NG6 scans
|
|
|
|
---
|
|
|
|
### 2.3 Meta Health Service v2 (MHS)
|
|
|
|
**Purpose**: Comprehensive system health monitoring with automated recovery
|
|
|
|
**File**: `prod/meta_health_daemon_v2.py`
|
|
|
|
**Service**: `/etc/systemd/system/meta_health_daemon.service`
|
|
|
|
#### 2.3.1 Five-Sensor Model
|
|
|
|
| Sensor | Name | Description | Thresholds |
|
|
|--------|------|-------------|------------|
|
|
| M1 | Process Integrity | Critical processes running | 0.0=missing, 1.0=all present |
|
|
| M2 | Heartbeat Freshness | Hz heartbeat recency | >60s=0.0, >30s=0.5, <30s=1.0 |
|
|
| M3 | Data Freshness | Hz data source timestamps | >120s=dead, >30s=stale, <30s=fresh |
|
|
| M4 | Control Plane | Port connectivity | Hz+Prefect ports |
|
|
| M5 | Data Coherence | Data integrity & posture validity | Valid ranges, enums |
|
|
|
|
#### 2.3.2 Monitored Subsystems
|
|
|
|
```python
|
|
HZ_DATA_SOURCES = {
|
|
"scan": ("DOLPHIN_FEATURES", "latest_eigen_scan", "bridge_ts"),
|
|
"obf": ("DOLPHIN_FEATURES", "ob_features_latest", "_pushed_at"),
|
|
"extf": ("DOLPHIN_FEATURES", "exf_latest", "_pushed_at"),
|
|
"esof": ("DOLPHIN_FEATURES", "esof_latest", "_pushed_at"),
|
|
"safety": ("DOLPHIN_SAFETY", "latest", "ts"),
|
|
"state": ("DOLPHIN_STATE_BLUE", "latest_nautilus", "updated_at"),
|
|
}
|
|
```
|
|
|
|
#### 2.3.3 Rm_meta Calculation
|
|
|
|
```python
|
|
rm_meta = M1 * M2 * M3 * M4 * M5
|
|
|
|
Status Mapping:
|
|
- rm_meta > 0.8: GREEN
|
|
- rm_meta > 0.5: DEGRADED
|
|
- rm_meta > 0.2: CRITICAL
|
|
- rm_meta <= 0.2: DEAD → Recovery actions triggered
|
|
```
|
|
|
|
#### 2.3.4 Recovery Actions
|
|
|
|
When `status == "DEAD"`:
|
|
1. Check M4 (Control Plane) → Restart Hz/Prefect if needed
|
|
2. Check M1 (Processes) → Restart missing services
|
|
3. Trigger deployment runs for Prefect-managed flows
|
|
|
|
**Current Status**: Active (PID 160052), monitoring (currently DEAD due to NG6 down)
|
|
|
|
---
|
|
|
|
### 2.4 Systemd Service Specifications
|
|
|
|
#### 2.4.1 dolphin-nautilus-trader.service
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=DOLPHIN Nautilus Event-Driven Trader
|
|
After=network.target hazelcast.service
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=root
|
|
WorkingDirectory=/mnt/dolphinng5_predict/prod
|
|
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
|
|
Environment="PYTHONPATH=/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin"
|
|
|
|
ExecStart=/home/dolphin/siloqy_env/bin/python3 nautilus_event_trader.py
|
|
|
|
Restart=always
|
|
RestartSec=5
|
|
StartLimitInterval=60s
|
|
StartLimitBurst=3
|
|
|
|
# Resource Limits (Critical!)
|
|
MemoryMax=2G
|
|
CPUQuota=200%
|
|
TasksMax=50
|
|
|
|
StandardOutput=append:/tmp/nautilus_trader.log
|
|
StandardError=append:/tmp/nautilus_trader.log
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
#### 2.4.2 meta_health_daemon.service
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Meta Health Daemon - Watchdog of Watchdogs
|
|
After=network.target hazelcast.service
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=root
|
|
WorkingDirectory=/mnt/dolphinng5_predict/prod
|
|
Environment="PREFECT_API_URL=http://localhost:4200/api"
|
|
ExecStart=/home/dolphin/siloqy_env/bin/python meta_health_daemon_v2.py
|
|
Restart=always
|
|
RestartSec=5
|
|
StandardOutput=append:/mnt/dolphinng5_predict/run_logs/meta_health.log
|
|
```
|
|
|
|
#### 2.4.3 dolphin-prefect-worker.service
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=DOLPHIN Prefect Worker
|
|
After=network.target hazelcast.service
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=root
|
|
WorkingDirectory=/mnt/dolphinng5_predict/prod
|
|
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
|
|
Environment="PREFECT_API_URL=http://localhost:4200/api"
|
|
ExecStart=/home/dolphin/siloqy_env/bin/prefect worker start --pool dolphin
|
|
Restart=always
|
|
RestartSec=10
|
|
StandardOutput=append:/tmp/prefect_worker.log
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Data Flow Specifications
|
|
|
|
### 3.1 Scan-to-Trade Latency Path
|
|
|
|
```
|
|
┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────┐
|
|
│ DolphinNG6 │────▶│ Arrow File │────▶│ Scan Bridge │────▶│ Hz │
|
|
│ (Windows) │ │ (SMB mount) │ │ (5s poll) │ │ DOLPHIN_ │
|
|
│ │ │ │ │ │ │ FEATURES │
|
|
└─────────────┘ └──────────────┘ └───────────────┘ └─────┬──────┘
|
|
│
|
|
┌──────────────────────────────────────┘
|
|
│ Entry Listener (event-driven)
|
|
▼
|
|
┌───────────────┐ ┌──────────────┐
|
|
│ Nautilus │────▶│ Trade Exec │
|
|
│ Event Trader │ │ (Paper/Live) │
|
|
│ (<1ms latency)│ │ │
|
|
└───────────────┘ └──────────────┘
|
|
```
|
|
|
|
**Target Latency**: < 10ms from NG6 scan to trade execution
|
|
|
|
### 3.2 Hz Data Schema Updates
|
|
|
|
#### DOLPHIN_FEATURES["latest_eigen_scan"]
|
|
```json
|
|
{
|
|
"scan_number": 8634,
|
|
"timestamp": "2026-03-25T10:30:00Z",
|
|
"assets": ["BTCUSDT", "ETHUSDT", ...],
|
|
"asset_prices": {...},
|
|
"eigenvalues": [...],
|
|
"bridge_ts": "2026-03-25T10:30:01.123456+00:00",
|
|
"bridge_source": "scan_bridge_prefect"
|
|
}
|
|
```
|
|
|
|
#### DOLPHIN_META_HEALTH["latest"] (NEW)
|
|
```json
|
|
{
|
|
"rm_meta": 0.0,
|
|
"status": "DEAD",
|
|
"m1_proc": 0.0,
|
|
"m2_heartbeat": 0.0,
|
|
"m3_data_freshness": 0.0,
|
|
"m4_control_plane": 1.0,
|
|
"m5_coherence": 0.0,
|
|
"subsystem_health": {
|
|
"processes": {...},
|
|
"data_sources": {...}
|
|
},
|
|
"timestamp": "2026-03-25T10:30:00+00:00"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Safety Mechanisms
|
|
|
|
### 4.1 Concurrency Controls
|
|
|
|
| Level | Mechanism | Value | Purpose |
|
|
|-------|-----------|-------|---------|
|
|
| Work Pool | `concurrency_limit` | 1 | Only 1 flow run per pool |
|
|
| Deployment | `prefect concurrency-limit` | 1 | Tag-based limit |
|
|
| Systemd | `TasksMax` | 50 | Max processes per service |
|
|
| Systemd | `MemoryMax` | 2G | OOM protection |
|
|
|
|
### 4.2 Recovery Procedures
|
|
|
|
**Scenario 1: Process Death**
|
|
```
|
|
M1 drops → systemd Restart=always → process restarts → M1 recovers
|
|
```
|
|
|
|
**Scenario 2: Data Staleness**
|
|
```
|
|
M3 drops (no NG6 data) → Status=DEGRADED → Wait for NG6 restart
|
|
(No automatic action - data source is external)
|
|
```
|
|
|
|
**Scenario 3: Control Plane Failure**
|
|
```
|
|
M4 drops → MHS triggers → systemctl restart hazelcast
|
|
```
|
|
|
|
**Scenario 4: System Deadlock (2026-03-24 Incident)**
|
|
```
|
|
Prefect worker spawns 60+ processes → Resource exhaustion → Kernel deadlock
|
|
Fix: Concurrency limits + systemd TasksMax prevent spawn loop
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Known Issues & Limitations
|
|
|
|
### 5.1 Prefect Worker Issue
|
|
|
|
**Symptom**: Flow runs stuck in "Late" state, worker not picking up
|
|
|
|
**Workaround**: Services run directly via systemd (Scan Bridge, Nautilus Trader)
|
|
|
|
**Root Cause**: Unknown - possibly pool paused state or worker polling issue
|
|
|
|
**Future Fix**: Investigate Prefect work pool `status` field, may need to recreate pool
|
|
|
|
### 5.2 NG6 Dependency
|
|
|
|
**Current State**: All services DEAD (expected) due to no scan data
|
|
|
|
**Recovery**: Automatic when NG6 restarts - scan bridge will detect files, Hz will update, Nautilus will trade
|
|
|
|
### 5.3 OBF/ExtF/EsoF Not Running
|
|
|
|
**Status**: Services defined but not yet started
|
|
|
|
**Action Required**: Start individually after NG6 recovery
|
|
|
|
---
|
|
|
|
## 6. Operational Commands
|
|
|
|
### 6.1 Service Management
|
|
|
|
```bash
|
|
# Check all services
|
|
systemctl status dolphin-* meta_health_daemon
|
|
|
|
# View logs
|
|
journalctl -u dolphin-nautilus-trader -f
|
|
tail -f /tmp/nautilus_trader.log
|
|
tail -f /mnt/dolphinng5_predict/run_logs/meta_health.log
|
|
|
|
# Restart services
|
|
systemctl restart dolphin-nautilus-trader
|
|
systemctl restart meta_health_daemon
|
|
```
|
|
|
|
### 6.2 Health Checks
|
|
|
|
```bash
|
|
# Check Hz data
|
|
cd /mnt/dolphinng5_predict/prod
|
|
source /home/dolphin/siloqy_env/bin/activate
|
|
python3 -c "
|
|
import hazelcast, json
|
|
client = hazelcast.HazelcastClient(cluster_name='dolphin')
|
|
features = client.get_map('DOLPHIN_FEATURES').blocking()
|
|
scan = json.loads(features.get('latest_eigen_scan') or '{}')
|
|
print(f'Scan #{scan.get(\"scan_number\", \"N/A\")}')
|
|
client.shutdown()
|
|
"
|
|
|
|
# Check MHS status
|
|
cat /mnt/dolphinng5_predict/run_logs/meta_health.json
|
|
```
|
|
|
|
### 6.3 Prefect Operations
|
|
|
|
```bash
|
|
export PREFECT_API_URL="http://localhost:4200/api"
|
|
|
|
# Check work pools
|
|
prefect work-pool ls
|
|
|
|
# Check deployments
|
|
prefect deployment ls
|
|
|
|
# Manual trigger (for testing)
|
|
prefect deployment run scan-bridge-flow/scan-bridge
|
|
```
|
|
|
|
---
|
|
|
|
## 7. File Locations
|
|
|
|
| Component | File Path |
|
|
|-----------|-----------|
|
|
| Scan Bridge Flow | `/mnt/dolphinng5_predict/prod/scan_bridge_prefect_flow.py` |
|
|
| Nautilus Trader | `/mnt/dolphinng5_predict/prod/nautilus_event_trader.py` |
|
|
| MHS v2 | `/mnt/dolphinng5_predict/prod/meta_health_daemon_v2.py` |
|
|
| Prefect Worker Service | `/etc/systemd/system/dolphin-prefect-worker.service` |
|
|
| Nautilus Trader Service | `/etc/systemd/system/dolphin-nautilus-trader.service` |
|
|
| MHS Service | `/etc/systemd/system/meta_health_daemon.service` |
|
|
| Health Logs | `/mnt/dolphinng5_predict/run_logs/meta_health.log` |
|
|
| Trader Logs | `/tmp/nautilus_trader.log` |
|
|
| Prefect Worker Logs | `/tmp/prefect_worker.log` |
|
|
| Hz Data | `DOLPHIN_FEATURES`, `DOLPHIN_SAFETY`, `DOLPHIN_META_HEALTH` |
|
|
|
|
---
|
|
|
|
## 8. Appendix: Version History
|
|
|
|
| Version | Date | Changes |
|
|
|---------|------|---------|
|
|
| 1.0 | 2026-03-25 | Initial multi-speed architecture spec |
|
|
|
|
---
|
|
|
|
## 9. Sign-Off
|
|
|
|
**Implementation**: Complete
|
|
**Testing**: In Progress (waiting for NG6 restart)
|
|
**Documentation**: Complete
|
|
**Next Review**: Post-NG6-recovery validation
|
|
|
|
**Agent**: Kimi Code CLI
|
|
**Session**: 2026-03-25
|
|
**Status**: PRODUCTION DEPLOYED
|