Files
DOLPHIN/prod/AGENT_READ_ARCHITECTURAL_CHANGES_SPEC.md

480 lines
16 KiB
Markdown
Raw Normal View History

# DOLPHIN Architectural Changes Specification
## Session: 2026-03-25 - Multi-Speed Event-Driven Architecture
**Version**: 1.0
**Author**: Kimi Code CLI Agent
**Status**: DEPLOYED (Production)
**Related Files**: `SYSTEM_BIBLE.md` (updated to v3.1)
---
## Executive Summary
This specification documents the architectural transformation of the DOLPHIN trading system from a **batch-oriented, single-worker Prefect architecture** to a **multi-speed, event-driven, multi-worker architecture** with proper resource isolation and self-healing capabilities.
### Key Changes
1. **Multi-Pool Prefect Architecture** - Separate work pools per frequency layer
2. **Event-Driven Nautilus Trader** - Hz listener for millisecond-latency trading
3. **Enhanced MHS v2** - Full 5-sensor monitoring with per-subsystem health tracking
4. **Systemd-Based Service Management** - Resource-constrained, auto-restarting services
5. **Concurrency Safety** - Prevents process explosion (root cause of 2026-03-24 outage)
---
## 1. Architecture Overview
### 1.1 Previous Architecture (Pre-Change)
```
┌─────────────────────────────────────────┐
│ Single Work Pool: "dolphin" │
│ Single Prefect Worker (unlimited) │
│ │
│ Problems: │
│ - No concurrency limits │
│ - 60+ prefect.engine zombies │
│ - Resource exhaustion │
│ - Kernel deadlock on SMB hang │
└─────────────────────────────────────────┘
```
### 1.2 New Architecture (Post-Change)
```
┌─────────────────────────────────────────────────────────────────────┐
│ LAYER 1: ULTRA-LOW LATENCY (<1ms)
│ ├─ Work Pool: dolphin-obf (planned) │
│ ├─ Service: dolphin-nautilus-trader.service │
│ └─ Pattern: Hz Entry Listener → Immediate Signal → Trade │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 2: FAST POLLING (1s-10s) │
│ ├─ Work Pool: dolphin-scan │
│ ├─ Service: Scan Bridge (direct/systemd hybrid) │
│ └─ Pattern: File watcher → Hz push │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 3: SCHEDULED INDICATORS (varied) │
│ ├─ Work Pool: dolphin-extf-indicators (planned) │
│ ├─ Services: Per-indicator flows (funding, DVOL, F&G, etc.) │
│ └─ Pattern: Individual schedules → Hz push │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 4: HEALTH MONITORING (~5s) │
│ ├─ Service: meta_health_daemon.service │
│ └─ Pattern: 5-sensor monitoring → Recovery actions │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 5: DAILY BATCH │
│ ├─ Work Pool: dolphin (existing) │
│ └─ Pattern: Scheduled backtests, paper trades │
└─────────────────────────────────────────────────────────────────────┘
```
---
## 2. Detailed Component Specifications
### 2.1 Scan Bridge Service
**Purpose**: Watch Arrow scan files from DolphinNG6, push to Hazelcast
**File**: `prod/scan_bridge_prefect_flow.py`
**Deployment**:
```bash
prefect deploy scan_bridge_prefect_flow.py:scan_bridge_flow \
--name scan-bridge --pool dolphin
```
**Key Configuration**:
- Concurrency limit: 1 (per deployment)
- Work pool concurrency: 1
- Poll interval: 5s when idle
- File mtime-based detection (handles NG6 restarts)
**Hz Output**:
- Map: `DOLPHIN_FEATURES`
- Key: `latest_eigen_scan`
- Fields: scan_number, assets, asset_prices, timestamp, bridge_ts, bridge_source
**Current Status**: Running directly (PID 158929) due to Prefect worker issues
---
### 2.2 Nautilus Event-Driven Trader
**Purpose**: Event-driven paper/live trading with millisecond latency
**File**: `prod/nautilus_event_trader.py`
**Service**: `/etc/systemd/system/dolphin-nautilus-trader.service`
**Architecture Pattern**:
```python
# Hz Entry Listener (not polling!)
features_map.add_entry_listener(
key='latest_eigen_scan',
updated_func=on_scan_update, # Called on every new scan
added_func=on_scan_update
)
# Signal computation in callback
def on_scan_update(event):
scan = json.loads(event.value)
signal = compute_signal(scan, ob_data, extf_data)
if signal.valid:
execute_trade(signal)
```
**Resource Limits** (systemd):
```ini
MemoryMax=2G
CPUQuota=200%
TasksMax=50
```
**Hz Integration**:
- Input: `DOLPHIN_FEATURES["latest_eigen_scan"]`
- Input: `DOLPHIN_FEATURES["ob_features_latest"]` (planned)
- Input: `DOLPHIN_FEATURES["exf_latest"]` (planned)
- Output: `DOLPHIN_PNL_BLUE[YYYY-MM-DD]`
- Output: `DOLPHIN_STATE_BLUE["latest_nautilus"]`
**Current Status**: Active (PID 159402), waiting for NG6 scans
---
### 2.3 Meta Health Service v2 (MHS)
**Purpose**: Comprehensive system health monitoring with automated recovery
**File**: `prod/meta_health_daemon_v2.py`
**Service**: `/etc/systemd/system/meta_health_daemon.service`
#### 2.3.1 Five-Sensor Model
| Sensor | Name | Description | Thresholds |
|--------|------|-------------|------------|
| M1 | Process Integrity | Critical processes running | 0.0=missing, 1.0=all present |
| M2 | Heartbeat Freshness | Hz heartbeat recency | >60s=0.0, >30s=0.5, <30s=1.0 |
| M3 | Data Freshness | Hz data source timestamps | >120s=dead, >30s=stale, <30s=fresh |
| M4 | Control Plane | Port connectivity | Hz+Prefect ports |
| M5 | Data Coherence | Data integrity & posture validity | Valid ranges, enums |
#### 2.3.2 Monitored Subsystems
```python
HZ_DATA_SOURCES = {
"scan": ("DOLPHIN_FEATURES", "latest_eigen_scan", "bridge_ts"),
"obf": ("DOLPHIN_FEATURES", "ob_features_latest", "_pushed_at"),
"extf": ("DOLPHIN_FEATURES", "exf_latest", "_pushed_at"),
"esof": ("DOLPHIN_FEATURES", "esof_latest", "_pushed_at"),
"safety": ("DOLPHIN_SAFETY", "latest", "ts"),
"state": ("DOLPHIN_STATE_BLUE", "latest_nautilus", "updated_at"),
}
```
#### 2.3.3 Rm_meta Calculation
```python
rm_meta = M1 * M2 * M3 * M4 * M5
Status Mapping:
- rm_meta > 0.8: GREEN
- rm_meta > 0.5: DEGRADED
- rm_meta > 0.2: CRITICAL
- rm_meta <= 0.2: DEAD → Recovery actions triggered
```
#### 2.3.4 Recovery Actions
When `status == "DEAD"`:
1. Check M4 (Control Plane) → Restart Hz/Prefect if needed
2. Check M1 (Processes) → Restart missing services
3. Trigger deployment runs for Prefect-managed flows
**Current Status**: Active (PID 160052), monitoring (currently DEAD due to NG6 down)
---
### 2.4 Systemd Service Specifications
#### 2.4.1 dolphin-nautilus-trader.service
```ini
[Unit]
Description=DOLPHIN Nautilus Event-Driven Trader
After=network.target hazelcast.service
[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PYTHONPATH=/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin"
ExecStart=/home/dolphin/siloqy_env/bin/python3 nautilus_event_trader.py
Restart=always
RestartSec=5
StartLimitInterval=60s
StartLimitBurst=3
# Resource Limits (Critical!)
MemoryMax=2G
CPUQuota=200%
TasksMax=50
StandardOutput=append:/tmp/nautilus_trader.log
StandardError=append:/tmp/nautilus_trader.log
[Install]
WantedBy=multi-user.target
```
#### 2.4.2 meta_health_daemon.service
```ini
[Unit]
Description=Meta Health Daemon - Watchdog of Watchdogs
After=network.target hazelcast.service
[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PREFECT_API_URL=http://localhost:4200/api"
ExecStart=/home/dolphin/siloqy_env/bin/python meta_health_daemon_v2.py
Restart=always
RestartSec=5
StandardOutput=append:/mnt/dolphinng5_predict/run_logs/meta_health.log
```
#### 2.4.3 dolphin-prefect-worker.service
```ini
[Unit]
Description=DOLPHIN Prefect Worker
After=network.target hazelcast.service
[Service]
Type=simple
User=root
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PREFECT_API_URL=http://localhost:4200/api"
ExecStart=/home/dolphin/siloqy_env/bin/prefect worker start --pool dolphin
Restart=always
RestartSec=10
StandardOutput=append:/tmp/prefect_worker.log
```
---
## 3. Data Flow Specifications
### 3.1 Scan-to-Trade Latency Path
```
┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌────────────┐
│ DolphinNG6 │────▶│ Arrow File │────▶│ Scan Bridge │────▶│ Hz │
│ (Windows) │ │ (SMB mount) │ │ (5s poll) │ │ DOLPHIN_ │
│ │ │ │ │ │ │ FEATURES │
└─────────────┘ └──────────────┘ └───────────────┘ └─────┬──────┘
┌──────────────────────────────────────┘
│ Entry Listener (event-driven)
┌───────────────┐ ┌──────────────┐
│ Nautilus │────▶│ Trade Exec │
│ Event Trader │ │ (Paper/Live) │
│ (<1ms latency)
└───────────────┘ └──────────────┘
```
**Target Latency**: < 10ms from NG6 scan to trade execution
### 3.2 Hz Data Schema Updates
#### DOLPHIN_FEATURES["latest_eigen_scan"]
```json
{
"scan_number": 8634,
"timestamp": "2026-03-25T10:30:00Z",
"assets": ["BTCUSDT", "ETHUSDT", ...],
"asset_prices": {...},
"eigenvalues": [...],
"bridge_ts": "2026-03-25T10:30:01.123456+00:00",
"bridge_source": "scan_bridge_prefect"
}
```
#### DOLPHIN_META_HEALTH["latest"] (NEW)
```json
{
"rm_meta": 0.0,
"status": "DEAD",
"m1_proc": 0.0,
"m2_heartbeat": 0.0,
"m3_data_freshness": 0.0,
"m4_control_plane": 1.0,
"m5_coherence": 0.0,
"subsystem_health": {
"processes": {...},
"data_sources": {...}
},
"timestamp": "2026-03-25T10:30:00+00:00"
}
```
---
## 4. Safety Mechanisms
### 4.1 Concurrency Controls
| Level | Mechanism | Value | Purpose |
|-------|-----------|-------|---------|
| Work Pool | `concurrency_limit` | 1 | Only 1 flow run per pool |
| Deployment | `prefect concurrency-limit` | 1 | Tag-based limit |
| Systemd | `TasksMax` | 50 | Max processes per service |
| Systemd | `MemoryMax` | 2G | OOM protection |
### 4.2 Recovery Procedures
**Scenario 1: Process Death**
```
M1 drops → systemd Restart=always → process restarts → M1 recovers
```
**Scenario 2: Data Staleness**
```
M3 drops (no NG6 data) → Status=DEGRADED → Wait for NG6 restart
(No automatic action - data source is external)
```
**Scenario 3: Control Plane Failure**
```
M4 drops → MHS triggers → systemctl restart hazelcast
```
**Scenario 4: System Deadlock (2026-03-24 Incident)**
```
Prefect worker spawns 60+ processes → Resource exhaustion → Kernel deadlock
Fix: Concurrency limits + systemd TasksMax prevent spawn loop
```
---
## 5. Known Issues & Limitations
### 5.1 Prefect Worker Issue
**Symptom**: Flow runs stuck in "Late" state, worker not picking up
**Workaround**: Services run directly via systemd (Scan Bridge, Nautilus Trader)
**Root Cause**: Unknown - possibly pool paused state or worker polling issue
**Future Fix**: Investigate Prefect work pool `status` field, may need to recreate pool
### 5.2 NG6 Dependency
**Current State**: All services DEAD (expected) due to no scan data
**Recovery**: Automatic when NG6 restarts - scan bridge will detect files, Hz will update, Nautilus will trade
### 5.3 OBF/ExtF/EsoF Not Running
**Status**: Services defined but not yet started
**Action Required**: Start individually after NG6 recovery
---
## 6. Operational Commands
### 6.1 Service Management
```bash
# Check all services
systemctl status dolphin-* meta_health_daemon
# View logs
journalctl -u dolphin-nautilus-trader -f
tail -f /tmp/nautilus_trader.log
tail -f /mnt/dolphinng5_predict/run_logs/meta_health.log
# Restart services
systemctl restart dolphin-nautilus-trader
systemctl restart meta_health_daemon
```
### 6.2 Health Checks
```bash
# Check Hz data
cd /mnt/dolphinng5_predict/prod
source /home/dolphin/siloqy_env/bin/activate
python3 -c "
import hazelcast, json
client = hazelcast.HazelcastClient(cluster_name='dolphin')
features = client.get_map('DOLPHIN_FEATURES').blocking()
scan = json.loads(features.get('latest_eigen_scan') or '{}')
print(f'Scan #{scan.get(\"scan_number\", \"N/A\")}')
client.shutdown()
"
# Check MHS status
cat /mnt/dolphinng5_predict/run_logs/meta_health.json
```
### 6.3 Prefect Operations
```bash
export PREFECT_API_URL="http://localhost:4200/api"
# Check work pools
prefect work-pool ls
# Check deployments
prefect deployment ls
# Manual trigger (for testing)
prefect deployment run scan-bridge-flow/scan-bridge
```
---
## 7. File Locations
| Component | File Path |
|-----------|-----------|
| Scan Bridge Flow | `/mnt/dolphinng5_predict/prod/scan_bridge_prefect_flow.py` |
| Nautilus Trader | `/mnt/dolphinng5_predict/prod/nautilus_event_trader.py` |
| MHS v2 | `/mnt/dolphinng5_predict/prod/meta_health_daemon_v2.py` |
| Prefect Worker Service | `/etc/systemd/system/dolphin-prefect-worker.service` |
| Nautilus Trader Service | `/etc/systemd/system/dolphin-nautilus-trader.service` |
| MHS Service | `/etc/systemd/system/meta_health_daemon.service` |
| Health Logs | `/mnt/dolphinng5_predict/run_logs/meta_health.log` |
| Trader Logs | `/tmp/nautilus_trader.log` |
| Prefect Worker Logs | `/tmp/prefect_worker.log` |
| Hz Data | `DOLPHIN_FEATURES`, `DOLPHIN_SAFETY`, `DOLPHIN_META_HEALTH` |
---
## 8. Appendix: Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-03-25 | Initial multi-speed architecture spec |
---
## 9. Sign-Off
**Implementation**: Complete
**Testing**: In Progress (waiting for NG6 restart)
**Documentation**: Complete
**Next Review**: Post-NG6-recovery validation
**Agent**: Kimi Code CLI
**Session**: 2026-03-25
**Status**: PRODUCTION DEPLOYED