SYSTEM BIBLE v7.1: OOM recovery runbook + supervisord safety rules

Add §16.10 corrected daemon start sequence (supervisord NOT auto-started on boot),
§16.12 critical supervisord.conf rules (no /tmp paths, OBF starvation → BLUE freeze,
pre-restart position check), §16.13 OOM recovery runbook with exact commands.

Incident context (2026-06-08):
- Previous agent set nautilus_trader to /tmp/blue_runtime_mirror/ — broken after OOM reboot
- OBF died during BLUE run, degraded gate for 285+ bars, BLUE stuck in RETRACT on LTCUSDT
- Fix: revert supervisord.conf to /mnt canonical paths, restart supervisord

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Codex
2026-06-08 13:25:50 +02:00
parent 8f57f4d855
commit 130c466cc9

View File

@@ -1362,19 +1362,83 @@ supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.con
### 16.10 Daemon Start Sequence ### 16.10 Daemon Start Sequence
**IMPORTANT**: supervisord has NO systemd unit it is NOT auto-started on reboot.
After any reboot or OOM kill, supervisord must be started manually (step 2 below).
```bash
# 1. Verify Hazelcast/Prefect are running (systemd-managed, survive reboots)
systemctl status dolphin-prefect-worker
# 2. Start supervisord (MUST export DOLPHIN_LOG_ROOT — used by logfile= directives)
mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
-c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf
# dolphin_data group (OBF, ACB, MHS, exf, maras, esof) starts automatically
# 3. Verify data pipeline is up
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
# 4. Start BLUE (manual — autostart=false; only start after verifying BingX position state)
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
start dolphin:nautilus_trader
# 5. Prefect deployments run on schedule (daily):
# paper_trade_flow.py ← 00:05 UTC
# nautilus_prefect_flow ← 00:10 UTC
``` ```
1. docker-compose up -d ← Hazelcast 5701, ManCenter 8080, Prefect 4200
2. supervisord (auto) ← starts dolphin_data group automatically on boot
└── exf_fetcher, acb_processor, obf_universe, meta_health start in parallel
3. (Manual when needed): ### 16.12 CRITICAL — supervisord.conf Safety Rules
supervisorctl start dolphin:nautilus_trader ← HZ entry listener
supervisorctl start dolphin:scan_bridge ← when DolphinNG6 active
4. Prefect deployments (daily, scheduled): **RULE 1 — Never use /tmp paths for trader binaries.**
paper_trade_flow.py ← 00:05 UTC `/tmp` is writable but survives reboots on this host (not a tmpfs). However, directories
nautilus_prefect_flow.py ← 00:10 UTC created by agents (e.g. `/tmp/blue_runtime_mirror/`) may be manually cleaned or never
mc_forewarner_flow.py ← daily recreated after an OOM kill, leaving supervisord unable to start the process.
**Canonical paths for all trader programs MUST reference `/mnt/dolphinng5_predict/`.**
**RULE 2 — nautilus_trader correct config (BLUE live mainnet):**
```ini
command=/home/dolphin/siloqy_env/bin/python3 /mnt/dolphinng5_predict/prod/nautilus_event_trader.py
directory=/mnt/dolphinng5_predict/prod
environment=PYTHONPATH="/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin:/mnt/dolphinng5_predict/prod",DOLPHIN_LOCAL_RUNTIME_ROOT="/mnt/dolphinng5_predict",...
```
**RULE 3 — OBF starvation → BLUE freeze.**
If `obf_universe` dies and is not restarted, BLUE logs
`"OBF step_live: no snapshots for N consecutive bars — OBF gate degraded to random"`.
After ~60 bars (~10 min) without OBF the survival stack degrades to TURTLE/HIBERNATE,
blocking new ENTERs. The existing open position stays open in RETRACT. Fix: restart
supervisord (which brings OBF up), then restart `dolphin:nautilus_trader`.
**RULE 4 — Before restarting BLUE after a gap, check for open positions.**
If BLUE was in RETRACT when it died, there MAY be an open position on live BingX mainnet.
Check `/tmp/dolphin_capital_checkpoint.json` (last capital) and `/tmp/nautilus_trader.log`
(last V7 decision) for context, but verify directly on BingX before restarting.
### 16.13 OOM Recovery Runbook (post-reboot)
```bash
# Confirm nothing is running
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status 2>&1 || echo "supervisord down — need to start"
# Check BLUE's last known state
tail -5 /tmp/nautilus_trader.log
cat /tmp/dolphin_capital_checkpoint.json
# Restart supervisord (data pipeline only — do NOT auto-start BLUE)
mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
-c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf
# Give services 15s to reach RUNNING state
sleep 15
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
# Verify OBF is connected (look for "subscribed" in first 30 lines of log)
head -30 /tmp/dolphin_logs/supervisor/obf_universe-error.log
# Only then, start BLUE after manually confirming BingX position state
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
start dolphin:nautilus_trader
``` ```
### 16.11 Monitoring Endpoints ### 16.11 Monitoring Endpoints