SYSTEM BIBLE v7.1: OOM recovery runbook + supervisord safety rules
Add §16.10 corrected daemon start sequence (supervisord NOT auto-started on boot), §16.12 critical supervisord.conf rules (no /tmp paths, OBF starvation → BLUE freeze, pre-restart position check), §16.13 OOM recovery runbook with exact commands. Incident context (2026-06-08): - Previous agent set nautilus_trader to /tmp/blue_runtime_mirror/ — broken after OOM reboot - OBF died during BLUE run, degraded gate for 285+ bars, BLUE stuck in RETRACT on LTCUSDT - Fix: revert supervisord.conf to /mnt canonical paths, restart supervisord Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1362,19 +1362,83 @@ supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.con
|
||||
|
||||
### 16.10 Daemon Start Sequence
|
||||
|
||||
**IMPORTANT**: supervisord has NO systemd unit — it is NOT auto-started on reboot.
|
||||
After any reboot or OOM kill, supervisord must be started manually (step 2 below).
|
||||
|
||||
```bash
|
||||
# 1. Verify Hazelcast/Prefect are running (systemd-managed, survive reboots)
|
||||
systemctl status dolphin-prefect-worker
|
||||
|
||||
# 2. Start supervisord (MUST export DOLPHIN_LOG_ROOT — used by logfile= directives)
|
||||
mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
|
||||
DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
|
||||
-c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf
|
||||
# dolphin_data group (OBF, ACB, MHS, exf, maras, esof) starts automatically
|
||||
|
||||
# 3. Verify data pipeline is up
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
|
||||
|
||||
# 4. Start BLUE (manual — autostart=false; only start after verifying BingX position state)
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
|
||||
start dolphin:nautilus_trader
|
||||
|
||||
# 5. Prefect deployments run on schedule (daily):
|
||||
# paper_trade_flow.py ← 00:05 UTC
|
||||
# nautilus_prefect_flow ← 00:10 UTC
|
||||
```
|
||||
1. docker-compose up -d ← Hazelcast 5701, ManCenter 8080, Prefect 4200
|
||||
2. supervisord (auto) ← starts dolphin_data group automatically on boot
|
||||
└── exf_fetcher, acb_processor, obf_universe, meta_health start in parallel
|
||||
|
||||
3. (Manual when needed):
|
||||
supervisorctl start dolphin:nautilus_trader ← HZ entry listener
|
||||
supervisorctl start dolphin:scan_bridge ← when DolphinNG6 active
|
||||
### 16.12 CRITICAL — supervisord.conf Safety Rules
|
||||
|
||||
4. Prefect deployments (daily, scheduled):
|
||||
paper_trade_flow.py ← 00:05 UTC
|
||||
nautilus_prefect_flow.py ← 00:10 UTC
|
||||
mc_forewarner_flow.py ← daily
|
||||
**RULE 1 — Never use /tmp paths for trader binaries.**
|
||||
`/tmp` is writable but survives reboots on this host (not a tmpfs). However, directories
|
||||
created by agents (e.g. `/tmp/blue_runtime_mirror/`) may be manually cleaned or never
|
||||
recreated after an OOM kill, leaving supervisord unable to start the process.
|
||||
**Canonical paths for all trader programs MUST reference `/mnt/dolphinng5_predict/`.**
|
||||
|
||||
**RULE 2 — nautilus_trader correct config (BLUE live mainnet):**
|
||||
```ini
|
||||
command=/home/dolphin/siloqy_env/bin/python3 /mnt/dolphinng5_predict/prod/nautilus_event_trader.py
|
||||
directory=/mnt/dolphinng5_predict/prod
|
||||
environment=PYTHONPATH="/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin:/mnt/dolphinng5_predict/prod",DOLPHIN_LOCAL_RUNTIME_ROOT="/mnt/dolphinng5_predict",...
|
||||
```
|
||||
|
||||
**RULE 3 — OBF starvation → BLUE freeze.**
|
||||
If `obf_universe` dies and is not restarted, BLUE logs
|
||||
`"OBF step_live: no snapshots for N consecutive bars — OBF gate degraded to random"`.
|
||||
After ~60 bars (~10 min) without OBF the survival stack degrades to TURTLE/HIBERNATE,
|
||||
blocking new ENTERs. The existing open position stays open in RETRACT. Fix: restart
|
||||
supervisord (which brings OBF up), then restart `dolphin:nautilus_trader`.
|
||||
|
||||
**RULE 4 — Before restarting BLUE after a gap, check for open positions.**
|
||||
If BLUE was in RETRACT when it died, there MAY be an open position on live BingX mainnet.
|
||||
Check `/tmp/dolphin_capital_checkpoint.json` (last capital) and `/tmp/nautilus_trader.log`
|
||||
(last V7 decision) for context, but verify directly on BingX before restarting.
|
||||
|
||||
### 16.13 OOM Recovery Runbook (post-reboot)
|
||||
|
||||
```bash
|
||||
# Confirm nothing is running
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status 2>&1 || echo "supervisord down — need to start"
|
||||
|
||||
# Check BLUE's last known state
|
||||
tail -5 /tmp/nautilus_trader.log
|
||||
cat /tmp/dolphin_capital_checkpoint.json
|
||||
|
||||
# Restart supervisord (data pipeline only — do NOT auto-start BLUE)
|
||||
mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
|
||||
DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
|
||||
-c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf
|
||||
|
||||
# Give services 15s to reach RUNNING state
|
||||
sleep 15
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
|
||||
|
||||
# Verify OBF is connected (look for "subscribed" in first 30 lines of log)
|
||||
head -30 /tmp/dolphin_logs/supervisor/obf_universe-error.log
|
||||
|
||||
# Only then, start BLUE after manually confirming BingX position state
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
|
||||
start dolphin:nautilus_trader
|
||||
```
|
||||
|
||||
### 16.11 Monitoring Endpoints
|
||||
|
||||
Reference in New Issue
Block a user