SYSTEM BIBLE v7.1: OOM recovery runbook + supervisord safety rules

Add §16.10 corrected daemon start sequence (supervisord NOT auto-started on boot), §16.12 critical supervisord.conf rules (no /tmp paths, OBF starvation → BLUE freeze, pre-restart position check), §16.13 OOM recovery runbook with exact commands. Incident context (2026-06-08): - Previous agent set nautilus_trader to /tmp/blue_runtime_mirror/ — broken after OOM reboot - OBF died during BLUE run, degraded gate for 285+ bars, BLUE stuck in RETRACT on LTCUSDT - Fix: revert supervisord.conf to /mnt canonical paths, restart supervisord Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 13:25:50 +02:00
parent 8f57f4d855
commit 130c466cc9
1 changed files with 74 additions and 10 deletions
--- a/prod/docs/SYSTEM_BIBLE.md
+++ b/prod/docs/SYSTEM_BIBLE.md
@@ -1362,19 +1362,83 @@ supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.con
 ### 16.10 Daemon Start Sequence
 **IMPORTANT**: supervisord has NO systemd unit — it is NOT auto-started on reboot.
 After any reboot or OOM kill, supervisord must be started manually (step 2 below).
 ```bash
 # 1. Verify Hazelcast/Prefect are running (systemd-managed, survive reboots)
 systemctl status dolphin-prefect-worker
 # 2. Start supervisord (MUST export DOLPHIN_LOG_ROOT — used by logfile= directives)
 mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
 DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
    -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf
 # dolphin_data group (OBF, ACB, MHS, exf, maras, esof) starts automatically
 # 3. Verify data pipeline is up
 supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
 # 4. Start BLUE (manual — autostart=false; only start after verifying BingX position state)
 supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
    start dolphin:nautilus_trader
 # 5. Prefect deployments run on schedule (daily):
 #    paper_trade_flow.py   ← 00:05 UTC
 #    nautilus_prefect_flow ← 00:10 UTC
 ```
 1. docker-compose up -d          ← Hazelcast 5701, ManCenter 8080, Prefect 4200
 2. supervisord (auto)            ← starts dolphin_data group automatically on boot
   └── exf_fetcher, acb_processor, obf_universe, meta_health start in parallel
-3. (Manual when needed):
+### 16.12 CRITICAL — supervisord.conf Safety Rules
   supervisorctl start dolphin:nautilus_trader   ← HZ entry listener
   supervisorctl start dolphin:scan_bridge       ← when DolphinNG6 active
-4. Prefect deployments (daily, scheduled):
+**RULE 1 — Never use /tmp paths for trader binaries.**
-   paper_trade_flow.py           ← 00:05 UTC
+`/tmp` is writable but survives reboots on this host (not a tmpfs). However, directories
-   nautilus_prefect_flow.py      ← 00:10 UTC
+created by agents (e.g. `/tmp/blue_runtime_mirror/`) may be manually cleaned or never
-   mc_forewarner_flow.py         ← daily
+recreated after an OOM kill, leaving supervisord unable to start the process.
 **Canonical paths for all trader programs MUST reference `/mnt/dolphinng5_predict/`.**
 **RULE 2 — nautilus_trader correct config (BLUE live mainnet):**
 ```ini
 command=/home/dolphin/siloqy_env/bin/python3 /mnt/dolphinng5_predict/prod/nautilus_event_trader.py
 directory=/mnt/dolphinng5_predict/prod
 environment=PYTHONPATH="/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin:/mnt/dolphinng5_predict/prod",DOLPHIN_LOCAL_RUNTIME_ROOT="/mnt/dolphinng5_predict",...
 ```
 **RULE 3 — OBF starvation → BLUE freeze.**
 If `obf_universe` dies and is not restarted, BLUE logs
 `"OBF step_live: no snapshots for N consecutive bars — OBF gate degraded to random"`.
 After ~60 bars (~10 min) without OBF the survival stack degrades to TURTLE/HIBERNATE,
 blocking new ENTERs. The existing open position stays open in RETRACT. Fix: restart
 supervisord (which brings OBF up), then restart `dolphin:nautilus_trader`.
 **RULE 4 — Before restarting BLUE after a gap, check for open positions.**
 If BLUE was in RETRACT when it died, there MAY be an open position on live BingX mainnet.
 Check `/tmp/dolphin_capital_checkpoint.json` (last capital) and `/tmp/nautilus_trader.log`
 (last V7 decision) for context, but verify directly on BingX before restarting.
 ### 16.13 OOM Recovery Runbook (post-reboot)
 ```bash
 # Confirm nothing is running
 supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status 2>&1 || echo "supervisord down — need to start"
 # Check BLUE's last known state
 tail -5 /tmp/nautilus_trader.log
 cat /tmp/dolphin_capital_checkpoint.json
 # Restart supervisord (data pipeline only — do NOT auto-start BLUE)
 mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
 DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
    -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf
 # Give services 15s to reach RUNNING state
 sleep 15
 supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
 # Verify OBF is connected (look for "subscribed" in first 30 lines of log)
 head -30 /tmp/dolphin_logs/supervisor/obf_universe-error.log
 # Only then, start BLUE after manually confirming BingX position state
 supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
    start dolphin:nautilus_trader
 ```
 ### 16.11 Monitoring Endpoints