From 130c466cc9b83ed4e4668ece7582b77a0a61d117 Mon Sep 17 00:00:00 2001 From: Codex Date: Mon, 8 Jun 2026 13:25:50 +0200 Subject: [PATCH] SYSTEM BIBLE v7.1: OOM recovery runbook + supervisord safety rules MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add §16.10 corrected daemon start sequence (supervisord NOT auto-started on boot), §16.12 critical supervisord.conf rules (no /tmp paths, OBF starvation → BLUE freeze, pre-restart position check), §16.13 OOM recovery runbook with exact commands. Incident context (2026-06-08): - Previous agent set nautilus_trader to /tmp/blue_runtime_mirror/ — broken after OOM reboot - OBF died during BLUE run, degraded gate for 285+ bars, BLUE stuck in RETRACT on LTCUSDT - Fix: revert supervisord.conf to /mnt canonical paths, restart supervisord Co-Authored-By: Claude Sonnet 4.6 --- prod/docs/SYSTEM_BIBLE.md | 84 ++++++++++++++++++++++++++++++++++----- 1 file changed, 74 insertions(+), 10 deletions(-) diff --git a/prod/docs/SYSTEM_BIBLE.md b/prod/docs/SYSTEM_BIBLE.md index 6111820..cf32919 100644 --- a/prod/docs/SYSTEM_BIBLE.md +++ b/prod/docs/SYSTEM_BIBLE.md @@ -1362,19 +1362,83 @@ supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.con ### 16.10 Daemon Start Sequence +**IMPORTANT**: supervisord has NO systemd unit — it is NOT auto-started on reboot. +After any reboot or OOM kill, supervisord must be started manually (step 2 below). + +```bash +# 1. Verify Hazelcast/Prefect are running (systemd-managed, survive reboots) +systemctl status dolphin-prefect-worker + +# 2. Start supervisord (MUST export DOLPHIN_LOG_ROOT — used by logfile= directives) +mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader +DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \ + -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf +# dolphin_data group (OBF, ACB, MHS, exf, maras, esof) starts automatically + +# 3. Verify data pipeline is up +supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status + +# 4. Start BLUE (manual — autostart=false; only start after verifying BingX position state) +supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \ + start dolphin:nautilus_trader + +# 5. Prefect deployments run on schedule (daily): +# paper_trade_flow.py ← 00:05 UTC +# nautilus_prefect_flow ← 00:10 UTC ``` -1. docker-compose up -d ← Hazelcast 5701, ManCenter 8080, Prefect 4200 -2. supervisord (auto) ← starts dolphin_data group automatically on boot - └── exf_fetcher, acb_processor, obf_universe, meta_health start in parallel -3. (Manual when needed): - supervisorctl start dolphin:nautilus_trader ← HZ entry listener - supervisorctl start dolphin:scan_bridge ← when DolphinNG6 active +### 16.12 CRITICAL — supervisord.conf Safety Rules -4. Prefect deployments (daily, scheduled): - paper_trade_flow.py ← 00:05 UTC - nautilus_prefect_flow.py ← 00:10 UTC - mc_forewarner_flow.py ← daily +**RULE 1 — Never use /tmp paths for trader binaries.** +`/tmp` is writable but survives reboots on this host (not a tmpfs). However, directories +created by agents (e.g. `/tmp/blue_runtime_mirror/`) may be manually cleaned or never +recreated after an OOM kill, leaving supervisord unable to start the process. +**Canonical paths for all trader programs MUST reference `/mnt/dolphinng5_predict/`.** + +**RULE 2 — nautilus_trader correct config (BLUE live mainnet):** +```ini +command=/home/dolphin/siloqy_env/bin/python3 /mnt/dolphinng5_predict/prod/nautilus_event_trader.py +directory=/mnt/dolphinng5_predict/prod +environment=PYTHONPATH="/mnt/dolphinng5_predict:/mnt/dolphinng5_predict/nautilus_dolphin:/mnt/dolphinng5_predict/prod",DOLPHIN_LOCAL_RUNTIME_ROOT="/mnt/dolphinng5_predict",... +``` + +**RULE 3 — OBF starvation → BLUE freeze.** +If `obf_universe` dies and is not restarted, BLUE logs +`"OBF step_live: no snapshots for N consecutive bars — OBF gate degraded to random"`. +After ~60 bars (~10 min) without OBF the survival stack degrades to TURTLE/HIBERNATE, +blocking new ENTERs. The existing open position stays open in RETRACT. Fix: restart +supervisord (which brings OBF up), then restart `dolphin:nautilus_trader`. + +**RULE 4 — Before restarting BLUE after a gap, check for open positions.** +If BLUE was in RETRACT when it died, there MAY be an open position on live BingX mainnet. +Check `/tmp/dolphin_capital_checkpoint.json` (last capital) and `/tmp/nautilus_trader.log` +(last V7 decision) for context, but verify directly on BingX before restarting. + +### 16.13 OOM Recovery Runbook (post-reboot) + +```bash +# Confirm nothing is running +supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status 2>&1 || echo "supervisord down — need to start" + +# Check BLUE's last known state +tail -5 /tmp/nautilus_trader.log +cat /tmp/dolphin_capital_checkpoint.json + +# Restart supervisord (data pipeline only — do NOT auto-start BLUE) +mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader +DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \ + -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf + +# Give services 15s to reach RUNNING state +sleep 15 +supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status + +# Verify OBF is connected (look for "subscribed" in first 30 lines of log) +head -30 /tmp/dolphin_logs/supervisor/obf_universe-error.log + +# Only then, start BLUE after manually confirming BingX position state +supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \ + start dolphin:nautilus_trader ``` ### 16.11 Monitoring Endpoints