OPS: supervisord systemd watchdog + controlled-bringup startup script
Adds dolphin-supervisord.service (installed + enabled) and dolphin_startup_check.sh: - ExecStartPre waits for HZ (CRITICAL/blocks), CH+Prefect (WARN/degraded-ok) - Logs to /tmp/dolphin_logs/startup.log + run_logs/dolphin_startup_<date>.log - Writes machine-readable /tmp/dolphin_logs/startup_status.json on every start - nautilus_trader remains autostart=false — BLUE must be started manually SYSTEM BIBLE bumped to v7.1; §16.10 updated, §16.14 added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
# DOLPHIN-NAUTILUS SYSTEM BIBLE
|
||||
## Doctrinal Reference — As Running 2026-04-05
|
||||
|
||||
**Version**: v7.0 — PINK DITAv2 Fee Accounting Fix + Orphan Prevention (2026-06-08)
|
||||
**Version**: v7.1 — supervisord systemd watchdog + controlled-bringup startup (2026-06-08)
|
||||
**Previous version**: v6.0 — NG8 Linux Scanner + TUI v3 Live Observability + Test Footer CI (2026-04-05)
|
||||
**Previous version**: v4.1 — Multi-Speed Event-Driven Architecture (2026-03-25)
|
||||
**CI gate (Nautilus)**: 46/46 tests green
|
||||
@@ -1362,12 +1362,24 @@ supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.con
|
||||
|
||||
### 16.10 Daemon Start Sequence
|
||||
|
||||
**IMPORTANT**: supervisord has NO systemd unit — it is NOT auto-started on reboot.
|
||||
After any reboot or OOM kill, supervisord must be started manually (step 2 below).
|
||||
**`supervisord` is now auto-started on reboot via `dolphin-supervisord.service` (systemd).**
|
||||
See §16.14 for watchdog details. Manual start is still the correct method for the initial
|
||||
activation or when you need to bypass the watchdog.
|
||||
|
||||
```bash
|
||||
# 1. Verify Hazelcast/Prefect are running (systemd-managed, survive reboots)
|
||||
systemctl status dolphin-prefect-worker
|
||||
# --- Automatic (post-reboot, systemd handles it) ---
|
||||
# dolphin-supervisord.service runs dolphin_startup_check.sh as ExecStartPre:
|
||||
# • Waits for dolphin-hazelcast container healthy (CRITICAL — blocks if not up)
|
||||
# • Waits for dolphin-clickhouse (WARN — ch_writer spools on failure)
|
||||
# • Waits for dolphin-prefect (WARN — exf_fetcher degrades)
|
||||
# • Writes /tmp/dolphin_logs/startup_status.json + /tmp/dolphin_logs/startup.log
|
||||
# Then starts supervisord. dolphin_data group (OBF, ACB, MHS, exf, maras, esof)
|
||||
# auto-starts. nautilus_trader (BLUE) does NOT auto-start — must be manual.
|
||||
|
||||
# --- Manual (if you need to start/restart supervisord yourself) ---
|
||||
|
||||
# 1. Verify Hazelcast/ClickHouse containers are running
|
||||
docker ps --filter name=dolphin-hazelcast --filter name=dolphin-clickhouse
|
||||
|
||||
# 2. Start supervisord (MUST export DOLPHIN_LOG_ROOT — used by logfile= directives)
|
||||
mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader
|
||||
@@ -1378,7 +1390,7 @@ DOLPHIN_LOG_ROOT=/tmp/dolphin_logs supervisord \
|
||||
# 3. Verify data pipeline is up
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
|
||||
|
||||
# 4. Start BLUE (manual — autostart=false; only start after verifying BingX position state)
|
||||
# 4. Start BLUE (manual — autostart=false; only start after verifying exchange position state)
|
||||
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf \
|
||||
start dolphin:nautilus_trader
|
||||
|
||||
@@ -1441,6 +1453,65 @@ supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.con
|
||||
start dolphin:nautilus_trader
|
||||
```
|
||||
|
||||
### 16.14 supervisord Systemd Watchdog
|
||||
|
||||
**Unit:** `/etc/systemd/system/dolphin-supervisord.service`
|
||||
**Pre-start script:** `/mnt/dolphinng5_predict/prod/supervisor/dolphin_startup_check.sh`
|
||||
|
||||
#### What it does
|
||||
|
||||
| Phase | Action |
|
||||
|---|---|
|
||||
| **ExecStartPre 1** | `mkdir -p /tmp/dolphin_logs/supervisor /tmp/dolphin_logs/trader` |
|
||||
| **ExecStartPre 2** | Runs `dolphin_startup_check.sh` — checks HZ (CRITICAL), CH (WARN), Prefect (WARN) |
|
||||
| **ExecStart** | `supervisord -c dolphin-supervisord.conf` |
|
||||
| **ExecStop** | `supervisorctl shutdown` |
|
||||
| **Restart** | `on-failure`, 30 s delay — only fires on crash/OOM-kill, never on clean stop |
|
||||
| **TimeoutStartSec** | 300 s — enough for bringup-check loops (HZ=90 s, CH=60 s, Prefect=45 s) |
|
||||
|
||||
#### Startup check logic (dolphin_startup_check.sh)
|
||||
|
||||
1. Checks Docker daemon. If unavailable, skips container checks, falls back to TCP.
|
||||
2. **Hazelcast** — container must reach `running` + `healthy` within 90 s, then TCP:5701
|
||||
confirmed. If HZ is not up: **script exits 1 → systemd refuses to start supervisord**.
|
||||
3. **ClickHouse** — container running + TCP:8123 within 60 s. Failure = WARN only.
|
||||
4. **Prefect** — container running + TCP:4200 within 45 s. Failure = WARN only.
|
||||
5. Writes `/tmp/dolphin_logs/startup.log` (same dir as all supervisor logs).
|
||||
6. Writes `/mnt/dolphinng5_predict/run_logs/dolphin_startup_<date>.log` (persistent).
|
||||
7. Writes `/tmp/dolphin_logs/startup_status.json` (machine-readable, for BLUE to check).
|
||||
|
||||
#### Key safety rule
|
||||
|
||||
`nautilus_trader` is `autostart=false` in supervisord.conf. **The systemd watchdog will
|
||||
NOT auto-start BLUE.** BLUE must always be started manually after verifying exchange
|
||||
position state. The watchdog only manages supervisord and the data pipeline.
|
||||
|
||||
#### Management commands
|
||||
|
||||
```bash
|
||||
# Status
|
||||
systemctl status dolphin-supervisord.service
|
||||
|
||||
# Start (e.g. first boot or after manual stop)
|
||||
systemctl start dolphin-supervisord.service
|
||||
# Startup log: /tmp/dolphin_logs/startup.log (also journalctl -u dolphin-supervisord)
|
||||
|
||||
# Stop (does NOT restart — clean exit)
|
||||
systemctl stop dolphin-supervisord.service
|
||||
|
||||
# Disable watchdog (maintenance)
|
||||
systemctl disable dolphin-supervisord.service
|
||||
|
||||
# Re-enable
|
||||
systemctl enable dolphin-supervisord.service
|
||||
|
||||
# View startup check log
|
||||
cat /tmp/dolphin_logs/startup.log
|
||||
journalctl -u dolphin-supervisord.service -n 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 16.11 Monitoring Endpoints
|
||||
|
||||
| Service | URL / Command |
|
||||
|
||||
Reference in New Issue
Block a user