Files
DOLPHIN/prod/docs/CLICKHOUSE_OBSERVABILITY.md

124 lines
4.1 KiB
Markdown
Raw Normal View History

# ClickHouse Observability Layer
**Deployed:** 2026-04-06
**CH Version:** 24.3-alpine
**Ports:** HTTP :8123, Native :9000
**OTel Collector:** OTLP gRPC :4317 / HTTP :4318
**Play UI:** http://100.105.170.6:8123/play
---
## Architecture
```
Dolphin services → ch_put() → ch_writer.py (async batch) → dolphin-clickhouse:8123
NG7 laptop → ng_otel_writer.py (OTel SDK) → dolphin-otelcol:4317 → dolphin-clickhouse
/proc poller → system_stats_service.py → dolphin.system_stats
supervisord → supervisord_ch_listener.py (eventlistener) → dolphin.supervisord_state
```
All writes are **fire-and-forget** — ch_writer batches in a background thread, drops silently on queue full. OBF hot loop (100ms) is never blocked.
---
## Tables
| Table | Source | Rate | Retention |
|---|---|---|---|
| `eigen_scans` | nautilus_event_trader | ~8/min | 10yr |
| `posture_events` | meta_health_service_v3 | few/day | forever |
| `acb_state` | acb_processor_service | ~5/day | forever |
| `daily_pnl` | paper_trade_flow | 1/day | forever |
| `trade_events` | DolphinActor (pending) | ~40/day | 10yr |
| `obf_universe` | obf_universe_service | 540/min | forever |
| `obf_fast_intrade` | DolphinActor (pending) | 100ms×assets | 5yr |
| `exf_data` | exf_fetcher_flow | ~1/min | forever |
| `meta_health` | meta_health_service_v3 | ~1/10s | forever |
| `account_events` | DolphinActor (pending) | rare | forever |
| `supervisord_state` | supervisord_ch_listener | push+60s poll | forever |
| `system_stats` | system_stats_service | 1/30s | forever |
OTel tables (`otel_logs`, `otel_traces`, `otel_metrics_*`) auto-created by collector for NG7 instrumentation.
---
## Distributed Trace ID
`scan_uuid` (UUIDv7) is the causal trace root across all tables:
```
eigen_scans.scan_uuid ← NG7 generates one per scan
├── obf_fast_intrade.scan_uuid (100ms OBF while in-trade)
├── trade_events.scan_uuid (entry + exit rows)
└── posture_events.scan_uuid (if scan triggered posture re-eval)
```
**NG7 migration:** replace `uuid.uuid4()` with `uuid7()` from `ch_writer.py` — same String format, drop-in.
---
## Key Queries (CH Play)
```sql
-- Current system state
SELECT * FROM dolphin.v_current_posture;
-- Scan latency last hour
SELECT * FROM dolphin.v_scan_latency_1h;
-- Trade summary last 30 days
SELECT * FROM dolphin.v_trade_summary_30d;
-- Process health
SELECT * FROM dolphin.v_process_health;
-- System resources (5min buckets, last hour)
SELECT * FROM dolphin.v_system_stats_1h ORDER BY bucket;
-- Full causal chain for a scan
SELECT event_type, ts, detail, value1, value2
FROM dolphin.v_scan_causal_chain
WHERE trace_id = '<scan_uuid>'
ORDER BY ts;
-- Scans that preceded losing trades
SELECT e.scan_number, e.vel_div, t.asset, t.pnl, t.exit_reason
FROM dolphin.trade_events t
JOIN dolphin.eigen_scans e ON e.scan_uuid = t.scan_uuid
WHERE t.pnl < 0 AND t.exit_price > 0
ORDER BY t.pnl ASC LIMIT 20;
```
---
## Files
| File | Purpose |
|---|---|
| `prod/ch_writer.py` | Shared singleton — `from ch_writer import ch_put, ts_us, uuid7` |
| `prod/system_stats_service.py` | /proc poller, runs under supervisord:system_stats |
| `prod/supervisord_ch_listener.py` | supervisord eventlistener |
| `prod/ng_otel_writer.py` (on NG7) | OTel drop-in for remote machines |
| `prod/clickhouse/config.xml` | CH server config (40% RAM cap, async_insert) |
| `prod/clickhouse/users.xml` | dolphin user, wait_for_async_insert=0 |
| `prod/otelcol/config.yaml` | OTel Collector → dolphin.otel_* |
| `/root/ch-setup/schema.sql` | Full DDL — idempotent, re-runnable |
---
## Credentials
- User: `dolphin` / `dolphin_ch_2026`
- OTel DSN: `http://dolphin_uptrace_token@100.105.170.6:14318/1` (if Uptrace ever deployed)
---
## Pending (when DolphinActor is wired)
- `trade_events` — add `ch_put("trade_events", {...})` at entry and exit
- `obf_fast_intrade` — add in OBF 100ms tick (only when n_open_positions > 0)
- `account_events` — STARTUP/SHUTDOWN/END_DAY hooks
- `daily_pnl` — end-of-day in paper_trade_flow / nautilus_prefect_flow
- See `prod/service_integration.py` for exact copy-paste snippets