Files
DOLPHIN/prod/docs/CLICKHOUSE_OBSERVABILITY.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

124 lines
4.1 KiB
Markdown
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ClickHouse Observability Layer
**Deployed:** 2026-04-06
**CH Version:** 24.3-alpine
**Ports:** HTTP :8123, Native :9000
**OTel Collector:** OTLP gRPC :4317 / HTTP :4318
**Play UI:** http://100.105.170.6:8123/play
---
## Architecture
```
Dolphin services → ch_put() → ch_writer.py (async batch) → dolphin-clickhouse:8123
NG7 laptop → ng_otel_writer.py (OTel SDK) → dolphin-otelcol:4317 → dolphin-clickhouse
/proc poller → system_stats_service.py → dolphin.system_stats
supervisord → supervisord_ch_listener.py (eventlistener) → dolphin.supervisord_state
```
All writes are **fire-and-forget** — ch_writer batches in a background thread, drops silently on queue full. OBF hot loop (100ms) is never blocked.
---
## Tables
| Table | Source | Rate | Retention |
|---|---|---|---|
| `eigen_scans` | nautilus_event_trader | ~8/min | 10yr |
| `posture_events` | meta_health_service_v3 | few/day | forever |
| `acb_state` | acb_processor_service | ~5/day | forever |
| `daily_pnl` | paper_trade_flow | 1/day | forever |
| `trade_events` | DolphinActor (pending) | ~40/day | 10yr |
| `obf_universe` | obf_universe_service | 540/min | forever |
| `obf_fast_intrade` | DolphinActor (pending) | 100ms×assets | 5yr |
| `exf_data` | exf_fetcher_flow | ~1/min | forever |
| `meta_health` | meta_health_service_v3 | ~1/10s | forever |
| `account_events` | DolphinActor (pending) | rare | forever |
| `supervisord_state` | supervisord_ch_listener | push+60s poll | forever |
| `system_stats` | system_stats_service | 1/30s | forever |
OTel tables (`otel_logs`, `otel_traces`, `otel_metrics_*`) auto-created by collector for NG7 instrumentation.
---
## Distributed Trace ID
`scan_uuid` (UUIDv7) is the causal trace root across all tables:
```
eigen_scans.scan_uuid ← NG7 generates one per scan
├── obf_fast_intrade.scan_uuid (100ms OBF while in-trade)
├── trade_events.scan_uuid (entry + exit rows)
└── posture_events.scan_uuid (if scan triggered posture re-eval)
```
**NG7 migration:** replace `uuid.uuid4()` with `uuid7()` from `ch_writer.py` — same String format, drop-in.
---
## Key Queries (CH Play)
```sql
-- Current system state
SELECT * FROM dolphin.v_current_posture;
-- Scan latency last hour
SELECT * FROM dolphin.v_scan_latency_1h;
-- Trade summary last 30 days
SELECT * FROM dolphin.v_trade_summary_30d;
-- Process health
SELECT * FROM dolphin.v_process_health;
-- System resources (5min buckets, last hour)
SELECT * FROM dolphin.v_system_stats_1h ORDER BY bucket;
-- Full causal chain for a scan
SELECT event_type, ts, detail, value1, value2
FROM dolphin.v_scan_causal_chain
WHERE trace_id = '<scan_uuid>'
ORDER BY ts;
-- Scans that preceded losing trades
SELECT e.scan_number, e.vel_div, t.asset, t.pnl, t.exit_reason
FROM dolphin.trade_events t
JOIN dolphin.eigen_scans e ON e.scan_uuid = t.scan_uuid
WHERE t.pnl < 0 AND t.exit_price > 0
ORDER BY t.pnl ASC LIMIT 20;
```
---
## Files
| File | Purpose |
|---|---|
| `prod/ch_writer.py` | Shared singleton — `from ch_writer import ch_put, ts_us, uuid7` |
| `prod/system_stats_service.py` | /proc poller, runs under supervisord:system_stats |
| `prod/supervisord_ch_listener.py` | supervisord eventlistener |
| `prod/ng_otel_writer.py` (on NG7) | OTel drop-in for remote machines |
| `prod/clickhouse/config.xml` | CH server config (40% RAM cap, async_insert) |
| `prod/clickhouse/users.xml` | dolphin user, wait_for_async_insert=0 |
| `prod/otelcol/config.yaml` | OTel Collector → dolphin.otel_* |
| `/root/ch-setup/schema.sql` | Full DDL — idempotent, re-runnable |
---
## Credentials
- User: `dolphin` / `dolphin_ch_2026`
- OTel DSN: `http://dolphin_uptrace_token@100.105.170.6:14318/1` (if Uptrace ever deployed)
---
## Pending (when DolphinActor is wired)
- `trade_events` — add `ch_put("trade_events", {...})` at entry and exit
- `obf_fast_intrade` — add in OBF 100ms tick (only when n_open_positions > 0)
- `account_events` — STARTUP/SHUTDOWN/END_DAY hooks
- `daily_pnl` — end-of-day in paper_trade_flow / nautilus_prefect_flow
- See `prod/service_integration.py` for exact copy-paste snippets