Files
DOLPHIN/prod/docs/EXTF_PROD_BRINGUP.md

655 lines
24 KiB
Markdown
Raw Normal View History

# DOLPHIN Paper Trading — Production Bringup Guide
**Purpose**: Step-by-step ops guide for standing up the Prefect + Hazelcast paper trading stack.
**Audience**: Operations agent or junior dev. No research decisions required.
**State as of**: 2026-03-06
**Assumes**: Windows 11, Docker Desktop installed, Siloqy venv exists at `C:\Users\Lenovo\Documents\- Siloqy\`
---
## Architecture Overview
```
[ARB512 Scanner] ─► eigenvalues/YYYY-MM-DD/ ─► [paper_trade_flow.py]
|
[NDAlphaEngine (Python)]
|
┌──────────────┴──────────────┐
[Hazelcast IMap] [paper_logs/*.jsonl]
|
[Prefect UI :4200]
[HZ-MC UI :8080]
```
**Components:**
- `docker-compose.yml`: Hazelcast 5.3 (port 5701) + HZ Management Center (port 8080) + Prefect Server (port 4200)
- `paper_trade_flow.py`: Prefect flow, runs daily at 00:05 UTC
- `configs/blue.yml`: Champion SHORT config (frozen, production)
- `configs/green.yml`: Bidirectional config (STATUS: PENDING — LONG validation still in progress)
- Python venv: `C:\Users\Lenovo\Documents\- Siloqy\`
**Data flow**: Prefect triggers daily → reads yesterday's Arrow/NPZ scans from eigenvalues dir → NDAlphaEngine processes → writes P&L to Hazelcast IMap + local JSONL log.
---
## Step 1: Prerequisites Check
Open a terminal (Git Bash or PowerShell).
```bash
# 1a. Verify Docker Desktop is installed
docker --version
# Expected: Docker version 29.x.x
# 1b. Verify Python venv
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/python.exe" --version
# Expected: Python 3.11.x or 3.12.x
# 1c. Verify working directories exist
ls "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod/"
# Expected: configs/ docker-compose.yml paper_trade_flow.py BRINGUP_GUIDE.md
ls "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod/configs/"
# Expected: blue.yml green.yml
```
---
## Step 2: Install Python Dependencies
Run once. Takes ~2-5 minutes.
```bash
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/pip.exe" install \
hazelcast-python-client \
prefect \
pyyaml \
pyarrow \
numpy \
pandas
```
**Verify:**
```bash
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/python.exe" -c "import hazelcast; import prefect; import yaml; print('OK')"
```
---
## Step 3: Start Docker Desktop
Docker Desktop must be running before starting containers.
**Option A (GUI):** Double-click Docker Desktop from Start menu. Wait for the whale icon in the system tray to stop animating (~30-60 seconds).
**Option B (command):**
```powershell
Start-Process "C:\Program Files\Docker\Docker\Docker Desktop.exe"
# Wait ~60 seconds, then verify:
docker ps
```
**Verify Docker is ready:**
```bash
docker info | grep "Server Version"
# Expected: Server Version: 27.x.x
```
---
## Step 4: Start the Infrastructure Stack
```bash
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod"
docker compose up -d
```
**Expected output:**
```
[+] Running 3/3
- Container dolphin-hazelcast Started
- Container dolphin-hazelcast-mc Started
- Container dolphin-prefect Started
```
**Verify all containers healthy:**
```bash
docker compose ps
# All 3 should show "healthy" or "running"
```
**Wait ~30 seconds for Hazelcast to initialize, then verify:**
```bash
curl http://localhost:5701/hazelcast/health/ready
# Expected: {"message":"Hazelcast is ready!"}
curl http://localhost:4200/api/health
# Expected: {"status":"healthy"}
```
**UIs:**
- Prefect UI: http://localhost:4200
- Hazelcast MC: http://localhost:8080
- Default cluster: `dolphin` (auto-connects to hazelcast:5701)
---
## Step 5: Register Prefect Deployments
Run once to register the blue and green scheduled deployments.
```bash
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod"
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/python.exe" paper_trade_flow.py --register
```
**Expected output:**
```
Registered: dolphin-paper-blue
Registered: dolphin-paper-green
```
**Verify in Prefect UI:** http://localhost:4200 → Deployments → should show 2 deployments with CronSchedule "5 0 * * *".
---
## Step 6: Start the Prefect Worker
The Prefect worker polls for scheduled runs. Run in a separate terminal (keep it open, or run as a service).
```bash
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/prefect.exe" worker start --pool "dolphin"
```
**OR** (if `prefect` CLI not in PATH):
```bash
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/python.exe" -m prefect worker start --pool "dolphin"
```
Leave this terminal running. It will pick up the 00:05 UTC scheduled runs.
---
## Step 7: Manual Test Run
Before relying on the schedule, test with a known good date (a date that has scan data).
```bash
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod"
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/python.exe" paper_trade_flow.py \
--date 2026-03-05 \
--config configs/blue.yml
```
**Expected output (abbreviated):**
```
=== BLUE paper trade: 2026-03-05 ===
Loaded N scans for 2026-03-05 | cols=XX
2026-03-05: PnL=+XX.XX T=X boost=1.XXx MC=OK
HZ write OK → DOLPHIN_PNL_BLUE[2026-03-05]
=== DONE: blue 2026-03-05 | PnL=+XX.XX | Capital=25,XXX.XX ===
```
**Verify data written to Hazelcast:**
- Open http://localhost:8080 → Maps → DOLPHIN_PNL_BLUE → should contain entry for 2026-03-05
**Verify log file written:**
```bash
ls "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod/paper_logs/blue/"
cat "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod/paper_logs/blue/paper_pnl_2026-03.jsonl"
```
---
## Step 8: Scan Data Source Verification
The flow reads scan files from:
```
C:\Users\Lenovo\Documents\- Dolphin NG HD (NG3)\correlation_arb512\eigenvalues\YYYY-MM-DD\
```
Each date directory should contain `scan_*__Indicators.npz` or `scan_*.arrow` files.
```bash
ls "/c/Users/Lenovo/Documents/- Dolphin NG HD (NG3)/correlation_arb512/eigenvalues/" | tail -5
# Expected: recent date directories like 2026-03-05, 2026-03-04, etc.
ls "/c/Users/Lenovo/Documents/- Dolphin NG HD (NG3)/correlation_arb512/eigenvalues/2026-03-05/"
# Expected: scan_NNNN__Indicators.npz files
```
If a date directory is missing, the flow logs a warning and writes pnl=0 for that day (non-critical).
---
## Step 9: Daily Operations
**Normal daily flow (automated):**
1. ARB512 scanner (extended_main.py) writes scans to eigenvalues/YYYY-MM-DD/ throughout the day
2. At 00:05 UTC, Prefect triggers dolphin-paper-blue and dolphin-paper-green
3. Each flow reads yesterday's scans, runs the engine, writes to HZ + JSONL log
4. Monitor via Prefect UI and HZ-MC
**Check today's run result:**
```bash
# Latest P&L log entry:
tail -1 "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod/paper_logs/blue/paper_pnl_$(date +%Y-%m).jsonl"
```
**Check HZ state:**
- http://localhost:8080 → Maps → DOLPHIN_STATE_BLUE → key "latest"
- Should show: `{"capital": XXXXX, "strategy": "blue", "last_date": "YYYY-MM-DD", ...}`
---
## Step 10: Restart After Reboot
After Windows restarts:
```bash
# 1. Start Docker Desktop (GUI or command — see Step 3)
# 2. Restart containers
cd "/c/Users/Lenovo/Documents/- DOLPHIN NG HD HCM TSF Predict/prod"
docker compose up -d
# 3. Restart Prefect worker (in a dedicated terminal)
"/c/Users/Lenovo/Documents/- Siloqy/Scripts/python.exe" -m prefect worker start --pool "dolphin"
```
Deployments and HZ data persist (docker volumes: hz_data, prefect_data).
---
## Troubleshooting
### "No scan dir for YYYY-MM-DD"
- The ARB512 scanner may not have run for that date
- Check: `ls "C:\Users\Lenovo\Documents\- Dolphin NG HD (NG3)\correlation_arb512\eigenvalues\YYYY-MM-DD\"`
- Non-critical: flow logs pnl=0 and continues
### "HZ write failed (not critical)"
- Hazelcast container not running or not yet healthy
- Run: `docker compose ps` → check dolphin-hazelcast shows "healthy"
- Run: `docker compose restart hazelcast`
### "ModuleNotFoundError: No module named 'hazelcast'"
- Dependencies not installed in Siloqy venv
- Rerun Step 2
### "error during connect: open //./pipe/dockerDesktopLinuxEngine"
- Docker Desktop not running
- Start Docker Desktop (see Step 3), wait 60 seconds, retry
### Prefect worker not picking up runs
- Verify worker is running with `--pool "dolphin"` (matches work_queue_name in deployments)
- Check Prefect UI → Work Pools → should show "dolphin" pool as online
### Green deployment errors on bidirectional config
- Green is PENDING LONG validation. If direction: bidirectional causes engine errors,
temporarily set green.yml direction: short_only until LONG system is validated.
---
## Key File Locations
| File | Path |
|---|---|
| Prefect flow | `prod/paper_trade_flow.py` |
| Blue config | `prod/configs/blue.yml` |
| Green config | `prod/configs/green.yml` |
| Docker stack | `prod/docker-compose.yml` |
| Blue P&L logs | `prod/paper_logs/blue/paper_pnl_YYYY-MM.jsonl` |
| Green P&L logs | `prod/paper_logs/green/paper_pnl_YYYY-MM.jsonl` |
| Scan data source | `C:\Users\Lenovo\Documents\- Dolphin NG HD (NG3)\correlation_arb512\eigenvalues\` |
| NDAlphaEngine | `HCM\nautilus_dolphin\nautilus_dolphin\nautilus\esf_alpha_orchestrator.py` |
| MC-Forewarner models | `HCM\nautilus_dolphin\mc_results\models\` |
---
## Current Status (2026-03-06)
| Item | Status |
|---|---|
| Docker stack | Built — needs Docker Desktop running |
| Python deps (HZ + Prefect) | Installing (pip background job) |
| Blue config | Frozen champion SHORT — ready |
| Green config | PENDING — LONG validation running (b79rt78uv) |
| Prefect deployments | Not yet registered (run Step 5 after deps install) |
| Manual test run | Not yet done (run Step 7) |
| vol_p60 calibration | Hardcoded 0.000099 (pre-calibrated from 55-day window) — acceptable |
| Engine state persistence | Implemented — engine capital and open positions serialize to Hazelcast STATE IMap |
### Engine State Persistence
The NDAlphaEngine is instantiated fresh during each daily Prefect run, but its internal state is loaded from the Hazelcast `DOLPHIN_STATE_BLUE`/`GREEN` maps. Both `capital` and any active `position` spanning midnight are accurately tracked and restored.
**Impact for paper trading**: P&L and cumulative capital growth track correctly across days.
---
*Guide written 2026-03-08. Status updated.*
---
## Appendix D: Live Operations Monitoring — DEV "Realized Slippage"
**Purpose**: Track whether ExF latency (~10ms) is causing unacceptable fill slippage vs backtest assumptions.
### Background
- Backtest friction assumptions: **8-10 bps** round-trip (2bps entry + 2bps exit + fees)
- ExF latency-induced drift: **~0.055 bps** (normal vol), **~0.17 bps** (high vol events)
- Current Python implementation is sufficient (latency << friction assumptions)
### Metric Definition
```python
realized_slippage_bps = abs(fill_price - signal_price) / signal_price * 10000
```
### Monitoring Thresholds
| Threshold | Action |
|-----------|--------|
| **< 2 bps** | Nominal within backtest assumptions |
| **2-5 bps** | ⚠️ Watch — approaching friction limits |
| **> 5 bps** | 🚨 **ALERT** — investigate latency/market impact issues |
### Implementation Notes
- Log `signal_price` (price at signal generation) vs `fill_price` (actual execution)
- Track per-trade slippage in paper_logs
- Alert if 24h moving average exceeds 5 bps
- If consistently > 5 bps → escalate to Java/Chronicle Queue port for <100μs latency
### TODO
- [ ] Add slippage tracking to `paper_trade_flow.py` trade logging
- [ ] Create Grafana/Prefect alert for slippage > 5 bps
- [ ] Document slippage post-trade analysis pipeline
---
*Last updated: 2026-03-17*
---
## Appendix E: External Factors (ExF) System v2.0
**Date**: 2026-03-17
**Purpose**: Complete production guide for the External Factors real-time data pipeline
**Components**: `exf_fetcher_flow.py`, `exf_persistence.py`, `exf_integrity_monitor.py`
### Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL FACTORS SYSTEM v2.0 │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Data Providers │ │ Data Providers │ │ Data Providers │ │
│ │ (Binance) │ │ (Deribit) │ │ (FRED/Macro) │ │
│ │ - funding_btc │ │ - dvol_btc │ │ - vix │ │
│ │ - basis │ │ - dvol_eth │ │ - dxy │ │
│ │ - spread │ │ - fund_dbt_btc │ │ - us10y │ │
│ │ - imbal_* │ │ │ │ │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │ │
│ └────────────────────────┼────────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ RealTimeExFService (28 indicators) │ │
│ │ - Per-indicator async polling at native rate │ │
│ │ - Rate limiting per provider (Binance 20/s, FRED 2/s, etc) │ │
│ │ - In-memory cache with <1ms read latency
│ │ - Daily history rotation for lag support │ │
│ └────────────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ HOT PATH │ │ OFF HOT PATH │ │ MONITORING │ │
│ │ (0.5s interval)│ │ (5 min interval│ │ (60s interval) │ │
│ │ │ │ │ │ │ │
│ │ Hazelcast │ │ Disk Persistence│ │ Integrity Check │ │
│ │ DOLPHIN_FEATURES│ │ NPZ Format │ │ HZ vs Disk │ │
│ │ ['exf_latest'] │ │ /mnt/ng6_data/ │ │ Staleness Check │ │
│ │ │ │ eigenvalues/ │ │ ACB Validation │ │
│ │ Instant access │ │ Durability │ │ Alert on drift │ │
│ │ for Alpha Engine│ │ for Backtests │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Component Reference
| Component | File | Purpose | Update Rate |
|-----------|------|---------|-------------|
| RealTimeExFService | `realtime_exf_service.py` | Fetches 28 indicators from 8 providers | Per-indicator native rate |
| ExF Fetcher Flow | `exf_fetcher_flow.py` | Prefect flow orchestrating HZ push | 0.5s (500ms) |
| ExF Persistence | `exf_persistence.py` | Disk writer (NPZ format) | 5 minutes |
| ExF Integrity Monitor | `exf_integrity_monitor.py` | Data validation & alerts | 60 seconds |
### Indicators (28 Total)
| Category | Indicators | Count |
|----------|-----------|-------|
| **Binance Derivatives** | funding_btc, funding_eth, oi_btc, oi_eth, ls_btc, ls_eth, ls_top, taker, basis | 9 |
| **Microstructure** | imbal_btc, imbal_eth, spread | 3 |
| **Deribit** | dvol_btc, dvol_eth, fund_dbt_btc, fund_dbt_eth | 4 |
| **Macro (FRED)** | vix, dxy, us10y, sp500, fedfunds | 5 |
| **Sentiment** | fng | 1 |
| **On-chain** | hashrate | 1 |
| **DeFi** | tvl | 1 |
| **Liquidations** | liq_vol_24h, liq_long_ratio, liq_z_score, liq_percentile | 4 |
### ACB-Critical Indicators (9 Required for _acb_ready=True)
These indicators **MUST** be present and fresh for the Adaptive Circuit Breaker to function:
```python
ACB_KEYS = [
"funding_btc", "funding_eth", # Binance funding rates
"dvol_btc", "dvol_eth", # Deribit volatility indices
"fng", # Fear & Greed
"vix", # VIX (market fear)
"ls_btc", # Long/Short ratio
"taker", # Taker buy/sell ratio
"oi_btc", # Open interest
]
```
### Data Flow
1. **Fetch**: `RealTimeExFService` polls each provider at native rate
2. **Cache**: Values stored in memory with staleness tracking
3. **HZ Push** (every 0.5s): Hot path to Hazelcast for Alpha Engine
4. **Persistence** (every 5min): Background flush to NPZ on disk
5. **Integrity Check** (every 60s): Validate HZ vs disk consistency
### File Locations (Linux)
| Data Type | Path |
|-----------|------|
| Persistence root | `/mnt/ng6_data/eigenvalues/` |
| Daily directory | `/mnt/ng6_data/eigenvalues/{YYYY-MM-DD}/` |
| ExF snapshots | `/mnt/ng6_data/eigenvalues/{YYYY-MM-DD}/extf_snapshot_{timestamp}__Indicators.npz` |
| Checksum files | `/mnt/ng6_data/eigenvalues/{YYYY-MM-DD}/extf_snapshot_{timestamp}__Indicators.npz.sha256` |
### NPZ File Format
```python
{
# Metadata (JSON string in _metadata array)
"_metadata": json.dumps({
"_timestamp_utc": "2026-03-17T12:00:00+00:00",
"_version": "1.0",
"_service": "ExFPersistence",
"_staleness_s": json.dumps({"basis": 0.2, "funding_btc": 3260.0, ...}),
}),
# Numeric indicators (each as float64 array)
"basis": np.array([0.01178]),
"spread": np.array([0.00143]),
"funding_btc": np.array([7.53e-06]),
"vix": np.array([24.06]),
...
}
```
### Running the ExF System
#### Option 1: Standalone (Development/Testing)
```bash
cd /root/extf_docs
# Test mode (no persistence, no monitoring)
python exf_fetcher_flow.py --no-persist --no-monitor --warmup 15
# With persistence (production)
python exf_fetcher_flow.py --warmup 30
# Run integration tests
python test_exf_integration.py --duration 30 --test all
```
#### Option 2: Prefect Deployment (Production)
```bash
# Deploy to Prefect
cd /mnt/dolphinng5_predict/prod
prefect deployment build exf_fetcher_flow.py:exf_fetcher_flow \
--name "exf-live" \
--pool dolphin \
--cron "*/5 * * * *" # Or run continuously
# Start worker
prefect worker start --pool dolphin
```
### Monitoring & Alerting
#### Health Status
The integrity monitor exposes health status via `get_health_status()`:
```python
{
"timestamp": "2026-03-17T12:00:00+00:00",
"overall": "healthy", # healthy | degraded | critical
"hz_connected": True,
"persist_connected": True,
"indicators_present": 28,
"indicators_expected": 28,
"acb_ready": True,
"stale_count": 2,
"alerts_active": 0,
}
```
#### Alert Thresholds
| Condition | Severity | Action |
|-----------|----------|--------|
| ACB-critical indicator missing | **CRITICAL** | Alpha engine may fail |
| Hazelcast disconnected | **CRITICAL** | Real-time data unavailable |
| Indicator stale > 120s | **WARNING** | Check provider API |
| HZ/disk divergence > 3 indicators | **WARNING** | Investigate sync issue |
| Overall health = degraded | **WARNING** | Monitor closely |
| Overall health = critical | **CRITICAL** | Page on-call engineer |
### Troubleshooting
#### Issue: `_acb_ready=False`
**Symptoms**: Health check shows `acb_ready: False`
**Diagnosis**: One or more ACRITICAL indicators missing
```bash
# Check which indicators are missing
python3 << 'EOF'
import hazelcast, json
client = hazelcast.HazelcastClient(cluster_name='dolphin', cluster_members=['localhost:5701'])
data = json.loads(client.get_map("DOLPHIN_FEATURES").get("exf_latest").result())
acb_keys = ["funding_btc", "funding_eth", "dvol_btc", "dvol_eth", "fng", "vix", "ls_btc", "taker", "oi_btc"]
missing = [k for k in acb_keys if k not in data or data[k] != data[k]] # NaN check
print(f"Missing ACB indicators: {missing}")
print(f"Present: {[k for k in acb_keys if k not in missing]}")
client.shutdown()
EOF
```
**Common Causes**:
- Deribit API down (dvol_btc, dvol_eth)
- Alternative.me API down (fng)
- FRED API key expired (vix)
**Fix**: Check provider status, verify API keys in `realtime_exf_service.py`
---
#### Issue: No disk persistence
**Symptoms**: `files_written: 0` in persistence stats
**Diagnosis**:
```bash
# Check mount
ls -la /mnt/ng6_data/eigenvalues/
# Check permissions
touch /mnt/ng6_data/eigenvalues/write_test && rm /mnt/ng6_data/eigenvalues/write_test
# Check disk space
df -h /mnt/ng6_data/
```
**Fix**:
```bash
# Remount if needed
sudo mount -t cifs //100.119.158.61/DolphinNG6_Data /mnt/ng6_data -o credentials=/root/.dolphin_creds
```
---
#### Issue: High staleness
**Symptoms**: Staleness > 120s for critical indicators
**Diagnosis**:
```bash
# Check fetcher process
ps aux | grep exf_fetcher
# Check logs
journalctl -u exf-fetcher -n 100
# Manual fetch test
curl -s "https://fapi.binance.com/fapi/v1/premiumIndex?symbol=BTCUSDT" | head -c 200
curl -s "https://www.deribit.com/api/v2/public/get_volatility_index_data?currency=BTC&resolution=3600&count=1" | head -c 200
```
**Fix**: Restart fetcher, check network connectivity, verify API rate limits not exceeded
### TODO (Future Enhancements)
- [ ] **Expand indicators**: Add 50+ additional indicators from CoinMetrics, Glassnode, etc.
- [ ] **Fix dead indicators**: Repair broken parsers (see `DEAD_INDICATORS` in service)
- [ ] **Adaptive lag**: Switch from uniform lag=1 to per-indicator optimal lags (needs 80+ days data)
- [ ] **Intra-day ACB**: Move from daily to continuous ACB calculation
- [ ] **Arrow format**: Dual output NPZ + Arrow for better performance
- [ ] **Redundancy**: Multiple provider failover for critical indicators
### Data Retention
| Data Type | Retention | Cleanup |
|-----------|-----------|---------|
| Hazelcast cache | Real-time only (no history) | N/A |
| Disk snapshots (NPZ) | 7 days | Automatic |
| Logs | 30 days | Manual/Logrotate |
| Backfill data | Permanent | Never |
---
*Last updated: 2026-03-17*