Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
7.0 KiB
Executable File
7.0 KiB
Executable File
DOLPHIN NG5 - 5 Year / 10 Year Klines Dataset Builder
Quick Summary
| Aspect | Details |
|---|---|
| Current State | 796 days of data (2021-06-15 to 2026-03-05) |
| Gap | 929 missing days (2021-06-16 to 2023-12-31) |
| Target | 5-year dataset: 2021-01-01 to 2026-03-05 (~1,826 days) |
| Disk Required | 150 GB free for 5-year, 400 GB for 10-year |
| Your Disk | 166 GB free ✅ (sufficient for 5-year) |
| Runtime | 10-18 hours for 5-year backfill |
Pre-Flight Status ✅
Disk Space
Free: 166.4 GB / Total: 951.6 GB
Status: SUFFICIENT for 5-year extension
Current Data Coverage
Parquet files: 796
Parquet range: 2021-06-15 to 2026-03-05
By year:
2021: 1 days ← Only 1 day!
2024: 366 days ← Complete
2025: 365 days ← Complete
2026: 64 days ← Partial
Arrow directories: 796 (matches parquet)
Klines cache: 0.54 GB (small - mostly fetched)
The Gap
Missing: 2021-06-16 to 2023-12-31 (929 days)
This is the 2022-2023 period that needs backfilling
How to Run
Option 1: Python Control Script (Recommended)
# Step 0: Review the plan
python klines_backfill_5y_10y.py --plan
# Step 1: Run pre-flight checks
python klines_backfill_5y_10y.py --preflight
# Step 2: Run complete 5-year backfill (ALL PHASES)
# ⚠️ This takes 10-18 hours! Run in a persistent session.
python klines_backfill_5y_10y.py --full-5y
# OR run step by step:
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute (8-16 hours)
python klines_backfill_5y_10y.py --convert # Convert to Parquet (30-60 min)
python klines_backfill_5y_10y.py --validate # Validate output (5-10 min)
Option 2: Batch Script (Windows)
# Run the batch file (double-click or run in CMD)
run_5y_klines_backfill.bat
Option 3: Manual Commands
# PHASE 1: Fetch klines (6-12 hours)
cd "C:\Users\Lenovo\Documents\- Dolphin NG Backfill"
python historical_klines_backfiller.py --fetch --start 2021-07-01 --end 2023-12-31
# PHASE 2: Compute eigenvalues (2-4 hours)
python historical_klines_backfiller.py --compute --start 2021-07-01 --end 2023-12-31
# PHASE 3: Convert to Parquet (30-60 minutes)
cd "C:\Users\Lenovo\Documents\- DOLPHIN NG HD HCM TSF Predict"
python ng5_arrow_to_vbt_cache.py --all
# PHASE 4: Validate
python klines_backfill_5y_10y.py --validate
What Each Phase Does
Phase 1: Fetch Klines (6-12 hours)
- Downloads 1-minute OHLCV from Binance public API
- 50 symbols × 914 days = ~45,700 symbol-days
- Rate limited to 1100 req/min (under Binance 1200 limit)
- Cached to
klines_cache/{symbol}/{YYYY-MM-DD}.parquet - Idempotent: Already-fetched dates are skipped
Phase 2: Compute Eigenvalues (2-4 hours)
- Reads cached klines
- Computes rolling correlation eigenvalues:
- w50, w150, w300, w750 windows (1-minute bars)
- Velocities, instabilities, vel_div
- Writes Arrow files:
arrow_klines/{date}/scan_{N:06d}_kbf_{HHMM}.arrow - Idempotent: Already-processed dates are skipped
Phase 3: Convert to Parquet (30-60 minutes)
- Reads Arrow files
- Converts to VBT cache format
- Output:
vbt_cache_klines/{YYYY-MM-DD}.parquet - Idempotent: Already-converted dates are skipped
Phase 4: Validation (5-10 minutes)
- Counts total parquet files
- Checks date range coverage
- Validates sample files have valid data
Important Notes
⏱️ Very Long Runtime
- Total: 10-18 hours for 5-year backfill
- Phase 1 (fetch) is the bottleneck - depends on Binance API rate limits
- Run in a persistent session (TMUX on Linux, persistent CMD on Windows)
- Safe to interrupt: The script is idempotent, just re-run to resume
💾 Disk Management
- klines_cache grows to ~100-150 GB during fetch
- Can be deleted after conversion to free space
- arrow_klines intermediate: ~20 GB
- Final parquets: ~3 GB additional
📊 Symbol Coverage by Year
| Period | Expected Coverage | Notes |
|---|---|---|
| 2021-07+ | ~40-50 symbols | Most major alts listed |
| 2021-01 to 06 | ~10-20 symbols | Sparse, many not listed |
| 2020 | ~5-10 symbols | Only majors (BTC, ETH, BNB) |
| 2019 | ~5 symbols | Very sparse |
| 2017-2018 | 3-5 symbols | Only BTC, ETH, BNB |
⚠️ Binance Launch Date
- Binance launched in July 2017
- Data before 2017-07-01 simply doesn't exist
- Recommended start: 2021-07-01 (reliable coverage)
Expected Output
After successful 5-year backfill:
vbt_cache_klines/
├── 2021-07-01.parquet ← NEW
├── 2021-07-02.parquet ← NEW
├── ... (914 new files)
├── 2023-12-31.parquet ← NEW
├── 2024-01-01.parquet ← existing
├── ... (existing files)
└── 2026-03-05.parquet ← existing
Total: ~1,710 parquets spanning 2021-07-01 to 2026-03-05
Troubleshooting
"Disk full" during fetch
# Stop the script (Ctrl-C), then:
# Option 1: Delete klines_cache for completed dates
# Option 2: Free up space elsewhere
# Then re-run - it will resume from where it stopped
"Rate limited" errors
- The script handles this automatically (sleeps 60s)
- If persistent, wait an hour and re-run
Missing symbols for early dates
- Expected behavior: Many alts weren't listed before 2021
- The eigenvalue computation handles this (uses available subset)
- Documented in the final report
Script crashes on specific date
# Re-run with --date to skip problematic date
python historical_klines_backfiller.py --date 2022-06-15
Post-Backfill Cleanup (Optional)
After validation passes, you can reclaim disk space:
# Delete klines_cache (raw OHLCV) - 100-150 GB
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\klines_cache"
# Delete arrow_klines intermediate - 20 GB
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\backfilled_data\arrow_klines"
# Keep only vbt_cache_klines/ (final output)
⚠️ Only delete after validating the parquets!
Validation Checklist
After running, verify:
- Total parquets: ~1,700+ files
- Date range: 2021-07-01 to 2026-03-05
- No gaps in 2022-2023 period
- Sample files have valid vel_div values (non-zero std)
- BTCUSDT price column present in all files
Run: python klines_backfill_5y_10y.py --validate
Summary of Commands
# FULL AUTOMATED RUN (recommended)
python klines_backfill_5y_10y.py --full-5y
# OR STEP BY STEP
python klines_backfill_5y_10y.py --preflight # Check first
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute
python klines_backfill_5y_10y.py --convert # To Parquet
python klines_backfill_5y_10y.py --validate # Verify
Ready to run? Start with python klines_backfill_5y_10y.py --plan to confirm, then run python klines_backfill_5y_10y.py --full-5y.