Files
DOLPHIN/prod/docs/KLINES_5Y_10Y_DATASET_README.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

7.0 KiB
Executable File
Raw Blame History

DOLPHIN NG5 - 5 Year / 10 Year Klines Dataset Builder

Quick Summary

Aspect Details
Current State 796 days of data (2021-06-15 to 2026-03-05)
Gap 929 missing days (2021-06-16 to 2023-12-31)
Target 5-year dataset: 2021-01-01 to 2026-03-05 (~1,826 days)
Disk Required 150 GB free for 5-year, 400 GB for 10-year
Your Disk 166 GB free (sufficient for 5-year)
Runtime 10-18 hours for 5-year backfill

Pre-Flight Status

Disk Space

Free: 166.4 GB / Total: 951.6 GB
Status: SUFFICIENT for 5-year extension

Current Data Coverage

Parquet files: 796
Parquet range: 2021-06-15 to 2026-03-05
By year:
  2021: 1 days   ← Only 1 day!
  2024: 366 days ← Complete
  2025: 365 days ← Complete
  2026: 64 days  ← Partial

Arrow directories: 796 (matches parquet)
Klines cache: 0.54 GB (small - mostly fetched)

The Gap

Missing: 2021-06-16 to 2023-12-31 (929 days)
This is the 2022-2023 period that needs backfilling

How to Run

# Step 0: Review the plan
python klines_backfill_5y_10y.py --plan

# Step 1: Run pre-flight checks
python klines_backfill_5y_10y.py --preflight

# Step 2: Run complete 5-year backfill (ALL PHASES)
# ⚠️  This takes 10-18 hours! Run in a persistent session.
python klines_backfill_5y_10y.py --full-5y

# OR run step by step:
python klines_backfill_5y_10y.py --backfill-5y   # Fetch + Compute (8-16 hours)
python klines_backfill_5y_10y.py --convert       # Convert to Parquet (30-60 min)
python klines_backfill_5y_10y.py --validate      # Validate output (5-10 min)

Option 2: Batch Script (Windows)

# Run the batch file (double-click or run in CMD)
run_5y_klines_backfill.bat

Option 3: Manual Commands

# PHASE 1: Fetch klines (6-12 hours)
cd "C:\Users\Lenovo\Documents\- Dolphin NG Backfill"
python historical_klines_backfiller.py --fetch --start 2021-07-01 --end 2023-12-31

# PHASE 2: Compute eigenvalues (2-4 hours)
python historical_klines_backfiller.py --compute --start 2021-07-01 --end 2023-12-31

# PHASE 3: Convert to Parquet (30-60 minutes)
cd "C:\Users\Lenovo\Documents\- DOLPHIN NG HD HCM TSF Predict"
python ng5_arrow_to_vbt_cache.py --all

# PHASE 4: Validate
python klines_backfill_5y_10y.py --validate

What Each Phase Does

Phase 1: Fetch Klines (6-12 hours)

  • Downloads 1-minute OHLCV from Binance public API
  • 50 symbols × 914 days = ~45,700 symbol-days
  • Rate limited to 1100 req/min (under Binance 1200 limit)
  • Cached to klines_cache/{symbol}/{YYYY-MM-DD}.parquet
  • Idempotent: Already-fetched dates are skipped

Phase 2: Compute Eigenvalues (2-4 hours)

  • Reads cached klines
  • Computes rolling correlation eigenvalues:
    • w50, w150, w300, w750 windows (1-minute bars)
    • Velocities, instabilities, vel_div
  • Writes Arrow files: arrow_klines/{date}/scan_{N:06d}_kbf_{HHMM}.arrow
  • Idempotent: Already-processed dates are skipped

Phase 3: Convert to Parquet (30-60 minutes)

  • Reads Arrow files
  • Converts to VBT cache format
  • Output: vbt_cache_klines/{YYYY-MM-DD}.parquet
  • Idempotent: Already-converted dates are skipped

Phase 4: Validation (5-10 minutes)

  • Counts total parquet files
  • Checks date range coverage
  • Validates sample files have valid data

Important Notes

⏱️ Very Long Runtime

  • Total: 10-18 hours for 5-year backfill
  • Phase 1 (fetch) is the bottleneck - depends on Binance API rate limits
  • Run in a persistent session (TMUX on Linux, persistent CMD on Windows)
  • Safe to interrupt: The script is idempotent, just re-run to resume

💾 Disk Management

  • klines_cache grows to ~100-150 GB during fetch
  • Can be deleted after conversion to free space
  • arrow_klines intermediate: ~20 GB
  • Final parquets: ~3 GB additional

📊 Symbol Coverage by Year

Period Expected Coverage Notes
2021-07+ ~40-50 symbols Most major alts listed
2021-01 to 06 ~10-20 symbols Sparse, many not listed
2020 ~5-10 symbols Only majors (BTC, ETH, BNB)
2019 ~5 symbols Very sparse
2017-2018 3-5 symbols Only BTC, ETH, BNB

⚠️ Binance Launch Date

  • Binance launched in July 2017
  • Data before 2017-07-01 simply doesn't exist
  • Recommended start: 2021-07-01 (reliable coverage)

Expected Output

After successful 5-year backfill:

vbt_cache_klines/
├── 2021-07-01.parquet  ← NEW
├── 2021-07-02.parquet  ← NEW
├── ... (914 new files)
├── 2023-12-31.parquet  ← NEW
├── 2024-01-01.parquet  ← existing
├── ... (existing files)
└── 2026-03-05.parquet  ← existing

Total: ~1,710 parquets spanning 2021-07-01 to 2026-03-05

Troubleshooting

"Disk full" during fetch

# Stop the script (Ctrl-C), then:
# Option 1: Delete klines_cache for completed dates
# Option 2: Free up space elsewhere
# Then re-run - it will resume from where it stopped

"Rate limited" errors

  • The script handles this automatically (sleeps 60s)
  • If persistent, wait an hour and re-run

Missing symbols for early dates

  • Expected behavior: Many alts weren't listed before 2021
  • The eigenvalue computation handles this (uses available subset)
  • Documented in the final report

Script crashes on specific date

# Re-run with --date to skip problematic date
python historical_klines_backfiller.py --date 2022-06-15

Post-Backfill Cleanup (Optional)

After validation passes, you can reclaim disk space:

# Delete klines_cache (raw OHLCV) - 100-150 GB
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\klines_cache"

# Delete arrow_klines intermediate - 20 GB
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\backfilled_data\arrow_klines"

# Keep only vbt_cache_klines/ (final output)

⚠️ Only delete after validating the parquets!


Validation Checklist

After running, verify:

  • Total parquets: ~1,700+ files
  • Date range: 2021-07-01 to 2026-03-05
  • No gaps in 2022-2023 period
  • Sample files have valid vel_div values (non-zero std)
  • BTCUSDT price column present in all files

Run: python klines_backfill_5y_10y.py --validate


Summary of Commands

# FULL AUTOMATED RUN (recommended)
python klines_backfill_5y_10y.py --full-5y

# OR STEP BY STEP
python klines_backfill_5y_10y.py --preflight     # Check first
python klines_backfill_5y_10y.py --backfill-5y   # Fetch + Compute
python klines_backfill_5y_10y.py --convert       # To Parquet
python klines_backfill_5y_10y.py --validate      # Verify

Ready to run? Start with python klines_backfill_5y_10y.py --plan to confirm, then run python klines_backfill_5y_10y.py --full-5y.