# DOLPHIN NG5 - 5 Year / 10 Year Klines Dataset Builder ## Quick Summary | Aspect | Details | |--------|---------| | **Current State** | 796 days of data (2021-06-15 to 2026-03-05) | | **Gap** | 929 missing days (2021-06-16 to 2023-12-31) | | **Target** | 5-year dataset: 2021-01-01 to 2026-03-05 (~1,826 days) | | **Disk Required** | 150 GB free for 5-year, 400 GB for 10-year | | **Your Disk** | 166 GB free ✅ (sufficient for 5-year) | | **Runtime** | 10-18 hours for 5-year backfill | --- ## Pre-Flight Status ✅ ### Disk Space ``` Free: 166.4 GB / Total: 951.6 GB Status: SUFFICIENT for 5-year extension ``` ### Current Data Coverage ``` Parquet files: 796 Parquet range: 2021-06-15 to 2026-03-05 By year: 2021: 1 days ← Only 1 day! 2024: 366 days ← Complete 2025: 365 days ← Complete 2026: 64 days ← Partial Arrow directories: 796 (matches parquet) Klines cache: 0.54 GB (small - mostly fetched) ``` ### The Gap ``` Missing: 2021-06-16 to 2023-12-31 (929 days) This is the 2022-2023 period that needs backfilling ``` --- ## How to Run ### Option 1: Python Control Script (Recommended) ```bash # Step 0: Review the plan python klines_backfill_5y_10y.py --plan # Step 1: Run pre-flight checks python klines_backfill_5y_10y.py --preflight # Step 2: Run complete 5-year backfill (ALL PHASES) # ⚠️ This takes 10-18 hours! Run in a persistent session. python klines_backfill_5y_10y.py --full-5y # OR run step by step: python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute (8-16 hours) python klines_backfill_5y_10y.py --convert # Convert to Parquet (30-60 min) python klines_backfill_5y_10y.py --validate # Validate output (5-10 min) ``` ### Option 2: Batch Script (Windows) ```bash # Run the batch file (double-click or run in CMD) run_5y_klines_backfill.bat ``` ### Option 3: Manual Commands ```bash # PHASE 1: Fetch klines (6-12 hours) cd "C:\Users\Lenovo\Documents\- Dolphin NG Backfill" python historical_klines_backfiller.py --fetch --start 2021-07-01 --end 2023-12-31 # PHASE 2: Compute eigenvalues (2-4 hours) python historical_klines_backfiller.py --compute --start 2021-07-01 --end 2023-12-31 # PHASE 3: Convert to Parquet (30-60 minutes) cd "C:\Users\Lenovo\Documents\- DOLPHIN NG HD HCM TSF Predict" python ng5_arrow_to_vbt_cache.py --all # PHASE 4: Validate python klines_backfill_5y_10y.py --validate ``` --- ## What Each Phase Does ### Phase 1: Fetch Klines (6-12 hours) - Downloads 1-minute OHLCV from Binance public API - 50 symbols × 914 days = ~45,700 symbol-days - Rate limited to 1100 req/min (under Binance 1200 limit) - Cached to `klines_cache/{symbol}/{YYYY-MM-DD}.parquet` - **Idempotent**: Already-fetched dates are skipped ### Phase 2: Compute Eigenvalues (2-4 hours) - Reads cached klines - Computes rolling correlation eigenvalues: - w50, w150, w300, w750 windows (1-minute bars) - Velocities, instabilities, vel_div - Writes Arrow files: `arrow_klines/{date}/scan_{N:06d}_kbf_{HHMM}.arrow` - **Idempotent**: Already-processed dates are skipped ### Phase 3: Convert to Parquet (30-60 minutes) - Reads Arrow files - Converts to VBT cache format - Output: `vbt_cache_klines/{YYYY-MM-DD}.parquet` - **Idempotent**: Already-converted dates are skipped ### Phase 4: Validation (5-10 minutes) - Counts total parquet files - Checks date range coverage - Validates sample files have valid data --- ## Important Notes ### ⏱️ Very Long Runtime - **Total: 10-18 hours** for 5-year backfill - **Phase 1 (fetch) is the bottleneck** - depends on Binance API rate limits - Run in a persistent session (TMUX on Linux, persistent CMD on Windows) - **Safe to interrupt**: The script is idempotent, just re-run to resume ### 💾 Disk Management - **klines_cache** grows to ~100-150 GB during fetch - Can be deleted after conversion to free space - **arrow_klines** intermediate: ~20 GB - **Final parquets**: ~3 GB additional ### 📊 Symbol Coverage by Year | Period | Expected Coverage | Notes | |--------|------------------|-------| | 2021-07+ | ~40-50 symbols | Most major alts listed | | 2021-01 to 06 | ~10-20 symbols | Sparse, many not listed | | 2020 | ~5-10 symbols | Only majors (BTC, ETH, BNB) | | 2019 | ~5 symbols | Very sparse | | 2017-2018 | 3-5 symbols | Only BTC, ETH, BNB | ### ⚠️ Binance Launch Date - Binance launched in **July 2017** - Data before 2017-07-01 simply doesn't exist - Recommended start: **2021-07-01** (reliable coverage) --- ## Expected Output After successful 5-year backfill: ``` vbt_cache_klines/ ├── 2021-07-01.parquet ← NEW ├── 2021-07-02.parquet ← NEW ├── ... (914 new files) ├── 2023-12-31.parquet ← NEW ├── 2024-01-01.parquet ← existing ├── ... (existing files) └── 2026-03-05.parquet ← existing Total: ~1,710 parquets spanning 2021-07-01 to 2026-03-05 ``` --- ## Troubleshooting ### "Disk full" during fetch ```bash # Stop the script (Ctrl-C), then: # Option 1: Delete klines_cache for completed dates # Option 2: Free up space elsewhere # Then re-run - it will resume from where it stopped ``` ### "Rate limited" errors - The script handles this automatically (sleeps 60s) - If persistent, wait an hour and re-run ### Missing symbols for early dates - **Expected behavior**: Many alts weren't listed before 2021 - The eigenvalue computation handles this (uses available subset) - Documented in the final report ### Script crashes on specific date ```bash # Re-run with --date to skip problematic date python historical_klines_backfiller.py --date 2022-06-15 ``` --- ## Post-Backfill Cleanup (Optional) After validation passes, you can reclaim disk space: ```bash # Delete klines_cache (raw OHLCV) - 100-150 GB rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\klines_cache" # Delete arrow_klines intermediate - 20 GB rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\backfilled_data\arrow_klines" # Keep only vbt_cache_klines/ (final output) ``` ⚠️ **Only delete after validating the parquets!** --- ## Validation Checklist After running, verify: - [ ] Total parquets: ~1,700+ files - [ ] Date range: 2021-07-01 to 2026-03-05 - [ ] No gaps in 2022-2023 period - [ ] Sample files have valid vel_div values (non-zero std) - [ ] BTCUSDT price column present in all files Run: `python klines_backfill_5y_10y.py --validate` --- ## Summary of Commands ```bash # FULL AUTOMATED RUN (recommended) python klines_backfill_5y_10y.py --full-5y # OR STEP BY STEP python klines_backfill_5y_10y.py --preflight # Check first python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute python klines_backfill_5y_10y.py --convert # To Parquet python klines_backfill_5y_10y.py --validate # Verify ``` **Ready to run?** Start with `python klines_backfill_5y_10y.py --plan` to confirm, then run `python klines_backfill_5y_10y.py --full-5y`.