Files
DOLPHIN/prod/docs/KLINES_5Y_10Y_DATASET_README.md

244 lines
7.0 KiB
Markdown
Raw Normal View History

# DOLPHIN NG5 - 5 Year / 10 Year Klines Dataset Builder
## Quick Summary
| Aspect | Details |
|--------|---------|
| **Current State** | 796 days of data (2021-06-15 to 2026-03-05) |
| **Gap** | 929 missing days (2021-06-16 to 2023-12-31) |
| **Target** | 5-year dataset: 2021-01-01 to 2026-03-05 (~1,826 days) |
| **Disk Required** | 150 GB free for 5-year, 400 GB for 10-year |
| **Your Disk** | 166 GB free ✅ (sufficient for 5-year) |
| **Runtime** | 10-18 hours for 5-year backfill |
---
## Pre-Flight Status ✅
### Disk Space
```
Free: 166.4 GB / Total: 951.6 GB
Status: SUFFICIENT for 5-year extension
```
### Current Data Coverage
```
Parquet files: 796
Parquet range: 2021-06-15 to 2026-03-05
By year:
2021: 1 days ← Only 1 day!
2024: 366 days ← Complete
2025: 365 days ← Complete
2026: 64 days ← Partial
Arrow directories: 796 (matches parquet)
Klines cache: 0.54 GB (small - mostly fetched)
```
### The Gap
```
Missing: 2021-06-16 to 2023-12-31 (929 days)
This is the 2022-2023 period that needs backfilling
```
---
## How to Run
### Option 1: Python Control Script (Recommended)
```bash
# Step 0: Review the plan
python klines_backfill_5y_10y.py --plan
# Step 1: Run pre-flight checks
python klines_backfill_5y_10y.py --preflight
# Step 2: Run complete 5-year backfill (ALL PHASES)
# ⚠️ This takes 10-18 hours! Run in a persistent session.
python klines_backfill_5y_10y.py --full-5y
# OR run step by step:
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute (8-16 hours)
python klines_backfill_5y_10y.py --convert # Convert to Parquet (30-60 min)
python klines_backfill_5y_10y.py --validate # Validate output (5-10 min)
```
### Option 2: Batch Script (Windows)
```bash
# Run the batch file (double-click or run in CMD)
run_5y_klines_backfill.bat
```
### Option 3: Manual Commands
```bash
# PHASE 1: Fetch klines (6-12 hours)
cd "C:\Users\Lenovo\Documents\- Dolphin NG Backfill"
python historical_klines_backfiller.py --fetch --start 2021-07-01 --end 2023-12-31
# PHASE 2: Compute eigenvalues (2-4 hours)
python historical_klines_backfiller.py --compute --start 2021-07-01 --end 2023-12-31
# PHASE 3: Convert to Parquet (30-60 minutes)
cd "C:\Users\Lenovo\Documents\- DOLPHIN NG HD HCM TSF Predict"
python ng5_arrow_to_vbt_cache.py --all
# PHASE 4: Validate
python klines_backfill_5y_10y.py --validate
```
---
## What Each Phase Does
### Phase 1: Fetch Klines (6-12 hours)
- Downloads 1-minute OHLCV from Binance public API
- 50 symbols × 914 days = ~45,700 symbol-days
- Rate limited to 1100 req/min (under Binance 1200 limit)
- Cached to `klines_cache/{symbol}/{YYYY-MM-DD}.parquet`
- **Idempotent**: Already-fetched dates are skipped
### Phase 2: Compute Eigenvalues (2-4 hours)
- Reads cached klines
- Computes rolling correlation eigenvalues:
- w50, w150, w300, w750 windows (1-minute bars)
- Velocities, instabilities, vel_div
- Writes Arrow files: `arrow_klines/{date}/scan_{N:06d}_kbf_{HHMM}.arrow`
- **Idempotent**: Already-processed dates are skipped
### Phase 3: Convert to Parquet (30-60 minutes)
- Reads Arrow files
- Converts to VBT cache format
- Output: `vbt_cache_klines/{YYYY-MM-DD}.parquet`
- **Idempotent**: Already-converted dates are skipped
### Phase 4: Validation (5-10 minutes)
- Counts total parquet files
- Checks date range coverage
- Validates sample files have valid data
---
## Important Notes
### ⏱️ Very Long Runtime
- **Total: 10-18 hours** for 5-year backfill
- **Phase 1 (fetch) is the bottleneck** - depends on Binance API rate limits
- Run in a persistent session (TMUX on Linux, persistent CMD on Windows)
- **Safe to interrupt**: The script is idempotent, just re-run to resume
### 💾 Disk Management
- **klines_cache** grows to ~100-150 GB during fetch
- Can be deleted after conversion to free space
- **arrow_klines** intermediate: ~20 GB
- **Final parquets**: ~3 GB additional
### 📊 Symbol Coverage by Year
| Period | Expected Coverage | Notes |
|--------|------------------|-------|
| 2021-07+ | ~40-50 symbols | Most major alts listed |
| 2021-01 to 06 | ~10-20 symbols | Sparse, many not listed |
| 2020 | ~5-10 symbols | Only majors (BTC, ETH, BNB) |
| 2019 | ~5 symbols | Very sparse |
| 2017-2018 | 3-5 symbols | Only BTC, ETH, BNB |
### ⚠️ Binance Launch Date
- Binance launched in **July 2017**
- Data before 2017-07-01 simply doesn't exist
- Recommended start: **2021-07-01** (reliable coverage)
---
## Expected Output
After successful 5-year backfill:
```
vbt_cache_klines/
├── 2021-07-01.parquet ← NEW
├── 2021-07-02.parquet ← NEW
├── ... (914 new files)
├── 2023-12-31.parquet ← NEW
├── 2024-01-01.parquet ← existing
├── ... (existing files)
└── 2026-03-05.parquet ← existing
Total: ~1,710 parquets spanning 2021-07-01 to 2026-03-05
```
---
## Troubleshooting
### "Disk full" during fetch
```bash
# Stop the script (Ctrl-C), then:
# Option 1: Delete klines_cache for completed dates
# Option 2: Free up space elsewhere
# Then re-run - it will resume from where it stopped
```
### "Rate limited" errors
- The script handles this automatically (sleeps 60s)
- If persistent, wait an hour and re-run
### Missing symbols for early dates
- **Expected behavior**: Many alts weren't listed before 2021
- The eigenvalue computation handles this (uses available subset)
- Documented in the final report
### Script crashes on specific date
```bash
# Re-run with --date to skip problematic date
python historical_klines_backfiller.py --date 2022-06-15
```
---
## Post-Backfill Cleanup (Optional)
After validation passes, you can reclaim disk space:
```bash
# Delete klines_cache (raw OHLCV) - 100-150 GB
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\klines_cache"
# Delete arrow_klines intermediate - 20 GB
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\backfilled_data\arrow_klines"
# Keep only vbt_cache_klines/ (final output)
```
⚠️ **Only delete after validating the parquets!**
---
## Validation Checklist
After running, verify:
- [ ] Total parquets: ~1,700+ files
- [ ] Date range: 2021-07-01 to 2026-03-05
- [ ] No gaps in 2022-2023 period
- [ ] Sample files have valid vel_div values (non-zero std)
- [ ] BTCUSDT price column present in all files
Run: `python klines_backfill_5y_10y.py --validate`
---
## Summary of Commands
```bash
# FULL AUTOMATED RUN (recommended)
python klines_backfill_5y_10y.py --full-5y
# OR STEP BY STEP
python klines_backfill_5y_10y.py --preflight # Check first
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute
python klines_backfill_5y_10y.py --convert # To Parquet
python klines_backfill_5y_10y.py --validate # Verify
```
**Ready to run?** Start with `python klines_backfill_5y_10y.py --plan` to confirm, then run `python klines_backfill_5y_10y.py --full-5y`.