244 lines
7.0 KiB
Markdown
244 lines
7.0 KiB
Markdown
|
|
# DOLPHIN NG5 - 5 Year / 10 Year Klines Dataset Builder
|
|||
|
|
|
|||
|
|
## Quick Summary
|
|||
|
|
|
|||
|
|
| Aspect | Details |
|
|||
|
|
|--------|---------|
|
|||
|
|
| **Current State** | 796 days of data (2021-06-15 to 2026-03-05) |
|
|||
|
|
| **Gap** | 929 missing days (2021-06-16 to 2023-12-31) |
|
|||
|
|
| **Target** | 5-year dataset: 2021-01-01 to 2026-03-05 (~1,826 days) |
|
|||
|
|
| **Disk Required** | 150 GB free for 5-year, 400 GB for 10-year |
|
|||
|
|
| **Your Disk** | 166 GB free ✅ (sufficient for 5-year) |
|
|||
|
|
| **Runtime** | 10-18 hours for 5-year backfill |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Pre-Flight Status ✅
|
|||
|
|
|
|||
|
|
### Disk Space
|
|||
|
|
```
|
|||
|
|
Free: 166.4 GB / Total: 951.6 GB
|
|||
|
|
Status: SUFFICIENT for 5-year extension
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Current Data Coverage
|
|||
|
|
```
|
|||
|
|
Parquet files: 796
|
|||
|
|
Parquet range: 2021-06-15 to 2026-03-05
|
|||
|
|
By year:
|
|||
|
|
2021: 1 days ← Only 1 day!
|
|||
|
|
2024: 366 days ← Complete
|
|||
|
|
2025: 365 days ← Complete
|
|||
|
|
2026: 64 days ← Partial
|
|||
|
|
|
|||
|
|
Arrow directories: 796 (matches parquet)
|
|||
|
|
Klines cache: 0.54 GB (small - mostly fetched)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### The Gap
|
|||
|
|
```
|
|||
|
|
Missing: 2021-06-16 to 2023-12-31 (929 days)
|
|||
|
|
This is the 2022-2023 period that needs backfilling
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## How to Run
|
|||
|
|
|
|||
|
|
### Option 1: Python Control Script (Recommended)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Step 0: Review the plan
|
|||
|
|
python klines_backfill_5y_10y.py --plan
|
|||
|
|
|
|||
|
|
# Step 1: Run pre-flight checks
|
|||
|
|
python klines_backfill_5y_10y.py --preflight
|
|||
|
|
|
|||
|
|
# Step 2: Run complete 5-year backfill (ALL PHASES)
|
|||
|
|
# ⚠️ This takes 10-18 hours! Run in a persistent session.
|
|||
|
|
python klines_backfill_5y_10y.py --full-5y
|
|||
|
|
|
|||
|
|
# OR run step by step:
|
|||
|
|
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute (8-16 hours)
|
|||
|
|
python klines_backfill_5y_10y.py --convert # Convert to Parquet (30-60 min)
|
|||
|
|
python klines_backfill_5y_10y.py --validate # Validate output (5-10 min)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Option 2: Batch Script (Windows)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Run the batch file (double-click or run in CMD)
|
|||
|
|
run_5y_klines_backfill.bat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Option 3: Manual Commands
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# PHASE 1: Fetch klines (6-12 hours)
|
|||
|
|
cd "C:\Users\Lenovo\Documents\- Dolphin NG Backfill"
|
|||
|
|
python historical_klines_backfiller.py --fetch --start 2021-07-01 --end 2023-12-31
|
|||
|
|
|
|||
|
|
# PHASE 2: Compute eigenvalues (2-4 hours)
|
|||
|
|
python historical_klines_backfiller.py --compute --start 2021-07-01 --end 2023-12-31
|
|||
|
|
|
|||
|
|
# PHASE 3: Convert to Parquet (30-60 minutes)
|
|||
|
|
cd "C:\Users\Lenovo\Documents\- DOLPHIN NG HD HCM TSF Predict"
|
|||
|
|
python ng5_arrow_to_vbt_cache.py --all
|
|||
|
|
|
|||
|
|
# PHASE 4: Validate
|
|||
|
|
python klines_backfill_5y_10y.py --validate
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## What Each Phase Does
|
|||
|
|
|
|||
|
|
### Phase 1: Fetch Klines (6-12 hours)
|
|||
|
|
- Downloads 1-minute OHLCV from Binance public API
|
|||
|
|
- 50 symbols × 914 days = ~45,700 symbol-days
|
|||
|
|
- Rate limited to 1100 req/min (under Binance 1200 limit)
|
|||
|
|
- Cached to `klines_cache/{symbol}/{YYYY-MM-DD}.parquet`
|
|||
|
|
- **Idempotent**: Already-fetched dates are skipped
|
|||
|
|
|
|||
|
|
### Phase 2: Compute Eigenvalues (2-4 hours)
|
|||
|
|
- Reads cached klines
|
|||
|
|
- Computes rolling correlation eigenvalues:
|
|||
|
|
- w50, w150, w300, w750 windows (1-minute bars)
|
|||
|
|
- Velocities, instabilities, vel_div
|
|||
|
|
- Writes Arrow files: `arrow_klines/{date}/scan_{N:06d}_kbf_{HHMM}.arrow`
|
|||
|
|
- **Idempotent**: Already-processed dates are skipped
|
|||
|
|
|
|||
|
|
### Phase 3: Convert to Parquet (30-60 minutes)
|
|||
|
|
- Reads Arrow files
|
|||
|
|
- Converts to VBT cache format
|
|||
|
|
- Output: `vbt_cache_klines/{YYYY-MM-DD}.parquet`
|
|||
|
|
- **Idempotent**: Already-converted dates are skipped
|
|||
|
|
|
|||
|
|
### Phase 4: Validation (5-10 minutes)
|
|||
|
|
- Counts total parquet files
|
|||
|
|
- Checks date range coverage
|
|||
|
|
- Validates sample files have valid data
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Important Notes
|
|||
|
|
|
|||
|
|
### ⏱️ Very Long Runtime
|
|||
|
|
- **Total: 10-18 hours** for 5-year backfill
|
|||
|
|
- **Phase 1 (fetch) is the bottleneck** - depends on Binance API rate limits
|
|||
|
|
- Run in a persistent session (TMUX on Linux, persistent CMD on Windows)
|
|||
|
|
- **Safe to interrupt**: The script is idempotent, just re-run to resume
|
|||
|
|
|
|||
|
|
### 💾 Disk Management
|
|||
|
|
- **klines_cache** grows to ~100-150 GB during fetch
|
|||
|
|
- Can be deleted after conversion to free space
|
|||
|
|
- **arrow_klines** intermediate: ~20 GB
|
|||
|
|
- **Final parquets**: ~3 GB additional
|
|||
|
|
|
|||
|
|
### 📊 Symbol Coverage by Year
|
|||
|
|
| Period | Expected Coverage | Notes |
|
|||
|
|
|--------|------------------|-------|
|
|||
|
|
| 2021-07+ | ~40-50 symbols | Most major alts listed |
|
|||
|
|
| 2021-01 to 06 | ~10-20 symbols | Sparse, many not listed |
|
|||
|
|
| 2020 | ~5-10 symbols | Only majors (BTC, ETH, BNB) |
|
|||
|
|
| 2019 | ~5 symbols | Very sparse |
|
|||
|
|
| 2017-2018 | 3-5 symbols | Only BTC, ETH, BNB |
|
|||
|
|
|
|||
|
|
### ⚠️ Binance Launch Date
|
|||
|
|
- Binance launched in **July 2017**
|
|||
|
|
- Data before 2017-07-01 simply doesn't exist
|
|||
|
|
- Recommended start: **2021-07-01** (reliable coverage)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Expected Output
|
|||
|
|
|
|||
|
|
After successful 5-year backfill:
|
|||
|
|
```
|
|||
|
|
vbt_cache_klines/
|
|||
|
|
├── 2021-07-01.parquet ← NEW
|
|||
|
|
├── 2021-07-02.parquet ← NEW
|
|||
|
|
├── ... (914 new files)
|
|||
|
|
├── 2023-12-31.parquet ← NEW
|
|||
|
|
├── 2024-01-01.parquet ← existing
|
|||
|
|
├── ... (existing files)
|
|||
|
|
└── 2026-03-05.parquet ← existing
|
|||
|
|
|
|||
|
|
Total: ~1,710 parquets spanning 2021-07-01 to 2026-03-05
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
### "Disk full" during fetch
|
|||
|
|
```bash
|
|||
|
|
# Stop the script (Ctrl-C), then:
|
|||
|
|
# Option 1: Delete klines_cache for completed dates
|
|||
|
|
# Option 2: Free up space elsewhere
|
|||
|
|
# Then re-run - it will resume from where it stopped
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### "Rate limited" errors
|
|||
|
|
- The script handles this automatically (sleeps 60s)
|
|||
|
|
- If persistent, wait an hour and re-run
|
|||
|
|
|
|||
|
|
### Missing symbols for early dates
|
|||
|
|
- **Expected behavior**: Many alts weren't listed before 2021
|
|||
|
|
- The eigenvalue computation handles this (uses available subset)
|
|||
|
|
- Documented in the final report
|
|||
|
|
|
|||
|
|
### Script crashes on specific date
|
|||
|
|
```bash
|
|||
|
|
# Re-run with --date to skip problematic date
|
|||
|
|
python historical_klines_backfiller.py --date 2022-06-15
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Post-Backfill Cleanup (Optional)
|
|||
|
|
|
|||
|
|
After validation passes, you can reclaim disk space:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Delete klines_cache (raw OHLCV) - 100-150 GB
|
|||
|
|
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\klines_cache"
|
|||
|
|
|
|||
|
|
# Delete arrow_klines intermediate - 20 GB
|
|||
|
|
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\backfilled_data\arrow_klines"
|
|||
|
|
|
|||
|
|
# Keep only vbt_cache_klines/ (final output)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
⚠️ **Only delete after validating the parquets!**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Validation Checklist
|
|||
|
|
|
|||
|
|
After running, verify:
|
|||
|
|
- [ ] Total parquets: ~1,700+ files
|
|||
|
|
- [ ] Date range: 2021-07-01 to 2026-03-05
|
|||
|
|
- [ ] No gaps in 2022-2023 period
|
|||
|
|
- [ ] Sample files have valid vel_div values (non-zero std)
|
|||
|
|
- [ ] BTCUSDT price column present in all files
|
|||
|
|
|
|||
|
|
Run: `python klines_backfill_5y_10y.py --validate`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Summary of Commands
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# FULL AUTOMATED RUN (recommended)
|
|||
|
|
python klines_backfill_5y_10y.py --full-5y
|
|||
|
|
|
|||
|
|
# OR STEP BY STEP
|
|||
|
|
python klines_backfill_5y_10y.py --preflight # Check first
|
|||
|
|
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute
|
|||
|
|
python klines_backfill_5y_10y.py --convert # To Parquet
|
|||
|
|
python klines_backfill_5y_10y.py --validate # Verify
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Ready to run?** Start with `python klines_backfill_5y_10y.py --plan` to confirm, then run `python klines_backfill_5y_10y.py --full-5y`.
|