initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
This commit is contained in:
243
prod/docs/KLINES_5Y_10Y_DATASET_README.md
Executable file
243
prod/docs/KLINES_5Y_10Y_DATASET_README.md
Executable file
@@ -0,0 +1,243 @@
|
||||
# DOLPHIN NG5 - 5 Year / 10 Year Klines Dataset Builder
|
||||
|
||||
## Quick Summary
|
||||
|
||||
| Aspect | Details |
|
||||
|--------|---------|
|
||||
| **Current State** | 796 days of data (2021-06-15 to 2026-03-05) |
|
||||
| **Gap** | 929 missing days (2021-06-16 to 2023-12-31) |
|
||||
| **Target** | 5-year dataset: 2021-01-01 to 2026-03-05 (~1,826 days) |
|
||||
| **Disk Required** | 150 GB free for 5-year, 400 GB for 10-year |
|
||||
| **Your Disk** | 166 GB free ✅ (sufficient for 5-year) |
|
||||
| **Runtime** | 10-18 hours for 5-year backfill |
|
||||
|
||||
---
|
||||
|
||||
## Pre-Flight Status ✅
|
||||
|
||||
### Disk Space
|
||||
```
|
||||
Free: 166.4 GB / Total: 951.6 GB
|
||||
Status: SUFFICIENT for 5-year extension
|
||||
```
|
||||
|
||||
### Current Data Coverage
|
||||
```
|
||||
Parquet files: 796
|
||||
Parquet range: 2021-06-15 to 2026-03-05
|
||||
By year:
|
||||
2021: 1 days ← Only 1 day!
|
||||
2024: 366 days ← Complete
|
||||
2025: 365 days ← Complete
|
||||
2026: 64 days ← Partial
|
||||
|
||||
Arrow directories: 796 (matches parquet)
|
||||
Klines cache: 0.54 GB (small - mostly fetched)
|
||||
```
|
||||
|
||||
### The Gap
|
||||
```
|
||||
Missing: 2021-06-16 to 2023-12-31 (929 days)
|
||||
This is the 2022-2023 period that needs backfilling
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How to Run
|
||||
|
||||
### Option 1: Python Control Script (Recommended)
|
||||
|
||||
```bash
|
||||
# Step 0: Review the plan
|
||||
python klines_backfill_5y_10y.py --plan
|
||||
|
||||
# Step 1: Run pre-flight checks
|
||||
python klines_backfill_5y_10y.py --preflight
|
||||
|
||||
# Step 2: Run complete 5-year backfill (ALL PHASES)
|
||||
# ⚠️ This takes 10-18 hours! Run in a persistent session.
|
||||
python klines_backfill_5y_10y.py --full-5y
|
||||
|
||||
# OR run step by step:
|
||||
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute (8-16 hours)
|
||||
python klines_backfill_5y_10y.py --convert # Convert to Parquet (30-60 min)
|
||||
python klines_backfill_5y_10y.py --validate # Validate output (5-10 min)
|
||||
```
|
||||
|
||||
### Option 2: Batch Script (Windows)
|
||||
|
||||
```bash
|
||||
# Run the batch file (double-click or run in CMD)
|
||||
run_5y_klines_backfill.bat
|
||||
```
|
||||
|
||||
### Option 3: Manual Commands
|
||||
|
||||
```bash
|
||||
# PHASE 1: Fetch klines (6-12 hours)
|
||||
cd "C:\Users\Lenovo\Documents\- Dolphin NG Backfill"
|
||||
python historical_klines_backfiller.py --fetch --start 2021-07-01 --end 2023-12-31
|
||||
|
||||
# PHASE 2: Compute eigenvalues (2-4 hours)
|
||||
python historical_klines_backfiller.py --compute --start 2021-07-01 --end 2023-12-31
|
||||
|
||||
# PHASE 3: Convert to Parquet (30-60 minutes)
|
||||
cd "C:\Users\Lenovo\Documents\- DOLPHIN NG HD HCM TSF Predict"
|
||||
python ng5_arrow_to_vbt_cache.py --all
|
||||
|
||||
# PHASE 4: Validate
|
||||
python klines_backfill_5y_10y.py --validate
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What Each Phase Does
|
||||
|
||||
### Phase 1: Fetch Klines (6-12 hours)
|
||||
- Downloads 1-minute OHLCV from Binance public API
|
||||
- 50 symbols × 914 days = ~45,700 symbol-days
|
||||
- Rate limited to 1100 req/min (under Binance 1200 limit)
|
||||
- Cached to `klines_cache/{symbol}/{YYYY-MM-DD}.parquet`
|
||||
- **Idempotent**: Already-fetched dates are skipped
|
||||
|
||||
### Phase 2: Compute Eigenvalues (2-4 hours)
|
||||
- Reads cached klines
|
||||
- Computes rolling correlation eigenvalues:
|
||||
- w50, w150, w300, w750 windows (1-minute bars)
|
||||
- Velocities, instabilities, vel_div
|
||||
- Writes Arrow files: `arrow_klines/{date}/scan_{N:06d}_kbf_{HHMM}.arrow`
|
||||
- **Idempotent**: Already-processed dates are skipped
|
||||
|
||||
### Phase 3: Convert to Parquet (30-60 minutes)
|
||||
- Reads Arrow files
|
||||
- Converts to VBT cache format
|
||||
- Output: `vbt_cache_klines/{YYYY-MM-DD}.parquet`
|
||||
- **Idempotent**: Already-converted dates are skipped
|
||||
|
||||
### Phase 4: Validation (5-10 minutes)
|
||||
- Counts total parquet files
|
||||
- Checks date range coverage
|
||||
- Validates sample files have valid data
|
||||
|
||||
---
|
||||
|
||||
## Important Notes
|
||||
|
||||
### ⏱️ Very Long Runtime
|
||||
- **Total: 10-18 hours** for 5-year backfill
|
||||
- **Phase 1 (fetch) is the bottleneck** - depends on Binance API rate limits
|
||||
- Run in a persistent session (TMUX on Linux, persistent CMD on Windows)
|
||||
- **Safe to interrupt**: The script is idempotent, just re-run to resume
|
||||
|
||||
### 💾 Disk Management
|
||||
- **klines_cache** grows to ~100-150 GB during fetch
|
||||
- Can be deleted after conversion to free space
|
||||
- **arrow_klines** intermediate: ~20 GB
|
||||
- **Final parquets**: ~3 GB additional
|
||||
|
||||
### 📊 Symbol Coverage by Year
|
||||
| Period | Expected Coverage | Notes |
|
||||
|--------|------------------|-------|
|
||||
| 2021-07+ | ~40-50 symbols | Most major alts listed |
|
||||
| 2021-01 to 06 | ~10-20 symbols | Sparse, many not listed |
|
||||
| 2020 | ~5-10 symbols | Only majors (BTC, ETH, BNB) |
|
||||
| 2019 | ~5 symbols | Very sparse |
|
||||
| 2017-2018 | 3-5 symbols | Only BTC, ETH, BNB |
|
||||
|
||||
### ⚠️ Binance Launch Date
|
||||
- Binance launched in **July 2017**
|
||||
- Data before 2017-07-01 simply doesn't exist
|
||||
- Recommended start: **2021-07-01** (reliable coverage)
|
||||
|
||||
---
|
||||
|
||||
## Expected Output
|
||||
|
||||
After successful 5-year backfill:
|
||||
```
|
||||
vbt_cache_klines/
|
||||
├── 2021-07-01.parquet ← NEW
|
||||
├── 2021-07-02.parquet ← NEW
|
||||
├── ... (914 new files)
|
||||
├── 2023-12-31.parquet ← NEW
|
||||
├── 2024-01-01.parquet ← existing
|
||||
├── ... (existing files)
|
||||
└── 2026-03-05.parquet ← existing
|
||||
|
||||
Total: ~1,710 parquets spanning 2021-07-01 to 2026-03-05
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Disk full" during fetch
|
||||
```bash
|
||||
# Stop the script (Ctrl-C), then:
|
||||
# Option 1: Delete klines_cache for completed dates
|
||||
# Option 2: Free up space elsewhere
|
||||
# Then re-run - it will resume from where it stopped
|
||||
```
|
||||
|
||||
### "Rate limited" errors
|
||||
- The script handles this automatically (sleeps 60s)
|
||||
- If persistent, wait an hour and re-run
|
||||
|
||||
### Missing symbols for early dates
|
||||
- **Expected behavior**: Many alts weren't listed before 2021
|
||||
- The eigenvalue computation handles this (uses available subset)
|
||||
- Documented in the final report
|
||||
|
||||
### Script crashes on specific date
|
||||
```bash
|
||||
# Re-run with --date to skip problematic date
|
||||
python historical_klines_backfiller.py --date 2022-06-15
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Post-Backfill Cleanup (Optional)
|
||||
|
||||
After validation passes, you can reclaim disk space:
|
||||
|
||||
```bash
|
||||
# Delete klines_cache (raw OHLCV) - 100-150 GB
|
||||
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\klines_cache"
|
||||
|
||||
# Delete arrow_klines intermediate - 20 GB
|
||||
rmdir /s "C:\Users\Lenovo\Documents\- Dolphin NG Backfill\backfilled_data\arrow_klines"
|
||||
|
||||
# Keep only vbt_cache_klines/ (final output)
|
||||
```
|
||||
|
||||
⚠️ **Only delete after validating the parquets!**
|
||||
|
||||
---
|
||||
|
||||
## Validation Checklist
|
||||
|
||||
After running, verify:
|
||||
- [ ] Total parquets: ~1,700+ files
|
||||
- [ ] Date range: 2021-07-01 to 2026-03-05
|
||||
- [ ] No gaps in 2022-2023 period
|
||||
- [ ] Sample files have valid vel_div values (non-zero std)
|
||||
- [ ] BTCUSDT price column present in all files
|
||||
|
||||
Run: `python klines_backfill_5y_10y.py --validate`
|
||||
|
||||
---
|
||||
|
||||
## Summary of Commands
|
||||
|
||||
```bash
|
||||
# FULL AUTOMATED RUN (recommended)
|
||||
python klines_backfill_5y_10y.py --full-5y
|
||||
|
||||
# OR STEP BY STEP
|
||||
python klines_backfill_5y_10y.py --preflight # Check first
|
||||
python klines_backfill_5y_10y.py --backfill-5y # Fetch + Compute
|
||||
python klines_backfill_5y_10y.py --convert # To Parquet
|
||||
python klines_backfill_5y_10y.py --validate # Verify
|
||||
```
|
||||
|
||||
**Ready to run?** Start with `python klines_backfill_5y_10y.py --plan` to confirm, then run `python klines_backfill_5y_10y.py --full-5y`.
|
||||
Reference in New Issue
Block a user