Files
DOLPHIN/prod/docs/EsoF_BLUE_IMPLEMENTATION_CURR_AND_RESEARCH.md

635 lines
27 KiB
Markdown
Raw Normal View History

# EsoF — Esoteric Factors: Current State & Research Findings
**As of: 2026-04-20 | Trade sample: 588 clean alpha trades (2026-03-31 → 2026-04-20) | Backtest: 2155 trades (2025-12-31 → 2026-02-26)**
---
## 1. What "EsoF" Actually Refers To (Disambiguation)
The name "EsoF" (Esoteric Factors) attaches to **two entirely separate systems** in the Dolphin codebase. Do not conflate them.
### 1A. The Hazard Multiplier (`set_esoteric_hazard_multiplier`)
Located in `esf_alpha_orchestrator.py`. Modulates `base_max_leverage` downward:
```
effective_base = base_max_leverage × (1.0 - hazard_mult × factor)
```
**Current gold spec**: `hazard_mult = 0.0` permanently. This means the hazard multiplier is **always at zero** — it reduces nothing, touches nothing. The parameter exists in the engine but is inert.
- Gold backtest ran with `hazard_mult=0.0`.
- **Do not change this** without running a full backtest comparison.
- The `esof_prefect_flow.py` computes astrological factors and pushes them to HZ, but **nothing in the trading engine reads or consumes this output**. The flow is dormant as an engine input.
### 1B. The Advisory System (`Observability/esof_advisor.py`)
A standalone advisory layer — **not wired into BLUE**. Built from 637 live trades. Computes session/DoW/slot/liq_hour expectancy and publishes an advisory score every 15 seconds to HZ and CH.
---
## 2. MarketIndicators — `external_factors/esoteric_factors_service.py`
The `MarketIndicators` class computes several temporal signals used by the advisory layer.
### 2.1 Regions Table
| Region | Population (M) | Liq Weight | Major centers |
|---------------|----------------|------------|---------------|
| Americas | 1,000 | 0.35 | NYSE, CME |
| EMEA | 2,200 | 0.30 | LSE, Frankfurt, ECB |
| South_Asia | 1,400 | 0.05 | BSE, NSE |
| East_Asia | 1,600 | 0.20 | TSE, HKEX, SGX |
| Oceania_SEA | 800 | 0.10 | ASX, SGX |
### 2.2 Computed Signals
| Method | Returns | Notes |
|--------|---------|-------|
| `get_weighted_times(now)` | `(pop_hour, liq_hour)` | Circular weighted average using sin/cos of each region's local hour |
| `get_liquidity_session(now)` | session string | Step function on UTC hour |
| `get_regional_times(now)` | dict per region | local_hour + is_tradfi_open flag |
| `is_tradfi_open(now)` | bool | Weekday 04, hour 917 local |
| `get_moon_phase(now)` | phase + illumination | Via astropy (ephem backend) |
| `is_mercury_retrograde(now)` | bool | Hardcoded period list |
| `get_fibonacci_time(now)` | strength float | Distance to nearest Fibonacci minute |
| `get_market_cycle_position(now)` | 0.01.0 | BTC halving 4-year cycle reference |
### 2.3 Weighted Hour Properties
- **pop_weighted_hour**: Population-weighted centroid ≈ UTC + 4.21h (South_Asia + East_Asia heavily weighted). Rotates strongly with East_Asian trading day opening.
- **liq_weighted_hour**: Liquidity-weighted centroid ≈ UTC + 0.98h (Americas 35% dominant). **Nearly linear monotone with UTC** — adds granularity but does not reveal fundamentally different patterns from raw UTC sessions.
- **Fallback** (if astropy not installed): `pop ≈ (UTC + 4.21) % 24`, `liq ≈ (UTC + 0.98) % 24`
- **astropy 7.2.0** is installed in siloqy_env (installed 2026-04-19).
---
## 3. Trade Analysis — 637 Trades (2026-03-31 → 2026-04-19)
**Baseline**: WR = 43.7%, net = +$172.45 across all 637 trades.
### 3.1 Session Expectancy
| Session | Trades | WR% | Net PnL | Avg/trade |
|---------|--------|-----|---------|-----------|
| **LONDON_MORNING** (0813h UTC) | 111 | **47.7%** | **+$4,133** | +$37.23 |
| **ASIA_PACIFIC** (0008h UTC) | 182 | 46.7% | +$1,600 | +$8.79 |
| **LN_NY_OVERLAP** (1317h UTC) | 147 | 45.6% | -$895 | -$6.09 |
| **LOW_LIQUIDITY** (2124h UTC) | 71 | 39.4% | -$809 | -$11.40 |
| **NY_AFTERNOON** (1721h UTC) | 127 | **35.4%** | **-$3,857** | -$30.37 |
**NY_AFTERNOON is a systematic loser across all days.** LONDON_MORNING is the cleanest positive session.
### 3.2 Day-of-Week Expectancy
| DoW | Trades | WR% | Net PnL | Avg/trade |
|-----|--------|-----|---------|-----------|
| Mon | 81 | **27.2%** | -$1,054 | -$13.01 |
| Tue | 77 | **54.5%** | +$3,824 | +$49.66 |
| Wed | 98 | 43.9% | -$385 | -$3.93 |
| Thu | 115 | 44.3% | -$4,017 | -$34.93 |
| Fri | 106 | 39.6% | -$1,968 | -$18.57 |
| Sat | 82 | 43.9% | +$43 | +$0.53 |
| Sun | 78 | **53.8%** | +$3,730 | +$47.82 |
**Monday is the worst trading day** (WR 27.2% — avoid). **Thursday is large-loss despite median WR** (heavy net damage from LN_NY_OVERLAP cell). **Tuesday and Sunday are positive outliers.**
### 3.3 Liquidity-Hour Expectancy (3h Buckets, liq_hour ≈ UTC + 0.98h)
| liq_hour bucket | Trades | WR% | Net PnL | Avg/trade | Approx UTC |
|-----------------|--------|-----|---------|-----------|------------|
| 03h | 70 | 51.4% | +$1,466 | +$20.9 | 232h |
| 36h | 73 | 46.6% | -$1,166 | -$16.0 | 25h |
| 69h | 62 | 41.9% | +$1,026 | +$16.5 | 58h |
| 912h | 65 | 43.1% | +$476 | +$7.3 | 811h |
| **1215h** | **84** | **52.4%** | **+$3,532** | **+$42.0** | **1114h ★ BEST** |
| 1518h | 113 | 43.4% | -$770 | -$6.8 | 1417h |
| 1821h | 99 | **35.4%** | **-$2,846** | **-$28.8** | 1720h ✗ WORST |
| 2124h | 72 | 36.1% | -$1,545 | -$21.5 | 2023h |
liq 1215h (EMEA afternoon + US open) is the standout best bucket. liq 1821h mirrors NY_AFTERNOON perfectly and is the worst.
### 3.4 DoW × Session Heatmap — Notable Cells
Full 5×7 grid (not all cells have enough data — cells with n < 5 omitted):
| DoW × Session | Trades | WR% | Net PnL | Label |
|---------------|--------|-----|---------|-------|
| **Sun × LONDON_MORNING** | 13 | **85.0%** | +$2,153 | ★ BEST CELL |
| **Sun × LN_NY_OVERLAP** | 24 | **75.0%** | +$2,110 | 2nd best |
| **Tue × ASIA_PACIFIC** | 27 | 67.0% | +$2,522 | 3rd |
| **Tue × LN_NY_OVERLAP** | 18 | 56.0% | +$2,260 | 4th |
| **Sun × NY_AFTERNOON** | 17 | **6.0%** | -$1,025 | ✗ WORST CELL |
| Mon × ASIA_PACIFIC | 21 | 19.0% | -$411 | avoid |
| **Thu × LN_NY_OVERLAP** | 27 | 41.0% | **-$3,310** | ✗ CATASTROPHIC |
**Sun NY_AFTERNOON (6% WR) is a near-perfect inverse signal.** Thu LN_NY_OVERLAP has enough trades (27) to be considered reliable — biggest single-cell loss in the dataset.
### 3.5 15-Minute Slot Highlights (n ≥ 5)
Top positive slots by avg_pnl (n ≥ 5):
| Slot | n | WR% | Net | Avg/trade |
|------|---|-----|-----|-----------|
| 15:00 | 10 | 70.0% | +$2,266 | +$226.58 ★ |
| 11:30 | 8 | 87.5% | +$1,075 | +$134.32 |
| 1:30 | 10 | 50.0% | +$1,607 | +$160.67 |
| 13:45 | 10 | 70.0% | +$1,082 | +$108.21 |
| 1:45 | 5 | 80.0% | +$459 | +$91.75 |
Top negative slots:
| Slot | n | WR% | Net | Avg/trade |
|------|---|-----|-----|-----------|
| 5:45 | 5 | 40.0% | -$1,665 | -$333.05 ★ |
| 2:15 | 5 | 0.0% | -$852 | -$170.31 |
| 16:30 | 4 | 25.0% | -$2,024 | -$506.01 (n<5) |
| 12:45 | 6 | 16.7% | -$1,178 | -$196.35 |
| 18:00 | 6 | 16.7% | -$1,596 | -$265.93 |
**Caveat on slots**: Many 15m slots have n = 410. Most are noise at current sample size. Weight slot_score low (10%) in composite.
---
## 4. Advisory Scoring Model
### 4.1 Score Formula
```
sess_score = (sess_wr - 43.7) / 20.0 # normalized [-1, +1]
liq_score = (liq_wr - 43.7) / 20.0
dow_score = (dow_wr - 43.7) / 20.0
slot_score = (slot_wr - 43.7) / 20.0 # if n≥5, else 0.0
cell_bonus = (cell_wr - 43.7) / 100.0 × 0.3 # ±0.30 max
advisory_score = liq_score×0.30 + sess_score×0.25 + dow_score×0.30
+ slot_score×0.10 + cell_bonus×0.05
advisory_score = clamp(advisory_score, -1.0, +1.0)
# Mercury retrograde: additional -0.05 penalty
if mercury_retrograde:
advisory_score = max(-1.0, advisory_score - 0.05)
```
Denominator 20.0 chosen because observed WR range across all factors is ≈ ±20pp from baseline.
### 4.2 Labels
| Score range | Label |
|-------------|-------|
| > +0.25 | `FAVORABLE` |
| > +0.05 | `MILD_POSITIVE` |
| > -0.05 | `NEUTRAL` |
| > -0.25 | `MILD_NEGATIVE` |
| ≤ -0.25 | `UNFAVORABLE` |
### 4.3 Weight Rationale
- **liq_hour (30%)**: More granular than session (3h vs 4h buckets, continuous). Captures EMEA-pm/US-open sweet spot cleanly.
- **DoW (30%)**: Strongest calendar factor in the data. MonThu split is statistically robust (n=77115).
- **Session (25%)**: Corroborates liq_hour. LONDON_MORNING/NY_AFTERNOON signal strong.
- **Slot 15m (10%)**: Useful signal but most slots have n < 10. Low weight appropriate until more data.
- **Cell DoW×Session (5%)**: Sun×LDN 85% WR is real but n=13 — kept at 5% to avoid overfitting.
---
## 5. Files Inventory
| File | Purpose | Status |
|------|---------|--------|
| `Observability/esof_advisor.py` | Advisory daemon + importable `get_advisory()` | Active, v2 |
| `Observability/dolphin_status.py` | Status panel — reads `esof_advisor_latest` from HZ | Wired (reads only) |
| `external_factors/esoteric_factors_service.py` | `MarketIndicators` — real weighted hours, moon, mercury | Source of truth |
| `external_factors/esof_prefect_flow.py` | Pushes astro data to HZ | Dormant (nothing consumes it) |
| `prod/tests/test_esof_advisor.py` | 55-test suite (9 classes) | All passing (28s) |
| CH: `dolphin.esof_advisory` | Time-series advisory archive | Active, 90-day TTL |
### CH Table Schema
```sql
CREATE TABLE IF NOT EXISTS dolphin.esof_advisory (
ts DateTime64(3, 'UTC'),
dow UInt8,
dow_name LowCardinality(String),
hour_utc UInt8,
slot_15m String,
session LowCardinality(String),
moon_illumination Float32,
moon_phase LowCardinality(String),
mercury_retrograde UInt8,
pop_weighted_hour Float32,
liq_weighted_hour Float32,
market_cycle_pos Float32,
fib_strength Float32,
slot_wr_pct Float32,
slot_net_pnl Float32,
session_wr_pct Float32,
session_net_pnl Float32,
dow_wr_pct Float32,
dow_net_pnl Float32,
advisory_score Float32,
advisory_label LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY ts
TTL toDateTime(ts) + toIntervalDay(90);
```
---
## 6. HZ Integration
- **Key**: `DOLPHIN_FEATURES['esof_advisor_latest']`
- **Format**: JSON string (all fields from `compute_esof()` return dict)
- **Write cadence**: Every 15 seconds by daemon; CH every 5 minutes
- **Reading** (in `dolphin_status.py`):
```python
esof = _get(hz, "DOLPHIN_FEATURES", "esof_advisor_latest")
```
Falls back to `"(start esof_advisor.py for advisory)"` when absent.
---
## 7. Starting the Daemon
```bash
source /home/dolphin/siloqy_env/bin/activate
python Observability/esof_advisor.py
# Options:
# --once compute once and exit
# --interval N seconds between updates (default 15)
# --no-hz skip HZ write
# --no-ch skip CH write
```
Daemon PID on last start: 2417597 (2026-04-19).
---
## 8. Test Suite — `prod/tests/test_esof_advisor.py`
55 tests, 9 classes, all passing (28.36s run, 2026-04-19).
| Class | Tests | What it covers |
|-------|-------|----------------|
| `TestComputeEsofSchema` | 5 | All required keys present, score in [-1,+1], labels valid |
| `TestSessionClassification` | 5 | Boundary conditions for all 5 sessions |
| `TestWeightedHours` | 4 | Pop/liq hour in [0,24), ordering, monotone liq |
| `TestAdvisoryScoring` | 7 | Best/worst cell ordering, Mon<Tue, Sun>Mon, NY_AFT negative |
| `TestExpectancyTables` | 6 | Table integrity: all WR in [0,100], net aligned with WR |
| `TestMoonApproximation` | 4 | Phase labels, new moon Apr 17, full moon Apr 2, illumination range |
| `TestPublicAPI` | 3 | `get_advisory()` returns same schema, `--once` flag, daemon args |
| `TestHZIntegration` | 8 | HZ write/read roundtrip (skipped if HZ unavailable) |
| `TestCHIntegration` | 13 | CH insert/query/TTL (skipped if CH unavailable) |
Key test fixtures used:
| Fixture | datetime UTC | Why |
|---------|-------------|-----|
| `sun_london` | Sun 10:00 | Best expected cell (WR 85%) |
| `thu_ovlp` | Thu 15:00 | Thu OVLP catastrophic cell |
| `sun_ny` | Sun 18:00 | Sun NY_AFT 6% WR inverse signal |
| `mon_asia` | Mon 03:00 | Mon worst day |
| `tue_asia` | Tue 03:00 | Tue vs Mon comparison |
| `midday_win` | Tue 12:30 | liq 1215h best bucket |
---
## 9. Known Limitations and Research Notes
### 9.1 DoW × Slot Interaction (not modeled)
The current model treats DoW and Slot as **independent factors** (additive). This is incorrect in at least one known case: slot 15:00 has WR=70% overall (the best slot by avg_pnl), but Thursday 15:00 is known to be catastrophic in context (Thu×LN_NY_OVERLAP cell = -$3,310). The additive model would give Thu 15:00 a *positive* slot score (+1.32) while the DoW/cell scores pull it negative — net result is weakly positive, which understates the risk.
**Future work**: Model DoW×Slot joint distribution when n ≥ 10 per cell (requires ~2,000 more trades).
### 9.2 Sample Size Caveats
| Factor | Min cell n | Confidence |
|--------|-----------|------------|
| Session | 71 (LOW_LIQ) | High |
| DoW | 77 (Tue) | High |
| liq_hour 3h | 62 (6-9h) | Medium-High |
| DoW×Session | 13 (Sun×LDN) | Medium |
| Slot 15m | 419 | LowMedium |
Rules of thumb: session + DoW patterns are reliable. Slot patterns are directional hints only until n ≥ 30.
### 9.3 Mercury Retrograde
Current period: 2026-03-07 → 2026-03-30 (ended). Next: 2026-06-29 → 2026-07-23.
The -0.05 penalty is arbitrary (no empirical basis from the 637 trades — not enough retrograde trades). Retain as a conservative prior.
### 9.4 Fibonacci Time
`fib_strength = 1.0 - min(dist_to_nearest_fib_minute / 30.0, 1.0)`
Currently **not incorporated into the advisory score** (computed but not weighted). No evidence from trade data. Track in CH for future regression.
### 9.5 Market Cycle Position
BTC halving reference: 2024-04-19. Current position: `(days_since % 1461) / 1461.0`. As of 2026-04-19 ≈ 365/1461 ≈ 0.25 (1 year post-halving, historically bullish mid-cycle). Not in advisory score — tracked only.
### 9.6 tradfi_open Flags
`MarketIndicators.get_regional_times()` returns `is_tradfi_open` per region. This signal is not yet used in scoring. Hypothesis: periods when 2+ major TradFi regions are simultaneously open may have better fill quality. Wire and test once more data exists.
---
## 10. Future Wiring Into BLUE Engine
**DO NOT wire until validated with more data.** The following describes the intended integration, NOT current state.
### Proposed gating logic (research phase):
```python
# In esf_alpha_orchestrator._try_entry() — FUTURE ONLY
advisory = get_advisory() # from esof_advisor.py
if advisory["advisory_label"] == "UNFAVORABLE":
# Option A: skip entry entirely
return None
# Option B: reduce sizing by 50%
size_mult *= 0.5
```
### Preconditions before wiring:
1. Accumulate ≥ 1,500 trades across all sessions/DoW (currently 637)
2. DoW slot interaction modeled or explicitly neutralized
3. NY_AFTERNOON pattern holds on next 500 trades (current WR=35.4% robust across all 127 trades, so likely durable)
4. Backtest: filter UNFAVORABLE periods → measure ROI uplift vs full universe
5. Unit test: advisory gate does not block >20% of entry opportunities
### Suggested first gate (lowest risk):
Block entries when **all three** hold simultaneously:
- `dow in (0, 3)` (Mon or Thu)
- `session == "NY_AFTERNOON"`
- `advisory_score < -0.25`
This is the intersection of the three worst factors, blocking the highest-conviction negative cells only.
---
## 11. Update Cadence
Update `SLOT_STATS`, `SESSION_STATS`, `DOW_STATS`, `LIQ_HOUR_STATS`, `DOW_SESSION_STATS` in `esof_advisor.py`:
```sql
-- Pull fresh session stats from CH:
SELECT session,
count() as trades,
round(100.0 * countIf(pnl > 0) / count(), 1) as wr_pct,
round(sum(pnl), 2) as net_pnl,
round(avg(pnl), 2) as avg_pnl
FROM dolphin.trade_events
WHERE strategy = 'blue'
GROUP BY session
ORDER BY session;
-- DoW stats:
SELECT toDayOfWeek(ts) - 1 as dow, -- 0=Mon in Python weekday()
count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM dolphin.trade_events WHERE strategy='blue'
GROUP BY dow ORDER BY dow;
-- 15m slot stats (n>=5):
SELECT slot_15m, count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM (
SELECT toStartOfFifteenMinutes(ts) as slot_ts,
formatDateTime(slot_ts, '%H:%M') as slot_15m,
pnl
FROM dolphin.trade_events WHERE strategy='blue'
)
GROUP BY slot_15m HAVING count() >= 5
ORDER BY slot_15m;
```
Suggested refresh: when cumulative trade count crosses 1000, 1500, 2000.
---
## 12. Gate Strategy Empirical Testing — 2026-04-20
### 12.1 Test Infrastructure
Three new files created:
| File | Purpose |
|------|---------|
| `Observability/esof_gate.py` | Pure gate strategy functions (no I/O). `GateResult` dataclass: action, lev_mult, reason, s6_mult, irp_params |
| `prod/tests/test_esof_gate_strategies.py` | CH-based strategy simulation + 39 unit tests, all passing |
| `prod/tests/test_esof_overfit_guard.py` | 24 industry-standard overfitting avoidance tests (6 intentionally fail — guard working) |
| `prod/tests/run_esof_backtest_sim.py` | 56-day gold-engine simulation over vbt_cache parquets |
### 12.2 Clean Alpha Exit Definition
For all strategy testing, only **FIXED_TP** and **MAX_HOLD** exits are counted. Excluded:
- `HIBERNATE_HALT` — forced position close, not alpha signal
- `SUBDAY_ACB_NORMALIZATION` — control-plane forced, not alpha-driven
This reduces the 588-trade raw CH dataset to **549 clean alpha trades**.
### 12.3 Strategies Tested (AF)
| ID | Strategy | Mechanism |
|----|----------|-----------|
| A | `LEV_SCALE` | Scale leverage by advisory score: FAVORABLE→1.2×, MILD_POS→1.0×, NEUTRAL→0.8×, MILD_NEG→0.6×, UNFAVORABLE→0.5× |
| B | `HARD_BLOCK` | Block entry when `advisory_label == "UNFAVORABLE"` |
| C | `DOW_BLOCK` | Block when `dow in (0, 3)` (Mon, Thu) |
| D | `SESSION_BLOCK` | Block when `session == "NY_AFTERNOON"` |
| E | `COMBINED` | Block when UNFAVORABLE **or** (Mon/Thu **and** NY_AFTERNOON) |
| F | `S6_BUCKET` | Per-bucket sizing multipliers keyed by EsoF label (5 labels × 7 buckets). Widened FAVORABLE, zeroed UNFAVORABLE buckets |
Counterfactual PnL methodology: `cf_pnl = actual_pnl × lev_mult` (linear scaling; valid only for FIXED_TP and MAX_HOLD exits where leverage scales linearly with PnL).
---
### 12.4 Posture Clarification — BLUE Is Effectively APEX-Only
User confirmed, code verified. Live BLUE posture distribution from CH:
```
APEX: 586 trades (99.8%)
STALKER: 1 trade (0.2%)
TURTLE: 0
HIBERNATE: 0
```
`dolphin_actor.py` reads posture from HZ `DOLPHIN_SAFETY`. STALKER applies a 2.0× leverage ceiling but does not block entries. TURTLE/HIBERNATE set `regime_dd_halt = True` (blocks entries for the day) — but these states occur essentially never in the current deployment window.
**Implication**: The live CH trade session/DoW distribution is NOT shaped by posture transitions. The session distribution is a genuine trading behavior signal.
---
### 12.5 56-Day Gold Backtest — Why It Is Invalid for EsoF Session Analysis
`run_esof_backtest_sim.py` ran the gold-spec `LiquidationGuardEngine` over 56 vbt_cache parquet days (2025-12-31 → 2026-02-26). Gold match: **2155 trades, ROI=+190.19%** (confirming engine correctness).
Session distribution in backtest:
| Session | n | % |
|---------|---|---|
| ASIA_PACIFIC | 2120 | **98.4%** |
| All others | 35 | 1.6% |
**Root cause**: vbt_cache parquets are 10-second bars (~8208 bars/day). Engine lookback (~100 bars) completes in **~17 minutes** from midnight. Entries fire at ~00:17 UTC (hour 0 = ASIA_PACIFIC). Single-position-per-asset plus MAX_HOLD=125 bars (~21 min) means 98% of all trades fire within the first hour of the day, before ASIA_PACIFIC ends at 08:00 UTC.
Confirmed by direct inspection: `entry_ts.hour == 0` for 2108/2155 trades.
**Consequence**: The backtest's EsoF label distribution is:
| Label | n | Note |
|-------|---|------|
| NEUTRAL | 863 | 40.0% |
| MILD_POSITIVE | 668 | 31.0% |
| MILD_NEGATIVE | 326 | 15.1% |
| UNFAVORABLE | 297 | 13.8% |
| **FAVORABLE** | **1** | **0.05%** |
With FAVORABLE=1 and score range [-0.33, +0.27] (mean≈0), there is no variance for EsoF signals to differentiate. Walk-forward test: top-quartile score WR=50.2% vs bottom=52.0% — zero predictive power. This is expected when the entire dataset is concentrated at midnight/ASIA_PACIFIC.
**The 56-day backtest is NOT a valid testbed for EsoF strategy evaluation.** It is valid only for confirming gold-engine mechanics (which it does: ROI=+190.19%).
The live CH 549-trade dataset (spread across all sessions/DoW) is the primary evidence base.
**Critical caveat on the backtest ROI itself**: because all trades fire at ~00:17 UTC, the backtest is testing "midnight-only BLUE" — not live BLUE. Live BLUE hour-0 entry performance: WR=55%, avg_pnl=-$3.92 (negative avg). The backtest +190.19% ≈ live gold +189.48% is numerically consistent, but this coincidence could mask canceling biases. The backtest validates that the vel_div signal produces positive EV and that engine mechanics are consistent; it does NOT validate the exact ROI figure under live intraday conditions. The backtest cannot account for the intraday session/DoW effects that EsoF is designed to capture — this is precisely the limitation that motivated the EsoF project in the first place.
---
### 12.6 CH-Based Strategy Results (549 Clean Alpha Trades)
Baseline: WR=47.4%, Net=+$3,103
| Strategy | T_exec | T_blk | CF Net | ΔPnL |
|----------|--------|-------|--------|------|
| A: LEV_SCALE | 549 | 0 | +$3,971 | **+$868** |
| B: HARD_BLOCK | 490 | 59 | +$5,922 | **+$2,819** |
| C: DOW_BLOCK | 375 | 174 | +$3,561 | +$458 |
| D: SESSION_BLOCK | 422 | 127 | +$6,960 | **+$3,857** |
| E: COMBINED | 340 | 209 | +$7,085 | **+$3,982** |
Note: Strategy F (S6_BUCKET) is separately treated in §12.7.
---
### 12.7 FAVORABLE vs UNFAVORABLE — Statistical Evidence
From 588 CH trades (all clean exits), EsoF label performance:
| Label | n | WR% | Net PnL | Avg/trade |
|-------|---|-----|---------|-----------|
| FAVORABLE | 84 | **78.6%** | +$11,889 | +$141.54 |
| MILD_POSITIVE | 190 | 55.8% | +$1,620 | +$8.53 |
| NEUTRAL | 93 | 24.7% | -$5,574 | -$59.94 |
| MILD_NEGATIVE | 162 | 42.6% | -$1,937 | -$11.96 |
| UNFAVORABLE | 59 | **28.8%** | -$2,819 | -$47.78 |
**FAVORABLE vs UNFAVORABLE statistical test:**
| Metric | Value |
|--------|-------|
| FAVORABLE wins/losses | 66 / 18 |
| UNFAVORABLE wins/losses | 17 / 42 |
| Odds ratio | **9.06×** |
| Cohen's h | **1.046** (large, threshold ≥ 0.80) |
| χ² (df=1) | **35.23** (p < 0.0001; critical value at p<0.001 = 10.83) |
**This is statistically robust.** The FAVORABLE/UNFAVORABLE split is not noise at n=136.
Strategy A on UNFAVORABLE at 0.5× leverage: saves ~$1,409 vs actual -$2,819.
Hard block of UNFAVORABLE: saves $2,819 (full elimination of the negative label bucket).
---
### 12.8 The NEUTRAL Label Anomaly
NEUTRAL (score between -0.05 and +0.05) shows WR=24.7% — worse than UNFAVORABLE (28.8%). This is counterintuitive.
Investigation:
- All 93 NEUTRAL trades are from **April 2026** (the current month)
- NEUTRAL ASIA_PACIFIC subset: WR=14.7% (n=34)
- Score range: -0.048 to +0.049
**Interpretation**: A score near zero does NOT mean "safe middle ground." It means the positive and negative calendar signals are **canceling each other** — signal conflict. In the current April 2026 market regime, that conflict is associated with the worst outcomes. "Mixed signals = proceed with caution" is the correct read.
This is not a scoring bug. The advisory score near 0 should be treated with the same caution as MILD_NEGATIVE, not as a neutral baseline. Consider re-labeling NEUTRAL to "UNCLEAR" in future documentation to avoid miscommunication.
Month breakdown of labels:
| Month | FAVORABLE | MILD_POS | NEUTRAL | MILD_NEG | UNFAVORABLE |
|-------|-----------|----------|---------|----------|-------------|
| 2026-03 | 7 | 4 | 0 | 0 | 0 |
| 2026-04 | 77 | 186 | 93 | 162 | 59 |
March data is sparse (11 trades). The full analysis is effectively April 2026.
---
### 12.9 Live Real-Time Validation — 2026-04-20
Three trades observed in-session, all during `advisory_label = "UNFAVORABLE"` (Monday × LONDON_MORNING 08:4509:40 UTC):
```
XRPUSDT ep:1.412 lev:9.00x pnl:-$91 exit:MAX_HOLD bars:125 08:45 UTC
TRXUSDT ep:0.3295 lev:9.00x pnl:-$109 exit:MAX_HOLD bars:125 09:15 UTC
CELRUSDT ep:0.002548 lev:9.00x pnl:-$355 exit:MAX_HOLD bars:125 09:40 UTC
```
Combined actual loss: **-$555**
At Strategy A (0.5× on UNFAVORABLE): counterfactual loss ≈ **-$277** (saves $278)
At Strategy B (hard block): **$0 loss** (saves $555)
This is consistent with UNFAVORABLE WR=28.8% and avg=-$47.78. Three MAX_HOLD losses in a row during a confirmed UNFAVORABLE window is the expected behavior, not an anomaly.
---
### 12.10 Overfitting Guard Summary
`prod/tests/test_esof_overfit_guard.py` — 24 tests, 9 classes.
From the 549-trade CH dataset:
| Test | Result | Verdict |
|------|--------|---------|
| NY_AFT permutation p-value | 0.035 | Significant (p<0.05) |
| NY_AFT WR 95% CI | [-$6,459, -$655] | Net loser, CI excludes 0 |
| NY_AFT Cohen's h | 0.089 | Trivial — loss is magnitude, not WR |
| Monday permutation p-value | 0.226 | Underpowered (n=34 in H1) |
| Walk-forward score→WR | Top-Q H2 WR=73.5% vs Bot=35.3% | **Strong** |
| FAVORABLE vs UNFAVORABLE χ² | 35.23 | p < 0.0001 |
6 tests intentionally fail (the guard is working — they flag genuine limitations):
- Bonferroni z-scores on per-cell WR do not clear threshold at n=549
- Bootstrap CI on NY_AFT WR overlaps baseline WR
- Cohen's h for NY_AFT WR is trivial (loss is from outlier magnitude trades)
These are not bugs. They represent real data limitations. Do not patch them to pass.
---
### 12.11 Recommendation (as of 2026-04-20)
**Wire Strategy A (LEV_SCALE) as the first live gate.** Rationale:
1. χ²=35.23 (p<0.0001) on FAVORABLE/UNFAVORABLE is robust at current sample size
2. Cohen's h=1.046 is a large effect — not a marginal signal
3. Strategy A is soft (leverage reduction, no hard blocks) — runs BLUE ungated by default, calibrates EsoF tables from all trades
4. Live 2026-04-20 observation (3 UNFAVORABLE MAX_HOLD losses) confirms the signal in real time
**Do NOT wire hard block (Strategy B/D/E) yet.** The walk-forward WR separation for NEUTRAL and MILD_NEGATIVE is not yet confirmed robust. Hard blocks increase regime sensitivity.
**Feedback loop protocol** (must not be violated):
- Always run BLUE **ungated** for base signal collection
- EsoF calibration tables (`SESSION_STATS`, `DOW_STATS`, etc.) updated ONLY from ungated trades
- Gate evaluated on out-of-sample ungated data — never feed gated trades back into calibration
- If Strategy A is wired: evaluate its counterfactual on ungated trades only, not on the leverage-adjusted subset
**Preconditions to upgrade to Strategy B (hard block):**
1. n ≥ 1,000 clean alpha trades with UNFAVORABLE label
2. UNFAVORABLE WR remains ≤ 35% at the new n
3. Walk-forward on separate 90-day window confirms WR separation
4. No regime break identified (e.g., FAVORABLE WR degrading to <60% would trigger review)