Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
27 KiB
Executable File
EsoF — Esoteric Factors: Current State & Research Findings
As of: 2026-04-20 | Trade sample: 588 clean alpha trades (2026-03-31 → 2026-04-20) | Backtest: 2155 trades (2025-12-31 → 2026-02-26)
1. What "EsoF" Actually Refers To (Disambiguation)
The name "EsoF" (Esoteric Factors) attaches to two entirely separate systems in the Dolphin codebase. Do not conflate them.
1A. The Hazard Multiplier (set_esoteric_hazard_multiplier)
Located in esf_alpha_orchestrator.py. Modulates base_max_leverage downward:
effective_base = base_max_leverage × (1.0 - hazard_mult × factor)
Current gold spec: hazard_mult = 0.0 permanently. This means the hazard multiplier is always at zero — it reduces nothing, touches nothing. The parameter exists in the engine but is inert.
- Gold backtest ran with
hazard_mult=0.0. - Do not change this without running a full backtest comparison.
- The
esof_prefect_flow.pycomputes astrological factors and pushes them to HZ, but nothing in the trading engine reads or consumes this output. The flow is dormant as an engine input.
1B. The Advisory System (Observability/esof_advisor.py)
A standalone advisory layer — not wired into BLUE. Built from 637 live trades. Computes session/DoW/slot/liq_hour expectancy and publishes an advisory score every 15 seconds to HZ and CH.
2. MarketIndicators — external_factors/esoteric_factors_service.py
The MarketIndicators class computes several temporal signals used by the advisory layer.
2.1 Regions Table
| Region | Population (M) | Liq Weight | Major centers |
|---|---|---|---|
| Americas | 1,000 | 0.35 | NYSE, CME |
| EMEA | 2,200 | 0.30 | LSE, Frankfurt, ECB |
| South_Asia | 1,400 | 0.05 | BSE, NSE |
| East_Asia | 1,600 | 0.20 | TSE, HKEX, SGX |
| Oceania_SEA | 800 | 0.10 | ASX, SGX |
2.2 Computed Signals
| Method | Returns | Notes |
|---|---|---|
get_weighted_times(now) |
(pop_hour, liq_hour) |
Circular weighted average using sin/cos of each region's local hour |
get_liquidity_session(now) |
session string | Step function on UTC hour |
get_regional_times(now) |
dict per region | local_hour + is_tradfi_open flag |
is_tradfi_open(now) |
bool | Weekday 0–4, hour 9–17 local |
get_moon_phase(now) |
phase + illumination | Via astropy (ephem backend) |
is_mercury_retrograde(now) |
bool | Hardcoded period list |
get_fibonacci_time(now) |
strength float | Distance to nearest Fibonacci minute |
get_market_cycle_position(now) |
0.0–1.0 | BTC halving 4-year cycle reference |
2.3 Weighted Hour Properties
- pop_weighted_hour: Population-weighted centroid ≈ UTC + 4.21h (South_Asia + East_Asia heavily weighted). Rotates strongly with East_Asian trading day opening.
- liq_weighted_hour: Liquidity-weighted centroid ≈ UTC + 0.98h (Americas 35% dominant). Nearly linear monotone with UTC — adds granularity but does not reveal fundamentally different patterns from raw UTC sessions.
- Fallback (if astropy not installed):
pop ≈ (UTC + 4.21) % 24,liq ≈ (UTC + 0.98) % 24 - astropy 7.2.0 is installed in siloqy_env (installed 2026-04-19).
3. Trade Analysis — 637 Trades (2026-03-31 → 2026-04-19)
Baseline: WR = 43.7%, net = +$172.45 across all 637 trades.
3.1 Session Expectancy
| Session | Trades | WR% | Net PnL | Avg/trade |
|---|---|---|---|---|
| LONDON_MORNING (08–13h UTC) | 111 | 47.7% | +$4,133 | +$37.23 |
| ASIA_PACIFIC (00–08h UTC) | 182 | 46.7% | +$1,600 | +$8.79 |
| LN_NY_OVERLAP (13–17h UTC) | 147 | 45.6% | -$895 | -$6.09 |
| LOW_LIQUIDITY (21–24h UTC) | 71 | 39.4% | -$809 | -$11.40 |
| NY_AFTERNOON (17–21h UTC) | 127 | 35.4% | -$3,857 | -$30.37 |
NY_AFTERNOON is a systematic loser across all days. LONDON_MORNING is the cleanest positive session.
3.2 Day-of-Week Expectancy
| DoW | Trades | WR% | Net PnL | Avg/trade |
|---|---|---|---|---|
| Mon | 81 | 27.2% | -$1,054 | -$13.01 |
| Tue | 77 | 54.5% | +$3,824 | +$49.66 |
| Wed | 98 | 43.9% | -$385 | -$3.93 |
| Thu | 115 | 44.3% | -$4,017 | -$34.93 |
| Fri | 106 | 39.6% | -$1,968 | -$18.57 |
| Sat | 82 | 43.9% | +$43 | +$0.53 |
| Sun | 78 | 53.8% | +$3,730 | +$47.82 |
Monday is the worst trading day (WR 27.2% — avoid). Thursday is large-loss despite median WR (heavy net damage from LN_NY_OVERLAP cell). Tuesday and Sunday are positive outliers.
3.3 Liquidity-Hour Expectancy (3h Buckets, liq_hour ≈ UTC + 0.98h)
| liq_hour bucket | Trades | WR% | Net PnL | Avg/trade | Approx UTC |
|---|---|---|---|---|---|
| 0–3h | 70 | 51.4% | +$1,466 | +$20.9 | 23–2h |
| 3–6h | 73 | 46.6% | -$1,166 | -$16.0 | 2–5h |
| 6–9h | 62 | 41.9% | +$1,026 | +$16.5 | 5–8h |
| 9–12h | 65 | 43.1% | +$476 | +$7.3 | 8–11h |
| 12–15h | 84 | 52.4% | +$3,532 | +$42.0 | 11–14h ★ BEST |
| 15–18h | 113 | 43.4% | -$770 | -$6.8 | 14–17h |
| 18–21h | 99 | 35.4% | -$2,846 | -$28.8 | 17–20h ✗ WORST |
| 21–24h | 72 | 36.1% | -$1,545 | -$21.5 | 20–23h |
liq 12–15h (EMEA afternoon + US open) is the standout best bucket. liq 18–21h mirrors NY_AFTERNOON perfectly and is the worst.
3.4 DoW × Session Heatmap — Notable Cells
Full 5×7 grid (not all cells have enough data — cells with n < 5 omitted):
| DoW × Session | Trades | WR% | Net PnL | Label |
|---|---|---|---|---|
| Sun × LONDON_MORNING | 13 | 85.0% | +$2,153 | ★ BEST CELL |
| Sun × LN_NY_OVERLAP | 24 | 75.0% | +$2,110 | 2nd best |
| Tue × ASIA_PACIFIC | 27 | 67.0% | +$2,522 | 3rd |
| Tue × LN_NY_OVERLAP | 18 | 56.0% | +$2,260 | 4th |
| Sun × NY_AFTERNOON | 17 | 6.0% | -$1,025 | ✗ WORST CELL |
| Mon × ASIA_PACIFIC | 21 | 19.0% | -$411 | avoid |
| Thu × LN_NY_OVERLAP | 27 | 41.0% | -$3,310 | ✗ CATASTROPHIC |
Sun NY_AFTERNOON (6% WR) is a near-perfect inverse signal. Thu LN_NY_OVERLAP has enough trades (27) to be considered reliable — biggest single-cell loss in the dataset.
3.5 15-Minute Slot Highlights (n ≥ 5)
Top positive slots by avg_pnl (n ≥ 5):
| Slot | n | WR% | Net | Avg/trade |
|---|---|---|---|---|
| 15:00 | 10 | 70.0% | +$2,266 | +$226.58 ★ |
| 11:30 | 8 | 87.5% | +$1,075 | +$134.32 |
| 1:30 | 10 | 50.0% | +$1,607 | +$160.67 |
| 13:45 | 10 | 70.0% | +$1,082 | +$108.21 |
| 1:45 | 5 | 80.0% | +$459 | +$91.75 |
Top negative slots:
| Slot | n | WR% | Net | Avg/trade |
|---|---|---|---|---|
| 5:45 | 5 | 40.0% | -$1,665 | -$333.05 ★ |
| 2:15 | 5 | 0.0% | -$852 | -$170.31 |
| 16:30 | 4 | 25.0% | -$2,024 | -$506.01 (n<5) |
| 12:45 | 6 | 16.7% | -$1,178 | -$196.35 |
| 18:00 | 6 | 16.7% | -$1,596 | -$265.93 |
Caveat on slots: Many 15m slots have n = 4–10. Most are noise at current sample size. Weight slot_score low (10%) in composite.
4. Advisory Scoring Model
4.1 Score Formula
sess_score = (sess_wr - 43.7) / 20.0 # normalized [-1, +1]
liq_score = (liq_wr - 43.7) / 20.0
dow_score = (dow_wr - 43.7) / 20.0
slot_score = (slot_wr - 43.7) / 20.0 # if n≥5, else 0.0
cell_bonus = (cell_wr - 43.7) / 100.0 × 0.3 # ±0.30 max
advisory_score = liq_score×0.30 + sess_score×0.25 + dow_score×0.30
+ slot_score×0.10 + cell_bonus×0.05
advisory_score = clamp(advisory_score, -1.0, +1.0)
# Mercury retrograde: additional -0.05 penalty
if mercury_retrograde:
advisory_score = max(-1.0, advisory_score - 0.05)
Denominator 20.0 chosen because observed WR range across all factors is ≈ ±20pp from baseline.
4.2 Labels
| Score range | Label |
|---|---|
| > +0.25 | FAVORABLE |
| > +0.05 | MILD_POSITIVE |
| > -0.05 | NEUTRAL |
| > -0.25 | MILD_NEGATIVE |
| ≤ -0.25 | UNFAVORABLE |
4.3 Weight Rationale
- liq_hour (30%): More granular than session (3h vs 4h buckets, continuous). Captures EMEA-pm/US-open sweet spot cleanly.
- DoW (30%): Strongest calendar factor in the data. Mon–Thu split is statistically robust (n=77–115).
- Session (25%): Corroborates liq_hour. LONDON_MORNING/NY_AFTERNOON signal strong.
- Slot 15m (10%): Useful signal but most slots have n < 10. Low weight appropriate until more data.
- Cell DoW×Session (5%): Sun×LDN 85% WR is real but n=13 — kept at 5% to avoid overfitting.
5. Files Inventory
| File | Purpose | Status |
|---|---|---|
Observability/esof_advisor.py |
Advisory daemon + importable get_advisory() |
Active, v2 |
Observability/dolphin_status.py |
Status panel — reads esof_advisor_latest from HZ |
Wired (reads only) |
external_factors/esoteric_factors_service.py |
MarketIndicators — real weighted hours, moon, mercury |
Source of truth |
external_factors/esof_prefect_flow.py |
Pushes astro data to HZ | Dormant (nothing consumes it) |
prod/tests/test_esof_advisor.py |
55-test suite (9 classes) | All passing (28s) |
CH: dolphin.esof_advisory |
Time-series advisory archive | Active, 90-day TTL |
CH Table Schema
CREATE TABLE IF NOT EXISTS dolphin.esof_advisory (
ts DateTime64(3, 'UTC'),
dow UInt8,
dow_name LowCardinality(String),
hour_utc UInt8,
slot_15m String,
session LowCardinality(String),
moon_illumination Float32,
moon_phase LowCardinality(String),
mercury_retrograde UInt8,
pop_weighted_hour Float32,
liq_weighted_hour Float32,
market_cycle_pos Float32,
fib_strength Float32,
slot_wr_pct Float32,
slot_net_pnl Float32,
session_wr_pct Float32,
session_net_pnl Float32,
dow_wr_pct Float32,
dow_net_pnl Float32,
advisory_score Float32,
advisory_label LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY ts
TTL toDateTime(ts) + toIntervalDay(90);
6. HZ Integration
- Key:
DOLPHIN_FEATURES['esof_advisor_latest'] - Format: JSON string (all fields from
compute_esof()return dict) - Write cadence: Every 15 seconds by daemon; CH every 5 minutes
- Reading (in
dolphin_status.py):Falls back toesof = _get(hz, "DOLPHIN_FEATURES", "esof_advisor_latest")"(start esof_advisor.py for advisory)"when absent.
7. Starting the Daemon
source /home/dolphin/siloqy_env/bin/activate
python Observability/esof_advisor.py
# Options:
# --once compute once and exit
# --interval N seconds between updates (default 15)
# --no-hz skip HZ write
# --no-ch skip CH write
Daemon PID on last start: 2417597 (2026-04-19).
8. Test Suite — prod/tests/test_esof_advisor.py
55 tests, 9 classes, all passing (28.36s run, 2026-04-19).
| Class | Tests | What it covers |
|---|---|---|
TestComputeEsofSchema |
5 | All required keys present, score in [-1,+1], labels valid |
TestSessionClassification |
5 | Boundary conditions for all 5 sessions |
TestWeightedHours |
4 | Pop/liq hour in [0,24), ordering, monotone liq |
TestAdvisoryScoring |
7 | Best/worst cell ordering, Mon<Tue, Sun>Mon, NY_AFT negative |
TestExpectancyTables |
6 | Table integrity: all WR in [0,100], net aligned with WR |
TestMoonApproximation |
4 | Phase labels, new moon Apr 17, full moon Apr 2, illumination range |
TestPublicAPI |
3 | get_advisory() returns same schema, --once flag, daemon args |
TestHZIntegration |
8 | HZ write/read roundtrip (skipped if HZ unavailable) |
TestCHIntegration |
13 | CH insert/query/TTL (skipped if CH unavailable) |
Key test fixtures used:
| Fixture | datetime UTC | Why |
|---|---|---|
sun_london |
Sun 10:00 | Best expected cell (WR 85%) |
thu_ovlp |
Thu 15:00 | Thu OVLP catastrophic cell |
sun_ny |
Sun 18:00 | Sun NY_AFT 6% WR inverse signal |
mon_asia |
Mon 03:00 | Mon worst day |
tue_asia |
Tue 03:00 | Tue vs Mon comparison |
midday_win |
Tue 12:30 | liq 12–15h best bucket |
9. Known Limitations and Research Notes
9.1 DoW × Slot Interaction (not modeled)
The current model treats DoW and Slot as independent factors (additive). This is incorrect in at least one known case: slot 15:00 has WR=70% overall (the best slot by avg_pnl), but Thursday 15:00 is known to be catastrophic in context (Thu×LN_NY_OVERLAP cell = -$3,310). The additive model would give Thu 15:00 a positive slot score (+1.32) while the DoW/cell scores pull it negative — net result is weakly positive, which understates the risk.
Future work: Model DoW×Slot joint distribution when n ≥ 10 per cell (requires ~2,000 more trades).
9.2 Sample Size Caveats
| Factor | Min cell n | Confidence |
|---|---|---|
| Session | 71 (LOW_LIQ) | High |
| DoW | 77 (Tue) | High |
| liq_hour 3h | 62 (6-9h) | Medium-High |
| DoW×Session | 13 (Sun×LDN) | Medium |
| Slot 15m | 4–19 | Low–Medium |
Rules of thumb: session + DoW patterns are reliable. Slot patterns are directional hints only until n ≥ 30.
9.3 Mercury Retrograde
Current period: 2026-03-07 → 2026-03-30 (ended). Next: 2026-06-29 → 2026-07-23. The -0.05 penalty is arbitrary (no empirical basis from the 637 trades — not enough retrograde trades). Retain as a conservative prior.
9.4 Fibonacci Time
fib_strength = 1.0 - min(dist_to_nearest_fib_minute / 30.0, 1.0)
Currently not incorporated into the advisory score (computed but not weighted). No evidence from trade data. Track in CH for future regression.
9.5 Market Cycle Position
BTC halving reference: 2024-04-19. Current position: (days_since % 1461) / 1461.0. As of 2026-04-19 ≈ 365/1461 ≈ 0.25 (1 year post-halving, historically bullish mid-cycle). Not in advisory score — tracked only.
9.6 tradfi_open Flags
MarketIndicators.get_regional_times() returns is_tradfi_open per region. This signal is not yet used in scoring. Hypothesis: periods when 2+ major TradFi regions are simultaneously open may have better fill quality. Wire and test once more data exists.
10. Future Wiring Into BLUE Engine
DO NOT wire until validated with more data. The following describes the intended integration, NOT current state.
Proposed gating logic (research phase):
# In esf_alpha_orchestrator._try_entry() — FUTURE ONLY
advisory = get_advisory() # from esof_advisor.py
if advisory["advisory_label"] == "UNFAVORABLE":
# Option A: skip entry entirely
return None
# Option B: reduce sizing by 50%
size_mult *= 0.5
Preconditions before wiring:
- Accumulate ≥ 1,500 trades across all sessions/DoW (currently 637)
- DoW slot interaction modeled or explicitly neutralized
- NY_AFTERNOON pattern holds on next 500 trades (current WR=35.4% robust across all 127 trades, so likely durable)
- Backtest: filter UNFAVORABLE periods → measure ROI uplift vs full universe
- Unit test: advisory gate does not block >20% of entry opportunities
Suggested first gate (lowest risk):
Block entries when all three hold simultaneously:
dow in (0, 3)(Mon or Thu)session == "NY_AFTERNOON"advisory_score < -0.25
This is the intersection of the three worst factors, blocking the highest-conviction negative cells only.
11. Update Cadence
Update SLOT_STATS, SESSION_STATS, DOW_STATS, LIQ_HOUR_STATS, DOW_SESSION_STATS in esof_advisor.py:
-- Pull fresh session stats from CH:
SELECT session,
count() as trades,
round(100.0 * countIf(pnl > 0) / count(), 1) as wr_pct,
round(sum(pnl), 2) as net_pnl,
round(avg(pnl), 2) as avg_pnl
FROM dolphin.trade_events
WHERE strategy = 'blue'
GROUP BY session
ORDER BY session;
-- DoW stats:
SELECT toDayOfWeek(ts) - 1 as dow, -- 0=Mon in Python weekday()
count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM dolphin.trade_events WHERE strategy='blue'
GROUP BY dow ORDER BY dow;
-- 15m slot stats (n>=5):
SELECT slot_15m, count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM (
SELECT toStartOfFifteenMinutes(ts) as slot_ts,
formatDateTime(slot_ts, '%H:%M') as slot_15m,
pnl
FROM dolphin.trade_events WHERE strategy='blue'
)
GROUP BY slot_15m HAVING count() >= 5
ORDER BY slot_15m;
Suggested refresh: when cumulative trade count crosses 1000, 1500, 2000.
12. Gate Strategy Empirical Testing — 2026-04-20
12.1 Test Infrastructure
Three new files created:
| File | Purpose |
|---|---|
Observability/esof_gate.py |
Pure gate strategy functions (no I/O). GateResult dataclass: action, lev_mult, reason, s6_mult, irp_params |
prod/tests/test_esof_gate_strategies.py |
CH-based strategy simulation + 39 unit tests, all passing |
prod/tests/test_esof_overfit_guard.py |
24 industry-standard overfitting avoidance tests (6 intentionally fail — guard working) |
prod/tests/run_esof_backtest_sim.py |
56-day gold-engine simulation over vbt_cache parquets |
12.2 Clean Alpha Exit Definition
For all strategy testing, only FIXED_TP and MAX_HOLD exits are counted. Excluded:
HIBERNATE_HALT— forced position close, not alpha signalSUBDAY_ACB_NORMALIZATION— control-plane forced, not alpha-driven
This reduces the 588-trade raw CH dataset to 549 clean alpha trades.
12.3 Strategies Tested (A–F)
| ID | Strategy | Mechanism |
|---|---|---|
| A | LEV_SCALE |
Scale leverage by advisory score: FAVORABLE→1.2×, MILD_POS→1.0×, NEUTRAL→0.8×, MILD_NEG→0.6×, UNFAVORABLE→0.5× |
| B | HARD_BLOCK |
Block entry when advisory_label == "UNFAVORABLE" |
| C | DOW_BLOCK |
Block when dow in (0, 3) (Mon, Thu) |
| D | SESSION_BLOCK |
Block when session == "NY_AFTERNOON" |
| E | COMBINED |
Block when UNFAVORABLE or (Mon/Thu and NY_AFTERNOON) |
| F | S6_BUCKET |
Per-bucket sizing multipliers keyed by EsoF label (5 labels × 7 buckets). Widened FAVORABLE, zeroed UNFAVORABLE buckets |
Counterfactual PnL methodology: cf_pnl = actual_pnl × lev_mult (linear scaling; valid only for FIXED_TP and MAX_HOLD exits where leverage scales linearly with PnL).
12.4 Posture Clarification — BLUE Is Effectively APEX-Only
User confirmed, code verified. Live BLUE posture distribution from CH:
APEX: 586 trades (99.8%)
STALKER: 1 trade (0.2%)
TURTLE: 0
HIBERNATE: 0
dolphin_actor.py reads posture from HZ DOLPHIN_SAFETY. STALKER applies a 2.0× leverage ceiling but does not block entries. TURTLE/HIBERNATE set regime_dd_halt = True (blocks entries for the day) — but these states occur essentially never in the current deployment window.
Implication: The live CH trade session/DoW distribution is NOT shaped by posture transitions. The session distribution is a genuine trading behavior signal.
12.5 56-Day Gold Backtest — Why It Is Invalid for EsoF Session Analysis
run_esof_backtest_sim.py ran the gold-spec LiquidationGuardEngine over 56 vbt_cache parquet days (2025-12-31 → 2026-02-26). Gold match: 2155 trades, ROI=+190.19% (confirming engine correctness).
Session distribution in backtest:
| Session | n | % |
|---|---|---|
| ASIA_PACIFIC | 2120 | 98.4% |
| All others | 35 | 1.6% |
Root cause: vbt_cache parquets are 10-second bars (~8208 bars/day). Engine lookback (~100 bars) completes in ~17 minutes from midnight. Entries fire at ~00:17 UTC (hour 0 = ASIA_PACIFIC). Single-position-per-asset plus MAX_HOLD=125 bars (~21 min) means 98% of all trades fire within the first hour of the day, before ASIA_PACIFIC ends at 08:00 UTC.
Confirmed by direct inspection: entry_ts.hour == 0 for 2108/2155 trades.
Consequence: The backtest's EsoF label distribution is:
| Label | n | Note |
|---|---|---|
| NEUTRAL | 863 | 40.0% |
| MILD_POSITIVE | 668 | 31.0% |
| MILD_NEGATIVE | 326 | 15.1% |
| UNFAVORABLE | 297 | 13.8% |
| FAVORABLE | 1 | 0.05% |
With FAVORABLE=1 and score range [-0.33, +0.27] (mean≈0), there is no variance for EsoF signals to differentiate. Walk-forward test: top-quartile score WR=50.2% vs bottom=52.0% — zero predictive power. This is expected when the entire dataset is concentrated at midnight/ASIA_PACIFIC.
The 56-day backtest is NOT a valid testbed for EsoF strategy evaluation. It is valid only for confirming gold-engine mechanics (which it does: ROI=+190.19%).
The live CH 549-trade dataset (spread across all sessions/DoW) is the primary evidence base.
Critical caveat on the backtest ROI itself: because all trades fire at ~00:17 UTC, the backtest is testing "midnight-only BLUE" — not live BLUE. Live BLUE hour-0 entry performance: WR=55%, avg_pnl=-$3.92 (negative avg). The backtest +190.19% ≈ live gold +189.48% is numerically consistent, but this coincidence could mask canceling biases. The backtest validates that the vel_div signal produces positive EV and that engine mechanics are consistent; it does NOT validate the exact ROI figure under live intraday conditions. The backtest cannot account for the intraday session/DoW effects that EsoF is designed to capture — this is precisely the limitation that motivated the EsoF project in the first place.
12.6 CH-Based Strategy Results (549 Clean Alpha Trades)
Baseline: WR=47.4%, Net=+$3,103
| Strategy | T_exec | T_blk | CF Net | ΔPnL |
|---|---|---|---|---|
| A: LEV_SCALE | 549 | 0 | +$3,971 | +$868 |
| B: HARD_BLOCK | 490 | 59 | +$5,922 | +$2,819 |
| C: DOW_BLOCK | 375 | 174 | +$3,561 | +$458 |
| D: SESSION_BLOCK | 422 | 127 | +$6,960 | +$3,857 |
| E: COMBINED | 340 | 209 | +$7,085 | +$3,982 |
Note: Strategy F (S6_BUCKET) is separately treated in §12.7.
12.7 FAVORABLE vs UNFAVORABLE — Statistical Evidence
From 588 CH trades (all clean exits), EsoF label performance:
| Label | n | WR% | Net PnL | Avg/trade |
|---|---|---|---|---|
| FAVORABLE | 84 | 78.6% | +$11,889 | +$141.54 |
| MILD_POSITIVE | 190 | 55.8% | +$1,620 | +$8.53 |
| NEUTRAL | 93 | 24.7% | -$5,574 | -$59.94 |
| MILD_NEGATIVE | 162 | 42.6% | -$1,937 | -$11.96 |
| UNFAVORABLE | 59 | 28.8% | -$2,819 | -$47.78 |
FAVORABLE vs UNFAVORABLE statistical test:
| Metric | Value |
|---|---|
| FAVORABLE wins/losses | 66 / 18 |
| UNFAVORABLE wins/losses | 17 / 42 |
| Odds ratio | 9.06× |
| Cohen's h | 1.046 (large, threshold ≥ 0.80) |
| χ² (df=1) | 35.23 (p < 0.0001; critical value at p<0.001 = 10.83) |
This is statistically robust. The FAVORABLE/UNFAVORABLE split is not noise at n=136.
Strategy A on UNFAVORABLE at 0.5× leverage: saves ~$1,409 vs actual -$2,819. Hard block of UNFAVORABLE: saves $2,819 (full elimination of the negative label bucket).
12.8 The NEUTRAL Label Anomaly
NEUTRAL (score between -0.05 and +0.05) shows WR=24.7% — worse than UNFAVORABLE (28.8%). This is counterintuitive.
Investigation:
- All 93 NEUTRAL trades are from April 2026 (the current month)
- NEUTRAL ASIA_PACIFIC subset: WR=14.7% (n=34)
- Score range: -0.048 to +0.049
Interpretation: A score near zero does NOT mean "safe middle ground." It means the positive and negative calendar signals are canceling each other — signal conflict. In the current April 2026 market regime, that conflict is associated with the worst outcomes. "Mixed signals = proceed with caution" is the correct read.
This is not a scoring bug. The advisory score near 0 should be treated with the same caution as MILD_NEGATIVE, not as a neutral baseline. Consider re-labeling NEUTRAL to "UNCLEAR" in future documentation to avoid miscommunication.
Month breakdown of labels:
| Month | FAVORABLE | MILD_POS | NEUTRAL | MILD_NEG | UNFAVORABLE |
|---|---|---|---|---|---|
| 2026-03 | 7 | 4 | 0 | 0 | 0 |
| 2026-04 | 77 | 186 | 93 | 162 | 59 |
March data is sparse (11 trades). The full analysis is effectively April 2026.
12.9 Live Real-Time Validation — 2026-04-20
Three trades observed in-session, all during advisory_label = "UNFAVORABLE" (Monday × LONDON_MORNING 08:45–09:40 UTC):
XRPUSDT ep:1.412 lev:9.00x pnl:-$91 exit:MAX_HOLD bars:125 08:45 UTC
TRXUSDT ep:0.3295 lev:9.00x pnl:-$109 exit:MAX_HOLD bars:125 09:15 UTC
CELRUSDT ep:0.002548 lev:9.00x pnl:-$355 exit:MAX_HOLD bars:125 09:40 UTC
Combined actual loss: -$555 At Strategy A (0.5× on UNFAVORABLE): counterfactual loss ≈ -$277 (saves $278) At Strategy B (hard block): $0 loss (saves $555)
This is consistent with UNFAVORABLE WR=28.8% and avg=-$47.78. Three MAX_HOLD losses in a row during a confirmed UNFAVORABLE window is the expected behavior, not an anomaly.
12.10 Overfitting Guard Summary
prod/tests/test_esof_overfit_guard.py — 24 tests, 9 classes.
From the 549-trade CH dataset:
| Test | Result | Verdict |
|---|---|---|
| NY_AFT permutation p-value | 0.035 | Significant (p<0.05) |
| NY_AFT WR 95% CI | [-$6,459, -$655] | Net loser, CI excludes 0 |
| NY_AFT Cohen's h | 0.089 | Trivial — loss is magnitude, not WR |
| Monday permutation p-value | 0.226 | Underpowered (n=34 in H1) |
| Walk-forward score→WR | Top-Q H2 WR=73.5% vs Bot=35.3% | Strong |
| FAVORABLE vs UNFAVORABLE χ² | 35.23 | p < 0.0001 |
6 tests intentionally fail (the guard is working — they flag genuine limitations):
- Bonferroni z-scores on per-cell WR do not clear threshold at n=549
- Bootstrap CI on NY_AFT WR overlaps baseline WR
- Cohen's h for NY_AFT WR is trivial (loss is from outlier magnitude trades)
These are not bugs. They represent real data limitations. Do not patch them to pass.
12.11 Recommendation (as of 2026-04-20)
Wire Strategy A (LEV_SCALE) as the first live gate. Rationale:
- χ²=35.23 (p<0.0001) on FAVORABLE/UNFAVORABLE is robust at current sample size
- Cohen's h=1.046 is a large effect — not a marginal signal
- Strategy A is soft (leverage reduction, no hard blocks) — runs BLUE ungated by default, calibrates EsoF tables from all trades
- Live 2026-04-20 observation (3 UNFAVORABLE MAX_HOLD losses) confirms the signal in real time
Do NOT wire hard block (Strategy B/D/E) yet. The walk-forward WR separation for NEUTRAL and MILD_NEGATIVE is not yet confirmed robust. Hard blocks increase regime sensitivity.
Feedback loop protocol (must not be violated):
- Always run BLUE ungated for base signal collection
- EsoF calibration tables (
SESSION_STATS,DOW_STATS, etc.) updated ONLY from ungated trades - Gate evaluated on out-of-sample ungated data — never feed gated trades back into calibration
- If Strategy A is wired: evaluate its counterfactual on ungated trades only, not on the leverage-adjusted subset
Preconditions to upgrade to Strategy B (hard block):
- n ≥ 1,000 clean alpha trades with UNFAVORABLE label
- UNFAVORABLE WR remains ≤ 35% at the new n
- Walk-forward on separate 90-day window confirms WR separation
- No regime break identified (e.g., FAVORABLE WR degrading to <60% would trigger review)