Files
DOLPHIN/prod/docs/EsoF_BLUE_IMPLEMENTATION_CURR_AND_RESEARCH.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

27 KiB
Executable File
Raw Permalink Blame History

EsoF — Esoteric Factors: Current State & Research Findings

As of: 2026-04-20 | Trade sample: 588 clean alpha trades (2026-03-31 → 2026-04-20) | Backtest: 2155 trades (2025-12-31 → 2026-02-26)


1. What "EsoF" Actually Refers To (Disambiguation)

The name "EsoF" (Esoteric Factors) attaches to two entirely separate systems in the Dolphin codebase. Do not conflate them.

1A. The Hazard Multiplier (set_esoteric_hazard_multiplier)

Located in esf_alpha_orchestrator.py. Modulates base_max_leverage downward:

effective_base = base_max_leverage × (1.0 - hazard_mult × factor)

Current gold spec: hazard_mult = 0.0 permanently. This means the hazard multiplier is always at zero — it reduces nothing, touches nothing. The parameter exists in the engine but is inert.

  • Gold backtest ran with hazard_mult=0.0.
  • Do not change this without running a full backtest comparison.
  • The esof_prefect_flow.py computes astrological factors and pushes them to HZ, but nothing in the trading engine reads or consumes this output. The flow is dormant as an engine input.

1B. The Advisory System (Observability/esof_advisor.py)

A standalone advisory layer — not wired into BLUE. Built from 637 live trades. Computes session/DoW/slot/liq_hour expectancy and publishes an advisory score every 15 seconds to HZ and CH.


2. MarketIndicators — external_factors/esoteric_factors_service.py

The MarketIndicators class computes several temporal signals used by the advisory layer.

2.1 Regions Table

Region Population (M) Liq Weight Major centers
Americas 1,000 0.35 NYSE, CME
EMEA 2,200 0.30 LSE, Frankfurt, ECB
South_Asia 1,400 0.05 BSE, NSE
East_Asia 1,600 0.20 TSE, HKEX, SGX
Oceania_SEA 800 0.10 ASX, SGX

2.2 Computed Signals

Method Returns Notes
get_weighted_times(now) (pop_hour, liq_hour) Circular weighted average using sin/cos of each region's local hour
get_liquidity_session(now) session string Step function on UTC hour
get_regional_times(now) dict per region local_hour + is_tradfi_open flag
is_tradfi_open(now) bool Weekday 04, hour 917 local
get_moon_phase(now) phase + illumination Via astropy (ephem backend)
is_mercury_retrograde(now) bool Hardcoded period list
get_fibonacci_time(now) strength float Distance to nearest Fibonacci minute
get_market_cycle_position(now) 0.01.0 BTC halving 4-year cycle reference

2.3 Weighted Hour Properties

  • pop_weighted_hour: Population-weighted centroid ≈ UTC + 4.21h (South_Asia + East_Asia heavily weighted). Rotates strongly with East_Asian trading day opening.
  • liq_weighted_hour: Liquidity-weighted centroid ≈ UTC + 0.98h (Americas 35% dominant). Nearly linear monotone with UTC — adds granularity but does not reveal fundamentally different patterns from raw UTC sessions.
  • Fallback (if astropy not installed): pop ≈ (UTC + 4.21) % 24, liq ≈ (UTC + 0.98) % 24
  • astropy 7.2.0 is installed in siloqy_env (installed 2026-04-19).

3. Trade Analysis — 637 Trades (2026-03-31 → 2026-04-19)

Baseline: WR = 43.7%, net = +$172.45 across all 637 trades.

3.1 Session Expectancy

Session Trades WR% Net PnL Avg/trade
LONDON_MORNING (0813h UTC) 111 47.7% +$4,133 +$37.23
ASIA_PACIFIC (0008h UTC) 182 46.7% +$1,600 +$8.79
LN_NY_OVERLAP (1317h UTC) 147 45.6% -$895 -$6.09
LOW_LIQUIDITY (2124h UTC) 71 39.4% -$809 -$11.40
NY_AFTERNOON (1721h UTC) 127 35.4% -$3,857 -$30.37

NY_AFTERNOON is a systematic loser across all days. LONDON_MORNING is the cleanest positive session.

3.2 Day-of-Week Expectancy

DoW Trades WR% Net PnL Avg/trade
Mon 81 27.2% -$1,054 -$13.01
Tue 77 54.5% +$3,824 +$49.66
Wed 98 43.9% -$385 -$3.93
Thu 115 44.3% -$4,017 -$34.93
Fri 106 39.6% -$1,968 -$18.57
Sat 82 43.9% +$43 +$0.53
Sun 78 53.8% +$3,730 +$47.82

Monday is the worst trading day (WR 27.2% — avoid). Thursday is large-loss despite median WR (heavy net damage from LN_NY_OVERLAP cell). Tuesday and Sunday are positive outliers.

3.3 Liquidity-Hour Expectancy (3h Buckets, liq_hour ≈ UTC + 0.98h)

liq_hour bucket Trades WR% Net PnL Avg/trade Approx UTC
03h 70 51.4% +$1,466 +$20.9 232h
36h 73 46.6% -$1,166 -$16.0 25h
69h 62 41.9% +$1,026 +$16.5 58h
912h 65 43.1% +$476 +$7.3 811h
1215h 84 52.4% +$3,532 +$42.0 1114h ★ BEST
1518h 113 43.4% -$770 -$6.8 1417h
1821h 99 35.4% -$2,846 -$28.8 1720h ✗ WORST
2124h 72 36.1% -$1,545 -$21.5 2023h

liq 1215h (EMEA afternoon + US open) is the standout best bucket. liq 1821h mirrors NY_AFTERNOON perfectly and is the worst.

3.4 DoW × Session Heatmap — Notable Cells

Full 5×7 grid (not all cells have enough data — cells with n < 5 omitted):

DoW × Session Trades WR% Net PnL Label
Sun × LONDON_MORNING 13 85.0% +$2,153 ★ BEST CELL
Sun × LN_NY_OVERLAP 24 75.0% +$2,110 2nd best
Tue × ASIA_PACIFIC 27 67.0% +$2,522 3rd
Tue × LN_NY_OVERLAP 18 56.0% +$2,260 4th
Sun × NY_AFTERNOON 17 6.0% -$1,025 ✗ WORST CELL
Mon × ASIA_PACIFIC 21 19.0% -$411 avoid
Thu × LN_NY_OVERLAP 27 41.0% -$3,310 ✗ CATASTROPHIC

Sun NY_AFTERNOON (6% WR) is a near-perfect inverse signal. Thu LN_NY_OVERLAP has enough trades (27) to be considered reliable — biggest single-cell loss in the dataset.

3.5 15-Minute Slot Highlights (n ≥ 5)

Top positive slots by avg_pnl (n ≥ 5):

Slot n WR% Net Avg/trade
15:00 10 70.0% +$2,266 +$226.58 ★
11:30 8 87.5% +$1,075 +$134.32
1:30 10 50.0% +$1,607 +$160.67
13:45 10 70.0% +$1,082 +$108.21
1:45 5 80.0% +$459 +$91.75

Top negative slots:

Slot n WR% Net Avg/trade
5:45 5 40.0% -$1,665 -$333.05 ★
2:15 5 0.0% -$852 -$170.31
16:30 4 25.0% -$2,024 -$506.01 (n<5)
12:45 6 16.7% -$1,178 -$196.35
18:00 6 16.7% -$1,596 -$265.93

Caveat on slots: Many 15m slots have n = 410. Most are noise at current sample size. Weight slot_score low (10%) in composite.


4. Advisory Scoring Model

4.1 Score Formula

sess_score  = (sess_wr  - 43.7) / 20.0    # normalized [-1, +1]
liq_score   = (liq_wr   - 43.7) / 20.0
dow_score   = (dow_wr   - 43.7) / 20.0
slot_score  = (slot_wr  - 43.7) / 20.0    # if n≥5, else 0.0
cell_bonus  = (cell_wr  - 43.7) / 100.0 × 0.3   # ±0.30 max

advisory_score = liq_score×0.30 + sess_score×0.25 + dow_score×0.30
               + slot_score×0.10 + cell_bonus×0.05

advisory_score = clamp(advisory_score, -1.0, +1.0)

# Mercury retrograde: additional -0.05 penalty
if mercury_retrograde:
    advisory_score = max(-1.0, advisory_score - 0.05)

Denominator 20.0 chosen because observed WR range across all factors is ≈ ±20pp from baseline.

4.2 Labels

Score range Label
> +0.25 FAVORABLE
> +0.05 MILD_POSITIVE
> -0.05 NEUTRAL
> -0.25 MILD_NEGATIVE
≤ -0.25 UNFAVORABLE

4.3 Weight Rationale

  • liq_hour (30%): More granular than session (3h vs 4h buckets, continuous). Captures EMEA-pm/US-open sweet spot cleanly.
  • DoW (30%): Strongest calendar factor in the data. MonThu split is statistically robust (n=77115).
  • Session (25%): Corroborates liq_hour. LONDON_MORNING/NY_AFTERNOON signal strong.
  • Slot 15m (10%): Useful signal but most slots have n < 10. Low weight appropriate until more data.
  • Cell DoW×Session (5%): Sun×LDN 85% WR is real but n=13 — kept at 5% to avoid overfitting.

5. Files Inventory

File Purpose Status
Observability/esof_advisor.py Advisory daemon + importable get_advisory() Active, v2
Observability/dolphin_status.py Status panel — reads esof_advisor_latest from HZ Wired (reads only)
external_factors/esoteric_factors_service.py MarketIndicators — real weighted hours, moon, mercury Source of truth
external_factors/esof_prefect_flow.py Pushes astro data to HZ Dormant (nothing consumes it)
prod/tests/test_esof_advisor.py 55-test suite (9 classes) All passing (28s)
CH: dolphin.esof_advisory Time-series advisory archive Active, 90-day TTL

CH Table Schema

CREATE TABLE IF NOT EXISTS dolphin.esof_advisory (
  ts                  DateTime64(3, 'UTC'),
  dow                 UInt8,
  dow_name            LowCardinality(String),
  hour_utc            UInt8,
  slot_15m            String,
  session             LowCardinality(String),
  moon_illumination   Float32,
  moon_phase          LowCardinality(String),
  mercury_retrograde  UInt8,
  pop_weighted_hour   Float32,
  liq_weighted_hour   Float32,
  market_cycle_pos    Float32,
  fib_strength        Float32,
  slot_wr_pct         Float32,
  slot_net_pnl        Float32,
  session_wr_pct      Float32,
  session_net_pnl     Float32,
  dow_wr_pct          Float32,
  dow_net_pnl         Float32,
  advisory_score      Float32,
  advisory_label      LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY ts
TTL toDateTime(ts) + toIntervalDay(90);

6. HZ Integration

  • Key: DOLPHIN_FEATURES['esof_advisor_latest']
  • Format: JSON string (all fields from compute_esof() return dict)
  • Write cadence: Every 15 seconds by daemon; CH every 5 minutes
  • Reading (in dolphin_status.py):
    esof = _get(hz, "DOLPHIN_FEATURES", "esof_advisor_latest")
    
    Falls back to "(start esof_advisor.py for advisory)" when absent.

7. Starting the Daemon

source /home/dolphin/siloqy_env/bin/activate
python Observability/esof_advisor.py
# Options:
#   --once        compute once and exit
#   --interval N  seconds between updates (default 15)
#   --no-hz       skip HZ write
#   --no-ch       skip CH write

Daemon PID on last start: 2417597 (2026-04-19).


8. Test Suite — prod/tests/test_esof_advisor.py

55 tests, 9 classes, all passing (28.36s run, 2026-04-19).

Class Tests What it covers
TestComputeEsofSchema 5 All required keys present, score in [-1,+1], labels valid
TestSessionClassification 5 Boundary conditions for all 5 sessions
TestWeightedHours 4 Pop/liq hour in [0,24), ordering, monotone liq
TestAdvisoryScoring 7 Best/worst cell ordering, Mon<Tue, Sun>Mon, NY_AFT negative
TestExpectancyTables 6 Table integrity: all WR in [0,100], net aligned with WR
TestMoonApproximation 4 Phase labels, new moon Apr 17, full moon Apr 2, illumination range
TestPublicAPI 3 get_advisory() returns same schema, --once flag, daemon args
TestHZIntegration 8 HZ write/read roundtrip (skipped if HZ unavailable)
TestCHIntegration 13 CH insert/query/TTL (skipped if CH unavailable)

Key test fixtures used:

Fixture datetime UTC Why
sun_london Sun 10:00 Best expected cell (WR 85%)
thu_ovlp Thu 15:00 Thu OVLP catastrophic cell
sun_ny Sun 18:00 Sun NY_AFT 6% WR inverse signal
mon_asia Mon 03:00 Mon worst day
tue_asia Tue 03:00 Tue vs Mon comparison
midday_win Tue 12:30 liq 1215h best bucket

9. Known Limitations and Research Notes

9.1 DoW × Slot Interaction (not modeled)

The current model treats DoW and Slot as independent factors (additive). This is incorrect in at least one known case: slot 15:00 has WR=70% overall (the best slot by avg_pnl), but Thursday 15:00 is known to be catastrophic in context (Thu×LN_NY_OVERLAP cell = -$3,310). The additive model would give Thu 15:00 a positive slot score (+1.32) while the DoW/cell scores pull it negative — net result is weakly positive, which understates the risk.

Future work: Model DoW×Slot joint distribution when n ≥ 10 per cell (requires ~2,000 more trades).

9.2 Sample Size Caveats

Factor Min cell n Confidence
Session 71 (LOW_LIQ) High
DoW 77 (Tue) High
liq_hour 3h 62 (6-9h) Medium-High
DoW×Session 13 (Sun×LDN) Medium
Slot 15m 419 LowMedium

Rules of thumb: session + DoW patterns are reliable. Slot patterns are directional hints only until n ≥ 30.

9.3 Mercury Retrograde

Current period: 2026-03-07 → 2026-03-30 (ended). Next: 2026-06-29 → 2026-07-23. The -0.05 penalty is arbitrary (no empirical basis from the 637 trades — not enough retrograde trades). Retain as a conservative prior.

9.4 Fibonacci Time

fib_strength = 1.0 - min(dist_to_nearest_fib_minute / 30.0, 1.0)

Currently not incorporated into the advisory score (computed but not weighted). No evidence from trade data. Track in CH for future regression.

9.5 Market Cycle Position

BTC halving reference: 2024-04-19. Current position: (days_since % 1461) / 1461.0. As of 2026-04-19 ≈ 365/1461 ≈ 0.25 (1 year post-halving, historically bullish mid-cycle). Not in advisory score — tracked only.

9.6 tradfi_open Flags

MarketIndicators.get_regional_times() returns is_tradfi_open per region. This signal is not yet used in scoring. Hypothesis: periods when 2+ major TradFi regions are simultaneously open may have better fill quality. Wire and test once more data exists.


10. Future Wiring Into BLUE Engine

DO NOT wire until validated with more data. The following describes the intended integration, NOT current state.

Proposed gating logic (research phase):

# In esf_alpha_orchestrator._try_entry() — FUTURE ONLY
advisory = get_advisory()   # from esof_advisor.py
if advisory["advisory_label"] == "UNFAVORABLE":
    # Option A: skip entry entirely
    return None
    # Option B: reduce sizing by 50%
    size_mult *= 0.5

Preconditions before wiring:

  1. Accumulate ≥ 1,500 trades across all sessions/DoW (currently 637)
  2. DoW slot interaction modeled or explicitly neutralized
  3. NY_AFTERNOON pattern holds on next 500 trades (current WR=35.4% robust across all 127 trades, so likely durable)
  4. Backtest: filter UNFAVORABLE periods → measure ROI uplift vs full universe
  5. Unit test: advisory gate does not block >20% of entry opportunities

Suggested first gate (lowest risk):

Block entries when all three hold simultaneously:

  • dow in (0, 3) (Mon or Thu)
  • session == "NY_AFTERNOON"
  • advisory_score < -0.25

This is the intersection of the three worst factors, blocking the highest-conviction negative cells only.


11. Update Cadence

Update SLOT_STATS, SESSION_STATS, DOW_STATS, LIQ_HOUR_STATS, DOW_SESSION_STATS in esof_advisor.py:

-- Pull fresh session stats from CH:
SELECT session,
       count() as trades,
       round(100.0 * countIf(pnl > 0) / count(), 1) as wr_pct,
       round(sum(pnl), 2) as net_pnl,
       round(avg(pnl), 2) as avg_pnl
FROM dolphin.trade_events
WHERE strategy = 'blue'
GROUP BY session
ORDER BY session;

-- DoW stats:
SELECT toDayOfWeek(ts) - 1 as dow,  -- 0=Mon in Python weekday()
       count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM dolphin.trade_events WHERE strategy='blue'
GROUP BY dow ORDER BY dow;

-- 15m slot stats (n>=5):
SELECT slot_15m, count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM (
  SELECT toStartOfFifteenMinutes(ts) as slot_ts,
         formatDateTime(slot_ts, '%H:%M') as slot_15m,
         pnl
  FROM dolphin.trade_events WHERE strategy='blue'
)
GROUP BY slot_15m HAVING count() >= 5
ORDER BY slot_15m;

Suggested refresh: when cumulative trade count crosses 1000, 1500, 2000.


12. Gate Strategy Empirical Testing — 2026-04-20

12.1 Test Infrastructure

Three new files created:

File Purpose
Observability/esof_gate.py Pure gate strategy functions (no I/O). GateResult dataclass: action, lev_mult, reason, s6_mult, irp_params
prod/tests/test_esof_gate_strategies.py CH-based strategy simulation + 39 unit tests, all passing
prod/tests/test_esof_overfit_guard.py 24 industry-standard overfitting avoidance tests (6 intentionally fail — guard working)
prod/tests/run_esof_backtest_sim.py 56-day gold-engine simulation over vbt_cache parquets

12.2 Clean Alpha Exit Definition

For all strategy testing, only FIXED_TP and MAX_HOLD exits are counted. Excluded:

  • HIBERNATE_HALT — forced position close, not alpha signal
  • SUBDAY_ACB_NORMALIZATION — control-plane forced, not alpha-driven

This reduces the 588-trade raw CH dataset to 549 clean alpha trades.

12.3 Strategies Tested (AF)

ID Strategy Mechanism
A LEV_SCALE Scale leverage by advisory score: FAVORABLE→1.2×, MILD_POS→1.0×, NEUTRAL→0.8×, MILD_NEG→0.6×, UNFAVORABLE→0.5×
B HARD_BLOCK Block entry when advisory_label == "UNFAVORABLE"
C DOW_BLOCK Block when dow in (0, 3) (Mon, Thu)
D SESSION_BLOCK Block when session == "NY_AFTERNOON"
E COMBINED Block when UNFAVORABLE or (Mon/Thu and NY_AFTERNOON)
F S6_BUCKET Per-bucket sizing multipliers keyed by EsoF label (5 labels × 7 buckets). Widened FAVORABLE, zeroed UNFAVORABLE buckets

Counterfactual PnL methodology: cf_pnl = actual_pnl × lev_mult (linear scaling; valid only for FIXED_TP and MAX_HOLD exits where leverage scales linearly with PnL).


12.4 Posture Clarification — BLUE Is Effectively APEX-Only

User confirmed, code verified. Live BLUE posture distribution from CH:

APEX:    586 trades (99.8%)
STALKER:   1 trade  (0.2%)
TURTLE:    0
HIBERNATE: 0

dolphin_actor.py reads posture from HZ DOLPHIN_SAFETY. STALKER applies a 2.0× leverage ceiling but does not block entries. TURTLE/HIBERNATE set regime_dd_halt = True (blocks entries for the day) — but these states occur essentially never in the current deployment window.

Implication: The live CH trade session/DoW distribution is NOT shaped by posture transitions. The session distribution is a genuine trading behavior signal.


12.5 56-Day Gold Backtest — Why It Is Invalid for EsoF Session Analysis

run_esof_backtest_sim.py ran the gold-spec LiquidationGuardEngine over 56 vbt_cache parquet days (2025-12-31 → 2026-02-26). Gold match: 2155 trades, ROI=+190.19% (confirming engine correctness).

Session distribution in backtest:

Session n %
ASIA_PACIFIC 2120 98.4%
All others 35 1.6%

Root cause: vbt_cache parquets are 10-second bars (~8208 bars/day). Engine lookback (~100 bars) completes in ~17 minutes from midnight. Entries fire at ~00:17 UTC (hour 0 = ASIA_PACIFIC). Single-position-per-asset plus MAX_HOLD=125 bars (~21 min) means 98% of all trades fire within the first hour of the day, before ASIA_PACIFIC ends at 08:00 UTC.

Confirmed by direct inspection: entry_ts.hour == 0 for 2108/2155 trades.

Consequence: The backtest's EsoF label distribution is:

Label n Note
NEUTRAL 863 40.0%
MILD_POSITIVE 668 31.0%
MILD_NEGATIVE 326 15.1%
UNFAVORABLE 297 13.8%
FAVORABLE 1 0.05%

With FAVORABLE=1 and score range [-0.33, +0.27] (mean≈0), there is no variance for EsoF signals to differentiate. Walk-forward test: top-quartile score WR=50.2% vs bottom=52.0% — zero predictive power. This is expected when the entire dataset is concentrated at midnight/ASIA_PACIFIC.

The 56-day backtest is NOT a valid testbed for EsoF strategy evaluation. It is valid only for confirming gold-engine mechanics (which it does: ROI=+190.19%).

The live CH 549-trade dataset (spread across all sessions/DoW) is the primary evidence base.

Critical caveat on the backtest ROI itself: because all trades fire at ~00:17 UTC, the backtest is testing "midnight-only BLUE" — not live BLUE. Live BLUE hour-0 entry performance: WR=55%, avg_pnl=-$3.92 (negative avg). The backtest +190.19% ≈ live gold +189.48% is numerically consistent, but this coincidence could mask canceling biases. The backtest validates that the vel_div signal produces positive EV and that engine mechanics are consistent; it does NOT validate the exact ROI figure under live intraday conditions. The backtest cannot account for the intraday session/DoW effects that EsoF is designed to capture — this is precisely the limitation that motivated the EsoF project in the first place.


12.6 CH-Based Strategy Results (549 Clean Alpha Trades)

Baseline: WR=47.4%, Net=+$3,103

Strategy T_exec T_blk CF Net ΔPnL
A: LEV_SCALE 549 0 +$3,971 +$868
B: HARD_BLOCK 490 59 +$5,922 +$2,819
C: DOW_BLOCK 375 174 +$3,561 +$458
D: SESSION_BLOCK 422 127 +$6,960 +$3,857
E: COMBINED 340 209 +$7,085 +$3,982

Note: Strategy F (S6_BUCKET) is separately treated in §12.7.


12.7 FAVORABLE vs UNFAVORABLE — Statistical Evidence

From 588 CH trades (all clean exits), EsoF label performance:

Label n WR% Net PnL Avg/trade
FAVORABLE 84 78.6% +$11,889 +$141.54
MILD_POSITIVE 190 55.8% +$1,620 +$8.53
NEUTRAL 93 24.7% -$5,574 -$59.94
MILD_NEGATIVE 162 42.6% -$1,937 -$11.96
UNFAVORABLE 59 28.8% -$2,819 -$47.78

FAVORABLE vs UNFAVORABLE statistical test:

Metric Value
FAVORABLE wins/losses 66 / 18
UNFAVORABLE wins/losses 17 / 42
Odds ratio 9.06×
Cohen's h 1.046 (large, threshold ≥ 0.80)
χ² (df=1) 35.23 (p < 0.0001; critical value at p<0.001 = 10.83)

This is statistically robust. The FAVORABLE/UNFAVORABLE split is not noise at n=136.

Strategy A on UNFAVORABLE at 0.5× leverage: saves ~$1,409 vs actual -$2,819. Hard block of UNFAVORABLE: saves $2,819 (full elimination of the negative label bucket).


12.8 The NEUTRAL Label Anomaly

NEUTRAL (score between -0.05 and +0.05) shows WR=24.7% — worse than UNFAVORABLE (28.8%). This is counterintuitive.

Investigation:

  • All 93 NEUTRAL trades are from April 2026 (the current month)
  • NEUTRAL ASIA_PACIFIC subset: WR=14.7% (n=34)
  • Score range: -0.048 to +0.049

Interpretation: A score near zero does NOT mean "safe middle ground." It means the positive and negative calendar signals are canceling each other — signal conflict. In the current April 2026 market regime, that conflict is associated with the worst outcomes. "Mixed signals = proceed with caution" is the correct read.

This is not a scoring bug. The advisory score near 0 should be treated with the same caution as MILD_NEGATIVE, not as a neutral baseline. Consider re-labeling NEUTRAL to "UNCLEAR" in future documentation to avoid miscommunication.

Month breakdown of labels:

Month FAVORABLE MILD_POS NEUTRAL MILD_NEG UNFAVORABLE
2026-03 7 4 0 0 0
2026-04 77 186 93 162 59

March data is sparse (11 trades). The full analysis is effectively April 2026.


12.9 Live Real-Time Validation — 2026-04-20

Three trades observed in-session, all during advisory_label = "UNFAVORABLE" (Monday × LONDON_MORNING 08:4509:40 UTC):

XRPUSDT   ep:1.412   lev:9.00x  pnl:-$91    exit:MAX_HOLD  bars:125  08:45 UTC
TRXUSDT   ep:0.3295  lev:9.00x  pnl:-$109   exit:MAX_HOLD  bars:125  09:15 UTC
CELRUSDT  ep:0.002548 lev:9.00x pnl:-$355   exit:MAX_HOLD  bars:125  09:40 UTC

Combined actual loss: -$555 At Strategy A (0.5× on UNFAVORABLE): counterfactual loss ≈ -$277 (saves $278) At Strategy B (hard block): $0 loss (saves $555)

This is consistent with UNFAVORABLE WR=28.8% and avg=-$47.78. Three MAX_HOLD losses in a row during a confirmed UNFAVORABLE window is the expected behavior, not an anomaly.


12.10 Overfitting Guard Summary

prod/tests/test_esof_overfit_guard.py — 24 tests, 9 classes.

From the 549-trade CH dataset:

Test Result Verdict
NY_AFT permutation p-value 0.035 Significant (p<0.05)
NY_AFT WR 95% CI [-$6,459, -$655] Net loser, CI excludes 0
NY_AFT Cohen's h 0.089 Trivial — loss is magnitude, not WR
Monday permutation p-value 0.226 Underpowered (n=34 in H1)
Walk-forward score→WR Top-Q H2 WR=73.5% vs Bot=35.3% Strong
FAVORABLE vs UNFAVORABLE χ² 35.23 p < 0.0001

6 tests intentionally fail (the guard is working — they flag genuine limitations):

  • Bonferroni z-scores on per-cell WR do not clear threshold at n=549
  • Bootstrap CI on NY_AFT WR overlaps baseline WR
  • Cohen's h for NY_AFT WR is trivial (loss is from outlier magnitude trades)

These are not bugs. They represent real data limitations. Do not patch them to pass.


12.11 Recommendation (as of 2026-04-20)

Wire Strategy A (LEV_SCALE) as the first live gate. Rationale:

  1. χ²=35.23 (p<0.0001) on FAVORABLE/UNFAVORABLE is robust at current sample size
  2. Cohen's h=1.046 is a large effect — not a marginal signal
  3. Strategy A is soft (leverage reduction, no hard blocks) — runs BLUE ungated by default, calibrates EsoF tables from all trades
  4. Live 2026-04-20 observation (3 UNFAVORABLE MAX_HOLD losses) confirms the signal in real time

Do NOT wire hard block (Strategy B/D/E) yet. The walk-forward WR separation for NEUTRAL and MILD_NEGATIVE is not yet confirmed robust. Hard blocks increase regime sensitivity.

Feedback loop protocol (must not be violated):

  • Always run BLUE ungated for base signal collection
  • EsoF calibration tables (SESSION_STATS, DOW_STATS, etc.) updated ONLY from ungated trades
  • Gate evaluated on out-of-sample ungated data — never feed gated trades back into calibration
  • If Strategy A is wired: evaluate its counterfactual on ungated trades only, not on the leverage-adjusted subset

Preconditions to upgrade to Strategy B (hard block):

  1. n ≥ 1,000 clean alpha trades with UNFAVORABLE label
  2. UNFAVORABLE WR remains ≤ 35% at the new n
  3. Walk-forward on separate 90-day window confirms WR separation
  4. No regime break identified (e.g., FAVORABLE WR degrading to <60% would trigger review)