Files

hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree

Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.

2026-04-21 16:58:38 +02:00

27 KiB

Executable File

Raw Permalink Blame History

EsoF — Esoteric Factors: Current State & Research Findings

As of: 2026-04-20 | Trade sample: 588 clean alpha trades (2026-03-31 → 2026-04-20) | Backtest: 2155 trades (2025-12-31 → 2026-02-26)

1. What "EsoF" Actually Refers To (Disambiguation)

The name "EsoF" (Esoteric Factors) attaches to two entirely separate systems in the Dolphin codebase. Do not conflate them.

1A. The Hazard Multiplier (`set_esoteric_hazard_multiplier`)

Located in esf_alpha_orchestrator.py. Modulates base_max_leverage downward:

effective_base = base_max_leverage × (1.0 - hazard_mult × factor)

Current gold spec: hazard_mult = 0.0 permanently. This means the hazard multiplier is always at zero — it reduces nothing, touches nothing. The parameter exists in the engine but is inert.

Gold backtest ran with hazard_mult=0.0.
Do not change this without running a full backtest comparison.
The esof_prefect_flow.py computes astrological factors and pushes them to HZ, but nothing in the trading engine reads or consumes this output. The flow is dormant as an engine input.

1B. The Advisory System (`Observability/esof_advisor.py`)

A standalone advisory layer — not wired into BLUE. Built from 637 live trades. Computes session/DoW/slot/liq_hour expectancy and publishes an advisory score every 15 seconds to HZ and CH.

2. MarketIndicators — `external_factors/esoteric_factors_service.py`

The MarketIndicators class computes several temporal signals used by the advisory layer.

2.1 Regions Table

Region	Population (M)	Liq Weight	Major centers
Americas	1,000	0.35	NYSE, CME
EMEA	2,200	0.30	LSE, Frankfurt, ECB
South_Asia	1,400	0.05	BSE, NSE
East_Asia	1,600	0.20	TSE, HKEX, SGX
Oceania_SEA	800	0.10	ASX, SGX

2.2 Computed Signals

Method	Returns	Notes
`get_weighted_times(now)`	`(pop_hour, liq_hour)`	Circular weighted average using sin/cos of each region's local hour
`get_liquidity_session(now)`	session string	Step function on UTC hour
`get_regional_times(now)`	dict per region	local_hour + is_tradfi_open flag
`is_tradfi_open(now)`	bool	Weekday 0–4, hour 9–17 local
`get_moon_phase(now)`	phase + illumination	Via astropy (ephem backend)
`is_mercury_retrograde(now)`	bool	Hardcoded period list
`get_fibonacci_time(now)`	strength float	Distance to nearest Fibonacci minute
`get_market_cycle_position(now)`	0.0–1.0	BTC halving 4-year cycle reference

2.3 Weighted Hour Properties

pop_weighted_hour: Population-weighted centroid ≈ UTC + 4.21h (South_Asia + East_Asia heavily weighted). Rotates strongly with East_Asian trading day opening.
liq_weighted_hour: Liquidity-weighted centroid ≈ UTC + 0.98h (Americas 35% dominant). Nearly linear monotone with UTC — adds granularity but does not reveal fundamentally different patterns from raw UTC sessions.
Fallback (if astropy not installed): pop ≈ (UTC + 4.21) % 24, liq ≈ (UTC + 0.98) % 24
astropy 7.2.0 is installed in siloqy_env (installed 2026-04-19).

3. Trade Analysis — 637 Trades (2026-03-31 → 2026-04-19)

Baseline: WR = 43.7%, net = +$172.45 across all 637 trades.

3.1 Session Expectancy

Session	Trades	WR%	Net PnL	Avg/trade
LONDON_MORNING (08–13h UTC)	111	47.7%	+$4,133	+$37.23
ASIA_PACIFIC (00–08h UTC)	182	46.7%	+$1,600	+$8.79
LN_NY_OVERLAP (13–17h UTC)	147	45.6%	-$895	-$6.09
LOW_LIQUIDITY (21–24h UTC)	71	39.4%	-$809	-$11.40
NY_AFTERNOON (17–21h UTC)	127	35.4%	-$3,857	-$30.37

NY_AFTERNOON is a systematic loser across all days. LONDON_MORNING is the cleanest positive session.

3.2 Day-of-Week Expectancy

DoW	Trades	WR%	Net PnL	Avg/trade
Mon	81	27.2%	-$1,054	-$13.01
Tue	77	54.5%	+$3,824	+$49.66
Wed	98	43.9%	-$385	-$3.93
Thu	115	44.3%	-$4,017	-$34.93
Fri	106	39.6%	-$1,968	-$18.57
Sat	82	43.9%	+$43	+$0.53
Sun	78	53.8%	+$3,730	+$47.82

Monday is the worst trading day (WR 27.2% — avoid). Thursday is large-loss despite median WR (heavy net damage from LN_NY_OVERLAP cell). Tuesday and Sunday are positive outliers.

3.3 Liquidity-Hour Expectancy (3h Buckets, liq_hour ≈ UTC + 0.98h)

liq_hour bucket	Trades	WR%	Net PnL	Avg/trade	Approx UTC
0–3h	70	51.4%	+$1,466	+$20.9	23–2h
3–6h	73	46.6%	-$1,166	-$16.0	2–5h
6–9h	62	41.9%	+$1,026	+$16.5	5–8h
9–12h	65	43.1%	+$476	+$7.3	8–11h
12–15h	84	52.4%	+$3,532	+$42.0	11–14h ★ BEST
15–18h	113	43.4%	-$770	-$6.8	14–17h
18–21h	99	35.4%	-$2,846	-$28.8	17–20h ✗ WORST
21–24h	72	36.1%	-$1,545	-$21.5	20–23h

liq 12–15h (EMEA afternoon + US open) is the standout best bucket. liq 18–21h mirrors NY_AFTERNOON perfectly and is the worst.

3.4 DoW × Session Heatmap — Notable Cells

Full 5×7 grid (not all cells have enough data — cells with n < 5 omitted):

DoW × Session	Trades	WR%	Net PnL	Label
Sun × LONDON_MORNING	13	85.0%	+$2,153	★ BEST CELL
Sun × LN_NY_OVERLAP	24	75.0%	+$2,110	2nd best
Tue × ASIA_PACIFIC	27	67.0%	+$2,522	3rd
Tue × LN_NY_OVERLAP	18	56.0%	+$2,260	4th
Sun × NY_AFTERNOON	17	6.0%	-$1,025	✗ WORST CELL
Mon × ASIA_PACIFIC	21	19.0%	-$411	avoid
Thu × LN_NY_OVERLAP	27	41.0%	-$3,310	✗ CATASTROPHIC

Sun NY_AFTERNOON (6% WR) is a near-perfect inverse signal. Thu LN_NY_OVERLAP has enough trades (27) to be considered reliable — biggest single-cell loss in the dataset.

3.5 15-Minute Slot Highlights (n ≥ 5)

Top positive slots by avg_pnl (n ≥ 5):

Slot	n	WR%	Net	Avg/trade
15:00	10	70.0%	+$2,266	+$226.58 ★
11:30	8	87.5%	+$1,075	+$134.32
1:30	10	50.0%	+$1,607	+$160.67
13:45	10	70.0%	+$1,082	+$108.21
1:45	5	80.0%	+$459	+$91.75

Top negative slots:

Slot	n	WR%	Net	Avg/trade
5:45	5	40.0%	-$1,665	-$333.05 ★
2:15	5	0.0%	-$852	-$170.31
16:30	4	25.0%	-$2,024	-$506.01 (n<5)
12:45	6	16.7%	-$1,178	-$196.35
18:00	6	16.7%	-$1,596	-$265.93

Caveat on slots: Many 15m slots have n = 4–10. Most are noise at current sample size. Weight slot_score low (10%) in composite.

4. Advisory Scoring Model

4.1 Score Formula

sess_score  = (sess_wr  - 43.7) / 20.0    # normalized [-1, +1]
liq_score   = (liq_wr   - 43.7) / 20.0
dow_score   = (dow_wr   - 43.7) / 20.0
slot_score  = (slot_wr  - 43.7) / 20.0    # if n≥5, else 0.0
cell_bonus  = (cell_wr  - 43.7) / 100.0 × 0.3   # ±0.30 max

advisory_score = liq_score×0.30 + sess_score×0.25 + dow_score×0.30
               + slot_score×0.10 + cell_bonus×0.05

advisory_score = clamp(advisory_score, -1.0, +1.0)

# Mercury retrograde: additional -0.05 penalty
if mercury_retrograde:
    advisory_score = max(-1.0, advisory_score - 0.05)

Denominator 20.0 chosen because observed WR range across all factors is ≈ ±20pp from baseline.

4.2 Labels

Score range	Label
> +0.25	`FAVORABLE`
> +0.05	`MILD_POSITIVE`
> -0.05	`NEUTRAL`
> -0.25	`MILD_NEGATIVE`
≤ -0.25	`UNFAVORABLE`

4.3 Weight Rationale

liq_hour (30%): More granular than session (3h vs 4h buckets, continuous). Captures EMEA-pm/US-open sweet spot cleanly.
DoW (30%): Strongest calendar factor in the data. Mon–Thu split is statistically robust (n=77–115).
Session (25%): Corroborates liq_hour. LONDON_MORNING/NY_AFTERNOON signal strong.
Slot 15m (10%): Useful signal but most slots have n < 10. Low weight appropriate until more data.
Cell DoW×Session (5%): Sun×LDN 85% WR is real but n=13 — kept at 5% to avoid overfitting.

5. Files Inventory

File	Purpose	Status
`Observability/esof_advisor.py`	Advisory daemon + importable `get_advisory()`	Active, v2
`Observability/dolphin_status.py`	Status panel — reads `esof_advisor_latest` from HZ	Wired (reads only)
`external_factors/esoteric_factors_service.py`	`MarketIndicators` — real weighted hours, moon, mercury	Source of truth
`external_factors/esof_prefect_flow.py`	Pushes astro data to HZ	Dormant (nothing consumes it)
`prod/tests/test_esof_advisor.py`	55-test suite (9 classes)	All passing (28s)
CH: `dolphin.esof_advisory`	Time-series advisory archive	Active, 90-day TTL

CH Table Schema

CREATE TABLE IF NOT EXISTS dolphin.esof_advisory (
  ts                  DateTime64(3, 'UTC'),
  dow                 UInt8,
  dow_name            LowCardinality(String),
  hour_utc            UInt8,
  slot_15m            String,
  session             LowCardinality(String),
  moon_illumination   Float32,
  moon_phase          LowCardinality(String),
  mercury_retrograde  UInt8,
  pop_weighted_hour   Float32,
  liq_weighted_hour   Float32,
  market_cycle_pos    Float32,
  fib_strength        Float32,
  slot_wr_pct         Float32,
  slot_net_pnl        Float32,
  session_wr_pct      Float32,
  session_net_pnl     Float32,
  dow_wr_pct          Float32,
  dow_net_pnl         Float32,
  advisory_score      Float32,
  advisory_label      LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY ts
TTL toDateTime(ts) + toIntervalDay(90);

6. HZ Integration

Key: DOLPHIN_FEATURES['esof_advisor_latest']
Format: JSON string (all fields from compute_esof() return dict)
Write cadence: Every 15 seconds by daemon; CH every 5 minutes
Reading (in dolphin_status.py):
```
esof = _get(hz, "DOLPHIN_FEATURES", "esof_advisor_latest")
```
Falls back to "(start esof_advisor.py for advisory)" when absent.

7. Starting the Daemon

source /home/dolphin/siloqy_env/bin/activate
python Observability/esof_advisor.py
# Options:
#   --once        compute once and exit
#   --interval N  seconds between updates (default 15)
#   --no-hz       skip HZ write
#   --no-ch       skip CH write

Daemon PID on last start: 2417597 (2026-04-19).

8. Test Suite — `prod/tests/test_esof_advisor.py`

55 tests, 9 classes, all passing (28.36s run, 2026-04-19).

Class	Tests	What it covers
`TestComputeEsofSchema`	5	All required keys present, score in [-1,+1], labels valid
`TestSessionClassification`	5	Boundary conditions for all 5 sessions
`TestWeightedHours`	4	Pop/liq hour in [0,24), ordering, monotone liq
`TestAdvisoryScoring`	7	Best/worst cell ordering, Mon<Tue, Sun>Mon, NY_AFT negative
`TestExpectancyTables`	6	Table integrity: all WR in [0,100], net aligned with WR
`TestMoonApproximation`	4	Phase labels, new moon Apr 17, full moon Apr 2, illumination range
`TestPublicAPI`	3	`get_advisory()` returns same schema, `--once` flag, daemon args
`TestHZIntegration`	8	HZ write/read roundtrip (skipped if HZ unavailable)
`TestCHIntegration`	13	CH insert/query/TTL (skipped if CH unavailable)

Key test fixtures used:

Fixture	datetime UTC	Why
`sun_london`	Sun 10:00	Best expected cell (WR 85%)
`thu_ovlp`	Thu 15:00	Thu OVLP catastrophic cell
`sun_ny`	Sun 18:00	Sun NY_AFT 6% WR inverse signal
`mon_asia`	Mon 03:00	Mon worst day
`tue_asia`	Tue 03:00	Tue vs Mon comparison
`midday_win`	Tue 12:30	liq 12–15h best bucket

9. Known Limitations and Research Notes

9.1 DoW × Slot Interaction (not modeled)

The current model treats DoW and Slot as independent factors (additive). This is incorrect in at least one known case: slot 15:00 has WR=70% overall (the best slot by avg_pnl), but Thursday 15:00 is known to be catastrophic in context (Thu×LN_NY_OVERLAP cell = -$3,310). The additive model would give Thu 15:00 a positive slot score (+1.32) while the DoW/cell scores pull it negative — net result is weakly positive, which understates the risk.

Future work: Model DoW×Slot joint distribution when n ≥ 10 per cell (requires ~2,000 more trades).

9.2 Sample Size Caveats

Factor	Min cell n	Confidence
Session	71 (LOW_LIQ)	High
DoW	77 (Tue)	High
liq_hour 3h	62 (6-9h)	Medium-High
DoW×Session	13 (Sun×LDN)	Medium
Slot 15m	4–19	Low–Medium

Rules of thumb: session + DoW patterns are reliable. Slot patterns are directional hints only until n ≥ 30.

9.3 Mercury Retrograde

Current period: 2026-03-07 → 2026-03-30 (ended). Next: 2026-06-29 → 2026-07-23. The -0.05 penalty is arbitrary (no empirical basis from the 637 trades — not enough retrograde trades). Retain as a conservative prior.

9.4 Fibonacci Time

fib_strength = 1.0 - min(dist_to_nearest_fib_minute / 30.0, 1.0)

Currently not incorporated into the advisory score (computed but not weighted). No evidence from trade data. Track in CH for future regression.

9.5 Market Cycle Position

BTC halving reference: 2024-04-19. Current position: (days_since % 1461) / 1461.0. As of 2026-04-19 ≈ 365/1461 ≈ 0.25 (1 year post-halving, historically bullish mid-cycle). Not in advisory score — tracked only.

9.6 tradfi_open Flags

MarketIndicators.get_regional_times() returns is_tradfi_open per region. This signal is not yet used in scoring. Hypothesis: periods when 2+ major TradFi regions are simultaneously open may have better fill quality. Wire and test once more data exists.

10. Future Wiring Into BLUE Engine

DO NOT wire until validated with more data. The following describes the intended integration, NOT current state.

Proposed gating logic (research phase):

# In esf_alpha_orchestrator._try_entry() — FUTURE ONLY
advisory = get_advisory()   # from esof_advisor.py
if advisory["advisory_label"] == "UNFAVORABLE":
    # Option A: skip entry entirely
    return None
    # Option B: reduce sizing by 50%
    size_mult *= 0.5

Preconditions before wiring:

Accumulate ≥ 1,500 trades across all sessions/DoW (currently 637)
DoW slot interaction modeled or explicitly neutralized
NY_AFTERNOON pattern holds on next 500 trades (current WR=35.4% robust across all 127 trades, so likely durable)
Backtest: filter UNFAVORABLE periods → measure ROI uplift vs full universe
Unit test: advisory gate does not block >20% of entry opportunities

Suggested first gate (lowest risk):

Block entries when all three hold simultaneously:

dow in (0, 3) (Mon or Thu)
session == "NY_AFTERNOON"
advisory_score < -0.25

This is the intersection of the three worst factors, blocking the highest-conviction negative cells only.

11. Update Cadence

Update SLOT_STATS, SESSION_STATS, DOW_STATS, LIQ_HOUR_STATS, DOW_SESSION_STATS in esof_advisor.py:

-- Pull fresh session stats from CH:
SELECT session,
       count() as trades,
       round(100.0 * countIf(pnl > 0) / count(), 1) as wr_pct,
       round(sum(pnl), 2) as net_pnl,
       round(avg(pnl), 2) as avg_pnl
FROM dolphin.trade_events
WHERE strategy = 'blue'
GROUP BY session
ORDER BY session;

-- DoW stats:
SELECT toDayOfWeek(ts) - 1 as dow,  -- 0=Mon in Python weekday()
       count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM dolphin.trade_events WHERE strategy='blue'
GROUP BY dow ORDER BY dow;

-- 15m slot stats (n>=5):
SELECT slot_15m, count(), round(100*countIf(pnl>0)/count(),1), round(sum(pnl),2), round(avg(pnl),2)
FROM (
  SELECT toStartOfFifteenMinutes(ts) as slot_ts,
         formatDateTime(slot_ts, '%H:%M') as slot_15m,
         pnl
  FROM dolphin.trade_events WHERE strategy='blue'
)
GROUP BY slot_15m HAVING count() >= 5
ORDER BY slot_15m;

Suggested refresh: when cumulative trade count crosses 1000, 1500, 2000.

12. Gate Strategy Empirical Testing — 2026-04-20

12.1 Test Infrastructure

Three new files created:

File	Purpose
`Observability/esof_gate.py`	Pure gate strategy functions (no I/O). `GateResult` dataclass: action, lev_mult, reason, s6_mult, irp_params
`prod/tests/test_esof_gate_strategies.py`	CH-based strategy simulation + 39 unit tests, all passing
`prod/tests/test_esof_overfit_guard.py`	24 industry-standard overfitting avoidance tests (6 intentionally fail — guard working)
`prod/tests/run_esof_backtest_sim.py`	56-day gold-engine simulation over vbt_cache parquets

12.2 Clean Alpha Exit Definition

For all strategy testing, only FIXED_TP and MAX_HOLD exits are counted. Excluded:

HIBERNATE_HALT — forced position close, not alpha signal
SUBDAY_ACB_NORMALIZATION — control-plane forced, not alpha-driven

This reduces the 588-trade raw CH dataset to 549 clean alpha trades.

12.3 Strategies Tested (A–F)

ID	Strategy	Mechanism
A	`LEV_SCALE`	Scale leverage by advisory score: FAVORABLE→1.2×, MILD_POS→1.0×, NEUTRAL→0.8×, MILD_NEG→0.6×, UNFAVORABLE→0.5×
B	`HARD_BLOCK`	Block entry when `advisory_label == "UNFAVORABLE"`
C	`DOW_BLOCK`	Block when `dow in (0, 3)` (Mon, Thu)
D	`SESSION_BLOCK`	Block when `session == "NY_AFTERNOON"`
E	`COMBINED`	Block when UNFAVORABLE or (Mon/Thu and NY_AFTERNOON)
F	`S6_BUCKET`	Per-bucket sizing multipliers keyed by EsoF label (5 labels × 7 buckets). Widened FAVORABLE, zeroed UNFAVORABLE buckets

Counterfactual PnL methodology: cf_pnl = actual_pnl × lev_mult (linear scaling; valid only for FIXED_TP and MAX_HOLD exits where leverage scales linearly with PnL).

12.4 Posture Clarification — BLUE Is Effectively APEX-Only

User confirmed, code verified. Live BLUE posture distribution from CH:

APEX:    586 trades (99.8%)
STALKER:   1 trade  (0.2%)
TURTLE:    0
HIBERNATE: 0

dolphin_actor.py reads posture from HZ DOLPHIN_SAFETY. STALKER applies a 2.0× leverage ceiling but does not block entries. TURTLE/HIBERNATE set regime_dd_halt = True (blocks entries for the day) — but these states occur essentially never in the current deployment window.

Implication: The live CH trade session/DoW distribution is NOT shaped by posture transitions. The session distribution is a genuine trading behavior signal.

12.5 56-Day Gold Backtest — Why It Is Invalid for EsoF Session Analysis

run_esof_backtest_sim.py ran the gold-spec LiquidationGuardEngine over 56 vbt_cache parquet days (2025-12-31 → 2026-02-26). Gold match: 2155 trades, ROI=+190.19% (confirming engine correctness).

Session distribution in backtest:

Session	n	%
ASIA_PACIFIC	2120	98.4%
All others	35	1.6%

Root cause: vbt_cache parquets are 10-second bars (~8208 bars/day). Engine lookback (~100 bars) completes in ~17 minutes from midnight. Entries fire at ~00:17 UTC (hour 0 = ASIA_PACIFIC). Single-position-per-asset plus MAX_HOLD=125 bars (~21 min) means 98% of all trades fire within the first hour of the day, before ASIA_PACIFIC ends at 08:00 UTC.

Confirmed by direct inspection: entry_ts.hour == 0 for 2108/2155 trades.

Consequence: The backtest's EsoF label distribution is:

Label	n	Note
NEUTRAL	863	40.0%
MILD_POSITIVE	668	31.0%
MILD_NEGATIVE	326	15.1%
UNFAVORABLE	297	13.8%
FAVORABLE	1	0.05%

With FAVORABLE=1 and score range [-0.33, +0.27] (mean≈0), there is no variance for EsoF signals to differentiate. Walk-forward test: top-quartile score WR=50.2% vs bottom=52.0% — zero predictive power. This is expected when the entire dataset is concentrated at midnight/ASIA_PACIFIC.

The 56-day backtest is NOT a valid testbed for EsoF strategy evaluation. It is valid only for confirming gold-engine mechanics (which it does: ROI=+190.19%).

The live CH 549-trade dataset (spread across all sessions/DoW) is the primary evidence base.

Critical caveat on the backtest ROI itself: because all trades fire at ~00:17 UTC, the backtest is testing "midnight-only BLUE" — not live BLUE. Live BLUE hour-0 entry performance: WR=55%, avg_pnl=-$3.92 (negative avg). The backtest +190.19% ≈ live gold +189.48% is numerically consistent, but this coincidence could mask canceling biases. The backtest validates that the vel_div signal produces positive EV and that engine mechanics are consistent; it does NOT validate the exact ROI figure under live intraday conditions. The backtest cannot account for the intraday session/DoW effects that EsoF is designed to capture — this is precisely the limitation that motivated the EsoF project in the first place.

12.6 CH-Based Strategy Results (549 Clean Alpha Trades)

Baseline: WR=47.4%, Net=+$3,103

Strategy	T_exec	T_blk	CF Net	ΔPnL
A: LEV_SCALE	549	0	+$3,971	+$868
B: HARD_BLOCK	490	59	+$5,922	+$2,819
C: DOW_BLOCK	375	174	+$3,561	+$458
D: SESSION_BLOCK	422	127	+$6,960	+$3,857
E: COMBINED	340	209	+$7,085	+$3,982

Note: Strategy F (S6_BUCKET) is separately treated in §12.7.

12.7 FAVORABLE vs UNFAVORABLE — Statistical Evidence

From 588 CH trades (all clean exits), EsoF label performance:

Label	n	WR%	Net PnL	Avg/trade
FAVORABLE	84	78.6%	+$11,889	+$141.54
MILD_POSITIVE	190	55.8%	+$1,620	+$8.53
NEUTRAL	93	24.7%	-$5,574	-$59.94
MILD_NEGATIVE	162	42.6%	-$1,937	-$11.96
UNFAVORABLE	59	28.8%	-$2,819	-$47.78

FAVORABLE vs UNFAVORABLE statistical test:

Metric	Value
FAVORABLE wins/losses	66 / 18
UNFAVORABLE wins/losses	17 / 42
Odds ratio	9.06×
Cohen's h	1.046 (large, threshold ≥ 0.80)
χ² (df=1)	35.23 (p < 0.0001; critical value at p<0.001 = 10.83)

This is statistically robust. The FAVORABLE/UNFAVORABLE split is not noise at n=136.

Strategy A on UNFAVORABLE at 0.5× leverage: saves ~$1,409 vs actual -$2,819. Hard block of UNFAVORABLE: saves $2,819 (full elimination of the negative label bucket).

12.8 The NEUTRAL Label Anomaly

NEUTRAL (score between -0.05 and +0.05) shows WR=24.7% — worse than UNFAVORABLE (28.8%). This is counterintuitive.

Investigation:

All 93 NEUTRAL trades are from April 2026 (the current month)
NEUTRAL ASIA_PACIFIC subset: WR=14.7% (n=34)
Score range: -0.048 to +0.049

Interpretation: A score near zero does NOT mean "safe middle ground." It means the positive and negative calendar signals are canceling each other — signal conflict. In the current April 2026 market regime, that conflict is associated with the worst outcomes. "Mixed signals = proceed with caution" is the correct read.

This is not a scoring bug. The advisory score near 0 should be treated with the same caution as MILD_NEGATIVE, not as a neutral baseline. Consider re-labeling NEUTRAL to "UNCLEAR" in future documentation to avoid miscommunication.

Month breakdown of labels:

Month	FAVORABLE	MILD_POS	NEUTRAL	MILD_NEG	UNFAVORABLE
2026-03	7	4	0	0	0
2026-04	77	186	93	162	59

March data is sparse (11 trades). The full analysis is effectively April 2026.

12.9 Live Real-Time Validation — 2026-04-20

Three trades observed in-session, all during advisory_label = "UNFAVORABLE" (Monday × LONDON_MORNING 08:45–09:40 UTC):

XRPUSDT   ep:1.412   lev:9.00x  pnl:-$91    exit:MAX_HOLD  bars:125  08:45 UTC
TRXUSDT   ep:0.3295  lev:9.00x  pnl:-$109   exit:MAX_HOLD  bars:125  09:15 UTC
CELRUSDT  ep:0.002548 lev:9.00x pnl:-$355   exit:MAX_HOLD  bars:125  09:40 UTC

Combined actual loss: -$555 At Strategy A (0.5× on UNFAVORABLE): counterfactual loss ≈ -$277 (saves $278) At Strategy B (hard block): $0 loss (saves $555)

This is consistent with UNFAVORABLE WR=28.8% and avg=-$47.78. Three MAX_HOLD losses in a row during a confirmed UNFAVORABLE window is the expected behavior, not an anomaly.

12.10 Overfitting Guard Summary

prod/tests/test_esof_overfit_guard.py — 24 tests, 9 classes.

From the 549-trade CH dataset:

Test	Result	Verdict
NY_AFT permutation p-value	0.035	Significant (p<0.05)
NY_AFT WR 95% CI	[-$6,459, -$655]	Net loser, CI excludes 0
NY_AFT Cohen's h	0.089	Trivial — loss is magnitude, not WR
Monday permutation p-value	0.226	Underpowered (n=34 in H1)
Walk-forward score→WR	Top-Q H2 WR=73.5% vs Bot=35.3%	Strong
FAVORABLE vs UNFAVORABLE χ²	35.23	p < 0.0001

6 tests intentionally fail (the guard is working — they flag genuine limitations):

Bonferroni z-scores on per-cell WR do not clear threshold at n=549
Bootstrap CI on NY_AFT WR overlaps baseline WR
Cohen's h for NY_AFT WR is trivial (loss is from outlier magnitude trades)

These are not bugs. They represent real data limitations. Do not patch them to pass.

12.11 Recommendation (as of 2026-04-20)

Wire Strategy A (LEV_SCALE) as the first live gate. Rationale:

χ²=35.23 (p<0.0001) on FAVORABLE/UNFAVORABLE is robust at current sample size
Cohen's h=1.046 is a large effect — not a marginal signal
Strategy A is soft (leverage reduction, no hard blocks) — runs BLUE ungated by default, calibrates EsoF tables from all trades
Live 2026-04-20 observation (3 UNFAVORABLE MAX_HOLD losses) confirms the signal in real time

Do NOT wire hard block (Strategy B/D/E) yet. The walk-forward WR separation for NEUTRAL and MILD_NEGATIVE is not yet confirmed robust. Hard blocks increase regime sensitivity.

Feedback loop protocol (must not be violated):

Always run BLUE ungated for base signal collection
EsoF calibration tables (SESSION_STATS, DOW_STATS, etc.) updated ONLY from ungated trades
Gate evaluated on out-of-sample ungated data — never feed gated trades back into calibration
If Strategy A is wired: evaluate its counterfactual on ungated trades only, not on the leverage-adjusted subset

Preconditions to upgrade to Strategy B (hard block):

n ≥ 1,000 clean alpha trades with UNFAVORABLE label
UNFAVORABLE WR remains ≤ 35% at the new n
Walk-forward on separate 90-day window confirms WR separation
No regime break identified (e.g., FAVORABLE WR degrading to <60% would trigger review)

27 KiB Executable File Raw Permalink Blame History Unescape Escape