diff --git a/PINK_DITAv2_E2E_TRACE_ANALYSIS.md b/PINK_DITAv2_E2E_TRACE_ANALYSIS.md index 8eb54e1..d8f5ad4 100644 --- a/PINK_DITAv2_E2E_TRACE_ANALYSIS.md +++ b/PINK_DITAv2_E2E_TRACE_ANALYSIS.md @@ -6776,3 +6776,362 @@ The `lot_step` field is declared with a default of `0.001` but is never read by | T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 | | U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 | | **Total** | | **333** | **30** | **99** | **96** | **64** | **37** | + +--- + +## PASS 19 — STARTUP/SHUTDOWN LIFECYCLE, RUST KERNEL SUBTLETIES, GENERATED TEST INFRA + +### V1: `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust kernel handle always leaks via `__del__` + +**File:** `launcher.py:40-42` + +```python +def close(self) -> None: + _maybe_close(self.venue) + _maybe_close(self.zinc_plane) + _maybe_close(self.control_plane) + # NOTE: self.kernel is never closed! +``` + +The bundle's `close()` method closes the venue, zinc plane, and control plane — but **not the kernel**. The `ExecutionKernel` Rust handle (`self._backend`) is only freed when Python's garbage collector calls `__del__`, which is non-deterministic. + +In a `with` block or explicit `bundle.close()` pattern, the Rust handle survives until the `ExecutionKernel` object's refcount drops to zero (usually when the bundle goes out of scope). If the caller holds a reference to the kernel (e.g., `k = bundle.kernel` for inspection), the handle lives indefinitely. + +Compare with `_build_rb()` in test code (gen2.py:326-338) which creates a `Shim` that has a `close` method calling `bundle.close()` and `bundle.kernel.close()`. The production `DITAv2LauncherBundle` has no equivalent. + +**Severity: Critical** + +### V2: `BingxVenueAdapter` has no `close()` or `disconnect()` — ThreadPoolExecutor, HTTP client sessions, connections never released + +**File:** `bingx_venue.py` (entire file) + +`BingxVenueAdapter` is the synchronous REST adapter to BingX. It has no `close()`, `disconnect()`, or any cleanup method. The `_maybe_close()` in the launcher tries `.close()` → `AttributeError` (caught), then `.disconnect()` → `AttributeError` (caught). **Nothing happens.** + +Leaked resources: +- **ThreadPoolExecutor** (class-level `_EXECUTOR`, 3 threads) — never shut down +- **BingxDirectExecutionAdapter backend** — holds an `aiohttp.ClientSession` with TCP connection pool (`limit=4`) — never closed +- Any in-flight HTTP connections — abandoned at process exit + +This is a compounding of R1 and R2 from Pass 15. It's the single largest resource leak in the system. + +**Severity: Critical** + +### V3: `process_intent` ENTER path does NOT clear `seen_event_ids` — old trade's dedup set pollutes new trade + +**File:** `_rust_kernel/src/lib.rs:1260-1290` (ENTER path) + +```rust +// ENTER path — slot reuse: +slot.trade_id = intent.trade_id.clone(); +slot.asset = intent.asset.clone(); +slot.entry_price = 0.0; +slot.size = 0.0; +slot.initial_size = 0.0; +slot.unrealized_pnl = 0.0; +slot.realized_pnl = 0.0; +slot.active_leg_index = 0; +slot.active_entry_order = None; +slot.active_exit_order = None; +slot.close_reason.clear(); +slot.closed = false; +slot.last_event_time = None; +// 🔴 seen_event_ids is NOT cleared — survives from old trade +slot.fsm_state = TradeStage::ORDER_REQUESTED; +``` + +When a slot is reused by a new ENTER intent, the `seen_event_ids` vector from the **previous trade** survives. If any event_id from the new trade collides with one observed by the old trade, the event is rejected as `DUPLICATE_EVENT` and silently dropped. + +**Example:** +1. Trade A in slot 0 receives events `evt-001`, `evt-002`, `evt-003`. All stored in `seen_event_ids`. +2. Trade A completes. Slot 0 returns to `IDLE`. +3. Trade B enters on slot 0. `seen_event_ids` still contains `evt-001`, `evt-002`, `evt-003`. +4. Exchange sends `evt-001` for Trade B (a legitimate new event with a reused event_id — possible on exchanges that recycle IDs daily). +5. Rust kernel sees `evt-001` in `seen_event_ids` → rejects as `DUPLICATE_EVENT`. The fill/ack is lost. + +The probability of collision depends on the exchange's event ID scheme. If event IDs are UUIDs, the risk is negligible. If event IDs are daily counters (e.g., `"20260601-001"` → `"20260602-001"`), the collision risk is **guaranteed** on every day boundary. + +**Fix:** Add `slot.seen_event_ids.clear();` in the ENTER path. + +**Severity: High** + +### V4: Three generators (`gen2.py`, `gen_live_tests.py`, `_gen_test.py`) all write to same output file — last writer wins, incompatible prologues + +**Files:** `gen2.py:431-432`, `gen_live_tests.py:680-681`, `_gen_test.py:1233-1234` + +All three generator scripts write to the same file: +``` +/mnt/dolphinng5_predict/prod/tests/test_pink_bingx_dita_live_e2e.py +``` + +Each produces **different prologues** with incompatible import styles, helper function signatures, and test runner implementations: +- `gen2.py`: Uses `RB`/`Shim` tuple, `_si()` helper, `_flatten()` helper +- `gen_live_tests.py`: Uses `_RuntimeBundle`/`_RuntimeShim` dataclass, `_run_scenario()` runner, `_flatten_via_kernel_intent()` helper +- `_gen_test.py`: Uses scenario-style `B()` body definitions, single parametrized `test_pink_ditav2` function + +The 68 named body functions are **identical** across gen2.py and gen_live_tests.py but their signatures differ (`(k, symbol, p)` vs `(bundle, client, symbol, snap)`). Running gen2.py then gen_live_tests.py produces a file where the bodies don't match the runner — the file compiles (Python sees valid functions) but silently tests nothing meaningful. + +`_build_pink_extended.py` and `_build_pink_bodies.py` then mutate the same file in-place with `str.replace()` — which silently does nothing if the expected format doesn't match. + +**Severity: Critical** + +### V5: Generated tests are triple env-gated — never run in CI, making them dead code + +**File:** `gen2.py` (generated output), `gen_live_tests.py` (generated output), `_gen_test.py` (generated output) + +Every generated test file has three `pytest.skip()` guards at the top: +```python +if not os.environ.get("BINGX_SMOKE_LIVE"): pytest.skip("...") +if not os.environ.get("BINGX_SMOKE_ALLOW_TRADE"): pytest.skip("...") +if not os.environ.get("PINK_DITA_E2E"): pytest.skip("...") +``` + +These three env vars are **never set in CI**. The CI pipeline has no step that sets them. As a result: +- 68 test functions from gen2.py → always skipped +- 70 test functions from gen_live_tests.py → always skipped +- 157 body scenarios from _gen_test.py → always skipped +- ~295 combined test scenarios → **zero executed in CI** + +Only the mock-venue tests (`test_flaws.py`, `test_account_core_v2.py`, `test_account_reconcile_faults.py`, `test_exchange_event_seam_parity.py`, `test_kernel_reliability.py`, `test_kernel_fee_friction.py`, `test_pink_clickhouse_phase4.py`, `test_alpha_blue_untouched_g7.py`) actually run in CI. + +**Severity: Critical** + +### V6: `ExecutionKernel.close()` destroys Rust handle immediately — no drain of in-flight intents, no wait for pending FFI calls, no state flush + +**File:** `rust_backend.py:320-330` + +```python +def close(self) -> None: + backend = self._backend + if backend is not None: + self._backend = None + try: + _get_rust().destroy(backend) # immediately frees Rust kernel memory + except Exception: + pass +``` + +`destroy(backend)` calls `dita_kernel_destroy(handle)` which calls `drop(Box::from_raw(handle))`, freeing the entire `KernelHandle` heap allocation including `KernelCore`, all slots, all account state. + +If `process_intent()` or `on_venue_event()` has been called from another thread and is mid-execution in Rust when `close()` fires: +1. Rust's state machine is destroyed mid-transition — dangling `self` reference +2. Any HTTP calls (venue.submit()) already in-flight complete, but their results are never processed +3. The Rust `Box::from_raw` calls `drop()` while another thread may be holding a reference to `core` — **use-after-free UB** +4. `_last_settled_pnl` dict is orphaned +5. The zinc plane, projection, and account state are all inconsistent with the destroyed kernel + +There is no cancel/token mechanism, no pending-call queue drain, and no state flush before destroy. + +**Severity: Critical** + +### V7: `_last_settled_pnl` dict accessed from both `process_intent` and `on_venue_event` without locks — thread-unsafe + +**File:** `rust_backend.py:440,475` + +```python +# process_intent (line 440): +self._last_settled_pnl[intent.slot_id] = 0.0 + +# on_venue_event (line 475): +incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0) +self._last_settled_pnl[slot.slot_id] = slot.realized_pnl +``` + +`process_intent()` (called from the async event loop or scheduler) and `on_venue_event()` (called from the venue event stream callback) both read and write `_last_settled_pnl` without any synchronization primitive. If these run on different threads, the dict can experience: +- **Lost update**: Two concurrent writes to the same `slot_id` — one overwrites the other +- **Dict corruption**: Python dict isn't thread-safe for concurrent writes — can produce `KeyError` on iteration or silently drop entries +- **Incorrect PnL settlement**: The incremental PnL calculation uses stale values + +In the current single-threaded async architecture, this isn't triggered. But the lack of protection means any future multi-threaded usage introduces a data race. + +**Severity: Medium** + +### V8: `#[serde(default)]` on `leverage: f64` defaults to 0.0 — `mark_price()` uses leverage directly without `.max(1.0)`, silent accounting error + +**File:** `_rust_kernel/src/lib.rs:400-408` + +```rust +fn mark_price(&mut self, price: f64) { + // ... + if self.entry_price <= 0.0 || self.size <= 0.0 { return; } + let mut delta = (price - self.entry_price) / self.entry_price; + if self.side == TradeSide::SHORT { delta = -delta; } + self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage; // uses leverage directly +} +``` + +The `TradeSlot` struct has `#[serde(default)] leverage: f64` which defaults to `0.0` when deserialized from JSON without a `leverage` field. + +`mark_price()` uses `self.leverage` directly in `unrealized_pnl` computation — **no `.max(1.0)` guard**. If leverage is `0.0` (from missing JSON field, `set_slot_json`, or snapshot restore without leverage), `unrealized_pnl` is always `0.0` regardless of price movement. + +Compare with `realized_pnl()` which correctly guards: +```rust +let notional = exit_size * slot.entry_price * slot.leverage.max(1.0); // correct +``` + +The `process_intent` ENTER path handles this (sets leverage to 1.0 if ≤ 0), but `set_slot_json` and `restore_full_snapshot` can bypass this and store leverage=0.0 directly into the slot. + +**Impact:** Any slot restored from a snapshot that predates the `leverage` field (or saved by a version without it) gets `leverage=0.0` silently. `unrealized_pnl` is always 0.0 — the operator sees no mark-to-market PnL even though the position has moved. This PnL is only realized on close (via `realized_pnl` which correctly uses `.max(1.0)`), so the total PnL is correct but the intra-period mark appears flat — a silent accounting error. + +**Severity: Medium** + +### V9: No `conftest.py`, no `pytest.ini`, no `asyncio_mode` configuration — test discovery relies on default pytest behavior + +**File:** (missing — no pytest configuration in workspace) + +The workspace has **zero pytest configuration files**. No `conftest.py`, `pytest.ini`, `setup.cfg`, `pyproject.toml` with pytest settings. This means: + +- **`asyncio_mode`** defaults to `"strict"` (pytest 8.x+), but all test files use `asyncio.run()` inline rather than `@pytest.mark.asyncio`. In strict mode, `async def` test functions are **not discovered** unless explicitly marked. Since all current tests are `sync def` wrapping `asyncio.run()`, this works — but the discovery relies on this specific pattern holding. +- No **timeout** configuration — a hanging test blocks the entire suite until killed +- No **test filtering** markers (no `slow`, `live`, `offline` markers) +- No **shared fixtures** — each test file repeats the `_build_rb()` setup pattern +- No **cache** or **rerun** configuration — flaky tests fail the CI suite + +If a developer adds a new test using `async def` + `await` (the natural pattern for async code) without `@pytest.mark.asyncio`, the test silently doesn't run — a false negative. + +**Severity: High** + +### V10: `kernel.close()` has `except Exception: pass` — silently swallows all destroy errors + +**File:** `rust_backend.py:325` + +```python +try: + _get_rust().destroy(backend) +except Exception: + pass # silently swallows RuntimeError, OSError, ctypes errors +``` + +If `dita_kernel_destroy` fails (e.g., segfault caught by `catch_unwind` returning an error, or the shared library was unloaded), the exception is silently consumed. The caller believes the kernel was closed successfully — but the handle may still be allocated. This is a leak path that's invisible to monitoring. + +**Severity: Low** + +### V11: `build_launcher_bundle()` has no cleanup for partially-created components — kernel OOM orphans venue, zinc, control plane + +**File:** `launcher.py:163-223` + +The build order is: +1. Control plane ✅ built +2. Projection ✅ built +3. Zinc plane ✅ built +4. Venue ✅ built +5. **ExecutionKernel** — if this fails (OOM, cargo build fail, CDLL load), components 1-4 are all orphaned + +There's no `try/finally` around the sequence. If `ExecutionKernel.__init__()` raises, the five already-built components (control plane, projection, zinc plane, venue, and any partially-initialized kernel state) leak. No cleanup code exists. + +**Fix:** Use a context manager or `try/finally` to close created components if a later one fails. + +**Severity: Medium** + +### V12: `KernelResult` clones entire kernel state (all slots + indexes + AccountState) on every FFI call — performance issue + +**File:** `_rust_kernel/src/lib.rs:1030-1050` (snapshot method) + +Every FFI call to `process_intent()` or `on_venue_event()` returns a `KernelResult` containing a full `snapshot()`: + +```rust +fn snapshot(&self) -> KernelSnapshot { + KernelSnapshot { + slots: self.slots.clone(), // O(n) clone ALL TradeSlots + active_trade_index: self.active_trade_index.clone(), // clone entire HashMap + venue_order_index: self.venue_order_index.clone(), + client_order_index: self.client_order_index.clone(), + account: self.account.clone(), // clone entire AccountState + } +} +``` + +For a kernel with 10 slots, each with metadata, seen_event_ids (1024 IDs), nested VenueOrders, and a full AccountState with fee_config, this produces **thousands of heap allocations per FFI call**. At 10 intents/second, this is tens of thousands of allocations/second. + +The Python side only reads `KernelResult.outcome` and `KernelResult.slot` (the single affected slot) from each response — **the full snapshot is never used**. The `snapshot` field is transmitted and decoded by Python (`json.loads` parses all of it) but the caller only accesses `outcome` and `slot`. The snapshot data is generated, serialized, transmitted, deserialized, and **silently discarded** on every call. + +**Severity: Medium** (performance, not correctness) + +### V13: `_build_rb()` leaks bundle on any post-creation failure — Shim construction failure orphans kernel and venue + +**File:** `gen2.py:326-338`, `_gen_test.py:69-80` + +```python +def _build_rb(ic=25000.0, max_slots=1): + cfg = _build_config(ic) + b = build_launcher_bundle(...) # bundle created with active kernel, venue, etc. + k = b.kernel + k.account.snapshot.capital = ic # <-- if this raises (e.g., AttributeError) + ... # Shim construction, etc. + return RB(runtime=Shim(k), config=cfg) +``` + +If any line after `build_launcher_bundle()` raises, the bundle `b` (with live kernel, venue connections, zinc plane) is leaked. No `try/finally` to call `b.close()`. In test code this is acceptable (process exits soon), but if test suites grow large, accumulated leaked bundles exhaust kernel slots or file descriptors. + +**Severity: Low** + +### V14: `_maybe_close` uses `break` after first method match — only tries `close` OR `disconnect`, never both + +**File:** `launcher.py:67` + +```python +for method_name in ("close", "disconnect"): + method = getattr(obj, method_name, None) + if callable(method): + ... + break # <-- breaks after first successful match — never tries the second method +``` + +If an object has both `close()` and `disconnect()`, only `close()` is called. The `disconnect()` fallback is never reached if `close()` exists. This is correct for objects like `RealZincPlane` (which has `close()` but no `disconnect()`), but for objects with both methods (possible future adapters), the second method is silently skipped. + +**Severity: Low** + +--- + +## Pass 19 Summary + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| V1 | `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust handle leaks via `__del__` | Launcher | **Critical** | +| V2 | `BingxVenueAdapter` no `close()`/`disconnect()` — ThreadPoolExecutor/HTTP never release | Venue | **Critical** | +| V3 | `process_intent` ENTER doesn't clear `seen_event_ids` — old dedup pollutes new trade | Rust | **High** | +| V4 | 3 generators write same output file — last writer wins, incompatible prologues | Test | **Critical** | +| V5 | Generated tests triple env-gated — never run in CI, dead code | Test | **Critical** | +| V6 | `kernel.close()` destroys Rust handle immediately — no drain, no flush, use-after-free risk | Bridge | **Critical** | +| V7 | `_last_settled_pnl` dict accessed from process_intent and on_venue_event without locks | Bridge | Medium | +| V8 | `#[serde(default)] leverage: f64` default 0.0 — mark_price uses directly no `.max(1.0)` | Rust | Medium | +| V9 | No `conftest.py`, no `pytest.ini`, no `asyncio_mode` — test discovery fragile | Test | **High** | +| V10 | `kernel.close()` `except Exception: pass` — silently swallows destroy errors | Bridge | Low | +| V11 | `build_launcher_bundle()` no cleanup on partial failure — OOM orphans 4 components | Launcher | Medium | +| V12 | `KernelResult` clones entire kernel state every FFI call — wasted allocations | Rust | Medium | +| V13 | `_build_rb()` leaks bundle on post-creation failure | Test | Low | +| V14 | `_maybe_close` breaks after first method — never tries both close and disconnect | Launcher | Low | + +### Pass 19 Severity + +| Severity | Count | +|----------|-------| +| **Critical** | 5 (V1, V2, V4, V5, V6) | +| **High** | 2 (V3, V9) | +| Medium | 4 (V7, V8, V11, V12) | +| Low | 3 (V10, V13, V14) | + +### Combined Catalog (All 19 Passes) + +| Pass | Focus | Count | Critical | High | Medium | Low | Info | +|------|-------|-------|----------|------|--------|-----|------| +| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 | +| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 | +| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 | +| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 | +| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 | +| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 | +| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 | +| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 | +| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 | +| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 | +| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 | +| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 | +| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 | +| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 | +| Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 | +| R | Pass 15 (Resource Leaks/Trust Boundaries/Security) | 14 | 2 | 6 | 3 | 2 | 1 | +| S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 | +| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 | +| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 | +| V | Pass 19 (Lifecycle/Rust Subtleties/Test Infra) | 14 | 5 | 2 | 4 | 3 | 0 | +| **Total** | | **347** | **35** | **101** | **100** | **64** | **37** | diff --git a/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md b/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md index 248ff3c..4449c03 100644 --- a/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md +++ b/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md @@ -1,7 +1,26 @@ # PINK DITAv2 — Structural Flaw Analysis (CENTRAL) **Analysis date:** 2026-05-31 +**Last updated:** 2026-06-02 (flaw fix pass — 8 flaws closed) **Scope:** Full PINK pipeline — all flaws across all modules. + +> **Fix notation:** Rows marked **✅ FIXED ``** are verified-fixed with a test commit on branch `exp/pink-ditav2-sprint0-20260530`. +> Unfixed flaws remain as originally described. + +### Fixes applied (2026-06-02) + +| Flaw | Commit | What changed | +|------|--------|--------------| +| I1 — apply_fill partial fill accumulation | `c87ca78` | `slot.size += fill_size` (was `=`); all partial fills accumulate | +| I10 — seen_event_ids lost on restart | `c87ca78` + `3ca154e` | IndexSet dedup in AccountState; KernelFullSnapshot persists per-slot seen_event_ids; startup reconcile wired (I14) | +| I13 — stray event reactivates CLOSED slot | `c87ca78` | `slot.closed` guard returns TERMINAL_STATE before FSM branch | +| I14 — no Zinc restore on startup | `3ca154e` | `__init__` now calls `reconcile_from_slots(zinc_live)` for any non-idle Zinc slots | +| I15 — CANCEL_REJECT leaves slot in EXIT_WORKING | `9a8d1b9` | Clears `active_exit_order`, transitions to `POSITION_OPEN` | +| O1 — `_maybe_close()` silent skip from async context | `338811e` | Routes to thread-pool executor when a running loop is detected | +| O5 — `_run()` no timeout → process hang | `338811e` | `Future.result(timeout=_BACKEND_TIMEOUT_S)` (default 30 s); raises `TimeoutError` | +| O10 — no `close()` on ExecutionKernel | `3ca154e` | `close()` nulls `_backend` to prevent double-free; `__enter__`/`__exit__` added | +| N1 — `with_handle_mut` zero sync (partial) | `c87ca78` | `catch_unwind` at FFI boundary; concurrent-call UB mitigated by Python GIL | + **Sources:** - This file (A-series): Detailed writeups for architectural flaws. - [PINK_DITAv2_E2E_TRACE_ANALYSIS.md](./PINK_DITAv2_E2E_TRACE_ANALYSIS.md) (E, F, G-series): @@ -33,7 +52,8 @@ | S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 | | T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 | | U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 | -| **Total** | | **333** | **30** | **99** | **96** | **64** | **37** | +| V | Pass 19 (Lifecycle/Rust Subtleties/Test Infra) | 14 | 5 | 2 | 4 | 3 | 0 | +| **Total** | | **347** | **35** | **101** | **100** | **64** | **37** | --- @@ -180,7 +200,7 @@ | # | Flaw | Layer | Severity | |---|------|-------|----------| -| I1 | Entry `apply_fill` multiple partial fills overwrite size instead of accumulating | Rust | **Critical** | +| I1 | Entry `apply_fill` multiple partial fills overwrite size instead of accumulating — **✅ FIXED `c87ca78`** | Rust | **Critical** | | I2 | Zero exit_ratio creates zero-size exit order — slot stuck in EXIT_REQUESTED | Rust | Medium | | I3 | entry_price inconsistency — Python falsy vs Rust `<= 0.0` gate | Bridge | Info | | I4 | Only 1 Rust unit test for 1765-line kernel — 99% untested at Rust layer | Rust | **High** | @@ -189,12 +209,12 @@ | I7 | Three weak/vacuous assertions in test_flaws.py | Test | Low | | I8 | Entry overfill no guard | Rust | Low | | I9 | No crash durability — slot state pure in-memory until step 7 of process_intent | Bridge | **Critical** | -| I10 | seen_event_ids lost on restart — events double-processed | Rust | **Critical** | +| I10 | seen_event_ids lost on restart — events double-processed — **✅ FIXED `c87ca78` + `3ca154e`** (IndexSet dedup + snapshot; startup restore wired) | Rust | **Critical** | | I11 | No idempotency key sent to BingX — lost response creates duplicate orders | Venue | **High** | | I12 | No graceful degradation for ANY subsystem | All | **High** | -| I13 | Stray venue event can reactivate CLOSED slot — no guard | Rust | **High** | -| I14 | No reconcile_from_slots call on startup — Zinc state never loaded into kernel | Restart | **High** | -| I15 | CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING | Rust | Medium | +| I13 | Stray venue event can reactivate CLOSED slot — no guard — **✅ FIXED `c87ca78`** | Rust | **High** | +| I14 | No reconcile_from_slots call on startup — Zinc state never loaded into kernel — **✅ FIXED `3ca154e`** | Restart | **High** | +| I15 | CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING — **✅ FIXED `9a8d1b9`** | Rust | Medium | | I16 | Zinc shared memory world-readable/writable by same-machine processes | Zinc | **High** | | I17 | KernelSlotView unrestricted getattr/setattr — bypasses all FSM guards | Bridge | **High** | | I18 | sys.path.insert(0) at import time in 3 production files — malicious module loading | Build | **High** | @@ -320,7 +340,7 @@ | # | Flaw | Layer | Severity | |---|------|-------|----------| -| N1 | Rust kernel `with_handle_mut` zero sync — `&mut` from raw ptr, UB on concurrent FFI | Rust | **Critical** | +| N1 | Rust kernel `with_handle_mut` zero sync — `&mut` from raw ptr, UB on concurrent FFI — *mitigated by Python GIL (single-threaded caller); catch_unwind added `c87ca78`* | Rust | **Critical** | | N2 | `_run()` has two completely different code paths — runtime branch, not design | Venue | **Critical** | | N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** | | N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call | Venue | **Critical** | @@ -339,16 +359,16 @@ | # | Flaw | Layer | Severity | |---|------|-------|----------| -| O1 | `_maybe_close()` asyncio.run without loop guard — close skipped from async context | Launcher | **High** | +| O1 | `_maybe_close()` asyncio.run without loop guard — close skipped from async context — **✅ FIXED `338811e`** | Launcher | **High** | | O2 | `async def connect()` shims call sync venue.connect() without await — blocking | Test | Medium | | O3 | `_contract_rows(client)` NOT awaited — `_pick_live_symbol` iterates coroutine = crash | Test | **High** | | O4 | Deprecated `get_event_loop().run_until_complete()` in test file | Test | Medium | -| O5 | `_run()` thread pool .result() no timeout — backend hang freezes process | Venue | **High** | +| O5 | `_run()` thread pool .result() no timeout — backend hang freezes process — **✅ FIXED `338811e`** | Venue | **High** | | O6 | MockVenueAdapter never exercises thread-pool bridge — untested in CI | Venue | Medium | | O7 | `_keepalive_loop`/`_rotation_sentinel` fire-and-forget — exceptions silently lost | Stream | Low | | O8 | `KernelSlotView.__getattr__` N FFI calls for N fields — no caching | Bridge | Medium | | O9 | `DITAv2LauncherBundle` no `__del__` — GC'd bundle leaks resource tree | Launcher | Medium | -| O10 | `ExecutionKernel` no `close()` — Rust handle only freed by unpredictable __del__ | Bridge | Medium | +| O10 | `ExecutionKernel` no `close()` — Rust handle only freed by unpredictable __del__ — **✅ FIXED `3ca154e`** | Bridge | Medium | | O11 | `KernelSlotView.__setattr__` triggers 5 persistence side effects — no read-only view | Bridge | Medium | --- @@ -486,6 +506,29 @@ --- +## V-Series: Startup/Shutdown Lifecycle, Rust Kernel Subtleties, Generated Test Infra (Pass 19) + +*Full detail in TRACE doc under "PASS 19 — STARTUP/SHUTDOWN LIFECYCLE, RUST KERNEL SUBTLETIES, GENERATED TEST INFRA."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| V1 | `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust handle leaks via `__del__` | Launcher | **Critical** | +| V2 | `BingxVenueAdapter` no `close()`/`disconnect()` — ThreadPoolExecutor/HTTP never release | Venue | **Critical** | +| V3 | `process_intent` ENTER doesn't clear `seen_event_ids` — old dedup pollutes new trade | Rust | **High** | +| V4 | 3 generators write same output file — last writer wins, incompatible prologues | Test | **Critical** | +| V5 | Generated tests triple env-gated — never run in CI, dead code | Test | **Critical** | +| V6 | `kernel.close()` destroys Rust handle immediately — no drain, no flush, UAF risk | Bridge | **Critical** | +| V7 | `_last_settled_pnl` dict accessed from process_intent and on_venue_event without locks | Bridge | Medium | +| V8 | `#[serde(default)] leverage: f64` default 0.0 — mark_price uses directly no .max(1.0) | Rust | Medium | +| V9 | No `conftest.py`, no `pytest.ini`, no `asyncio_mode` — test discovery fragile | Test | **High** | +| V10 | `kernel.close()` `except Exception: pass` — silently swallows destroy errors | Bridge | Low | +| V11 | `build_launcher_bundle()` no cleanup on partial failure — OOM orphans 4 components | Launcher | Medium | +| V12 | `KernelResult` clones entire kernel state every FFI call — wasted allocations | Rust | Medium | +| V13 | `_build_rb()` leaks bundle on post-creation failure | Test | Low | +| V14 | `_maybe_close` breaks after first method — never tries both close and disconnect | Launcher | Low | + +--- + ## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5) *Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."* diff --git a/prod/clean_arch/dita_v2/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md b/prod/clean_arch/dita_v2/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md new file mode 100644 index 0000000..ba1f897 --- /dev/null +++ b/prod/clean_arch/dita_v2/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md @@ -0,0 +1,1090 @@ +# PINK DITAv2 — Structural Flaw Analysis (CENTRAL) + +**Analysis date:** 2026-05-31 +**Last updated:** 2026-06-02 (flaw fix pass — 8 flaws closed) +**Scope:** Full PINK pipeline — all flaws across all modules. + +> **Fix notation:** Rows marked **✅ FIXED ``** are verified-fixed with a test commit on branch `exp/pink-ditav2-sprint0-20260530`. +> Unfixed flaws remain as originally described. + +### Fixes applied (2026-06-02) + +| Flaw | Commit | What changed | +|------|--------|--------------| +| I1 — apply_fill partial fill accumulation | `c87ca78` | `slot.size += fill_size` (was `=`); all partial fills accumulate | +| I10 — seen_event_ids lost on restart | `c87ca78` + `3ca154e` | IndexSet dedup in AccountState; KernelFullSnapshot persists per-slot seen_event_ids; startup reconcile wired (I14) | +| I13 — stray event reactivates CLOSED slot | `c87ca78` | `slot.closed` guard returns TERMINAL_STATE before FSM branch | +| I14 — no Zinc restore on startup | `3ca154e` | `__init__` now calls `reconcile_from_slots(zinc_live)` for any non-idle Zinc slots | +| I15 — CANCEL_REJECT leaves slot in EXIT_WORKING | `9a8d1b9` | Clears `active_exit_order`, transitions to `POSITION_OPEN` | +| O1 — `_maybe_close()` silent skip from async context | `338811e` | Routes to thread-pool executor when a running loop is detected | +| O5 — `_run()` no timeout → process hang | `338811e` | `Future.result(timeout=_BACKEND_TIMEOUT_S)` (default 30 s); raises `TimeoutError` | +| O10 — no `close()` on ExecutionKernel | `3ca154e` | `close()` nulls `_backend` to prevent double-free; `__enter__`/`__exit__` added | +| N1 — `with_handle_mut` zero sync (partial) | `c87ca78` | `catch_unwind` at FFI boundary; concurrent-call UB mitigated by Python GIL | + +**Sources:** +- This file (A-series): Detailed writeups for architectural flaws. +- [PINK_DITAv2_E2E_TRACE_ANALYSIS.md](./PINK_DITAv2_E2E_TRACE_ANALYSIS.md) (E, F, G-series): + Full E2E data-flow trace, deep bridge/Zinc/lifecycle scans. + Every E, F, G entry below is a summary only — full detail is in the TRACE doc. + +--- + +## Combined Catalog (All Flaws, All Passes) + +| Pass | Focus | Count | Critical | High | Medium | Low | Info | +|------|-------|-------|----------|------|--------|-----|------| +| A | Architectural (detailed in this file) | 15 | 0 | 2 | 0 | 2 | 11 | +| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 | +| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 | +| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 | +| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 | +| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 | +| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 | +| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 | +| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 | +| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 | +| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 | +| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 | +| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 | +| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 | +| Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 | +| R | Pass 15 (Resource Leaks/Trust Boundaries/Security) | 14 | 2 | 6 | 3 | 2 | 1 | +| S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 | +| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 | +| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 | +| **Total** | | **333** | **30** | **99** | **96** | **64** | **37** | + +--- + +## T-Series: Threading & Atomicity Flaws + +*Full detail in TRACE doc under "Threading & Atomicity" section.* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| T1 | `InMemoryZincPlane` thread-Condition deadlock from slot update re-entrancy | Zinc | **Critical** | +| T2 | Thread-unsafe kernel snapshot capture for account | Bridge | **High** | +| T3 | Re-entrant or incorrectly-scoped Rust-kernel handle usage | Bridge | **High** | +| T4 | Consequence: `on_venue_event` PnL settle races | Bridge | **High** | +| T5 | Access to shared `_state_seq` / `_slot_cache` in `RealZincPlane` from multiple kernel calls | Zinc | Medium | +| T6 | `_write_region` buffer zero + notify race with concurrent reader | Zinc | Medium | +| T7 | Publication of events in `process_intent` loop not synchronized with persist | Bridge | Medium | +| T8 | `asyncio.run` executor skip in `_run` leads to event-loop stall | Venue | Low | +| T9 | No thread-safe Python↔Rust ownership / lifetime protocol | Bridge | Low | + +--- + +## E-Series: E2E Data-Flow Flaws (Pass 1) + +*Full detail in TRACE doc under "Layer 1" through "Layer 9."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| E1 | `step()` calls `pump_venue_events()` every cycle unconditionally | Runtime | **High** | +| E2 | `kernel.snapshot()["account"]` returns a fresh dict, not a live view | Bridge | Low | +| E3 | `_decision_to_kernel_intent` drops `order_type` and `limit_price` | Runtime | **High** | +| E4 | `_exit_intent_from_slot` trusts slot.size but slot may be stale | Runtime | **High** | +| E5 | JSON serialization round-trip loses numeric precision | Bridge | Low | +| E6 | `_RustKernelLib` is a global singleton — shared across all kernels | Bridge | Low | +| E7 | ENTER handler silently allows re-entry with same trade_id | Rust | **High** | +| E8 | EXIT handler uses `initial_size` not current size | Rust | **High** | +| E9 | CANCEL handler returns diagnostic even when nothing happened | Rust | Low | +| E10 | `apply_fill` entry branch double-sets `active_entry_order` | Rust | Low | +| E11 | `_legacy_intent()` is a lossy conversion | Venue | Low | +| E12 | `_events_from_submit()` price fallback chain can lose venue price | Venue | Low | +| E13 | `_backend_snapshot()` timeout returns stale data | Venue | Medium | +| E14 | `_events_from_cancel` uses stale `slot_id` from order metadata | Venue | Low | +| E15 | Submit sets leverage via separate HTTP call | Adapter | Medium | +| E16 | `_format_quantity`/`_format_price` may use zero tick/step | Adapter | Medium | +| E17 | Cancel uses truth-based confirmation — can mask real errors | Adapter | Medium | +| E18 | `on_venue_event` settles PnL incrementally — fees never included | Bridge | Medium | +| E19 | `observe_slots` called with ALL slots, not just changed ones | Bridge | Low | +| E20 | `_capital()` reads live from `AccountProjection` — stale row risk | Persistence | Low | +| E21 | `persist_fill_events()` synthesizes fake Decision/Intent | Persistence | Medium | +| E22 | `_write_trade_exit_leg` capital_before uses arithmetic reconstruction | Persistence | Medium | +| E23 | `_write_trade_event` uses entry_price as exit_price | Persistence | Medium | +| E24 | Mock venue always emits fill on `partial_fill_ratio > 0` | Test | Low | +| E25 | Test scenarios use MARKET-only `_si()` helper — no LIMIT tests | Test | Low | +| E26 | Fresh-kernel reconcile tests create second kernel but share venue | Test | Low | + +--- + +## F-Series: Deep Bridge/Zinc/Lifecycle Flaws (Pass 3) + +*Full detail in TRACE doc under "PASS 3 — NEW FINDINGS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| F1 | CANCEL returns "accepted" before cancel happens — stale diagnostic_code | Bridge | Medium | +| F2 | `_last_settled_pnl` reset before `venue.submit()` — transient window | Bridge | Medium | +| F3 | `_first_invalid_intent_field` allows `leverage=0` and `target_size=0` | Bridge | Low | +| F4 | `outcome.emitted_events` only from venue — Rust kernel events dropped | Bridge | Low | +| F5 | `on_venue_event` redundant FFI read of slot already returned by Rust | Bridge | Low | +| F6 | `process_intent` records pre-venue transitions with `event=None` | Bridge | Info | +| F7 | `reconcile_from_slots` writes ALL slots to projection/zinc | Bridge | Low | +| F8 | `HazelcastRowWriter.put()` synchronous, no error handling — crashes intent | Projection | **Medium** | +| F9 | `RealZincPlane.write_slot()` serializes ALL slots, not just changed one | Zinc | Low | +| F10 | `RealZincPlane` zeros buffer before write — concurrent read sees empty | Zinc | Low | +| F11 | `RealZincPlane._write_region` no partial-write recovery | Zinc | Low | +| F12 | `InMemoryZincPlane` intent_region grows without bound | Zinc | Low | +| F13 | `InMemoryZincPlane` uses non-re-entrant `threading.Condition` | Zinc | Low | +| F14 | `KernelSlotView.__setattr__` round-trips unknown fields — silently dropped | Bridge | Low | +| F15 | `on_venue_event` loop stops on first exception — slot left in partial state | Bridge | **High** | +| F16 | `venue.submit()` returning empty events leaves slot in ORDER_REQUESTED | Bridge | Medium | +| F17 | Cancel truth-based confirmation returns REJECTED for already-cancelled orders | Adapter | Medium | +| F18 | Leverage-set and order-submit failures share error handler | Adapter | Low | +| F19 | `_events_from_submit` stale snapshot fallback → wrong fill detection | Venue | Medium | +| F20 | `__del__` frees Rust handle at unpredictable GC time — no explicit close() | Bridge | **Medium** | +| F21 | `DITAv2LauncherBundle.close()` closes venue before kernel is done | Launcher | Low | +| F22 | Silent fallback from real Zinc/Hazelcast to in-memory — operator unaware | Launcher | **Medium** | +| F23 | `VenueEvent.size` = `intent.target_size` not actual fill | Venue | Info | +| F24 | `asyncio.run()` inside async function in test generator | Test | Low | +| F25 | `_build_fresh_kernel_from_slot` leaks old kernel objects per call | Test | Low | +| F26 | `seen_event_ids` not cleared on re-entry — accumulates across trades | Rust | Low | +| F27 | `RealZincControlPlane.read()` parses Zinc region every call — no caching | Control | Low | +| F28 | `_legacy_intent` hardcodes confidence=1.0, bars_held=0 | Venue | Info | +| F29 | `_slot_to_payload` in real_zinc_plane.py is dead code | Zinc | Info | +| F30 | Duplicate `_slot_from_payload` in real_zinc_plane.py and rust_backend.py | Zinc | Low | + +--- + +## G-Series: Domain Scans — Rust Kernel, Config, Persistence, Lifecycle (Pass 4) + +*Full detail in TRACE doc under "PASS 4 — SYSTEMATIC DOMAIN SCANS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| G1 | EXIT_RESIDUAL action missing from Rust KernelCommandType enum | Rust | **Critical** | +| G2 | `into_c_string` unwrap() panics on NUL byte in FFI string | Rust | **Critical** | +| G3 | EXIT hardcodes prev_state=POSITION_OPEN — allows backward FSM transition | Rust | **Critical** | +| G4 | `consume_exit_leg` stale `all_legs_done` variable — wrong branch after last leg | Rust | **Critical** | +| G5 | `realized_pnl` unbounded f64 — overflows to inf at extreme values | Rust | **High** | +| G6 | `mark_price` produces unbounded unrealized_pnl — no result guard | Rust | **High** | +| G7 | ENTER no is_finite() guard on target_size | Rust | **High** | +| G8 | `reconcile_slots_json` no dedup or bounds validation | Rust | **High** | +| G9 | `exchange_order_id` update targets wrong order — exit cancel broken | Rust | **High** | +| G10 | CANCEL diagnostic always says NO_ACTIVE_EXIT_ORDER | Rust | **High** | +| G11 | `apply_fill` overwrites intended_size with slot.size | Rust | Medium | +| G12 | No max leverage cap enforced by kernel | Rust | Medium | +| G13 | `resolve_slot` fallback returns unwrap_or(0) — misroutes events | Rust | Medium | +| G14 | `commit_slot` silently ignores out-of-bounds slot_id | Rust | Medium | +| G15 | Zero `__post_init__` validators on all 16 config dataclasses (127 fields) | Config | **High** | +| G16 | DITA_V2_DEBUG_CLICKHOUSE defaults to True when unset | Config | Info | +| G17 | String config fields — Zinc region injection risk | Config | Medium | +| G18 | `exit_leg_ratios` no sum-to-1 validation | Config | Low | +| G19 | RealZincControlPlane.read() no sequence check — torn-read risk | Config | Low | +| G20 | ClickHouse journal strategy/db env vars — SQL injection risk | Config | Low | +| G21 | entry_price used as exit_price in trade_events — data loss | Persistence | **High** | +| G22 | active_leg_index → entry_bar semantic mis-mapping | Persistence | Medium | +| G23 | capital_before arithmetic absorbs cross-slot PnL | Persistence | Medium | +| G24 | Recovery trade_reconstruction always has trade_id="" | Persistence | Medium | +| G25 | seen_event_ids, exit_leg_ratios, VenueOrder, metadata not in flat CH tables | Persistence | Low | +| G26 | _safe_float silently converts NaN/None/Inf to 0.0 | Persistence | Low | +| G27 | build_launcher_bundle no exception safety — prior resources leak | Lifecycle | **High** | +| G28 | RealZincPlane/RealZincControlPlane no __del__ — SHM orphaned | Lifecycle | **High** | +| G29 | Zero signal handlers — no cleanup on SIGTERM/SIGINT | Lifecycle | **High** | +| G30 | ExecutionKernel has no close() — relies on __del__ for Rust handle | Lifecycle | **High** | +| G31 | Hazelcast projection never closed | Lifecycle | Medium | +| G32 | _maybe_close() break skips second method | Lifecycle | Low | +| G33 | close() not idempotent for RealZinc components | Lifecycle | Low | +| G34 | No context manager on DITAv2LauncherBundle | Lifecycle | Low | +| G35 | BingxVenueAdapter.connect() never called | Lifecycle | Info | +| G36 | Only one try/finally in entire codebase | Lifecycle | **High** | + +--- + +## I-Series: Math, Tests, Concurrency, Recovery, Security (Pass 6) + +*Full detail in TRACE doc under "PASS 6 — MATH, TESTS, CONCURRENCY, RECOVERY, SECURITY."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| I1 | Entry `apply_fill` multiple partial fills overwrite size instead of accumulating — **✅ FIXED `c87ca78`** | Rust | **Critical** | +| I2 | Zero exit_ratio creates zero-size exit order — slot stuck in EXIT_REQUESTED | Rust | Medium | +| I3 | entry_price inconsistency — Python falsy vs Rust `<= 0.0` gate | Bridge | Info | +| I4 | Only 1 Rust unit test for 1765-line kernel — 99% untested at Rust layer | Rust | **High** | +| I5 | MockVenueScenario rejection flags exist but zero tests use them | Test | **High** | +| I6 | No LIMIT order test through full kernel path | Test | **High** | +| I7 | Three weak/vacuous assertions in test_flaws.py | Test | Low | +| I8 | Entry overfill no guard | Rust | Low | +| I9 | No crash durability — slot state pure in-memory until step 7 of process_intent | Bridge | **Critical** | +| I10 | seen_event_ids lost on restart — events double-processed — **✅ FIXED `c87ca78` + `3ca154e`** (IndexSet dedup + snapshot; startup restore wired) | Rust | **Critical** | +| I11 | No idempotency key sent to BingX — lost response creates duplicate orders | Venue | **High** | +| I12 | No graceful degradation for ANY subsystem | All | **High** | +| I13 | Stray venue event can reactivate CLOSED slot — no guard — **✅ FIXED `c87ca78`** | Rust | **High** | +| I14 | No reconcile_from_slots call on startup — Zinc state never loaded into kernel — **✅ FIXED `3ca154e`** | Restart | **High** | +| I15 | CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING — **✅ FIXED `9a8d1b9`** | Rust | Medium | +| I16 | Zinc shared memory world-readable/writable by same-machine processes | Zinc | **High** | +| I17 | KernelSlotView unrestricted getattr/setattr — bypasses all FSM guards | Bridge | **High** | +| I18 | sys.path.insert(0) at import time in 3 production files — malicious module loading | Build | **High** | +| I19 | pump_venue_events stale snapshot diff produces phantom position events | Venue | **High** | +| I20 | exit_leg_ratios empty list — next_exit_ratio defaults to 1.0 (undocumented) | Contracts | Info | +| I21 | RATE_LIMITED code path in both Python and Rust is completely untested | All | Medium | +| I22 | Thread pool max_workers=3 shared across all adapter instances — never shut down | Venue | Medium | + +--- + +## J-Series: Test Infra, Data Feed, Rust Deeper, Env Parsing, Connections (Pass 7) + +*Full detail in TRACE doc under "PASS 7 — TEST INFRA, DATA FEED, RUST DEEPER, ENV PARSING, CONNECTIONS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| J1 | `_flatten` submits wrong direction for LONG positions | Test | Medium | +| J2 | `_check_slot_accounting` double-counts unrealized PnL | Test | Medium | +| J3 | `_build_live_snapshot` timestamp is float vs datetime — type crash risk | Data Feed | **High** | +| J4 | `ExecutionKernel.mark_price()` never called — no mark-to-market | Bridge | **High** | +| J5 | All VenueEvent timestamps use local clock, not exchange timestamp | Venue | Medium | +| J6 | No monotonic timestamp verification anywhere | All | Low | +| J7 | `rebuild_indexes()` overwrites duplicate trade_id — last wins | Rust | **High** | +| J8 | `resolve_slot()` falls back to slot 0 — stray event corrupts slot 0 | Rust | **High** | +| J9 | `get_slot_json`/`snapshot_json` return null with no diagnostic | Rust | Medium | +| J10 | Two processes with same DITA_V2_PREFIX corrupt shared Zinc memory | Zinc | **High** | +| J11 | `load_dotenv()` only runs on launcher.py import — ordering dependency | Config | Medium | +| J12 | BINGX_API_KEY passed None with no validation — fails at HTTP time | Config | Medium | +| J13 | API credentials never masked in error messages or tracebacks | Config | **High** | +| J14 | `_env_bool` inconsistent: empty string = False vs unset = default | Config | Low | +| J15 | gen2.py and _gen_test.py write to same output — last writer wins | Test | Medium | +| J16 | Shim test bridge lacks step(), decision_engine — zero runtime fidelity | Test | **High** | + +--- + +## K-Series: Observability, Memory, Time, Dead Code, Module Init (Pass 8) + +*Full detail in TRACE doc under "PASS 8 — OBSERVABILITY, MEMORY, TIME, DEAD CODE, MODULE INIT."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| K1 | Zero stdout/stderr — system completely silent | All | **Critical** | +| K2 | No health check, metrics, or monitoring surface | All | **Critical** | +| K3 | Failed trades produce no notification — error in return value only | Bridge | **High** | +| K4 | Exception tracebacks not captured — all except:pass swallow silently | All | **High** | +| K5 | ~85+ Python objects per process_intent — 36 TradeSlot copies via FFI | Bridge | Medium | +| K6 | Circular ref cycle Kernel→StateView→SlotView→Kernel — delays __del__ | Bridge | **High** | +| K7 | MemoryKernelJournal silently drops transitions after 10K rows | Journal | **High** | +| K8 | RealZincPlane._intent_cache unbounded Python list growth | Zinc | **High** | +| K9 | _backend_snapshot timeout uses wall clock — NTP truncates/extends | Venue | **High** | +| K10 | RealZincControlPlane.wait() uses wall clock — no monotonic | Control | Medium | +| K11 | exchange_ts fallback to local time.time() when E missing | Stream | Medium | +| K12 | No monotonic timestamp verification anywhere | All | Medium | +| K13 | ControlPlane.wait()/notify() — zero callers across all impls | Control | Info | +| K14 | AccountProjection.to_account_event() — zero callers | Account | Info | +| K15 | HazelcastProjector entire class dead | Projection | Info | +| K16 | _order_to_payload() dead code | Bridge | Info | +| K17 | MirroredControlPlane entire class dead — never constructed | Control | Info | +| K18 | 12 of 20 TradeStage variants never matched in Rust FSM | Rust | Low | +| K19 | Unused imports in projection.py and hazelcast_projection.py | Projection | Info | +| K20 | sys.path mutation on import — global side effect | Config | Medium | +| K21 | load_dotenv() at module import time — mutates os.environ globally | Config | Medium | +| K22 | ControlPlane protocol not exported in __all__ | Config | Info | +| K23 | KernelSlotView.__getattr__ makes FFI call per attribute access | Bridge | Medium | + +--- + +## L-Series: Contracts, Exchange Events, Network, FFI, Backup Diffs (Pass 9) + +*Full detail in TRACE doc under "PASS 9 — CONTRACTS, EXCHANGE EVENTS, NETWORK, FFI, BACKUP DIFFS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| L1 | `KernelOutcome(accepted=True, diag=INVALID_INTENT)` parseable — no invariant check | Bridge | Medium | +| L2 | `VenueEvent.filled_size > size` possible via different source fields | Venue | Medium | +| L3 | `VenueEvent.price=0` reaches kernel — zero-price fill = 100% loss PnL | Venue | **High** | +| L4 | `available_margin` set to cross-wallet balance, not available margin | Stream | **High** | +| L5 | `wallet_balance` defaults to 0 when `"wb"` absent — E-side reconcile always ERROR | Stream | **High** | +| L6 | `_keepalive_loop` no stop mechanism — runs on old key after rotation | Stream | Medium | +| L7 | `event_id` integer 0 → `str(0)` falsy on `or` → random UUID generated | Stream | Medium | +| L8 | Hardcoded VST URLs in test generators — wrong env if LIVE configured | Test | Medium | +| L9 | No proxy support — can't deploy behind corporate proxy | Network | Low | +| L10 | 5-minute DNS cache TTL — stale IPs on infrastructure change | Network | Low | +| L11 | `limit_price` getattr reads dataclass field, not metadata dict | Venue | Low | +| L12 | Backup diff: 14+ critical bugs fixed, 428-line dual-ledger accounting added | Rust | Info | +| L13 | `_build_full_runtime` dead — real pipeline never tested | Test | **High** | +| L14 | `listenKeyExpired` raises RuntimeError instead of clean yield | Stream | Medium | +| L15 | `_delete_listen_key` suppresses all exceptions — leaked server keys | Stream | Low | +| L16 | `venue_order_id` target selection ambiguous when entry order exists | Rust | Medium | + +--- + +## M-Series: Runtime, Test Bugs, FSM Audit, Persistence, Measurement (Pass 10) + +*Full detail in TRACE doc under "PASS 10 — RUNTIME, TEST BUGS, FSM AUDIT, PERSISTENCE, MEASUREMENT."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| M1 | ENTER transition hardcodes prev_state=IDLE — audit trail lies for re-entries | Rust | **Critical** | +| M2 | CANCEL creates no transition record — invisible in audit log | Rust | **Critical** | +| M3 | `_mk_intent` drops order_type/limit_price into metadata, not proper field | Test | **High** | +| M4 | test_cancel_entry_with_partial_fill never sends CANCEL — misnamed vacuous test | Test | **High** | +| M5 | Flaw 7 tests never send EXIT — exit_partial_fill_ratio untested | Test | Medium | +| M6 | test_dedup tests use wrong constant (actual=256, claim 64) — 70 events insufficient | Test | Medium | +| M7 | test_outcome_state_matches_actual_slot is tautological | Test | Low | +| M8 | ORDER_ACK silent fallthrough when no active order — accepted with no effect | Rust | Medium | +| M9 | ORDER_REJECT on POSITION_OPEN with stale entry order destroys position | Rust | **Critical** | +| M10 | No aggregation of trade count, success/fail, latency — all zero | All | **High** | +| M11 | Flaw 6 tests pass via metadata passthrough, not field logic | Test | **High** | +| M12 | No retry/fallback for ClickHouse INSERT failures — crashes policy cycle | Persistence | **High** | +| M13 | AccountSnapshot.trade_seq never incremented — always 0 | Account | Medium | +| M14 | test_reentry_after_full_close_no_pnl_loss uses 50% bound — absurd | Test | Low | +| M15 | test_reconcile_rejects_position_open_with_zero_size passes for wrong reason | Test | Low | +| M16 | No built-in metric for active slots, event throughput, or memory | All | Medium | +| M17 | Flaw 9 tests named for cancel but never call cancel | Test | **High** | +| M18 | _decision_to_kernel_intent drops order_type and limit_price — LIMIT dead from runtime | Runtime | **High** | + +--- + +## N-Series: Async/Sync Seams, Lock Analysis, Threading (Pass 11) + +*Full detail in TRACE doc under "PASS 11 — ASYNC/SYNC SEAMS, LOCK ANALYSIS, THREADING."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| N1 | Rust kernel `with_handle_mut` zero sync — `&mut` from raw ptr, UB on concurrent FFI — *mitigated by Python GIL (single-threaded caller); catch_unwind added `c87ca78`* | Rust | **Critical** | +| N2 | `_run()` has two completely different code paths — runtime branch, not design | Venue | **Critical** | +| N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** | +| N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call | Venue | **Critical** | +| N5 | `_snapshot_ready` cascading re-fetch — N callers produce N overlapping HTTP | Venue | **High** | +| N6 | `BingxUserStream.close()` doesn't cancel pending tasks | Stream | Medium | +| N7 | Live test architecture forces worst-case `_run()` path for every operation | Test | Medium | +| N8 | `subscribe()` reconnect creates new tasks per iteration | Stream | Medium | +| N9 | No `asyncio.all_tasks()` or task accounting — leaked tasks undetectable | All | Low | +| N10 | `_snap_lock` no reader-side protection (informational) | Venue | Info | + +--- + +## O-Series: Sync/Async Wider Scope (Launcher, Generators, Streams, FFI, Tests) (Pass 12) + +*Full detail in TRACE doc under "PASS 12 — SYNC/ASYNC WIDER SCOPE."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| O1 | `_maybe_close()` asyncio.run without loop guard — close skipped from async context — **✅ FIXED `338811e`** | Launcher | **High** | +| O2 | `async def connect()` shims call sync venue.connect() without await — blocking | Test | Medium | +| O3 | `_contract_rows(client)` NOT awaited — `_pick_live_symbol` iterates coroutine = crash | Test | **High** | +| O4 | Deprecated `get_event_loop().run_until_complete()` in test file | Test | Medium | +| O5 | `_run()` thread pool .result() no timeout — backend hang freezes process — **✅ FIXED `338811e`** | Venue | **High** | +| O6 | MockVenueAdapter never exercises thread-pool bridge — untested in CI | Venue | Medium | +| O7 | `_keepalive_loop`/`_rotation_sentinel` fire-and-forget — exceptions silently lost | Stream | Low | +| O8 | `KernelSlotView.__getattr__` N FFI calls for N fields — no caching | Bridge | Medium | +| O9 | `DITAv2LauncherBundle` no `__del__` — GC'd bundle leaks resource tree | Launcher | Medium | +| O10 | `ExecutionKernel` no `close()` — Rust handle only freed by unpredictable __del__ — **✅ FIXED `3ca154e`** | Bridge | Medium | +| O11 | `KernelSlotView.__setattr__` triggers 5 persistence side effects — no read-only view | Bridge | Medium | + +--- + +## P-Series: FFI Safety, Dangling Pointers, Coverage Gaps (Pass 13) + +*Full detail in TRACE doc under "PASS 13 — FFI BOUNDARY SAFETY, DANGLING POINTERS, COVERAGE GAPS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| P1 | `dita_kernel_destroy` double-free UB — Python doesn't null handle.value | Bridge | **Critical** | +| P2 | `CStr::from_ptr(payload)` without null guard in 3 FFI exports | Rust | **High** | +| P3 | `_check_open_orders` calls `asyncio.run()` from async `_verify` — RuntimeError | Test | **High** | +| P4 | `into_c_string` NUL sanitizer produces invalid JSON — json.loads fails | Rust | Medium | +| P5 | `reconcile_slots_json`/`snapshot_json` return null on failure — no diagnostic | Rust | Medium | +| P6 | `_get_rust()` TOCTOU race — concurrent cargo build corruption | Bridge | **High** | +| P7 | `KernelHandle` no Send/Sync — FFI bypasses Rust ownership rules | Rust | Info | +| P8 | No explicit Rust handle destroy path from bundle.close() | Launcher | Medium | +| P9 | `__del__` accesses module `_RUST` during shutdown — NameError leak | Bridge | Low | + +--- + +## Q-Series: Serde Edges, Backup Diffs, Market Data/Timestamps (Pass 14) + +*Full detail in TRACE doc under "PASS 14 — SERDE EDGE CASES, BACKUP DIFFS, MARKET DATA/TIMESTAMPS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| Q1 | `fromisoformat()` can't parse Rust `Z` suffix on Python < 3.11 | Bridge | **High** | +| Q2 | No `#[serde(deny_unknown_fields)]` — misspelled fields silently default | Rust | Medium | +| Q3 | `indexmap` new dependency in current code | Rust | Info | +| Q4 | Backup diff: 6 critical bug fixes between backup and current | Rust | Info | +| Q5 | `MarketSnapshot.timestamp` type inconsistent — float vs datetime in same file | Data Feed | **High** | +| Q6 | `fromisoformat()` Z-suffix fail on all 5 timestamp deserialization sites | Bridge | **High** | +| Q7 | No upper-bound price validation — 1e300 passes all guards | Bridge | Medium | +| Q8 | `_first_invalid_intent_field` does not reject zero/negative price or zero size | Bridge | Low | +| Q9 | Rust/Python clock sources diverge — transition timestamps mixed source | Rust | Low | +| Q10 | `threading.Event.wait()` uses platform-dependent clock — NTP jump risk | Venue | Medium | +| Q11 | Backup had no STALE_STATE_RECONCILING guard in on_venue_event (info) | Rust | Info | +| Q12 | All 5 fromisoformat() sites fail on Python < 3.11 (duplicate of Q1) | Bridge | **High** | + +--- + +## R-Series: Resource Leaks, Trust Boundaries, Security (Pass 15) + +*Full detail in TRACE doc under "PASS 15 — RESOURCE LEAKS, TRUST BOUNDARIES, SECURITY."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| R1 | `ThreadPoolExecutor` never shut down — 3 threads leak | Venue | **High** | +| R2 | `BingxVenueAdapter` no `close()` — backend HTTP client unreleasable | Venue | **High** | +| R3 | `real_zinc_plane._intent_cache` grows unboundedly | Plane | **High** | +| R4 | `RealZincPlane`/`ControlPlane` partial-construction SharedRegion leak | Plane | Medium | +| R5 | `BingxUserStream.ClientSession` no `__del__` — connection pool leak | Venue | Medium | +| R6 | `test_alpha_blue_untouched_g7.py` open() without context manager | Test | Low | +| R7 | All exchange REST/WS data parsed without schema validation | Venue | **Critical** | +| R8 | Shared memory JSON deserialization without integrity check | Plane | **High** | +| R9 | `restore_state()` deserializes arbitrary JSON — full kernel takeover | Bridge | **Critical** | +| R10 | `DOLPHIN_BINGX_ENV` + `ALLOW_MAINNET` mainnet switch via env | Config | **High** | +| R11 | `.env` file loaded from project root — secrets exposure | Config | **High** | +| R12 | Unvalidated `int()` on env vars — recv_window, leverage extremes | Config | Medium | +| R13 | `listenKey` from exchange in WS URL f-string — MITM injection | Venue | **High** | +| R14 | `mock_venue._exchange_event_queue` unbounded growth | Test | Low | + +--- + +## S-Series: Error Handling, Arithmetic Stability, Test Infrastructure (Pass 16) + +*Full detail in TRACE doc under "PASS 16 — ERROR HANDLING, ARITHMETIC STABILITY, TEST INFRASTRUCTURE."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| S1 | `realized_pnl()`/`mark_price()` NaN bypasses `<=0.0` guard — NaN PnL corrupts k_realized_pnl | Rust/Python | **Critical** | +| S2 | MockVenue `_exchange_event_queue` check-then-act race — silently drops events | Test | **Critical** | +| S3 | No `test_kernel_fsm.py` or `test_kernel_fsm_recovery.py` exists | Test | **Critical** | +| S4 | Generated tests use `await asyncio.sleep(0.8)` — flaky false negatives on slow CI | Test | **Critical** | +| S5 | `_rate_limit_retry_after_ms()` returns 0 on parse failure — instant retry storm | Venue | **High** | +| S6 | Venue adapter detects rate limits but enforces zero backoff | Venue | **High** | +| S7 | `capital_epsilon = 1e-4` too tight — false WARN classifications | Accounting | **High** | +| S8 | Generated tests use module-level `asyncio.run()` — leaks tasks on Python 3.12+ | Test | **High** | +| S9 | `str.replace()` patching silently does nothing on format change | Build | **High** | +| S10 | `_consume()` no per-message WS timeout — silent hang blocks forever | Venue | **High** | +| S11 | `_run()` blocks pool thread with no timeout — 3 hung calls lock adapter | Venue | **High** | +| S12 | Rate-limit regex depends on exchange message format — non-portable | Venue | Medium | +| S13 | `_row_float()` silently skips malformed rows, filters zero values | Venue | Medium | +| S14 | Reconnection backoff lacks jitter — thundering herd risk | Venue | Medium | +| S15 | `_venue_event_status_from_row()` falls back to ACKED — masks new rejections | Venue | Medium | +| S16 | `except: pass` in generated test code — swallows KeyboardInterrupt | Test | Medium | + +--- + +## T-Series: Unsafe Review, Dead Code/Backup Debris, Build/Plane Protocols (Pass 17) + +*Full detail in TRACE doc under "PASS 17 — UNSAFE REVIEW, DEAD CODE/BACKUP DEBRIS, BUILD/PLANE PROTOCOLS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| T1 | `catch_unwind` + `AssertUnwindSafe` — partially mutated state persists, no rollback | Rust | **High** | +| T2 | Empty backup dir `_backup_20260530_105512/` and stale `tea_debug.log` (0 bytes) | Repo | Low | +| T3 | `HazelcastRowWriter` uses bare `json.dumps(row, default=str)` — Enums/datetimes wrong format | Bridge | **High** | +| T4 | `real_zinc_plane._slot_from_payload()` direct key access `payload["last_event_time"]` — KeyError | Plane | **High** | +| T5 | `_build_pink_bodies.py` `str.index("]")` finds first `]` — corrupts SCENARIOS list | Build | **High** | +| T6 | `VenueAdapter` protocol missing `connect()`/`disconnect()` — AttributeError at runtime | Venue | **High** | +| T7 | Shared memory writes non-atomic — visible-zero window exposes partial state | Plane | **High** | +| T8 | `_slot_from_payload()` reconstructs internal_trade_id from slot trade_id — order data loss | Plane | Medium | +| T9 | `_slot_from_payload()` duplicated in two files — schema drift risk | Plane | Medium | +| T10 | `str.index("finally:")` finds first match — nested try/finally mismatch | Build | Medium | +| T11 | No workspace-root `.gitignore` — __pycache__, backup dirs, debris untracked | Repo | Low | +| T12 | `projection.py` lazy import failure silently swallowed — writer=None drops all writes | Bridge | Medium | +| T13 | `Codex_CONTEXT_RESTORE__*.txt` AI context files in root — debris | Repo | Low | +| T14 | `_backup_20260530/` is a valid Python package — accidental old-code import risk | Repo | Medium | + +--- + +## U-Series: Rust Test Gaps, Accounting Reconciliation Bugs, FFI Type Mismatches (Pass 18) + +*Full detail in TRACE doc under "PASS 18 — RUST TEST GAPS, ACCOUNTING RECONCILIATION BUGS, FFI TYPE MISMATCHES."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| U1 | `order_type`/`limit_price` sent to Rust, no serde fields — silently dropped | FFI | **High** | +| U2 | Rust `VenueEventStatus` expects `"CANCEL_REJECTED"` (typo) — `"CANCELED_REJECTED"` fails deserialization | Rust | **High** | +| U3 | R2 compares cumulative K realized vs single-last-fill E realized — broken after 2nd+ fill | Accounting | **Critical** | +| U4 | R4 compares K open_notional vs E used_margin — fundamentally different quantities | Accounting | **Critical** | +| U5 | R3 skipped when `len(e.positions)==0` — K has positions but E reports none, silent | Accounting | **High** | +| U6 | `on_venue_event`/`apply_fill` no NaN guards on venue event price/size — NaN propagates | Rust | **Critical** | +| U7 | Zero Rust tests for ORDER_REJECT, PARTIAL_FILL, TERMINAL_STATE guard, etc. | Test | **High** | +| U8 | `safe_float()` returns NaN/Inf instead of default — contradicts `_safe()` | Bridge | Medium | +| U9 | `_scan_slots` uses `metadata.get("leverage")` not `slot.leverage` — wrong leverage | Accounting | Medium | +| U10 | Rust injects `"k_net_fees"` key alongside serde's `k_fees_paid` — duplicate | Bridge | Low | +| U11 | 10+ AccountState fields transmitted across FFI but never read by Python | FFI | Low | +| U12 | `_order_from_payload()` overwrites `internal_trade_id` with slot's trade_id | Bridge | Medium | +| U13 | No independent third reference — symmetrical K=E errors invisible | Accounting | Medium | +| U14 | `lot_step` declared in ReconcileConfig but never used — dead field | Accounting | Low | + +--- + +## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5) + +*Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| H1 | No Python dependency files (requirements.txt, pyproject.toml, etc.) | Build | **Critical** | +| H2 | Rust kernel compiled from source on every cold start — no prebuilt binary | Build | **Critical** | +| H3 | Zero logging — 16+ silent except:pass sites, no error observability | All | **Critical** | +| H4 | `_row_float` rejects zero as valid, `except Exception: continue` swallows all | Venue | **High** | +| H5 | `_backend_snapshot` timeout returns stale data/None — callers crash | Venue | **High** | +| H6 | All enum-from-raw-string sites crash on unknown variant (17 sites) | Bridge | **High** | +| H7 | `_legacy_intent` reads `getattr(intent, "order_type")` not metadata — always MARKET | Venue | **High** | +| H8 | Unknown venue status silently mapped to ACKED | Venue | **High** | +| H9 | `RealZincPlane.write_slot()` `slot_id >= slot_count` silently lost | Zinc | **High** | +| H10 | `RealZincControlPlane.read()` no atomicity with concurrent `update()` | Control | **High** | +| H11 | `_RustKernelLib` lazy init with race condition — concurrent cargo build | Bridge | **High** | +| H12 | `ExecutionKernel.__del__` use-after-free on Rust handle | Bridge | **High** | +| H13 | `MirroredControlPlane` missing protocol methods (wait/notify) | Control | Medium | +| H14 | `TradeSlot.remaining_size` vs `VenueOrder.remaining_size` — different semantics | Contracts | Medium | +| H15 | `_maybe_close` asyncio.run RuntimeError silently swallowed | Launcher | Medium | +| H16 | Lazy import of bingx_direct masks config errors until first trade | Build | Info | +| H17 | `load_dotenv()` at module level — import-time I/O side effect | Launcher | Medium | +| H18 | `_run()` blocks event loop on every HTTP call via thread pool | Venue | Medium | +| H19 | `HazelcastClientLike` protocol has zero concrete implementations | Projection | Low | +| H20 | `_decode_packet` uncaught UnicodeDecodeError/ValueError on corrupted SHM | Zinc | Low | +| H21 | `wasm-bindgen` compiled into native library unnecessarily | Build | Low | +| H22 | `socket.getaddrinfo` monkey-patch in test code | Test | Low | + +--- + +## A-Series: Architectural Flaws (detailed writeups) + +*These are the original architectural flaws with full analysis.* + +--- + +### Flaw A1: Exit-size overshoot on multi-leg with initial_size > remaining size + +**Location:** `_rust_kernel/src/lib.rs` lines ~770-780 (EXIT handler in `process_intent`) + +**Severity:** **High** + +**Nature:** Logic error — wrong base for exit-size computation. + +### Downstream effect + +The EXIT handler computes the exit size as `base_size * exit_ratio` where: +```rust +let base_size = if slot.initial_size > 0.0 { slot.initial_size } else { slot.size }; +``` + +After partial fills (e.g., two separate MARKET exit legs), `initial_size` is still the +**original** entry size while `slot.size` has been reduced by previous legs. If the +cumulative leg ratios don't sum to exactly 1.0 (or the final ratio is not 1.0), the +computed exit size can exceed the remaining position. + +The venue adapter clamps to actual position via `reduceOnly`, but the kernel's _own_ +accounting reduces `slot.size` by the fill size, not by the intended exit size. The +slot can therefore go negative (`slot.size < 0`) if the fill is larger than remaining. + +### Exact trigger + +1. Enter SHORT, size=1.0, `initial_size=1.0`, ratios=(0.6, 0.6, 1.0) — note ratios sum > 1.0 +2. EXIT leg 0: `exit_size = 1.0 * 0.6 = 0.6`. Fill consumes 0.6. Slot size goes to 0.4. +3. EXIT leg 1: `exit_size = 1.0 * 0.6 = 0.6`. But remaining is 0.4. Requests 0.6. +4. BingX `reduceOnly` clamps fill to 0.4. Slot size goes to 0.0. +5. EXIT leg 2 (ratio 1.0): `exit_size = 1.0 * 1.0 = 1.0`. Slot is already at 0.0. + Kernel returns `NO_OPEN_POSITION` — the final EXIT is rejected because `slot.closed` + was not set by the previous fill (it was a partial close, not terminal). +6. Slot is at size=0.0, `!slot.closed`, no active orders, but `!slot.is_free()` because + `size <= 0.0` is true but `fsm_state != IDLE/CLOSED` — slot is **stuck** in + `POSITION_OPEN` with zero size. + +This is **not** purely a mis-sized ratio problem. With MARKET orders that fill fully, +even correct ratios can leave the slot stuck if the fill price differs from the +intended-size price and the venue adjusts fill quantity. + +### Fix strategy + +Use `slot.size` directly as the base (not `initial_size`): +```rust +let exit_size = (slot.size * exit_ratio).max(0.0).min(slot.size); +``` + +This guarantees the exit never requests more than the remaining position, regardless +of cumulative ratio math. The venue still clamps, but the kernel's intent is correct. + +--- + +### Flaw A2: Misleading CANCEL diagnostic code on entry-only slots + +**Location:** `_rust_kernel/src/lib.rs` lines ~798-810 (CANCEL rejection path) + +**Severity:** **Low** + +**Nature:** Diagnostic pollution — wrong error code. + +### Downstream effect + +When a CANCEL intent arrives and **neither** `active_exit_order` nor +`active_entry_order` is cancellable, the kernel returns: +```rust +diagnostic_code: KernelDiagnosticCode::NO_ACTIVE_EXIT_ORDER +``` + +But the reason may be that there's no active entry order either, or the FSM state +doesn't permit cancellation. The diagnostic name suggests an exit-order-specific +problem when the failure is generic "nothing to cancel." + +### Fix + +Change to a generic `NO_ACTIVE_ORDER` diagnostic or `SLOT_IDLE` when the slot is +already in IDLE. `NO_ACTIVE_EXIT_ORDER` is misleading for a slot that has never had +any order. + +--- + +### Flaw A3: Float-accumulated slot.size after partial fills can go negative + +**Location:** `_rust_kernel/src/lib.rs` lines ~1365-1370 (apply_fill exit path) + +**Severity:** **Low** + +**Nature:** Numerical precision edge case. + +### Code path + +```rust +slot.size = (slot.size - fill_size).max(0.0); +``` + +This clamps to zero, which is correct. But if the venue fills *more* than requested +(on BingX, this can happen with market orders where the fill walks the book), the +slot sees `fill_size > intended_size`. The `max(0.0)` prevents negative, but the +slot then reports `size=0.0` with `!closed` and an FSM state that's not IDLE. + +The `is_free()` check requires `size <= 0.0` AND `fsm_state in {IDLE, CLOSED}`. A +slot with `size=0.0` and `fsm_state=POSITION_OPEN` is stuck — no EXIT will be +accepted and no ENTER can start. + +### Trigger + +Submit an EXIT for 0.6 of remaining 0.6. BingX fills 0.8 (market order walks the +book, overshoots). `fill_size=0.8`, `slot.size = (0.6 - 0.8).max(0.0) = 0.0`. +Slot is now size=0, fsm_state=EXIT_WORKING (or POSITION_OPEN), `closed=false`. + +### Fix + +When `slot.size <= 1e-12` after a fill and the slot is in an exit-related state, +force transition to CLOSED/IDLE regardless of leg index: +```rust +if slot.size <= 1e-12 { + slot.closed = true; + slot.fsm_state = TradeStage::CLOSED; + slot.active_exit_order = None; + slot.active_entry_order = None; + return; +} +``` + +--- + +### Flaw A4: Entry price is clobbered by mark_price if called before fill arrives + +**Location:** `_rust_kernel/src/lib.rs` lines ~432-436 (mark_price) and ~1390 (apply_fill entry branch) + +**Severity:** **Medium** + +**Nature:** Accounting accuracy — incorrect PnL base. + +### Code path + +```rust +// In mark_price: +if self.entry_price <= 0.0 { + self.entry_price = price; // Seeds entry_price from mark before fill +} + +// In apply_fill (entry): +if event.price > 0.0 { + slot.entry_price = event.price; // Overwrites with actual fill price +} +``` + +The `mark_price` path seeds `entry_price` from a market price when the slot has no +fill yet. The `apply_fill` entry path correctly overwrites with the actual fill price. +So in the normal flow this is harmless — the fill overwrites the mark. + +**However**, consider this sequence: +1. ENTER intent accepted → slot goes `ORDER_REQUESTED`, `entry_price = 0.0` +2. `runtime.step()` calls `kernel.mark_price(snapshot.symbol, snapshot.price)` → sets `entry_price = 100.0` +3. `on_venue_event(ORDER_ACK)` → `ENTRY_WORKING`, `entry_price` still `100.0` +4. `on_venue_event(PARTIAL_FILL)` → `apply_fill` sets `entry_price = 99.5` (fill price) +5. Unrealized PnL from step 2-3 used a mark price of 100.0, not the fill price of 99.5 + +This is a transient mis-valuation window. It corrects itself on the next `observe_slots` +call, but intra-step readers see wrong unrealized PnL. Not critical because: +- `account.snapshot.unrealized_pnl` uses the slot's `unrealized_pnl`, not the mark +- Realized PnL is computed from actual fill prices +- The window lasts at most one scan cycle (~5s) + +### Fix + +Don't set `entry_price` from `mark_price` when there's no fill: +```rust +fn mark_price(&mut self, price: f64) { + if !price.is_finite() || price <= 0.0 { return; } + // Don't seed entry_price — leave it at 0.0 until a fill arrives + if self.entry_price <= 0.0 || self.size <= 0.0 { + self.unrealized_pnl = 0.0; + return; + } + // ... normal PnL computation +} +``` + +--- + +### Flaw A5: Capital-before computation is arithmetic not snapshot-based + +**Location:** `pink_clickhouse.py` lines ~761-762 (`_write_trade_exit_leg`) and ~822-823 (`_write_trade_event`) + +**Severity:** **High** + +**Nature:** Accounting accuracy — wrong capital_before under multi-slot or intervening events. + +### Code pattern (appears in two places) + +```python +capital_after = self._capital() +capital_before = capital_after - pnl_leg # In _write_trade_exit_leg +capital_before = capital_after - pnl # In _write_trade_event +``` + +This reconstructs `capital_before` by subtracting the current leg's PnL from the +current capital. This is **only correct** if: +1. No other slots settled PnL between this leg and the previous one +2. No capital corrections (reconcile, manual override) happened between legs +3. No fees were deducted between legs + +With multi-slot (PINK configurable `max_slots > 1`), a concurrent trade on slot 1 +that closes between slot 0's exit legs will have its PnL baked into `capital_after`, +making `capital_before = capital_after - pnl_leg` wrong. + +### Fix + +Maintain a per-trade `capital_before_leg` snapshot taken at the moment of the +first fill event for each trade, advancing it by the realized PnL of each leg: +```python +self._leg_state[trade_id]["capital_before"] = prev.get("capital_after", capital_after - pnl_leg) +self._leg_state[trade_id]["capital_after"] = capital_after +``` + +And use `prev["capital_before"]` for the row, not `capital_after - pnl_leg`. + +--- + +### Flaw A6: Reconcile accoun(t) reseeds capital from kernel, not exchange + +**Location:** `pink_direct.py` lines ~597-630 (`recover_account`) and docstring of `reconcile_account` + +**Severity:** **Medium** + +**Nature:** Operational drift — capital is never verified against exchange truth in hot loop. + +### The gap + +`reconcile_account()` (line 632) has this docstring: +``` +Periodic exchange-led account sync. +Capital is re-seeded from the exchange balance as a guard against long-running drift +``` + +But the actual implementation: +```python +async def reconcile_account(self, ...) -> dict[str, Any]: + return await self.recover_account(...) + +async def recover_account(self, ...) -> dict[str, Any]: + capital = float(self.kernel.account.snapshot.capital or 25000.0) + _reconcile_position_slot(self.kernel, capital, slot_id=0) +``` + +It passes the **kernel's own capital** to `_reconcile_position_slot`, which then +overwrites `kernel.account.snapshot.capital` with... the same value. No exchange +balance poll ever overwrites capital. + +`connect()` at line 224 does the same — it passes `initial_capital` (an env default), +not the exchange balance. The exchange balance is never read for capital seeding +in the current code path. `_reconcile_position_slot` does call +`venue.open_positions()`, but it only reads positions, not capital. + +### Effect + +Capital drift (caused by fees the kernel doesn't track, unrealized PnL mis-valuation, +or any other systematic error) accumulates monotonically. There is no mechanism to +detect or correct drift. Over weeks of live trading, the kernel's capital snapshot +can diverge arbitrarily from the exchange's actual balance. + +### Fix + +Either: +1. Make `_reconcile_position_slot` read the exchange balance and use it for + capital reseeding (the docstring claims it does this already), or +2. Add a separate capital-verification path that surfaces the delta between + kernel capital and exchange balance as an anomaly, even if it doesn't auto-correct. + +--- + +### Flaw A7: No fee tracking in kernel accounting + +**Location:** `rust_backend.py` lines ~540-545 (on_venue_event settle), `bingx_direct.py` submit_intent return + +**Severity:** **Medium** + +**Nature:** Accounting accuracy — fees are invisible to capital tracking. + +### Downstream effect + +When a trade closes, the kernel computes: +```rust +realized_pnl = delta * notional +``` + +This is **gross** PnL. BingX charges fees on every fill (taker ~0.04%, maker ~0.02%). +These fees are never subtracted from the kernel's realized PnL. Over 100 trades with +$100 average notional at 0.04%, the cumulative error is $4 — negligible. Over 10,000 +trades at 10x leverage and $50k average notional, the error is $200k. + +The `BingxDirectExecutionAdapter` does return `ExecutionReceipt` with fill data, +but `bingx_venue._events_from_submit()` only reads `price` and `filled_size` — +commission/fee fields are ignored. + +### Fix + +1. Read fee/commission from the BingX ack payload in `_events_from_submit()` +2. Pass fees through `VenueEvent.metadata["fee"]` +3. In the Rust kernel's `apply_fill`, subtract the fee from realized PnL: + ```rust + let fee = event.metadata.get("fee").and_then(|v| v.as_f64()).unwrap_or(0.0); + slot.realized_pnl += realized - fee; + ``` + +--- + +### Flaw A8: ENTER intent silently defaults leverage to 1.0 on bad input + +**Location:** `_rust_kernel/src/lib.rs` lines ~745-748 + +**Severity:** **Low** + +**Nature:** Silent fallback — corrupt input produces a trade, not a rejection. + +```rust +slot.leverage = if intent.leverage.is_finite() && intent.leverage > 0.0 { + intent.leverage +} else { + 1.0 +}; +``` + +A NaN, zero, negative, or infinite leverage value silently trades at 1x instead of +rejecting the intent. The Python bridge does validate `_first_invalid_intent_field()` +which catches NaN/inf, but it doesn't catch `leverage <= 0.0` (it only checks +`not math.isfinite(value)`). + +### Fix + +Add `leverage <= 0.0` to the Python bridge's invalid-intent check. The Rust kernel +should still have the `1.0` fallback as a defensive measure, but the bridge should +prevent bad leverages from reaching Rust in the first place. + +--- + +### Flaw A9: Mock venue submit condition convoluted — dead code paths + +**Location:** `mock_venue.py` lines ~60-90 + +**Severity:** **Informational** + +**Nature:** Code clarity — confusing condition logic. + +```python +if self.scenario.emit_ack_before_fill or not self.scenario.emit_fill_on_submit: + events.append(ack_event) +if self.scenario.emit_fill_on_submit or self.scenario.partial_fill_ratio > 0: + # ... fill events +``` + +The condition logic is confusing: +- When `emit_ack_before_fill=True` and `emit_fill_on_submit=True`: both branches run → ACK + fill +- When `emit_ack_before_fill=False` and `emit_fill_on_submit=True`: first branch runs because + `not True = False`, so `False or False = False` → no ACK. Second branch runs → fill only. + This produces a fill without an ACK, which is **not** a realistic venue scenario. +- When `partial_fill_ratio=1.0` (default): second branch runs and emits a `FULL_FILL` event + even when `emit_fill_on_submit=False`, because `0.0 or 1.0 > 0 = True`. + +The partial fill ratio check should be gated on `emit_fill_on_submit`: +```python +should_emit_fill = self.scenario.emit_fill_on_submit or ( + is_entry and self.scenario.entry_partial_fill_ratio > 0 +) or ( + not is_entry and self.scenario.exit_partial_fill_ratio > 0 +) +``` + +--- + +### Flaw A10: Pump venue events on every step cycle — expensive for MARKET-only flows + +**Location:** `pink_direct.py` lines ~318-374 (`pump_venue_events`), called at line ~436 + +**Severity:** **Medium** + +**Nature:** Operational overhead — unnecessary exchange HTTP calls. + +### The problem + +`step()` calls `pump_venue_events()` **every cycle**, which calls `venue.reconcile()`. +For `BingxVenueAdapter`, `reconcile()` calls `_backend_snapshot()` which does up to 5 +HTTP requests (balance, positions, open orders) in parallel. For a MARKET-only workflow +where orders fill synchronously within `process_intent()`, there are **no** late fills +to drain. + +On BingX VST, the rate limit is ~10 requests/second across all endpoints. Each +`pump_venue_events()` call consumes 5+ of that budget. At a 5-second policy cycle, +this is 60 requests/minute — 60% of the rate budget — just to poll for fills that +don't exist. + +### Fix + +Gate the pump on whether the previous cycle submitted a LIMIT order: +```python +self._has_resting_order = any( + o.status not in (VenueOrderStatus.FILLED, VenueOrderStatus.CANCELED) + for o in kernel.open_orders() +) +if self._has_resting_order: + await self.pump_venue_events(snapshot, market_state=market_state) +``` + +Or add a config flag `async_fill_mode: bool = False`. + +--- + +### Flaw A11: VenueAdapter.submit() blocks the event loop + +**Location:** `bingx_venue.py` lines ~225-233 (`_run`) + +**Severity:** **Medium** + +**Nature:** Runtime safety — synchronous call in async context. + +```python +def _run(self, result: Any) -> Any: + if inspect.isawaitable(result): + try: + asyncio.get_running_loop() + except RuntimeError: + return asyncio.run(result) + pool = self._get_executor() + return pool.submit(asyncio.run, result).result() +``` + +When called from `step()` (which is an async function), `_run` submits the async +`submit_intent()` to a thread pool, runs it with `asyncio.run()`, then calls +`.result()` which blocks the current thread until complete. The BingX HTTP call +can take 1-5 seconds depending on network latency and exchange load. + +During this block, the event loop **cannot** process other async tasks (data feed +updates, health checks, signal processing). In a single-runtime deployment, this +stalls the entire policy cycle. + +### Fix + +Make `process_intent` in `ExecutionKernel` accept an async venue callback, or +make `BingxVenueAdapter` truly async (not sync-with-thread-bridge). For now, +at minimum the PINK runtime should run `step()` in an executor to avoid blocking +the main event loop. + +--- + +### Flaw A12: Stale KernelStateView slot references after reconcile + +**Location:** `rust_backend.py` lines ~350-365 (`KernelStateView.refresh`) + +**Severity:** **Low** + +**Nature:** Stale data — view not rebuilt on reconcile. + +```python +class KernelStateView: + def __init__(self, kernel): + self.slots = [KernelSlotView(kernel, slot_id) for slot_id in range(kernel.max_slots)] + # ... + + def refresh(self) -> None: + snapshot = self._kernel._snapshot_backend() + self.active_trade_index = dict(snapshot.get("active_trade_index", {})) + self.venue_order_index = dict(snapshot.get("venue_order_index", {})) + self.client_order_index = dict(snapshot.get("client_order_index", {})) +``` + +`refresh()` updates the index maps but does **not** recreate `self.slots`. The slot +views in `self.slots` are live proxies (they read through `_get_slot` each time), +so slot data is current. But if `max_slots` changes (it shouldn't, but it's mutable) +or if slots are re-indexed by a reconcile, the view list is wrong. + +Not critical because `max_slots` is set at init and never changes, but worth +fixing for robustness. + +--- + +### Flaw A13: `persist_fill_events` uses current price as exit price + +**Location:** `pink_clickhouse.py` lines ~408 + +**Severity:** **Low** + +**Nature:** Historical accuracy — logged price may not match fill price. + +```python +price = next((float(getattr(e, "price", 0.0) or 0.0) for e in event_list + if getattr(e, "price", 0.0)), 0.0) or self._slot_entry_price(slot) +``` + +This correctly reads from the event's price. But `decision.reference_price` at line +417 falls back to this price, which is the fill price. The trade_event row at line +835 uses `exit_price = slot_dict.get("entry_price", ...)` — which is the **entry** +price, not the exit price. The trade_event always shows exit_price == entry_price. + +This means `trade_events` in ClickHouse will never show a realistic exit price +for the persisted trade, breaking any PnL reconstruction that relies on +`(exit_price - entry_price) * size * leverage`. + +--- + +### Flaw A14: `_write_position_state` maps active_leg_index to entry_bar + +**Location:** `pink_clickhouse.py` line ~673 + +**Severity:** **Low** + +**Nature:** Semantic mismatch — wrong field mapping. + +```python +"entry_bar": int(slot_dict.get("active_leg_index", 0) or 0), +``` + +`active_leg_index` is the index into the exit-leg-ratios array (which leg is being +exited next). It has nothing to do with how many bars the position has been held. +When a position opens, `active_leg_index` is 0. After the first exit leg, it +advances to 1. Neither value is a bar count. + +`entry_bar` should be `bars_held` from the intent/decision, or a computed value +from `entry_time` to now. + +--- + +### Flaw A15: `persist_recovery_state` passes account dict as slot dict + +**Location:** `pink_clickhouse.py` lines ~447-460 + +**Severity:** **Low** + +**Nature:** Wrong data — account snapshot used where slot data is expected. + +```python +def persist_recovery_state(self, *, snapshot, acc_dict, ...): + slot_dict = acc_dict or {} # ← acc_dict is an account snapshot, not a slot + self._write_position_state(..., slot_dict={}, ...) # ← correctly uses empty dict + self._write_trade_reconstruction( + snapshot, + trade_id=acc_dict.get("trade_id", "") if acc_dict else "", + # acc_dict is {"capital": ..., "equity": ...} — no "trade_id" key + ) +``` + +The `trade_id` in the trade_reconstruction row will always be `""` because +`acc_dict` comes from `kernel.snapshot()["account"]` which has keys `capital`, +`equity`, `realized_pnl`, etc. — not `trade_id`. This means the recovery +`trade_reconstruction` row has no trade_id linkage.