PINK: E2E trace analysis — Pass 19 lifecycle/Rust subtleties/test infra (V1-V14)

Nineteenth pass: DITAv2LauncherBundle.close() never calls kernel.close() Rust handle leaks via __del__ (V1 Critical), BingxVenueAdapter no close/disconnect ThreadPoolExecutor/HTTP never release (V2 Critical), 3 generators write same output file last writer wins incompatible prologues (V4 Critical), generated tests triple env-gated never run in CI dead code (V5 Critical), kernel.close() destroys Rust handle immediately no drain no flush UAF risk (V6 Critical), process_intent ENTER doesn't clear seen_event_ids old dedup pollutes new trade (V3 High), no conftest/pytest.ini/asyncio_mode test discovery fragile (V9 High), #[serde(default)] leverage:0.0 mark_price no .max(1.0) silent accounting error (V8 Medium). 347 total flaws across 19 passes. Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
2026-06-02 16:34:58 +02:00
parent 94078ee8fe
commit ded4b59891
3 changed files with 1502 additions and 10 deletions
--- a/PINK_DITAv2_E2E_TRACE_ANALYSIS.md
+++ b/PINK_DITAv2_E2E_TRACE_ANALYSIS.md
@@ -6776,3 +6776,362 @@ The `lot_step` field is declared with a default of `0.001` but is never read by
 | T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
 | U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
 | **Total** | | **333** | **30** | **99** | **96** | **64** | **37** |
+
+---
+
+## PASS 19 — STARTUP/SHUTDOWN LIFECYCLE, RUST KERNEL SUBTLETIES, GENERATED TEST INFRA
+
+### V1: `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust kernel handle always leaks via `__del__`
+
+**File:** `launcher.py:40-42`
+
+```python
+def close(self) -> None:
+    _maybe_close(self.venue)
+    _maybe_close(self.zinc_plane)
+    _maybe_close(self.control_plane)
+    # NOTE: self.kernel is never closed!
+```
+
+The bundle's `close()` method closes the venue, zinc plane, and control plane — but **not the kernel**. The `ExecutionKernel` Rust handle (`self._backend`) is only freed when Python's garbage collector calls `__del__`, which is non-deterministic.
+
+In a `with` block or explicit `bundle.close()` pattern, the Rust handle survives until the `ExecutionKernel` object's refcount drops to zero (usually when the bundle goes out of scope). If the caller holds a reference to the kernel (e.g., `k = bundle.kernel` for inspection), the handle lives indefinitely.
+
+Compare with `_build_rb()` in test code (gen2.py:326-338) which creates a `Shim` that has a `close` method calling `bundle.close()` and `bundle.kernel.close()`. The production `DITAv2LauncherBundle` has no equivalent.
+
+**Severity: Critical**
+
+### V2: `BingxVenueAdapter` has no `close()` or `disconnect()` — ThreadPoolExecutor, HTTP client sessions, connections never released
+
+**File:** `bingx_venue.py` (entire file)
+
+`BingxVenueAdapter` is the synchronous REST adapter to BingX. It has no `close()`, `disconnect()`, or any cleanup method. The `_maybe_close()` in the launcher tries `.close()` → `AttributeError` (caught), then `.disconnect()` → `AttributeError` (caught). **Nothing happens.**
+
+Leaked resources:
+- **ThreadPoolExecutor** (class-level `_EXECUTOR`, 3 threads) — never shut down
+- **BingxDirectExecutionAdapter backend** — holds an `aiohttp.ClientSession` with TCP connection pool (`limit=4`) — never closed
+- Any in-flight HTTP connections — abandoned at process exit
+
+This is a compounding of R1 and R2 from Pass 15. It's the single largest resource leak in the system.
+
+**Severity: Critical**
+
+### V3: `process_intent` ENTER path does NOT clear `seen_event_ids` — old trade's dedup set pollutes new trade
+
+**File:** `_rust_kernel/src/lib.rs:1260-1290` (ENTER path)
+
+```rust
+// ENTER path — slot reuse:
+slot.trade_id = intent.trade_id.clone();
+slot.asset = intent.asset.clone();
+slot.entry_price = 0.0;
+slot.size = 0.0;
+slot.initial_size = 0.0;
+slot.unrealized_pnl = 0.0;
+slot.realized_pnl = 0.0;
+slot.active_leg_index = 0;
+slot.active_entry_order = None;
+slot.active_exit_order = None;
+slot.close_reason.clear();
+slot.closed = false;
+slot.last_event_time = None;
+// 🔴 seen_event_ids is NOT cleared — survives from old trade
+slot.fsm_state = TradeStage::ORDER_REQUESTED;
+```
+
+When a slot is reused by a new ENTER intent, the `seen_event_ids` vector from the **previous trade** survives. If any event_id from the new trade collides with one observed by the old trade, the event is rejected as `DUPLICATE_EVENT` and silently dropped.
+
+**Example:**
+1. Trade A in slot 0 receives events `evt-001`, `evt-002`, `evt-003`. All stored in `seen_event_ids`.
+2. Trade A completes. Slot 0 returns to `IDLE`.
+3. Trade B enters on slot 0. `seen_event_ids` still contains `evt-001`, `evt-002`, `evt-003`.
+4. Exchange sends `evt-001` for Trade B (a legitimate new event with a reused event_id — possible on exchanges that recycle IDs daily).
+5. Rust kernel sees `evt-001` in `seen_event_ids` → rejects as `DUPLICATE_EVENT`. The fill/ack is lost.
+
+The probability of collision depends on the exchange's event ID scheme. If event IDs are UUIDs, the risk is negligible. If event IDs are daily counters (e.g., `"20260601-001"` → `"20260602-001"`), the collision risk is **guaranteed** on every day boundary.
+
+**Fix:** Add `slot.seen_event_ids.clear();` in the ENTER path.
+
+**Severity: High**
+
+### V4: Three generators (`gen2.py`, `gen_live_tests.py`, `_gen_test.py`) all write to same output file — last writer wins, incompatible prologues
+
+**Files:** `gen2.py:431-432`, `gen_live_tests.py:680-681`, `_gen_test.py:1233-1234`
+
+All three generator scripts write to the same file:
+```
+/mnt/dolphinng5_predict/prod/tests/test_pink_bingx_dita_live_e2e.py
+```
+
+Each produces **different prologues** with incompatible import styles, helper function signatures, and test runner implementations:
+- `gen2.py`: Uses `RB`/`Shim` tuple, `_si()` helper, `_flatten()` helper
+- `gen_live_tests.py`: Uses `_RuntimeBundle`/`_RuntimeShim` dataclass, `_run_scenario()` runner, `_flatten_via_kernel_intent()` helper
+- `_gen_test.py`: Uses scenario-style `B()` body definitions, single parametrized `test_pink_ditav2` function
+
+The 68 named body functions are **identical** across gen2.py and gen_live_tests.py but their signatures differ (`(k, symbol, p)` vs `(bundle, client, symbol, snap)`). Running gen2.py then gen_live_tests.py produces a file where the bodies don't match the runner — the file compiles (Python sees valid functions) but silently tests nothing meaningful.
+
+`_build_pink_extended.py` and `_build_pink_bodies.py` then mutate the same file in-place with `str.replace()` — which silently does nothing if the expected format doesn't match.
+
+**Severity: Critical**
+
+### V5: Generated tests are triple env-gated — never run in CI, making them dead code
+
+**File:** `gen2.py` (generated output), `gen_live_tests.py` (generated output), `_gen_test.py` (generated output)
+
+Every generated test file has three `pytest.skip()` guards at the top:
+```python
+if not os.environ.get("BINGX_SMOKE_LIVE"): pytest.skip("...")
+if not os.environ.get("BINGX_SMOKE_ALLOW_TRADE"): pytest.skip("...")
+if not os.environ.get("PINK_DITA_E2E"): pytest.skip("...")
+```
+
+These three env vars are **never set in CI**. The CI pipeline has no step that sets them. As a result:
+- 68 test functions from gen2.py → always skipped
+- 70 test functions from gen_live_tests.py → always skipped
+- 157 body scenarios from _gen_test.py → always skipped
+- ~295 combined test scenarios → **zero executed in CI**
+
+Only the mock-venue tests (`test_flaws.py`, `test_account_core_v2.py`, `test_account_reconcile_faults.py`, `test_exchange_event_seam_parity.py`, `test_kernel_reliability.py`, `test_kernel_fee_friction.py`, `test_pink_clickhouse_phase4.py`, `test_alpha_blue_untouched_g7.py`) actually run in CI.
+
+**Severity: Critical**
+
+### V6: `ExecutionKernel.close()` destroys Rust handle immediately — no drain of in-flight intents, no wait for pending FFI calls, no state flush
+
+**File:** `rust_backend.py:320-330`
+
+```python
+def close(self) -> None:
+    backend = self._backend
+    if backend is not None:
+        self._backend = None
+        try:
+            _get_rust().destroy(backend)  # immediately frees Rust kernel memory
+        except Exception:
+            pass
+```
+
+`destroy(backend)` calls `dita_kernel_destroy(handle)` which calls `drop(Box::from_raw(handle))`, freeing the entire `KernelHandle` heap allocation including `KernelCore`, all slots, all account state.
+
+If `process_intent()` or `on_venue_event()` has been called from another thread and is mid-execution in Rust when `close()` fires:
+1. Rust's state machine is destroyed mid-transition — dangling `self` reference
+2. Any HTTP calls (venue.submit()) already in-flight complete, but their results are never processed
+3. The Rust `Box::from_raw` calls `drop()` while another thread may be holding a reference to `core` — **use-after-free UB**
+4. `_last_settled_pnl` dict is orphaned
+5. The zinc plane, projection, and account state are all inconsistent with the destroyed kernel
+
+There is no cancel/token mechanism, no pending-call queue drain, and no state flush before destroy.
+
+**Severity: Critical**
+
+### V7: `_last_settled_pnl` dict accessed from both `process_intent` and `on_venue_event` without locks — thread-unsafe
+
+**File:** `rust_backend.py:440,475`
+
+```python
+# process_intent (line 440):
+self._last_settled_pnl[intent.slot_id] = 0.0
+
+# on_venue_event (line 475):
+incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0)
+self._last_settled_pnl[slot.slot_id] = slot.realized_pnl
+```
+
+`process_intent()` (called from the async event loop or scheduler) and `on_venue_event()` (called from the venue event stream callback) both read and write `_last_settled_pnl` without any synchronization primitive. If these run on different threads, the dict can experience:
+- **Lost update**: Two concurrent writes to the same `slot_id` — one overwrites the other
+- **Dict corruption**: Python dict isn't thread-safe for concurrent writes — can produce `KeyError` on iteration or silently drop entries
+- **Incorrect PnL settlement**: The incremental PnL calculation uses stale values
+
+In the current single-threaded async architecture, this isn't triggered. But the lack of protection means any future multi-threaded usage introduces a data race.
+
+**Severity: Medium**
+
+### V8: `#[serde(default)]` on `leverage: f64` defaults to 0.0 — `mark_price()` uses leverage directly without `.max(1.0)`, silent accounting error
+
+**File:** `_rust_kernel/src/lib.rs:400-408`
+
+```rust
+fn mark_price(&mut self, price: f64) {
+    // ...
+    if self.entry_price <= 0.0 || self.size <= 0.0 { return; }
+    let mut delta = (price - self.entry_price) / self.entry_price;
+    if self.side == TradeSide::SHORT { delta = -delta; }
+    self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage;  // uses leverage directly
+}
+```
+
+The `TradeSlot` struct has `#[serde(default)] leverage: f64` which defaults to `0.0` when deserialized from JSON without a `leverage` field.
+
+`mark_price()` uses `self.leverage` directly in `unrealized_pnl` computation — **no `.max(1.0)` guard**. If leverage is `0.0` (from missing JSON field, `set_slot_json`, or snapshot restore without leverage), `unrealized_pnl` is always `0.0` regardless of price movement.
+
+Compare with `realized_pnl()` which correctly guards:
+```rust
+let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);  // correct
+```
+
+The `process_intent` ENTER path handles this (sets leverage to 1.0 if ≤ 0), but `set_slot_json` and `restore_full_snapshot` can bypass this and store leverage=0.0 directly into the slot.
+
+**Impact:** Any slot restored from a snapshot that predates the `leverage` field (or saved by a version without it) gets `leverage=0.0` silently. `unrealized_pnl` is always 0.0 — the operator sees no mark-to-market PnL even though the position has moved. This PnL is only realized on close (via `realized_pnl` which correctly uses `.max(1.0)`), so the total PnL is correct but the intra-period mark appears flat — a silent accounting error.
+
+**Severity: Medium**
+
+### V9: No `conftest.py`, no `pytest.ini`, no `asyncio_mode` configuration — test discovery relies on default pytest behavior
+
+**File:** (missing — no pytest configuration in workspace)
+
+The workspace has **zero pytest configuration files**. No `conftest.py`, `pytest.ini`, `setup.cfg`, `pyproject.toml` with pytest settings. This means:
+
+- **`asyncio_mode`** defaults to `"strict"` (pytest 8.x+), but all test files use `asyncio.run()` inline rather than `@pytest.mark.asyncio`. In strict mode, `async def` test functions are **not discovered** unless explicitly marked. Since all current tests are `sync def` wrapping `asyncio.run()`, this works — but the discovery relies on this specific pattern holding.
+- No **timeout** configuration — a hanging test blocks the entire suite until killed
+- No **test filtering** markers (no `slow`, `live`, `offline` markers)
+- No **shared fixtures** — each test file repeats the `_build_rb()` setup pattern
+- No **cache** or **rerun** configuration — flaky tests fail the CI suite
+
+If a developer adds a new test using `async def` + `await` (the natural pattern for async code) without `@pytest.mark.asyncio`, the test silently doesn't run — a false negative.
+
+**Severity: High**
+
+### V10: `kernel.close()` has `except Exception: pass` — silently swallows all destroy errors
+
+**File:** `rust_backend.py:325`
+
+```python
+try:
+    _get_rust().destroy(backend)
+except Exception:
+    pass  # silently swallows RuntimeError, OSError, ctypes errors
+```
+
+If `dita_kernel_destroy` fails (e.g., segfault caught by `catch_unwind` returning an error, or the shared library was unloaded), the exception is silently consumed. The caller believes the kernel was closed successfully — but the handle may still be allocated. This is a leak path that's invisible to monitoring.
+
+**Severity: Low**
+
+### V11: `build_launcher_bundle()` has no cleanup for partially-created components — kernel OOM orphans venue, zinc, control plane
+
+**File:** `launcher.py:163-223`
+
+The build order is:
+1. Control plane ✅ built
+2. Projection ✅ built
+3. Zinc plane ✅ built
+4. Venue ✅ built
+5. **ExecutionKernel** — if this fails (OOM, cargo build fail, CDLL load), components 1-4 are all orphaned
+
+There's no `try/finally` around the sequence. If `ExecutionKernel.__init__()` raises, the five already-built components (control plane, projection, zinc plane, venue, and any partially-initialized kernel state) leak. No cleanup code exists.
+
+**Fix:** Use a context manager or `try/finally` to close created components if a later one fails.
+
+**Severity: Medium**
+
+### V12: `KernelResult` clones entire kernel state (all slots + indexes + AccountState) on every FFI call — performance issue
+
+**File:** `_rust_kernel/src/lib.rs:1030-1050` (snapshot method)
+
+Every FFI call to `process_intent()` or `on_venue_event()` returns a `KernelResult` containing a full `snapshot()`:
+
+```rust
+fn snapshot(&self) -> KernelSnapshot {
+    KernelSnapshot {
+        slots: self.slots.clone(),                           // O(n) clone ALL TradeSlots
+        active_trade_index: self.active_trade_index.clone(), // clone entire HashMap
+        venue_order_index: self.venue_order_index.clone(),
+        client_order_index: self.client_order_index.clone(),
+        account: self.account.clone(),                       // clone entire AccountState
+    }
+}
+```
+
+For a kernel with 10 slots, each with metadata, seen_event_ids (1024 IDs), nested VenueOrders, and a full AccountState with fee_config, this produces **thousands of heap allocations per FFI call**. At 10 intents/second, this is tens of thousands of allocations/second.
+
+The Python side only reads `KernelResult.outcome` and `KernelResult.slot` (the single affected slot) from each response — **the full snapshot is never used**. The `snapshot` field is transmitted and decoded by Python (`json.loads` parses all of it) but the caller only accesses `outcome` and `slot`. The snapshot data is generated, serialized, transmitted, deserialized, and **silently discarded** on every call.
+
+**Severity: Medium** (performance, not correctness)
+
+### V13: `_build_rb()` leaks bundle on any post-creation failure — Shim construction failure orphans kernel and venue
+
+**File:** `gen2.py:326-338`, `_gen_test.py:69-80`
+
+```python
+def _build_rb(ic=25000.0, max_slots=1):
+    cfg = _build_config(ic)
+    b = build_launcher_bundle(...)  # bundle created with active kernel, venue, etc.
+    k = b.kernel
+    k.account.snapshot.capital = ic  # <-- if this raises (e.g., AttributeError)
+    ...  # Shim construction, etc.
+    return RB(runtime=Shim(k), config=cfg)
+```
+
+If any line after `build_launcher_bundle()` raises, the bundle `b` (with live kernel, venue connections, zinc plane) is leaked. No `try/finally` to call `b.close()`. In test code this is acceptable (process exits soon), but if test suites grow large, accumulated leaked bundles exhaust kernel slots or file descriptors.
+
+**Severity: Low**
+
+### V14: `_maybe_close` uses `break` after first method match — only tries `close` OR `disconnect`, never both
+
+**File:** `launcher.py:67`
+
+```python
+for method_name in ("close", "disconnect"):
+    method = getattr(obj, method_name, None)
+    if callable(method):
+        ...
+        break  # <-- breaks after first successful match — never tries the second method
+```
+
+If an object has both `close()` and `disconnect()`, only `close()` is called. The `disconnect()` fallback is never reached if `close()` exists. This is correct for objects like `RealZincPlane` (which has `close()` but no `disconnect()`), but for objects with both methods (possible future adapters), the second method is silently skipped.
+
+**Severity: Low**
+
+---
+
+## Pass 19 Summary
+
+| # | Flaw | Layer | Severity |
+|---|------|-------|----------|
+| V1 | `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust handle leaks via `__del__` | Launcher | **Critical** |
+| V2 | `BingxVenueAdapter` no `close()`/`disconnect()` — ThreadPoolExecutor/HTTP never release | Venue | **Critical** |
+| V3 | `process_intent` ENTER doesn't clear `seen_event_ids` — old dedup pollutes new trade | Rust | **High** |
+| V4 | 3 generators write same output file — last writer wins, incompatible prologues | Test | **Critical** |
+| V5 | Generated tests triple env-gated — never run in CI, dead code | Test | **Critical** |
+| V6 | `kernel.close()` destroys Rust handle immediately — no drain, no flush, use-after-free risk | Bridge | **Critical** |
+| V7 | `_last_settled_pnl` dict accessed from process_intent and on_venue_event without locks | Bridge | Medium |
+| V8 | `#[serde(default)] leverage: f64` default 0.0 — mark_price uses directly no `.max(1.0)` | Rust | Medium |
+| V9 | No `conftest.py`, no `pytest.ini`, no `asyncio_mode` — test discovery fragile | Test | **High** |
+| V10 | `kernel.close()` `except Exception: pass` — silently swallows destroy errors | Bridge | Low |
+| V11 | `build_launcher_bundle()` no cleanup on partial failure — OOM orphans 4 components | Launcher | Medium |
+| V12 | `KernelResult` clones entire kernel state every FFI call — wasted allocations | Rust | Medium |
+| V13 | `_build_rb()` leaks bundle on post-creation failure | Test | Low |
+| V14 | `_maybe_close` breaks after first method — never tries both close and disconnect | Launcher | Low |
+
+### Pass 19 Severity
+
+| Severity | Count |
+|----------|-------|
+| **Critical** | 5 (V1, V2, V4, V5, V6) |
+| **High** | 2 (V3, V9) |
+| Medium | 4 (V7, V8, V11, V12) |
+| Low | 3 (V10, V13, V14) |
+
+### Combined Catalog (All 19 Passes)
+
+| Pass | Focus | Count | Critical | High | Medium | Low | Info |
+|------|-------|-------|----------|------|--------|-----|------|
+| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
+| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
+| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
+| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
+| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
+| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
+| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
+| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
+| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
+| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
+| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
+| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
+| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 |
+| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 |
+| Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 |
+| R | Pass 15 (Resource Leaks/Trust Boundaries/Security) | 14 | 2 | 6 | 3 | 2 | 1 |
+| S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 |
+| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
+| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
+| V | Pass 19 (Lifecycle/Rust Subtleties/Test Infra) | 14 | 5 | 2 | 4 | 3 | 0 |
+| **Total** | | **347** | **35** | **101** | **100** | **64** | **37** |