PINK: E2E trace analysis — Pass 11 async/sync seams/locks/threading (N1-N10)

Eleventh pass: Rust kernel with_handle_mut has zero synchronization — &mut KernelCore from raw pointer with no Mutex, concurrent FFI calls cause UB (N1 Critical), _run() has two completely different code paths depending on event loop state (N2 Critical), path B blocks event loop thread for every HTTP operation (N3 Critical), asyncio.run() called repeatedly creating destroying event loops per call (N4 Critical), _snapshot_ready Event cascading re-fetch — N callers produce N overlapping HTTP calls (N5 High). 243 total. Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
2026-06-02 08:00:50 +02:00
parent 81fe1d6d25
commit 24034416e0
2 changed files with 261 additions and 1 deletions
--- a/PINK_DITAv2_E2E_TRACE_ANALYSIS.md
+++ b/PINK_DITAv2_E2E_TRACE_ANALYSIS.md
@@ -4307,3 +4307,243 @@ Even if the `DecisionEngine` produced a LIMIT decision with a limit price, the r
 | L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
 | M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
 | **Total** | | **233** | **16** | **66** | **68** | **56** | **27** |
 ---
 ## PASS 11 — ASYNC/SYNC SEAMS, LOCK ANALYSIS, THREADING
 ### N1: Rust kernel `with_handle_mut` has zero synchronization — `&mut KernelCore` from raw pointer, UB on concurrent FFI
 **File:** `_rust_kernel/src/lib.rs:2042`
 ```rust
 fn with_handle_mut<F, R>(handle: *mut KernelHandle, f: F) -> Result<R, String>
 where
    F: FnOnce(&mut KernelCore) -> Result<R, String>,
 {
    // Safety: single-threaded; caller holds exclusive access for the duration.
    let core = unsafe { &mut (*handle).core };   // raw ptr → &mut
 ```
 The comment says "single-threaded" but provides **zero enforcement** — no `Mutex`, no `RwLock`, no atomic flag, no thread-local constraints, no `!Send`/`!Sync` marker on `KernelCore`. The `unsafe` block converts a raw pointer to a `&mut` reference, which under Rust's aliasing rules must be **exclusive** — two simultaneous `&mut` references to the same data is **undefined behavior** (data race, torn reads, LLVM miscompilation).
 The `ctypes` FFI mechanism releases the GIL during the Rust call (`Py_BEGIN_ALLOW_THREADS`/`Py_END_ALLOW_THREADS`). Two Python threads can call any two `dita_kernel_*` functions simultaneously — one in `process_intent` (writing slot state), another in `snapshot_json` (reading). Both produce `&mut KernelCore`. This is a **compiler-level UB**, not just a logical race.
 **Trigger scenario:** Thread A calls `process_intent()` (ENTRY fill → mutates slot). Thread B calls `on_venue_event()` (exit fill → mutates slot). The GIL is released during both Rust FFI calls. Both get `&mut KernelCore`. The Rust compiler can reorder, elide, or speculate any memory operation. Slot data becomes corrupted, PnL doubles, or the process segfaults.
 **Severity: Critical** — undefined behavior, no enforcement, no mitigation.
 ### N2: `_run()` has two completely different code paths depending on event loop state — runtime branch, not design decision
 **File:** `bingx_venue.py:225-238`
 ```python
 def _run(self, result):
    if inspect.isawaitable(result):
        try:
            asyncio.get_running_loop()
        except RuntimeError:
            return asyncio.run(result)           # Path A: no loop → direct run
        pool = self._get_executor()
        return pool.submit(asyncio.run, result).result()  # Path B: loop → pool + block
    return result
 ```
 Path A (no event loop running): `asyncio.run(result)` — creates a new event loop, runs the coroutine, closes it. All on the same thread. Correct for sync contexts.
 Path B (event loop running): `pool.submit(asyncio.run, result).result()` — submits to a 3-thread pool, each worker creates yet ANOTHER event loop via `asyncio.run()`, then blocks the calling thread with `.result()`.
 The `asyncio.get_running_loop()` check is a **runtime probe** — the code doesn't know from its design whether it's in an async context. Same logical operation (run a coroutine), two completely different implementations. Path B is a documented anti-pattern (creating/destroying event loops per call), Path A is correct.
 This is the root cause of the entire async/sync seam problem — the architecture never committed to being async or sync.
 **Severity: Critical**
 ### N3: `_run()` Path B blocks the event loop thread for every venue HTTP operation
 **File:** `bingx_venue.py:236`
 ```python
 return pool.submit(asyncio.run, result).result()  # BLOCKS calling thread
 ```
 When called from within a running event loop (all live tests, any async deployment), `.result()` **blocks the event loop thread** until the thread pool worker completes. During this block:
 - No WS messages can be received from the `BingxUserStream`
 - No keepalive tasks can run
 - No timer-based events can fire
 - The event loop is **stuck**
 If the thread pool is exhausted (3 concurrent HTTP calls — e.g., `_backend_snapshot` from `submit()` which calls it twice plus `cancel()` which calls it three times), the 4th call blocks at `.result()` **indefinitely** — the work item is queued but no worker is free. This is a **stuck-process scenario** where the entire system freezes.
 The event loop thread is blocked on `.result()`, which means it cannot process the WS events that might contain the fill for the order it just submitted. If the exchange fills instantly, the WS message arrives before `.result()` returns — the WS data sits in the kernel's TCP receive buffer, unprocessed, until `process_intent` completes and the event loop can schedule the WS reader again. This delay can cause stale fills, missed state transitions, or WS timeouts.
 **Severity: Critical**
 ### N4: `asyncio.run()` called repeatedly inside thread pool — creates/destroys event loops per call, documented anti-pattern
 **File:** `bingx_venue.py:236`
 ```python
 return pool.submit(asyncio.run, result).result()
 ```
 Each call to `asyncio.run()` creates a new `SelectorEventLoop`, runs it, then closes it. Doing this repeatedly for every HTTP call is a documented CPython anti-pattern:
 - Each loop allocation costs memory (selector, callbacks, timeout queue)
 - Each loop destruction leaves loop-internal objects for GC
 - Over many calls (hundreds of trades), this creates GC pressure and memory fragmentation
 - The `asyncio.run()` documentation explicitly says "don't call this repeatedly" — use a long-lived loop
 Path A (no event loop) has the same issue — `asyncio.run()` is called per-`_run()` invocation.
 The total cost: each `process_intent()` may call `_run()` 3-4 times (`_backend_snapshot` ×2 + `submit_intent` + optionally `cancel`). Each `_run()` creates/destroys an event loop. With 10 trades/min, that's 30-40 event loop creations/destructions per minute.
 **Severity: Critical**
 ### N5: `_snapshot_ready` Event cascading re-fetch — N concurrent callers produce N overlapping HTTP calls
 **File:** `bingx_venue.py:258-274`
 ```python
 def _backend_snapshot(self, ...):
    if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
        return self._last_snapshot  # stale
    self._snapshot_ready.clear()
    try:
        snapshot = self._call_backend("refresh_state", ...)  # HTTP call
    except Exception:
        self._snapshot_ready.set()
        raise
    with self._snap_lock:
        self._last_snapshot = snapshot
    self._snapshot_ready.set()
 ```
 When `_snapshot_ready.set()` fires at the end, ALL threads waiting on `.wait()` wake up. Each one proceeds to `clear()` and start a **new** HTTP call — even though a fresh snapshot was just written. With N concurrent callers to `_backend_snapshot`, this produces N overlapping `refresh_state` HTTP calls instead of N-1 callers reading the just-received result.
 On BingX VST (rate limit ~10 req/s), 3 overlapping `refresh_state` calls (each doing 5 parallel sub-requests) burns 15 of the 10 req/s budget. The calls overlap and cascade, wasting rate-limit capacity with redundant work.
 **Severity: High**
 ### N6: `BingxUserStream.close()` does not cancel pending tasks — keepalive/rotation tasks continue after close
 **File:** `bingx_user_stream.py:160-169`
 ```python
 async def close(self) -> None:
    self._closed.set()
    if self._session is not None and not self._session.closed:
        await self._session.close()
 ```
 `close()` sets the `_closed` event and closes the aiohttp session. It does **not** cancel the `keepalive_task` or `rotation_task` created inside `subscribe()`. These tasks are only cancelled in the `finally` block of `subscribe()`. If `close()` is called while nobody is iterating the `subscribe()` generator (or if iteration is blocked in `_consume()`), those tasks **keep running** until:
 - The event loop shuts down (automatic task cancellation)
 - The subscribe generator is garbage collected
 - An exception occurs in the WS reader
 During this window, the keepalive loop continues sending PUT requests to the (now potentially deleted) listen key. The rotation task continues its 23h50m sleep. Both are zombie tasks with no cleanup path.
 **Severity: Medium**
 ### N7: Live test architecture forces worst-case `_run()` behavior for every operation
 **File:** `gen_live_tests.py`, `gen2.py`, `_gen_test.py` (all test generators)
 The live tests use this pattern:
 ```python
 def test_pink_ditav2_xxx(_live_client) -> None:
    ...
    result = asyncio.run(_run_scenario(bundle, _live_client, body_fn, name, ic))
 ```
 Each test is a **synchronous** function that calls `asyncio.run()`. Inside the resulting event loop, every call to `k.process_intent()` triggers **Path B** of `_run()` — the pool-submit-`.result()` path. The test architecture forces the architecture's slowest, most thread-expensive code path for every single intent.
 Every HTTP call: creates a new event loop on a pool thread → blocks the main event loop thread → blocks WS processing → wastes pool slots. Even for trivial mock-venue tests that don't need HTTP at all, the architecture still goes through the same `_run()` → pool → `.result()` path because the mock venue also returns awaitables.
 **Severity: Medium**
 ### N8: `BingxUserStream subscribe()` creates new tasks on every reconnect — rapid reconnect causes task churn
 **File:** `bingx_user_stream.py:100-120`
 ```python
 async def subscribe(self):
    while not self._closed.is_set():
        ...
        keepalive_task = asyncio.create_task(self._keepalive_loop(listen_key))
        rotation_task = asyncio.create_task(self._rotation_sentinel())
        ...
        async for event in self._consume(listen_key, rotation_task):
            yield event
 ```
 Each iteration of the reconnect loop creates new `keepalive_task` and `rotation_task`, then cancels the previous ones in the `finally` block. If the connection drops every few seconds (unstable WS), tasks are created and cancelled in rapid succession. Cancellation races with task creation — a task can be cancelled before its first `await`, which changes its state machine.
 Also: no rate limiting on the reconnect loop beyond the `delay_ms` exponential backoff. If the WS repeatedly fails immediately after connection, the loop creates/destroys tasks in a tight cycle.
 **Severity: Medium**
 ### N9: No `asyncio.all_tasks()` or task accounting anywhere — leaked tasks undetectable
 No code in the entire workspace calls `asyncio.all_tasks()` or maintains a task registry. If a task is leaked (cancellation not propagated, generator not cleaned up), there is:
 - No way to detect it programmatically
 - No warning log
 - No metrics
 - No `__del__` fallback
 Combined with N6 (tasks not cancelled on close) and N8 (task churn on reconnect), leaked tasks accumulate silently. Each leaked task holds references to its coroutine frame, which may hold references to `aiohttp.ClientSession`, websocket connections, and other resources.
 **Severity: Low**
 ### N10: `_snap_lock` / `_snapshot_ready` pattern has no reader-side protection on `_last_snapshot`
 **File:** `bingx_venue.py:258-274`
 The `_snap_lock` protects `_last_snapshot` only during writes (line 269-271). The fallback path (timeout at line 260-262) also reads `_last_snapshot` under `_snap_lock`. But the `_call_backend` call at line 266 is **outside** the lock — the snapshot is fetched without holding `_snap_lock`, which is correct (don't hold a lock across HTTP). However, the time between releasing the lock and reacquiring it for the write (line 269) means another thread could also be writing `_last_snapshot` concurrently. The `_snap_lock` ensures only one write at a time, but the `_last_snapshot` can still be overwritten between threads — this is the intended behavior (last writer wins for staleness purposes, not a correctness bug).
 **Severity: Informational**
 ---
 ## Pass 11 Summary
 | # | Flaw | Layer | Severity |
 |---|------|-------|----------|
 | N1 | Rust kernel `with_handle_mut` zero synchronization — `&mut` from raw ptr, UB on concurrent FFI | Rust | **Critical** |
 | N2 | `_run()` has two completely different code paths — runtime branch, not design decision | Venue | **Critical** |
 | N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** |
 | N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call, documented anti-pattern | Venue | **Critical** |
 | N5 | `_snapshot_ready` cascading re-fetch — N callers produce N overlapping HTTP calls | Venue | **High** |
 | N6 | `BingxUserStream.close()` doesn't cancel pending tasks — zombie keepalive/rotation after close | Stream | Medium |
 | N7 | Live test architecture forces worst-case `_run()` path for every operation | Test | Medium |
 | N8 | `subscribe()` reconnect creates new tasks per iteration — rapid reconnect causes task churn | Stream | Medium |
 | N9 | No `asyncio.all_tasks()` or task accounting — leaked tasks undetectable | All | Low |
 | N10 | `_snap_lock`/`_snapshot_ready` no reader-side protection (informational) | Venue | Info |
 ### Pass 11 Severity
 | Severity | Count |
 |----------|-------|
 | **Critical** | 4 (N1, N2, N3, N4) |
 | **High** | 1 (N5) |
 | Medium | 3 (N6, N7, N8) |
 | Low | 1 (N9) |
 | Info | 1 (N10) |
 ### Combined Catalog (All 11 Passes)
 | Pass | Focus | Count | Critical | High | Medium | Low | Info |
 |------|-------|-------|----------|------|--------|-----|------|
 | A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
 | T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
 | E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
 | F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
 | G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
 | H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
 | I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
 | J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
 | K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
 | L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
 | M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
 | N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
 | **Total** | | **243** | **20** | **67** | **70** | **58** | **28** |
--- a/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md
+++ b/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md
@@ -25,7 +25,8 @@
 | K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
 | L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
 | M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
-| **Total** | | **233** | **16** | **66** | **68** | **56** | **27** |
+| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
 | **Total** | | **243** | **20** | **67** | **70** | **58** | **28** |
 ---
@@ -306,6 +307,25 @@
 ---
 ## N-Series: Async/Sync Seams, Lock Analysis, Threading (Pass 11)
 *Full detail in TRACE doc under "PASS 11 — ASYNC/SYNC SEAMS, LOCK ANALYSIS, THREADING."*
 | # | Flaw | Layer | Severity |
 |---|------|-------|----------|
 | N1 | Rust kernel `with_handle_mut` zero sync — `&mut` from raw ptr, UB on concurrent FFI | Rust | **Critical** |
 | N2 | `_run()` has two completely different code paths — runtime branch, not design | Venue | **Critical** |
 | N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** |
 | N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call | Venue | **Critical** |
 | N5 | `_snapshot_ready` cascading re-fetch — N callers produce N overlapping HTTP | Venue | **High** |
 | N6 | `BingxUserStream.close()` doesn't cancel pending tasks | Stream | Medium |
 | N7 | Live test architecture forces worst-case `_run()` path for every operation | Test | Medium |
 | N8 | `subscribe()` reconnect creates new tasks per iteration | Stream | Medium |
 | N9 | No `asyncio.all_tasks()` or task accounting — leaked tasks undetectable | All | Low |
 | N10 | `_snap_lock` no reader-side protection (informational) | Venue | Info |
 ---
 ## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5)
 *Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."*