Public Access

Files

Codex 09db2e694b PINK: E2E trace analysis — Pass 21 rust build/deps/python packaging/shared mem (X1-X14)

Twenty-first pass: no ABI compatibility check on Rust .so load stale binary
corrupts silently (X1 Critical), real_zinc_plane _write_region zeroes entire
buffer before write visible all-zero window (X2 Critical), no requirements.txt
setup.py pyproject.toml zero Python dependency declarations (X3 Critical),
RealZincControlPlane.update() no thread lock concurrent calls corrupt seq and
shared memory (X4 High), libc declared in Cargo.toml never used dead dependency
(X5 High), 5 test files hardcoded sys.path.insert non-portable (X6 High),
_decode_packet no try/except on json.loads partial body read crashes reader (X7
High), ExchangeEvent not exported from __init__.py package API inconsistency (X8
High), RealZincPlane and RealZincControlPlane collide on {prefix}_control region
name (X10 Medium). 375 total flaws across 21 passes.

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>

2026-06-02 18:04:33 +02:00

392 KiB

Raw Blame History

PINK DITAv2 — End-to-End Trace & Flaw Analysis

Analysis date: 2026-05-31 Method: Full-trace static analysis — every file, every data path, every boundary crossing in the PINK execution pipeline. No test execution. System scope: 34 active source files, ~12,000 lines across Rust kernel, Python bridge, venue adapter, runtime, and persistence.

Central flaw registry: PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md contains the combined catalog of all 116 flaws (A, T, E, F, G series) with severity distribution and cross-references. This file provides the deep E2E trace context — read the central registry for the master list.

E2E Data Flow (One Call)

Every E2E path in the PINK system traces through this sequence. Each numbered step below is a site where data crosses a module boundary and can be lost, mangled, or misinterpreted.

PinkDirectRuntime.step()                    # R1: policy cycle entry
  ├─ pump_venue_events()                    # R2: drain async fills
  ├─ kernel.snapshot()["account"]           # R3: read capital
  ├─ kernel.slot(0)                         # R4: read slot state
  ├─ decision_engine.decide()               # R5: policy-layer ENTER/EXIT
  ├─ intent_engine.plan()                   # R6: intent sizing
  ├─ _decision_to_kernel_intent()           # R7: Decision → KernelIntent
  ├─ kernel.process_intent(kernel_intent)   # R8: KERNEL BOUNDARY
  │   ├─ rust_backend._intent_to_payload()  # R8a: KernelIntent → JSON
  │   ├─ _RustKernelLib.process_intent()    # R8b: JSON → C FFI
  │   │   └─ Rust process_intent()          # R8c: FSM mutates TradeSlot
  │   ├─ venue.submit(intent)               # R9: VENUE BOUNDARY
  │   │   ├─ bingx_venue._legacy_intent()   # R9a: KernelIntent → LegacyIntent
  │   │   ├─ BingxDirectExecutionAdapter    # R9b: HTTP POST /trade/order
  │   │   │   .submit_intent()
  │   │   └─ bingx_venue._events_from_submit() # R9c: receipt → VenueEvent[]
  │   └─ on_venue_event(event)              # R10: FEEDBACK BOUNDARY
  │       ├─ _RustKernelLib → Rust FSM      # R10a: C FFI → FSM transition
  │       ├─ account.settle(delta)          # R10b: incremental PnL settlement
  │       └─ persistence writes             # R10c: ClickHouse / Zinc / HZ
  ├─ kernel.snapshot()["account"]           # R11: read final capital
  └─ persistence.persist_step()             # R12: PERSISTENCE BOUNDARY

Layer 1: Policy Cycle Entry (pink_direct.py:422)

E1: `step()` calls `pump_venue_events()` every cycle unconditionally

pink_direct.py:436

await self.pump_venue_events(snapshot, market_state=market_state)

This is called before reading slot/account state for the policy decision. The pump calls venue.reconcile() which for BingxVenueAdapter does 5 HTTP requests (balance, positions, open orders, plus history if include_history).

For MARKET-only workflows, no resting orders exist, so reconcile() returns empty events every time. But the HTTP calls still happen. On BingX VST with ~10 req/s limit and a 5s policy cycle, this burns 1 req/s just to learn "nothing changed." Add the actual trade HTTP calls, and the budget is tight.

Flaw: E1 — unconditional exchange poll wastes rate limit. Already documented as A10, but worse when traced E2E: each pump_venue_events calls venue.reconcile() → _backend_snapshot() → parallel asyncio.gather of 3 HTTP GETs. The _refresh_exchange_state at bingx_direct.py:281-352 always fetches balance + positions + openOrders concurrently. Even when include_history=False (which it is for the pump), that's 3 HTTP calls every policy cycle regardless of whether any orders are resting.

Severity: Medium. Wasteful but not destructive on testnet.

E2: `kernel.snapshot()["account"]` returns a fresh dict, not a live view

pink_direct.py:437

acc = self.kernel.snapshot()["account"]

ExecutionKernel.snapshot() at rust_backend.py:740-752 builds a dict from kernel state at call time. The decision/intent engines then consume this snapshot. Between the snapshot and process_intent() (line 523), another caller (or the same runtime in a concurrent cycle) could advance the kernel state, making the decision based on stale capital.

Flaw: E2 — TOCTOU between capital snapshot and intent execution. The context.capital read at line 437 is used at line 523 for the ENTER safety guard (_unsafe_entry_reason) and possibly by the decision/intent engines. If capital changes between these two points (e.g. an async fill arrives via a concurrent test-HTTP path), the guard uses stale capital.

Severity: Low in single-threaded deployment. Critical under concurrency.

Layer 2: Decision/Intent Bridging (pink_direct.py:79-115)

E3: `_decision_to_kernel_intent` drops `order_type` and `limit_price`

pink_direct.py:79-115

def _decision_to_kernel_intent(decision, intent, slot_id=0):
    return KernelIntent(
        ...
        # order_type and limit_price are NOT SET here
    )

KernelIntent has order_type="MARKET" and limit_price=0.0 as defaults, so MARKET orders work correctly. But the runtime never sets these fields from the policy layer. If decision or intent ever carries order_type or limit_price, it's silently dropped because the bridge doesn't map them.

Flaw: E3 — LIMIT support in runtime is dead code. The order_type/limit_price fields in KernelIntent and the LIMIT payload building in bingx_direct.py lines 384-398 are unreachable from the runtime. The only path that can set them is direct KernelIntent(...) construction in tests (_build_pink_bodies.py style scenarios). The _decision_to_kernel_intent bridge must be patched when a policy engine needs to emit LIMIT orders.

Severity: Medium. Blocks any production path to LIMIT orders.

E4: `_exit_intent_from_slot` trusts slot.size but slot may be stale

pink_direct.py:398-420

def _exit_intent_from_slot(self, kernel_intent):
    try:
        slot_size = float(self.kernel.slot(int(kernel_intent.slot_id)).size or 0.0)
    except Exception:
        slot_size = 0.0
    ...
    exit_size = min(policy_size, slot_size) if policy_ok else slot_size

Reads slot.size fresh from the Rust kernel at call time, then uses it to cap the exit size. Between this read and the process_intent call that actually executes the EXIT (line 523), the slot can be modified by pump_venue_events (line 436) or a concurrent cycle. If a partial fill arrived between the slot read and the EXIT, the exit size could be wrong.

Flaw: E4 — TOCTOU between exit sizing and exit execution. Same class as E2 but for exit size rather than capital. If the pump drained a partial fill between R4 (slot read) and R8 (process_intent), the EXIT requests a size based on pre-pump remaining size. The kernel caps it at actual remaining, so this is self-correcting — but the intent payload has wrong metadata.

Severity: Low. Self-correcting at kernel level.

Layer 3: Kernel Bridge — Rust FSM Entry (rust_backend.py)

E5: JSON serialization round-trip loses numeric precision

rust_backend.py:460-485 (_intent_to_payload)

KernelIntent fields like reference_price, target_size, leverage are Python floats. They're serialized to JSON text, sent through C FFI, parsed by serde_json into Rust f64, then serialized back to JSON, parsed by Python json.loads(). Each serialization step can introduce precision loss:

# Python float → JSON: 0.1 → "0.1" → Rust f64: 0.10000000000000000555
# Rust f64 → JSON: → serde_json may print "0.10000000000000001"
# Python json.loads → 0.10000000000000001

For prices (TRXUSDT at ~$0.08), a 1e-16 relative error is negligible. For PnL accumulation over thousands of trades at 9x leverage, the error can grow to cents or dollars. The |Δcapital − realized| < 1e-9 assertion in tests would catch gross errors but not sub-cent accumulation.

Flaw: E5 — JSON serialization precision drift over long runs. Severity: Low. Not a practical concern for the current deployment scale.

E6: `_RustKernelLib` is a global singleton — shared across all kernels

rust_backend.py:40-45

_RUST: _RustKernelLib | None = None

def _get_rust() -> _RustKernelLib:
    global _RUST
    if _RUST is None:
        _RUST = _RustKernelLib()
    return _RUST

The _RustKernelLib singleton loads the .so shared library once and provides FFI functions. Each ExecutionKernel instance gets its own KernelHandle via _get_rust().create(max_slots). The FFI functions take the handle as the first argument, so multiple kernels are isolated at the Rust level.

However, the singleton means ALL kernels share the same ctypes function pointer table. If a second kernel is created and the first is destroyed, KernelHandle of the first becomes a dangling pointer. Calling any FFI function on the destroyed kernel's handle is use-after-free.

Flaw: E6 — No protection against use-after-free on kernel destroy. Already documented as T7. Worth re-emphasizing in the E2E trace because the test infrastructure creates and destroys kernels frequently (fresh-kernel reconcile tests, each _build_rb() call in scenario wrappers).

Severity: High. Use-after-free in C FFI is memory corruption.

Layer 4: Rust Kernel FSM (lib.rs:728)

E7: ENTER handler silently allows re-entry with same trade_id

lib.rs:740-745

if !slot.is_free() && !slot.trade_id.is_empty() && slot.trade_id != intent.trade_id {
    return SLOT_BUSY;
}

If slot.trade_id == intent.trade_id, the ENTER is accepted even if the slot is not free (e.g., POSITION_OPEN with an active position). This is by design — it lets the same trade_id re-enter after the slot was partially reconciled or restored from a snapshot. But it also means:

EXIT sets slot.closed=true and transitions to CLOSED
A new ENTER with the same trade_id re-enters the CLOSED slot
The slot resets slot.closed=false, slot.size=0.0, slot.initial_size=0.0
Kernel now thinks the trade is new, but the Rust indexes still have the old trade_id pointing to slot 0

Downstream effect: After a re-entry with the same trade_id, the active_trade_index[trade_id] still correctly points to slot 0. But the old VenueOrder in client_order_index and venue_order_index is still present until the new entry fills and creates new orders. A reconcile event addressed to the old venue_client_id could stomp on the new trade.

Flaw: E7 — Re-entry with same trade_id leaves stale index entries. Severity: Low. The rebuild_indexes() call in commit_slot() rebuilds from scratch, so stale entries are cleared on the first write.

E8: EXIT handler uses `initial_size` not `current size`

lib.rs:770-775

let exit_ratio = slot.next_exit_ratio();
let base_size = if slot.initial_size > 0.0 { slot.initial_size } else { slot.size };
let exit_size = (base_size * exit_ratio).max(0.0);

Already documented as A1. In the E2E trace, this is the single most impactful execution flaw. A concrete scenario:

Enter size=1.0, initial_size=1.0, exit_leg_ratios=(0.5, 0.5, 1.0)
EXIT leg 0: requests 1.0 * 0.5 = 0.5. Slot goes to 0.5.
EXIT leg 1: requests 1.0 * 0.5 = 0.5. Slot goes to 0.0. active_leg_index advances to 2. all_legs_done = (2 >= 3) = false. But wait — exit_leg_ratios.len() is 3: [0.5, 0.5, 1.0]. So all_legs_done = (2 >= 3) = false. The slot stays at POSITION_OPEN, size=0.0, !closed.
EXIT leg 2 (ratio 1.0): exit_size = 1.0 * 1.0 = 1.0. Slot is at 0.0. slot.is_free(): fsm_state=POSITION_OPEN, not in {IDLE, CLOSED}. slot.size <= 0.0 is true. But !slot.is_free() returns true because of the FSM state check, not the size check. The ENTER guard !slot.is_free() blocks re-entry. The EXIT guard slot.is_free() || slot.closed || size <= 0.0 triggers — returns NO_OPEN_POSITION.
Slot is stuck forever. No operation can advance it.

Severity: High. Concrete, reproducible, and not caught by any test.

E9: CANCEL handler returns diagnostic even when nothing happened

lib.rs:795-810

if matches!(intent.action, KernelCommandType::CANCEL) {
    let has_cancellable_exit = slot.active_exit_order.is_some();
    let has_cancellable_entry = slot.active_entry_order.is_some()
        && matches!(slot.fsm_state, ENTRY_WORKING | ORDER_REQUESTED | ORDER_SENT | IDLE);
    if !has_cancellable_exit && !has_cancellable_entry {
        return KernelResult {
            outcome: KernelOutcome {
                accepted: false,
                diagnostic_code: NO_ACTIVE_EXIT_ORDER,
                ...
            },
            ...
        };
    }
    return KernelResult {
        outcome: KernelOutcome {
            accepted: true,
            ...
        },
        ...
    };
}

Two issues:

When neither is cancellable, the diagnostic is NO_ACTIVE_EXIT_ORDER even if the actual reason is "no active entry order either" or "slot is already IDLE". The diagnostic is misleading.
When at least one IS cancellable, the Rust kernel returns accepted=true but does not mutate the slot at all — it returns immediately with the slot as-is. The actual cancel (HTTP call + FSM transition) happens in the Python bridge. The Rust kernel's "accept" just means "yes you may try to cancel this" — not "the cancel is complete."

This disconnect means: if the Python bridge's venue.cancel() fails (HTTP error), the Rust kernel has already returned accepted=true for a cancel that never happened. The caller sees accepted=true but the slot state hasn't changed.

Flaw: E9 — Rust CANCEL "accepts" before Python actually cancels. Severity: Medium. The outcome.accepted boolean is misleading for CANCEL.

E10: `apply_fill` entry branch double-sets `active_entry_order`

lib.rs:1330-1390

// First set — at the top of the entry branch:
slot.active_entry_order = Some(VenueOrder {
    ...
    filled_size: fill_size,
    status: if partial { PARTIALLY_FILLED } else { FILLED },
    ...
});

// ... then later for full fill:
if !partial {
    slot.fsm_state = TradeStage::POSITION_OPEN;
    slot.active_entry_order = Some(VenueOrder {  // SECOND SET
        ...
        filled_size: slot.size,    // uses updated slot.size
        ...
    });
}

The entry branch sets active_entry_order at the top with filled_size from the event, then for a FULL_FILL, sets it again with filled_size = slot.size (which may have been updated by slot.initial_size = fill_size above). The first VenueOrder's intended_size is from the event, the second uses slot.size. Both are correct in isolation, but the double-write is wasteful.

More importantly, for a PARTIAL_FILL entry, the first set is the ONLY set. If a second PARTIAL_FILL arrives for the same order, the entry branch at line 1334 checks slot.active_entry_order.is_some() which is true (set by the first partial), but the FSM state is ENTRY_WORKING (also set by first partial). The condition at line 1334-1338 matches ENTRY_WORKING, so the second partial enters the entry branch again. But fill_size is the event's filled_size — the total filled, not the incremental amount.

Flaw: E10 — Second PARTIAL_FILL on entry overwrites, doesn't accumulate.

let fill_size = if event.filled_size > 0.0 {
    event.filled_size      // ← TOTAL filled, not incremental
} else {
    event.size
}.max(0.0);

slot.active_entry_order = Some(VenueOrder {
    ...
    filled_size: fill_size,  // ← overwrites previous filled_size
    ...
});

slot.initial_size = slot.initial_size.max(fill_size);  // ← OK, uses max
slot.size = fill_size;  // ← OVERWRITES previous size with total

On a RESTING LIMIT entry that partially fills in two events:

Event 1: filled_size=0.3 → slot.size=0.3, entry_order.filled_size=0.3
Event 2: filled_size=0.7 → slot.size=0.7, entry_order.filled_size=0.7

The filled_size on the VenueOrder correctly reflects cumulative fill (0.7), but slot.size jumps from 0.3 to 0.7 — the increment is 0.4, which is correct because fill_size IS the cumulative fill (0.7). Actually this is correct — the venue sends cumulative filled_size, not incremental. Let me re-verify: at bingx_venue._events_from_submit() line ~480:

filled_size = _row_float(ack_row, "executedQty", ...)

This reads executedQty which on BingX IS cumulative. So the second event's filled_size=0.7 means "total filled across all fills = 0.7." The kernel sets slot.size = 0.7 which is the total position size. This is correct.

But the second fill event has slot.entry_price overwritten by the new fill's price. If the first fill was at 0.0834 and the second at 0.0836, the slot's entry_price becomes 0.0836 — losing the blended average. For a LIMIT entry with two partial fills at different prices, the entry_price in the slot is the price of the LAST fill, not the VWAP.

Flaw: E10a — Entry price on multi-partial entry is last-fill, not VWAP. Severity: Low. Unrealized PnL computation uses this price. Error is small for tight spreads.

Layer 5: Venue Adapter Boundary (bingx_venue.py)

E11: `_legacy_intent()` is a lossy conversion

bingx_venue.py:270-285

@staticmethod
def _legacy_intent(intent: KernelIntent) -> LegacyIntent:
    action = LegacyDecisionAction.ENTER if intent.action == E.ENTER else ...
    side = LegacyTradeSide.SHORT if intent.side == TS.SHORT else ...
    metadata = dict(intent.metadata)
    metadata["_order_type"] = getattr(intent, "order_type", "MARKET")
    metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
    return LegacyIntent(
        timestamp=intent.timestamp,
        trade_id=intent.trade_id,
        decision_id=intent.intent_id,
        asset=intent.asset,
        action=action,
        side=side,
        reason=intent.reason,
        target_size=float(intent.target_size),
        leverage=float(intent.leverage),
        reference_price=float(intent.reference_price),
        confidence=1.0,           # ← HARDCODED
        bars_held=0,              # ← HARDCODED
        exit_leg_ratios=tuple(intent.exit_leg_ratios or (1.0,)),
        metadata=metadata,
    )

confidence is always 1.0 and bars_held is always 0. The LegacyIntent carries these to BingxDirectExecutionAdapter.submit_intent() which ignores them (it only reads asset, side, action, target_size, leverage, and metadata). So the hardcoded values don't affect execution — but they affect the ExecutionReceipt and any downstream consumers that might read receipt.confidence.

Flaw: E11 — Lossy conversion with hardcoded metadata. Severity: Informational. No downstream consumer reads these fields.

E12: `_events_from_submit()` price fallback chain can lose venue price

bingx_venue.py:375-400 (_events_from_submit)

base_event = VenueEvent(
    ...
    price=safe_float(getattr(receipt, "price", 0.0), 0.0),
    ...
)

# ... later for fill event:
fill_price = safe_float(
    _row_float(ack_row, "avgPrice", "ap", "price", "lastFillPrice",
               default=getattr(receipt, "price", 0.0)),
    0.0
)

The fill price is read from ack_row (the HTTP response dict) first, falling back to receipt.price (the ExecutionReceipt field). The executionReceipt price comes from bingx_direct.py:434:

fill_price = 0.0
for key in ("avgPrice", "avgFilledPrice", "price", "lastFillPrice", "tradePrice"):
    try: value = float(ack_row.get(key) or 0.0)
    except: value = 0.0
    if value > 0: fill_price = value; break
if fill_price <= 0 and self._state is not None:
    fill_price = next((float(...)) for ... in self._state.open_positions.values() ...)

So the price flows: BingX HTTP ack → ack_row[key] → receipt.price → _events_from_submit() → fill_price in VenueEvent.

If ack_row has no price field AND self._state.open_positions has no matching position (e.g., first fill on a new entry), fill_price stays 0.0. The kernel's apply_fill at lib.rs:1397 checks if event.price > 0.0 before setting entry_price — so a zero fill price leaves entry_price at 0.0. This means:

The slot's entry_price stays 0.0
realized_pnl() at lib.rs:662 checks if slot.entry_price <= 0.0 → returns 0.0
PnL is never computed for this fill
Capital never settles

This is very unlikely on BingX VST, which always returns avgPrice in order acknowledgements. But on any venue that doesn't, PnL is silently zeroed.

Flaw: E12 — Zero fill price → zero entry_price → zero PnL. Severity: Medium. Silent PnL loss if venue returns no price.

E13: `_backend_snapshot()` timeout returns stale data

bingx_venue.py:290-320

def _backend_snapshot(self, *, include_history=False, timeout_ms=5000.0):
    if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
        with self._snap_lock:
            return self._last_snapshot  # ← STALE DATA

If the previous snapshot fetch is still in-flight when a new caller arrives, the timeout returns self._last_snapshot — which could be seconds or minutes old. The caller (e.g., submit()) then uses this stale snapshot to compute _filled_size_from_snapshots() — potentially comparing stale "before" data with fresh "after" data, producing a wrong delta.

Flaw: E13 — Stale snapshot fallback causes wrong fill-size detection. Severity: Medium. The _filled_size_from_snapshots diff can be wrong.

E14: `_events_from_cancel` uses stale `slot_id` from order metadata

bingx_venue.py:485-510

VenueEvent(
    ...
    slot_id=int(order.metadata.get("slot_id", 0) or 0),
    ...
)

The slot_id in the CANCEL event comes from the VenueOrder.metadata which was set when the order was created (in Rust FSM's process_intent or on_venue_event). If the slot was re-assigned or the kernel's slot count changed since order creation, this slot_id is wrong. The Rust kernel's resolve_slot() at lib.rs:610-624 would use the event's slot_id (the stale one) and find the wrong slot.

Flaw: E14 — Cancel event carries stale slot_id from order creation. Severity: Low. Slots are stable and never renumbered.

Layer 6: BingX Direct Adapter (bingx_direct.py)

E15: Submit sets leverage via separate HTTP call

bingx_direct.py:376-379

await self._client.signed_post(
    "/openApi/swap/v2/trade/leverage",
    {"symbol": symbol, "side": "BOTH", "leverage": leverage},
)

This is a POST to set exchange leverage before each order. If this call fails (rate limit, network error), the exception at line 417 sets status = "RATE_LIMITED" and returns a rejection — the order is NOT submitted. But the error handling at line 417 catches BingxHttpError for the leverage call AND the order call with the same handler. If the leverage call fails with a non-rate-limit error (e.g., 400 Bad Request for invalid symbol), the status is "REJECTED" and no order is placed. This is correct behavior — but the error message doesn't distinguish "leverage set failed" from "order submission failed."

Flaw: E15 — Leverage-set failure and order failure share error handler. Severity: Low. Correct behavior, poor diagnostics.

E16: `_format_quantity` and `_format_price` use `_instrument_step`/`_instrument_tick` — both may be zero

bingx_direct.py:234-268

def _instrument_step(self, asset):
    instrument = self._resolve_instrument(asset)
    if instrument is not None:
        try: return Decimal(str(instrument.size_increment.as_decimal()))
        except: pass
    return Decimal("0.001")  # fallback

def _format_quantity(self, asset, quantity):
    step = self._instrument_step(asset)
    if step <= 0:
        return str(max(0.0, quantity))
    ...

If _resolve_instrument returns None (asset not in provider), step=0.001 and tick=0.01. These defaults are correct for most USDT perpetuals on BingX VST, but may be wrong for non-standard symbols. The format functions still produce a valid string — just possibly with wrong precision.

More concerning: _resolve_instrument at line 211-226 tries three lookup strategies and iterates all instruments on the third. This iteration is O(n) in the number of instruments and happens on EVERY submit_intent() call. With 540 instruments, this is ~0.5ms — acceptable. But _instrument_step and _instrument_tick each call _resolve_instrument independently, so submit_intent() calls it twice (once for quantity, once for price, plus once for _instrument_venue_symbol at line 358). Three full-instrument-list iterations per order.

Flaw: E16 — Instrument resolution called 3x per order with O(n) scan. Severity: Low. Performance, not correctness.

E17: Cancel uses truth-based confirmation — can mask real errors

bingx_direct.py:474-498

still_open = True
try:
    oo = await self._client.signed_get("/openApi/swap/v2/trade/openOrders", ...)
    ...
    still_open = (venue_order_id in ids) if venue_order_id else (venue_client_id in cids)
except Exception:
    still_open = None

if still_open is False:
    return {"status": "CANCELED", ...}
if str(delete_resp.get("status", "")).upper() in {"CANCELED", "CANCELLED", "SUCCESS", "OK"}:
    return {"status": "CANCELED", ...}
return {"status": delete_resp.get("status", "REJECTED"), ...}

The cancel logic:

DELETE the order on BingX
GET open orders to verify
If the order is no longer open, return CANCELED
If the DELETE response says CANCELED, return CANCELED
Otherwise return REJECTED

If step 2's GET fails (network error, rate limit), still_open=None. Then step 4 checks the DELETE response. If the DELETE also returned an error (e.g., "order not found" because it was already cancelled by another caller), status is "ERROR" or "not found" — neither matches "CANCELED". The cancel is reported as REJECTED even though the order IS cancelled.

The bingx_venue._events_from_cancel() then emits CANCEL_REJECT instead of CANCEL_ACK. The Rust kernel handles CANCEL_REJECT at lib.rs:1218:

KernelEventKind::CANCEL_REJECT => {
    if slot.fsm_state == TradeStage::EXIT_WORKING {
        slot.fsm_state = TradeStage::EXIT_WORKING;  // no-op
    }
    diagnostic_code = KernelDiagnosticCode::CANCEL_REJECTED;
}

The slot stays in its current state (e.g., EXIT_WORKING) with no active order (the exchange has no record of it). The slot is stuck until a manual reconcile.

Flaw: E17 — Cancel can return false REJECTED for already-cancelled orders. Severity: Medium. Leads to stuck slot requiring manual intervention.

Layer 7: Fill Feedback Loop (rust_backend.py on_venue_event)

E18: `on_venue_event` settles PnL incrementally — but fees are never included

rust_backend.py:530-545

incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0)
if abs(incremental_pnl) > 1e-12:
    self.account.settle(incremental_pnl)
    self._last_settled_pnl[slot.slot_id] = slot.realized_pnl

The Rust kernel's apply_fill computes realized PnL as:

let realized = Self::realized_pnl(slot, event.price, fill_size);
slot.realized_pnl += realized;

No fee subtraction. No commission reading from the event. The VenueEvent could carry fee data via metadata["fee"] or raw_payload["commission"], but the Rust kernel doesn't read it and the Python bridge doesn't extract it.

Over the 142 live test scenarios on VST (where fees are 0 or negligible), this is invisible. On live mainnet with exchange fees of 0.02-0.04%, the cumulative error is unbounded.

Flaw: E18 — PnL settlement ignores fees. Already documented as A7. In the E2E trace, the gap is specifically here: VenueEvent.price is used for realized_pnl() but VenueEvent.metadata (which could carry commission from the venue) is never read.

Severity: Medium (grows with trade volume).

E19: `observe_slots` called with ALL slots, not just changed ones

rust_backend.py:538-545

slots = [self._get_slot(i) for i in range(self.max_slots)]
self.account.observe_slots(slots)

Every on_venue_event call re-reads ALL slots from the Rust kernel (N FFI calls) and calls observe_slots with the full list. With max_slots=10, this is 10 FFI round-trips per venue event. Each round-trip serializes a TradeSlot to JSON, passes through C FFI, parses on the Rust side, serializes the result, passes back, and parses on the Python side. For a multi-leg EXIT with 3 fills (ACK + PARTIAL + FULL), that's 3 × 10 = 30 slot reads per process_intent call.

Flaw: E19 — Full-slot-list read on every event is N×FFI overhead. Severity: Low (performance). Not a correctness issue.

Layer 8: Persistence Boundary (pink_clickhouse.py)

E20: `_capital()` reads live from `AccountProjection` — stale row risk

pink_clickhouse.py:199-200

def _capital(self) -> float:
    return float(self.account.snapshot.capital or 0.0)

Every row writer calls _capital() at write time to get the current capital. But persist_result() is called AFTER kernel.process_intent() returns — at which point the account has already been settled. The account_events, position_state, and trade_events rows all record the SAME capital value (the post-settle value). capital_before is then reconstructed by subtracting PnL (already documented as A5).

The effect: all ClickHouse rows for a single process_intent() call show identical capital / account_capital / portfolio_capital values, because they're all written within the same Python call stack with no intervening events. This is correct for single-threaded operation — all rows reflect POST-trade state. But it means ClickHouse querying for "capital before trade" must use capital_after - pnl, which is the wrong formula under multi-slot.

Flaw: E20 — All persistence rows write post-trade capital, not pre-trade. Already documented as A5 from the capital_before angle.

Severity: High for multi-slot accounting reconstruction.

E21: `persist_fill_events()` synthesizes fake Decision/Intent

pink_clickhouse.py:383-435

def persist_fill_events(self, *, snapshot, events, slot_dict, market_state):
    ...
    decision = Decision(
        timestamp=ts, decision_id=trade_id or "async", asset=asset,
        action=action, side=side, reason="ASYNC_FILL",
        confidence=0.0, velocity_divergence=0.0, irp_alignment=0.0,
        reference_price=price, target_size=cur_size, leverage=leverage,
        ...
    )
    intent = Intent(
        timestamp=ts, trade_id=trade_id, decision_id=trade_id or "async",
        ...
    )

The async fill pump (called by pump_venue_events) constructs fake Decision/Intent objects because there's no real policy decision backing an async fill — it just arrived from the exchange. These synthetic objects have:

decision_id = trade_id (or "async" if trade_id is empty)
decision_id and trade_id are the same string
confidence=0.0, velocity_divergence=0.0, irp_alignment=0.0
target_size = cur_size (the remaining size after the fill, not the size that was filled)

These are written to policy_events, trade_reconstruction, and trade_events with the same row shapes as real policy-driven fills. Any ClickHouse query that joins policy_events to trade_events on decision_id will find matching rows (both set to trade_id), but the policy_events row's target_size is the POST-fill size, not the pre-fill size. A replay system that reconstructs position from policy_events → trade_reconstruction would see incorrect sizing.

Flaw: E21 — Async fill persistence uses synthetic decision with wrong data. Severity: Medium. Misleading historical records.

E22: `_write_trade_exit_leg` capital_before uses arithmetic reconstruction

pink_clickhouse.py:761-762

capital_after = self._capital()
capital_before = capital_after - pnl_leg

Already documented as A5. In the E2E trace, the specific path is:

Slot 0 exit leg fills → _capital() returns capital AFTER settlement (because the kernel's on_venue_event already called account.settle)
capital_before = capital_after - pnl_leg reconstructs pre-leg capital

If slot 1 also settled between the leg fill and the persistence write (possible in multi-threaded or concurrent scenario), capital_after includes slot 1's PnL, and capital_before is wrong by exactly slot 1's contribution.

Severity: High for multi-slot.

E23: `_write_trade_event` uses `slot_dict.get("entry_price")` as exit_price

pink_clickhouse.py:813-815

entry_price = _safe_float(slot_dict.get("entry_price", 0.0), ...)
exit_price = _safe_float(slot_dict.get("entry_price", 0.0), ...)  # ← SAME FIELD

Already documented as A13. The exit_price is set to entry_price from the same slot dict field. The BingX ack payload does contain the fill price, but it's not propagated to the slot dict's entry_price for exit fills — the slot's entry_price is set during entry fill and remains unchanged during exit. The exit fill price is only on the VenueEvent, which is not passed through to _write_trade_event.

The trade_events row in ClickHouse always shows exit_price == entry_price, making PnL reconstruction from (exit_price - entry_price) × size × lev impossible. The pnl field IS correct (it's slot.realized_pnl), but only the summary is accurate — the component prices are wrong.

Severity: Low. pnl is correct, only the decomposed price is wrong.

Layer 9: Test Infrastructure

E24: `MockVenueAdapter.submit()` always emits fill on `partial_fill_ratio > 0`

mock_venue.py:60-90

if self.scenario.emit_fill_on_submit or self.scenario.partial_fill_ratio > 0:
    fill_ratio = max(0.0, min(1.0, float(effective_ratio)))
    ...
    if is_entry:
        effective_ratio = self.scenario.entry_partial_fill_ratio if \
            self.scenario.entry_partial_fill_ratio != 1.0 else \
            self.scenario.partial_fill_ratio
    else:
        effective_ratio = self.scenario.exit_partial_fill_ratio ...

The default MockVenueScenario() has partial_fill_ratio=1.0. So every submit() call on a default mock emits a FULL_FILL event immediately. This means mock-venue tests always test the "order fills instantly" path — they never test resting orders, partial fills, or async fills.

Any test that relies on the mock venue is testing a subset of real venue behavior. The mock never produces:

DELAYED fills (fill arrives on a later reconcile() call)
PARTIAL fills with subsequent fills
Partial fills during entry (entry fills partially, then more later)
Mixed entry/exit partial behavior

Flaw: E24 — Mock venue always fills synchronously — never tests async path. Severity: Medium. The pump_venue_events() path has never been exercised with the mock venue.

E25: Test scenarios use MARKET-only `_si()` helper — no LIMIT tests

gen_live_tests.py and _gen_test.py

The _si() helper constructs a KernelIntent with order_type="MARKET" and limit_price=0.0 (the defaults). All 157 live test scenarios use _si(). The 3 "LIMIT" scenarios (limit_does_not_fill, limit_immediate_fill) use reference_price=0.0 and target_size=-0.001 respectively — they test intent validation, not actual LIMIT order submission.

There is zero live-test coverage of:

Submitting a LIMIT order that rests on the book
A resting LIMIT being cancelled
A resting LIMIT receiving a partial fill then a subsequent fill
An async fill arriving via pump_venue_events()

The Rust kernel's PARTIAL_FILL event handling and the Python bridge's on_venue_event + incremental settle + async pump has never been exercised on a live exchange.

Flaw: E25 — Zero live tests for LIMIT/resting/async-fill paths. Severity: High. The partial-fill code path is untested in production.

gen_live_tests.py (fresh_kernel_reconcile_entry body)

fresh = _build_fresh_kernel_from_slot(slot_data, ic=cb)
k2 = fresh.runtime.kernel

The _build_fresh_kernel_from_slot function creates a new PinkDirectRuntime with a new ExecutionKernel. But the venue adapter is shared or re-created with the same BingX backend. Two kernels making concurrent HTTP calls to BingX through shared or separate venue adapters is exactly the multi-threaded scenario that triggers T1 (Rust kernel UB) — except the tests are sequential, not concurrent, so they don't trigger it.

The fresh kernel does NOT restore the venue state (open orders, positions). The fresh kernel has a blank venue adapter state — it can't know about previous LIMIT orders resting on the exchange. This is correct for MARKET-only tests (no resting orders) but would fail for LIMIT tests.

Flaw: E26 — Fresh-kernel reconcile doesn't restore venue state. Severity: Medium (would break LIMIT scenarios).

Summary: Critical E2E Flaw Chain

The most dangerous E2E scenario is a LIMIT order with partial fills on a live exchange:

1. Policy emits LIMIT ENTER                       [E3: can't happen — bridge drops order_type]
2. KernelIntent with order_type="LIMIT"            [dead code path from step 1]
3. bingx_direct.submit_intent builds LIMIT payload [works if reached]
4. BingX accepts LIMIT, returns ACK with no fill   [VenueEvent.price may be 0]
5. FSM transitions to ENTRY_WORKING                [correct]
6. RESTING LIMIT sits on book                      [no further kernel events]
7. Next policy cycle: pump_venue_events()           [E1: expensive HTTP calls]
8. Reconciled venue has no fill events              [nothing to drain]
9. Repeated cycles with no progress                 [wasteful but safe]
10. Eventually BingX fills partially               [VenueEvent arrives]
11. apply_fill PARTIAL_FILL entry branch runs       [E10: entry_price = last fill, not VWAP]
12. on_venue_event settles incremental PnL          [E18: fees not included]
13. persistence writes                              [E20/E21/E22/E23: wrong capital_before, exit_price]
14. Remaining LIMIT still rests on book             [continues to step 7]
15. Eventually full fill or cancel                  [E17: cancel can return false REJECTED]

None of steps 4-15 have live test coverage.

Complete Flaw Catalog (All Layers)

#	Flaw	Layer	Step	Severity
E1	Unconditional pump_venue_events wastes rate limit	Runtime	R2	Medium
E2	TOCTOU between capital snapshot and intent	Runtime	R3→R8	Medium
E3	Runtime bridge drops order_type/limit_price	Bridging	R7	Medium
E4	TOCTOU between exit sizing and execution	Runtime	R8	Low
E5	JSON precision drift over long runs	Bridge	R8a→R8c	Low
E6	Global FFI singleton no guard vs use-after-free	Bridge	R8b	High
E7	Same-trade-id re-entry leaves stale index entries	Rust	R8c	Low
E8	EXIT uses initial_size not remaining size	Rust	R8c	High
E9	CANCEL "accepted" before cancel actually happens	Rust	R8c	Medium
E10	Entry price on multi-partial fill = last fill, not VWAP	Rust	R10a	Low
E11	_legacy_intent hardcodes confidence/bars_held	Venue	R9a	Info
E12	Zero fill price → zero PnL	Venue	R9c	Medium
E13	Stale snapshot fallback causes wrong fill delta	Venue	R9c	Medium
E14	Cancel event carries stale slot_id	Venue	R9c	Low
E15	Leverage-set failure and order failure share handler	Adapter	R9b	Low
E16	Instrument resolution 3x per order, O(n) scan	Adapter	R9b	Low
E17	Cancel returns false REJECTED for already-cancelled	Adapter	R9b	Medium
E18	PnL settlement ignores fees	Bridge	R10b	Medium
E19	Full-slot-list read on every event = N×FFI overhead	Bridge	R10b	Low
E20	All persistence rows write post-trade capital	Persistence	R12	High
E21	Async fill uses synthetic Decision with wrong size	Persistence	R12	Medium
E22	capital_before arithmetic reconstruction wrong	Persistence	R12	High
E23	trade_events exit_price = entry_price	Persistence	R12	Low
E24	Mock venue always fills synchronously	Test	—	Medium
E25	Zero live tests for LIMIT/async-fill paths	Test	—	High
E26	Fresh-kernel reconcile doesn't restore venue	Test	—	Medium

Total: 26 E2E flaws (4 High, 10 Medium, 11 Low, 1 Info)

The four High-severity flaws in the E2E trace:

E6: Global FFI singleton + __del__ use-after-free — memory corruption risk
E8: Exit-size overshoot — slot can get stuck (A1)
E20/E22: Post-trade capital in all persistence rows + arithmetic capital_before — ClickHouse records are misleading for accounting
E25: No LIMIT/async-fill test coverage — partial-fill path is production code with zero live validation

PASS 3 — NEW FINDINGS (Deepest E2E Trace)

F1: `process_intent` CANCEL returns "accepted" before the cancel happens — caller gets wrong `outcome.state`

File: rust_backend.py:595-614

The CANCEL path:

Calls self.venue.cancel(order) → HTTP DELETE → returns VenueEvent[]
For each event, calls self.on_venue_event(event) → Rust FSM transition
Assembles final_outcome from the Rust kernel's pre-venue-event slot state

outcome = _outcome_from_payload(result["outcome"])  # Rust CANCEL accepts (slot NOT mutated yet)
# ... venue.cancel() ...
# ... on_venue_event() for each event (now slot IS mutated) ...
final_slot = self._get_slot(outcome.slot_id)         # Re-reads post-mutation state
final_outcome = KernelOutcome(
    accepted=outcome.accepted,        # TRUE — from Rust's pre-event accept
    state=final_slot.fsm_state,       # IDLE — from post-event state
    diagnostic_code=outcome.diagnostic_code,  # "OK" — from Rust's pre-event accept
)

For ENTER/EXIT, the same pattern exists — the Rust kernel's outcome is pre-venue. But for CANCEL the disconnect is worst: Rust returns accepted=true with the slot still in ENTRY_WORKING, and only the subsequent on_venue_event(CANCEL_ACK) transitions to IDLE.

Fix: The diagnostic code should be reconciled with the actual venue outcome, not taken from the pre-venue Rust outcome.

Severity: Medium

F2: `_last_settled_pnl` reset before `venue.submit()` — transient window

File: rust_backend.py:597-604

if intent.action == KernelCommandType.ENTER and outcome.accepted:
    self._last_settled_pnl[intent.slot_id] = 0.0   # reset HERE
# ... venue.submit() called below ...

If venue.submit() fails (HTTP error, rate limit), the ENTER was accepted by the Rust FSM but no venue order was placed. The slot is stuck in ORDER_REQUESTED. If the caller retries the same ENTER, _last_settled_pnl is 0.0 from the first attempt — correct for a new trade.

Real risk: If the previous trade on this slot had realized PnL that was never settled (impossible with incremental settle, but hypothetically), resetting to 0.0 loses that PnL. In practice, incremental settle makes this safe.

Severity: Medium (retry-safe, but exposes slot-stall)

F3: `_first_invalid_intent_field` allows `leverage=0` and `target_size=0`

File: rust_backend.py:295-316

The guard catches NaN/Inf and negative target_size. Does NOT catch:

leverage=0 or negative (Rust silently falls back to 1.0)
target_size=0 (submits zero-quantity order to BingX)
reference_price=0 (mark_price ignores non-positive)
limit_price=0 with order_type="LIMIT" (BingX rejects price=0)

The zero-target-size case: a direct process_intent(EXIT, target_size=0.0) computes exit_size = 0, submits MARKET order with quantity=0 to BingX, which may return an error or silent no-op.

Severity: Low (runtime's _exit_intent_from_slot prevents for EXIT; direct kernel API users can trigger it)

F4: `outcome.emitted_events` only contains venue events — Rust kernel's events silently dropped

File: rust_backend.py:641-652

final_outcome = KernelOutcome(
    emitted_events=tuple(emitted_events),  # only from venue.submit()
)

The Rust kernel's KernelOutcome struct has emitted_events — currently always empty because the Rust FSM never sets it. If a future change adds Rust-side event emission, those events are silently dropped: final_outcome only uses the Python-side list.

Severity: Low (no Rust-emitted events exist today)

F5: `on_venue_event` does redundant FFI read of slot already returned by Rust

File: `rust_backend.py:698-706**

def on_venue_event(self, event):
    result = _get_rust().on_venue_event(...)
    outcome = _outcome_from_payload(result["outcome"])
    slot_payload = result.get("slot")
    slot = _slot_from_payload(slot_payload) if slot_payload else self._get_slot(...)
    # ...
    current = self._get_slot(slot.slot_id)  # REDUNDANT — slot already has this data!
    self.projection.write_slot(current)

Line 706 re-reads current from the backend even though slot (from the Rust result) already has the exact same data. Each redundant FFI read is JSON serialize → C FFI → Rust serialize → C FFI → Python parse — ~100μs. With 2-3 events per process_intent and 10 slots, ~3ms wasted per cycle.

Severity: Low (performance)

F6: `_record_transitions` in `process_intent` records pre-venue transitions with `event=None`

File: `rust_backend.py:708, 650**

# process_intent line 650:
self._record_transitions(outcome.transitions, final_slot, None)  # event=None

# on_venue_event line 708:
self._record_transitions(outcome.transitions, slot, event)  # event attached

Venue-event transitions ARE recorded individually inside each on_venue_event call (line 708). The journal has all transitions. But the pre-venue transitions (from Rust FSM before venue call) have event=None attached — no event context for the journal reader.

Severity: Informational (diagnostic inconvenience only)

F7: `reconcile_from_slots` writes ALL slots to projection/zinc, not just reconciled ones

File: `rust_backend.py:718-733**

for current in slots:          # iterates ALL max_slots
    self.projection.write_slot(current)   # writes unchanged slots too
    self.zinc_plane.write_slot(current)

After reconcile, ALL slots are written to projection and Zinc, even if the reconcile only modified one slot. Slots 1-9 are serialized and written with their unchanged state. Wasteful but harmless.

Also: Rust kernel's reconcile_slots_json silently ignores slot_id out of range — no error returned. Caller sees accepted=true even if no slots were reconciled.

Severity: Low

F8: `HazelcastRowWriter.put()` is synchronous with no error handling — Hazelcast failure crashes the intent

File: `hazelcast_projection.py:30-48**

class HazelcastRowWriter:
    def __call__(self, name, row):
        if name.endswith("trade_events"):
            self.client.get_topic(name).publish(json.dumps(row, ...))
            return
        self.client.get_map(name).put(key, json_safe(row))  # synchronous, no try/except

No try/except. Hazelcast put() is synchronous — blocks until the cluster acknowledges. If Hazelcast is down, under load, or partitioned, this:

Blocks the calling thread (which holds the Rust kernel handle — no other operation can proceed)
Raises an exception that propagates through _set_slot() → process_intent() → crashes the entire intent

Severity: Medium (Hazelcast failure in hot path stalls execution)

F9: `RealZincPlane.write_slot()` serializes ALL slots, not just the changed one

File: `real_zinc_plane.py:205-212**

def write_slot(self, slot):
    with self._lock:
        self._slot_cache[int(slot.slot_id)] = slot
        payload = {"slots": [self._slot_cache[key].to_dict() for key in range(self._slot_count)]}
        self._write_region(self.state_region, self._state_seq, payload)

Every single-slot write serializes ALL slot_count slots (default 10) to JSON. With VenueOrder metadata, each slot payload can be ~1-5KB → 10-50KB per write. This is written to Zinc shared memory on every process_intent() and on_venue_event() call.

InMemoryZincPlane does NOT have this problem — it only stores the one slot.

Severity: Low (performance + Zinc shared-memory capacity waste)

F10: `RealZincPlane.write_slot` zeros buffer before write — concurrent read sees empty data

File: `real_zinc_plane.py:255-263**

def _write_region(self, region, seq, payload):
    buf = region.as_buffer()
    view = memoryview(buf)
    view[:] = b"\x00" * len(view)     # Zeros the buffer
    view[: len(packet)] = packet       # Writes packet
    region.notify()

Between the zero and the write, any concurrent reader sees zeros or a truncated packet. _decode_packet checks size <= len(buf) - 16 — a partially-written packet fails validation and returns {}. The reader (e.g., another thread calling read_slots()) gets an empty result.

Window is microseconds but it exists. No version guard — reader always returns whatever is in the region.

Severity: Low (brief window, no corruption — just empty results)

F11: `RealZincPlane._write_region` has no partial-write recovery

File: `real_zinc_plane.py:255-263**

If _encode_packet raises (JSON serialization error), the method raises before writing — region retains previous content. Safe.

If view[:] = b"\x00" fails (memory error), the region is partially zeroed. Not recoverable. No fallback.

Severity: Low (memory errors are extremely rare)

F12: `InMemoryZincPlane` intent_region grows without bound

File: `zinc_plane.py:83-85**

def publish_intent(self, intent):
    self.intent_region.append(intent)   # unbounded growth

self.intent_region is List[KernelIntent] — grows on every publish_intent call. Over thousands of policy cycles, this grows without bound.

RealZincPlane.publish_intent() limits to last 512 entries in shared memory, but its self._intent_cache (in-memory) also grows without bound.

Severity: Low (memory leak — ~MB/day)

F13: `InMemoryZincPlane` uses non-re-entrant `threading.Condition`

File: `zinc_plane.py:41-43**

_signal: threading.Condition = field(default_factory=threading.Condition)

threading.Condition is NOT re-entrant. If any code path calls back into publish_intent while holding the condition's lock — deadlock.

Severity: Low (no current code path triggers this, but it's a landmine)

F14: `KernelSlotView.setattr` round-trips unknown fields through Rust — silently dropped

File: `rust_backend.py:370-395**

If a new field is added to Python's TradeSlot that Rust's TradeSlot doesn't know about, slot.to_dict() includes it. _set_slot serializes to JSON, sends to Rust, which deserializes with #[serde(default)] — unknown fields are silently dropped. The round-trip loses data without warning.

The reverse: if Rust adds a field that Python doesn't know about, _slot_from_payload ignores unknown keys. Also silently dropped.

Severity: Low (fields must be added to both sides atomically; no guard)

F15: `on_venue_event` loop in `process_intent` stops on first exception — slot left in partial state

File: `rust_backend.py:599-610**

for event in emitted_events:
    evt_outcome = self.on_venue_event(event)  # NO TRY/EXCEPT

If self.on_venue_event(event) raises (FFI error, null pointer, OOM), the loop stops. Events after the failing event are never processed. The slot is in a partial state — some events applied, some not.

Concrete scenario: ACK arrives first → applied. FULL_FILL arrives second → FFI error, exception raised. Slot is stuck in ENTRY_WORKING with size=0. Next process_intent(EXIT) returns NO_OPEN_POSITION. No recovery path exists.

Severity: High — single exception during fill feedback leaves slot unrecoverable. Zero defense in depth.

F16: `venue.submit()` returning empty events leaves slot in `ORDER_REQUESTED`

File: `rust_backend.py:599-610**

If venue.submit() returns [] (venue rejected order with no response, or internal error), the for loop doesn't run. No on_venue_event is called. Slot stays in Rust's pre-venue state (ORDER_REQUESTED).

final_outcome has accepted=true, state=ORDER_REQUESTED, emitted_events=[]. Caller sees "successful" but no exchange order exists. Slot stuck in ORDER_REQUESTED until pump_venue_events() or manual reconcile.

Severity: Medium — silent slot stall with no error indication.

F17: Cancel truth-based confirmation returns `REJECTED` for already-cancelled orders on GET failure

File: `bingx_direct.py:474-498**

try:
    oo = await self._client.signed_get("/openApi/swap/v2/trade/openOrders", ...)
    still_open = (venue_order_id in ids)
except Exception:
    still_open = None  # GET failed

if still_open is False:
    return {"status": "CANCELED", ...}
# still_open is None (GET failed) or True (order still on book)
# Falls through to DELETE response check

If the DELETE succeeded but the verification GET failed (network blip, rate limit on the verification endpoint), still_open=None. The code then checks the DELETE response. If the DELETE returned an ambiguous error (e.g., "order not found" because it was already cancelled by another path), the status is "ERROR" — reported as REJECTED even though the order IS cancelled.

The bingx_venue._events_from_cancel() emits CANCEL_REJECT. The Rust FSM handles CANCEL_REJECT as a no-op — slot stays in EXIT_WORKING with no active order. Stuck until pump_venue_events() or manual reconcile.

Severity: Medium — needs a third state: "definitely cancelled," "probably cancelled," "definitely not cancelled."

File: `bingx_direct.py:376-417**

await self._client.signed_post("/openApi/swap/v2/trade/leverage", ...)  # step A
# ...
ack_payload = await self._client.signed_post("/openApi/swap/v2/trade/order", payload)  # step B

If step A fails (400 for invalid symbol), the exception handler at line 417 catches BingxHttpError and returns REJECTED. No way for the caller to know whether the leverage set failed or the order submission failed — both go through the same handler. The error message just says "REJECTED."

Also: if step A succeeds and step B fails, leverage was changed on the exchange but no order was placed. System state unchanged (leverage changes don't affect capital), but diagnostics are poor.

Severity: Low (correct behavior, poor diagnostics)

F19: `_events_from_submit` stale snapshot fallback → wrong fill detection

File: `bingx_venue.py:375-400**

_filled_size_from_snapshots() diffs position quantity before and after submit. The "before" snapshot comes from _backend_snapshot() which can return stale data (E13). A stale "before" against a fresh "after" produces a wrong diff — could be negative, zero, or larger than reality.

This wrong diff propagates to emitted_events — the PARTIAL_FILL or FULL_FILL event has wrong filled_size. The Rust kernel's apply_fill uses this wrong filled_size to set slot.size. Capital settles on the wrong delta.

Severity: Medium — wrong fill size propagates to kernel state and PnL.

F20: `del` frees Rust handle at unpredictable GC time — no explicit `close()`

File: `rust_backend.py:558-566**

def __del__(self):
    backend = getattr(self, "_backend", None)
    if backend is not None:
        try: _get_rust().destroy(backend)
        except: pass

ExecutionKernel has no close() method. The Rust KernelHandle is only freed by __del__, which runs on the GC thread at unpredictable time. If any code holds a stale reference to self._backend, the pointer dangles when the kernel is GC'd.

DITAv2LauncherBundle.close() calls _maybe_close on venue, zinc, and control plane — but NOT on kernel (which has no close() or disconnect()). The kernel is leaked until GC.

Severity: Medium — reliance on __del__ for critical C resource cleanup.

F21: `DITAv2LauncherBundle.close()` closes venue before kernel is done with it

File: `launcher.py:90-95**

def close(self):
    _maybe_close(self.venue)       # Closes HTTP client
    _maybe_close(self.zinc_plane)  # Closes Zinc regions

If the kernel is mid-process_intent in another thread (hypothetical — single-threaded in practice), venue.submit() would fail because the HTTP client is already closed. No ordering enforcement.

Severity: Low (single-threaded deployment)

F22: Silent fallback from real Zinc/Hazelcast to in-memory on error — operator unaware

File: control.py:210-217, launcher.py:175-185, projection.py:30-40

def build_control_plane(...):
    if real_requested:
        try:
            return RealZincControlPlane(...)
        except Exception:
            pass  # SILENT — operator never knows
    return ZincControlPlane(snapshot=snapshot)

Three places have this pattern. An operator who configures DITA_V2_ZINC=REAL and Zinc isn't available gets in-memory storage without any warning, error, or log. The ZincPlane protocol has no introspection method to check if it's real or in-memory.

The same applies to Hazelcast projection and the venue adapter.

Severity: Medium — configuration errors are silently masked.

F23: `VenueEvent.size` = `intent.target_size` not actual fill — wrong for multi-leg EXIT

File: `bingx_venue.py:410-420**

base_event = VenueEvent(
    size=float(intent.target_size or 0.0),  # target, not fill
)

For an EXIT leg, intent.target_size is the intended exit size. The ACK event's size reflects the target, not the actual fill. For fully-filled MARKET orders, target == fill so it's invisible. For partially-filled LIMIT orders, size on the ACK is wrong.

The fill event later has filled_size from the venue's executedQty, so the downstream kernel uses the correct fill size. The ACK's size is unused by the kernel (the kernel uses filled_size for PnL computation).

Severity: Informational (unused by kernel)

F24: `asyncio.run()` inside async function in test generator — nested event loops

File: _build_pink_extended.py:75-81

def _check_open_orders(c, vs):
    r = __import__('asyncio').run(c._request_json("GET", ...))

asyncio.run() is called INSIDE an async def context (the test body is async). This creates a new event loop on the current thread, suspending pytest's asyncio loop. Nested event loops are "not recommended" per Python docs.

Severity: Low (works in practice)

F25: `_build_fresh_kernel_from_slot` leaks old kernel objects per call

File: `_build_pink_extended.py:95-108**

def _build_fresh_kernel_from_slot(slot_data, ic=25000.0):
    cfg = _build_config(ic)
    b = build_launcher_bundle(venue_mode="BINGX", ...)  # NEW bundle, OLD not closed
    k = b.kernel
    return RB(runtime=Shim(k), config=cfg)

Each call creates a new launcher bundle (new kernel, new Rust handle, new HTTP client, new Zinc plane) without closing the old one. Called 4 times across the fresh-kernel test bodies. Leaks ~50MB per call (Rust lib, HTTP connections).

Severity: Low (test infrastructure only)

F26: `seen_event_ids` not cleared on re-entry — event IDs accumulate across trades

File: lib.rs:672-683

When a slot re-enters (new ENTER after previous EXIT), the Rust kernel resets most fields (lib.rs:740-765) but does NOT clear seen_event_ids. The new trade inherits the previous trade's event history up to MAX_SEEN_EVENT_IDS (256). After 256 events across multiple trades, old IDs are drained.

For MARKET trading (2-4 events per trade), this takes ~60-80 trades before draining. For LIMIT trading (many partial fills), could be 5-10 trades.

Fix: slot.seen_event_ids.clear() on ENTER.

Severity: Low (event ID collision across trades is astronomically unlikely)

F27: `RealZincControlPlane.read()` parses Zinc region every call — no caching

File: `real_control_plane.py:88-94**

def read(self):
    payload = _decode_packet(self.region.as_buffer())  # JSON parse every call
    control = payload.get("control")
    self._snapshot = KernelControlSnapshot(**control)   # reconstruct every call
    return self._snapshot

Called by ExecutionKernel.control property on every process_intent(). Each call re-constructs a KernelControlSnapshot from dict — allocating new objects for every field. ~50μs per call. A simple cached-until-modified pattern would eliminate all parses between writes.

Severity: Low (performance)

F28: `_legacy_intent` hardcodes `confidence=1.0` and `bars_held=0`

File: bingx_venue.py:270-285

These fields are in LegacyIntent but unused by submit_intent() (which only reads asset, side, action, target_size, leverage, metadata). The downstream ClickHouse rows use the policy-layer Intent, not LegacyIntent, so the hardcoded values don't reach persistence.

Only propagates through the venue adapter's internal chain. No consumer reads them today.

Severity: Informational

F29: `_slot_to_payload` in `real_zinc_plane.py` is dead code

File: `real_zinc_plane.py:57-59**

def _slot_to_payload(slot):
    data = slot.to_dict()
    return data

Defined, never called anywhere in the file. All slot serialization calls slot.to_dict() directly.

Severity: Informational

F30: Duplicate `_slot_from_payload` in `real_zinc_plane.py` and `rust_backend.py`

File: real_zinc_plane.py:62-112**, rust_backend.py:270-310`

Two nearly identical implementations. The real_zinc_plane version manually constructs VenueOrder objects (lines 63-88) with different defaults (e.g., fallback to slot size if intended_size missing). The rust_backend version delegates to _order_from_payload with all-default fallbacks.

If fields are added to TradeSlot or VenueOrder, both must be updated.

Severity: Low (code duplication risk)

Complete Flaw Catalog

All-Passes Combined

Family	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural (old 13, now superseded)	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
Total		80	1	10	21	32	16

Most Dangerous Single Flaw: F15

An exception in on_venue_event() during the fill-feedback loop stops the chain mid-apply. The ACK applied but the FILL didn't. Slot in ENTRY_WORKING with no position. No retry mechanism, no recovery path. The slot is stuck forever until manual intervention. Zero defense in depth — no try/except, no undo, no validation that the slot reached a consistent state.

This is the single highest-impact E2E flaw because it requires no concurrency, no race condition, no unusual market conditions — just a transient FFI error during normal operation.

PASS 4 — SYSTEMATIC DOMAIN SCANS (Config, Rust, Persistence, Lifecycle)

Rust Kernel — Numeric & FSM Invariants

G1: EXIT_RESIDUAL action is entirely missing from Rust KernelCommandType

File: _rust_kernel/src/lib.rs

string_enum! {
    enum KernelCommandType {
        ENTER, EXIT, MARK_PRICE, RECONCILE, CONTROL, CANCEL,
    }
}

Six variants. No EXIT_RESIDUAL. If any caller submits an intent with action = "EXIT_RESIDUAL", the string_enum deserializer fails — serde returns INVALID_INTENT_PARSE. Even if deserialization worked, there's no branch to handle residual-position cleanup. Any position with remaining size after partial exit legs has no way to trigger a clean-up exit via the intent system.

The Python KernelCommandType enum (contracts.py) does have EXIT_RESIDUAL, translated to "EXIT_RESIDUAL" string by _intent_to_payload. This string hits Rust's string_enum → parse error → INVALID_INTENT_PARSE.

Fix: Add EXIT_RESIDUAL variant to Rust enum + match arm that skips the NO_OPEN_POSITION guard for residual-sized positions.

Severity: Critical

G2: `into_c_string` uses `unwrap()` — panics on interior NUL byte

File: _rust_kernel/src/lib.rs:1477

fn into_c_string(value: &str) -> *mut c_char {
    CString::new(value).unwrap().into_raw()
}

CString::new() returns Err if the string contains a NUL ('\0') byte. .unwrap() panics at the C FFI boundary. If any serde_json::to_string() output (e.g., user-controlled string in KernelIntent, VenueEvent, or TradeSlot) contains a NUL byte, this panics the entire process.

Triggered by every FFI call that returns a string:

dita_kernel_process_intent_json
dita_kernel_on_venue_event_json
dita_kernel_reconcile_slots_json
dita_kernel_snapshot_json
dita_kernel_get_slot_json

Fix: Replace .unwrap() with unwrap_or_else(|_| ptr::null_mut()) or feed through invalid_intent_cstring.

Severity: Critical

G3: `process_intent` EXIT hardcodes `prev_state = POSITION_OPEN` unconditionally

File: _rust_kernel/src/lib.rs:842-890

slot.fsm_state = TradeStage::EXIT_REQUESTED;        // unconditional override
let transition = self.transition(
    &slot,
    TradeStage::POSITION_OPEN,                        // always POSITION_OPEN
    slot.fsm_state.clone(),
    "EXIT_INTENT",
);

Three problems:

(a) Transition prev_state is a lie. If the slot was in EXIT_WORKING, EXIT_SENT, EXIT_REQUESTED, or POSITION_PARTIALLY_CLOSED, the transition record says POSITION_OPEN — wrong.

(b) Backward transition. If the slot is EXIT_WORKING and a new EXIT intent arrives, fsm_state is set to EXIT_REQUESTED — a backward transition from EXIT_WORKING → EXIT_REQUESTED. This corrupts the FSM.

(c) No state guard. EXIT should only be allowed from POSITION_OPEN, EXIT_WORKING (for additional legs), or POSITION_PARTIALLY_CLOSED. Currently any state that passes !is_free() && !closed && size > 0 can transition to EXIT_REQUESTED.

Fix: Check actual FSM state before allowing EXIT, log actual prev_state, guard against backward transitions.

Severity: Critical

G4: `consume_exit_leg` advances beyond last valid index — stale `all_legs_done` variable

File: _rust_kernel/src/lib.rs:1420-1435

let all_legs_done = slot.active_leg_index >= slot.exit_leg_ratios.len(); // (A)
let should_close = (slot.size <= 1e-12 || (!partial && all_legs_done));  // (B)

if !partial {
    slot.consume_exit_leg();  // (C) — advances active_leg_index POST (A)
}

if should_close && slot.size <= 1e-12 {         // (D) — close
} else if !partial && !all_legs_done {           // (E) — stale! uses (A) not post-advance index

On the last leg (active_leg_index = len - 1):

(A): all_legs_done = false (pre-advance)
(C): advances to len (exhausted)
(E): !partial && !false = true → enters POSITION_OPEN instead of examining should_close with post-advance index

The all_legs_done variable is captured before consume_exit_leg advances the index. Branch (E) should use the post-advance index to correctly detect exhaustion.

After exhaustion, next_exit_ratio() returns 1.0 (out-of-bounds unwrap_or(1.0)) — silently tries to exit remaining size as 100% instead of detecting completion.

Severity: Critical

G5: `realized_pnl` uses unbounded f64 — overflows to inf at extreme values

File: _rust_kernel/src/lib.rs:648-656

let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);
delta * notional

No is_finite() check on intermediate products. At exit_price=1e200, entry_price=1e-200: delta = (1e200 - 1e-200) / 1e-200 ≈ 1e400 → inf. The resulting inf is stored in slot.realized_pnl, corrupting all future PnL tracking.

Subnormals: entry_price=5e-324 (subnormal) causes division to produce inf for modest exit prices on some platforms.

Fix: Add is_finite() guards on both prices and cap intermediate products.

Severity: High

G6: `mark_price` produces unbounded `unrealized_pnl`

File: _rust_kernel/src/lib.rs:384-399

self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage;
// No is_finite() check on result

If any of delta, size, entry_price, or leverage is extreme, the product overflows to inf. No result guard. inf stored in unrealized_pnl forever. Capped only by the price <= 0.0 guard on input — no guard on the computation chain.

Also: self.entry_price = price at line 388 overwrites entry_price on every mark_price call for a position with entry_price <= 0.0, even when the position has been open for a while. This means a stale-zero entry_price gets set to the current market price on first mark_price after open, which is correct — but if the slot is reused (re-entry without resetting entry_price), the old entry price from the prior trade bleeds into unrealized PnL.

Severity: High

G7: `process_intent` ENTER — no `is_finite()` guard on `target_size`

File: _rust_kernel/src/lib.rs:806-807

intended_size: intent.target_size.max(0.0),

f64::NAN.max(0.0) returns NAN. f64::INFINITY.max(0.0) returns inf. Serde_json does accept Infinity and NaN by default — they're valid JSON tokens. If the Python-side _first_invalid_intent_field guard is bypassed (F3 — it allows these through), NaN/inf propagates into intended_size in VenueOrder, corrupting all fill calculations.

Similarly, reference_price is never validated for finiteness before being stored in VenueOrder.metadata.

Severity: High

G8: `reconcile_slots_json` — no dedup or bounds validation

File: _rust_kernel/src/lib.rs:1668-1675

for slot in slots {
    if slot.slot_id < core.slots.len() {
        core.slots[slot.slot_id] = slot.clone();
    }
}

Two slots with the same slot_id: the second overwrites the first silently. A slot with slot_id >= core.slots.len(): silently dropped — no error, no diagnostic. Caller sees accepted=true even if some/all slots were not applied.

Severity: High

G9: `exchange_order_id` propagation uses wrong order target

File: _rust_kernel/src/lib.rs:1110-1125

let target = if slot.active_entry_order.is_some() {
    slot.active_entry_order.as_mut()
} else {
    slot.active_exit_order.as_mut()
};

If an entry order exists (even if fully filled) and an exit fill event arrives, the code updates the entry order's venue_order_id instead of the exit order's. The exit order's venue_order_id stays empty. Any subsequent CANCEL intent on the exit order fails because active_exit_order.venue_order_id is empty — the venue can't match the cancel.

Fix: Disambiguate by matching venue_client_id, or clear active_entry_order when entry is complete.

Severity: High

G10: CANCEL diagnostic code says NO_ACTIVE_EXIT_ORDER for entry cancel too

File: _rust_kernel/src/lib.rs:966-1005

if !has_cancellable_exit && !has_cancellable_entry {
    return KernelResult {
        diagnostic_code: KernelDiagnosticCode::NO_ACTIVE_EXIT_ORDER, // always says exit
        details: json!({"reason": "NO_ACTIVE_EXIT_ORDER"}),
    };
}

When neither exit nor entry is cancellable, the diagnostic returns NO_ACTIVE_EXIT_ORDER regardless of which order was the target. If the user wanted to cancel an entry order that's not in a cancellable state, the diagnostic is misleading.

Fix: Separate diagnostic codes: NO_ACTIVE_EXIT_ORDER, NO_ACTIVE_ENTRY_ORDER, ENTRY_NOT_CANCELLABLE.

Severity: High

G11: `apply_fill` entry-fill overwrites `active_entry_order.intended_size` with `slot.size`

File: `_rust_kernel/src/lib.rs:1363-1377**

On FULL_FILL entry, slot.active_entry_order is entirely replaced with a new VenueOrder where intended_size = slot.size (the fill amount) instead of the original intended size. The original intended size (which could be larger than fill size for partial fills) is lost.

If a duplicate fill event arrives (dedup fails due to missing event_id), the second fill would use slot.size as the basis for further fills — wrong values.

Severity: Medium

G12: `leverage` unbounded after `is_finite()` — no maximum cap

File: _rust_kernel/src/lib.rs:778

slot.leverage = if intent.leverage.is_finite() && intent.leverage > 0.0 {
    intent.leverage  // 1e100 accepted here
} else { 1.0 };

leverage = 1e100 passes is_finite(). Feeds into realized_pnl() as slot.leverage.max(1.0) = 1e100, producing notional = exit_size * entry_price * 1e100. Makes unrealized_pnl arbitrarily large.

No maximum leverage cap enforced anywhere — the exchange-level cap (DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP) exists in BingxExecClientConfig but is never passed to the Rust kernel.

Severity: Medium

G13: `resolve_slot` fallback returns `unwrap_or(0)` — can misroute events

File: _rust_kernel/src/lib.rs:623

self.slots.first().map(|slot| slot.slot_id).unwrap_or(0)

When no slot matches the event (slot_id out of range or all slot filters fail), returns slot_id of the first slot (which may be 0 or any value). No diagnostic emitted — caller sees slot state change with no idea the event was misrouted.

Severity: Medium

G14: `commit_slot` silently ignores out-of-bounds slot_id

File: `_rust_kernel/src/lib.rs:595-600**

fn commit_slot(&mut self, slot: TradeSlot) {
    if slot.slot_id < self.slots.len() {
        self.slots[slot_id] = slot;
    }
    // else: silently dropped — no error returned
}

Mutations to out-of-bounds slot are silently discarded. Can happen if slot.slot_id is corrupted via set_slot_from_json causing index mismatch between slot.slot_id and the actual slot position.

Severity: Medium

Configuration & Validation Chain

G15: Zero `__post_init__` validators on all config dataclasses

Every config dataclass in the system has zero field-level validation:

Dataclass	Fields	Validators
`KernelControlSnapshot`	16	0
`ControlUpdate`	16	0
`KernelIntent`	19	0
`TradeSlot`	22	0
`VenueOrder`	8	0
`VenueEvent`	18	0
`KernelTransition`	11	0
`KernelOutcome`	8	0
`AccountSnapshot`	9	0
Total	127	0

The only validation in the entire chain:

_first_invalid_intent_field() — finiteness guard at Python→Rust FFI boundary (not a dataclass validator)
Rust leverage = if is_finite && > 0.0 { val } else { 1.0 } — post-hoc clamp
Rust KernelCore::new(max_slots.max(1)) — floor only, no ceiling
launcher.py:143: max(1, int(...)) for active_slot_limit — floor only

No __post_init__ exists anywhere. No bounds check on any field except the two floor-only guards.

Severity: High

G16: `DITA_V2_DEBUG_CLICKHOUSE` defaults to `True` when env var is unset

File: launcher.py:133

debug = _env_bool("DITA_V2_DEBUG_CLICKHOUSE", True)

_env_bool (launcher.py:75) returns default when the env var is unset. So debug = True by default. Every runtime writes debug traces to ClickHouse by default. DITA_V2_DEBUG_CLICKHOUSE=False is required to disable it.

This is not a bug per se, but it means debug ClickHouse writes are on by default, adding ~10 ClickHouse insertions per process_intent call (every transition + position state + trade event) that most production deployments may not want.

Severity: Informational

G17: String config fields have no charset/length validation — Zinc region injection risk

File: control.py:31-53, real_zinc_plane.py:30

runtime_namespace, strategy_namespace, event_namespace, actor_name, exec_venue, data_venue, ledger_authority are all free-form strings with no validation. They're used as:

Zinc shared memory region names: self.prefix + "." + namespace + "." + kind — an attacker-controlled namespace could collide with other processes' Zinc regions
ClickHouse table names: DOLPHIN_BINGX_JOURNAL_STRATEGY is used as a table suffix — SQL injection risk in ClickHouse journal
Hazelcast map names: Same injection risk via event_namespace

Severity: Medium

G18: `exit_leg_ratios` no sum-to-1 validation

KernelIntent.exit_leg_ratios and TradeSlot.exit_leg_ratios are tuple/list of floats. No validator ensures they sum to approximately 1.0. Ratios summing to 0.5 leave the position partially closed forever (residual can't be exited because next_exit_ratio() returns 1.0 after exhaustion, exiting 100% of remaining — which may exceed the intended residual).

Severity: Low

G19: `RealZincControlPlane.read()` has no sequence check — torn-read risk

File: `real_control_plane.py:88-94**

def read(self):
    payload = _decode_packet(self.region.as_buffer())
    control = payload.get("control")
    if not isinstance(control, dict):
        return self._snapshot
    self._snapshot = KernelControlSnapshot(**control)
    return self._snapshot

The binary packet has a 64-bit sequence number but read() never checks it. Between the zero-write and packet-write in _write_region, a reader sees an empty buffer → _decode_packet fails → falls back to self._snapshot (stale). Between the packet-write and struct.pack header (order depends on implementation), a reader sees a partial write with wrong size → _decode_packet fails.

No checksum on the wire format: struct.pack("!QQ", seq, len) + json_bytes. A torn write produces garbage that json.loads may or may not parse successfully.

Severity: Low

G20: `DOLPHIN_BINGX_JOURNAL_STRATEGY`/`_DB` — ClickHouse SQL injection risk

File: launcher.py:202-203

"DOLPHIN_BINGX_JOURNAL_STRATEGY": os.environ.get("DOLPHIN_BINGX_JOURNAL_STRATEGY", ""),
"DOLPHIN_BINGX_JOURNAL_DB": os.environ.get("DOLPHIN_BINGX_JOURNAL_DB", ""),

These are used as ClickHouse table and database name suffixes in pink_clickhouse.py. An attacker who can set env vars can inject SQL via semicolons or quotes in the table name. ClickHouse supports INSERT INTO db.table FORMAT JSONEachRow — a table name like positions; DROP TABLE ...; could be destructive.

Severity: Low (requires env var control, which implies broader access)

Persistence Schema Alignment

G21: `entry_price` used as `exit_price` in `trade_events` — data loss

File: pink_clickhouse.py (outside workspace)

The _write_trade_event function maps entry_price from slot.to_dict() to both the entry_price and exit_price columns. The actual exit fill price (available on the VenueEvent object) is never written to the exit_price column.

Result: Every trade_events row has exit_price == entry_price. The exit_price column is a dead column — always contains the entry price, never the actual fill.

Severity: High — data loss to DB for the most important trade metric.

G22: `active_leg_index` → `entry_bar` semantic mis-mapping

File: pink_clickhouse.py (outside workspace)

"entry_bar": int(slot_dict.get("active_leg_index", 0) or 0),

active_leg_index tracks the exit-leg-ratios cursor (which leg of a multi-leg exit we're on), not a bar count. The value 0 at position open and 1 after the first exit leg — neither value represents bars held. The entry_bar column stores the wrong concept.

Severity: Medium — column contains semantically meaningless data.

G23: `capital_before` arithmetic reconstruction absorbs cross-slot PnL

File: pink_clickhouse.py (outside workspace)

capital_before = capital_after - pnl_leg

capital_before is reconstructed by subtracting the current leg's PnL from the current capital. In a multi-slot system, other slots' PnL changes between legs are absorbed into capital_before. The column is always wrong in multi-slot scenarios because capital_after reflects total PnL from all slots, not just the leg being recorded.

Severity: Medium — wrong capital_before for multi-slot trading.

G24: Recovery `trade_reconstruction` always has `trade_id=""`

File: pink_clickhouse.py (outside workspace)

The persist_recovery_state function passes kernel.snapshot()["account"] (an account dict with keys capital, equity, realized_pnl, ...) where a slot dict is expected. The trade_id key does not exist on the account dict. The recovery_state row always has trade_id="".

Severity: Medium — recovery data is not associable with any trade.

G25: `seen_event_ids`, `exit_leg_ratios`, `VenueOrder`, `metadata` not in flat ClickHouse tables

These fields are:

Present on the Python TradeSlot ✅
Transmitted through Zinc shared memory ✅
Stored in Hazelcast ✅
Stored in ClickHouse dita_kernel_debug (full JSON) ✅
NOT extracted into main ClickHouse flat tables position_state, trade_events, trade_exit_legs ❌

Data exists at the source, travels through the pipeline, hits the debug journal — but is lost in the main analytical tables.

Severity: Low (data exists in debug journal if needed for reconstruction)

G26: `_safe_float` silently converts NaN/None/Inf to 0.0

File: utils.py:15

def _safe_float(v, default=0.0):
    try:
        f = float(v)
        if not math.isfinite(f):
            return default
        return f
    except (TypeError, ValueError, OverflowError):
        return default

Used in multiple ClickHouse writers. Silently converts NaN/Inf/parsing errors to 0.0. No diagnostic emitted when a non-finite value reaches the persistence layer — data silently zeroed.

Severity: Low (safe default but silent corruption)

Lifecycle & Resource Management

G27: `build_launcher_bundle` has no exception safety — prior resources leak

File: `launcher.py:264-300**

def build_launcher_bundle(...):
    control_plane = _build_control_plane(...)
    projection = build_projection(...)
    zinc_plane = _build_zinc_plane(...)
    venue = _build_venue(...)
    kernel = ExecutionKernel(...)  # ← if THIS fails, everything above leaks

If any step after the first raises, all previously built resources leak:

RealZincPlane created → _build_venue() fails → 3 shared memory regions orphaned
RealZincControlPlane created → _build_zinc_plane() fails → 1 shared memory region orphaned
BingxVenueAdapter created → ExecutionKernel.__init__() fails → HTTP connection leaked

No try/finally anywhere in the builder. The init order is also optimized for forward construction, not backward cleanup.

Severity: High — shared memory leak on any build failure.

G28: `RealZincPlane` and `RealZincControlPlane` have no `del`

When close() is not called (exception in builder, forgotten cleanup, GC during shutdown), the shared memory regions opened by RealZincPlane (3 regions) and RealZincControlPlane (1 region) are orphaned on the OS. They persist in /dev/shm/ (or platform equivalent) until system reboot.

Python's __del__ is unreliable (not called on SIGKILL, not called if the object is part of a cycle without a GC run), but its absence means even normal garbage collection can't clean up.

Severity: High — shared memory leaks.

G29: Zero signal handlers — no cleanup on SIGTERM/SIGINT

$ grep -rn "signal\|SIGTERM\|SIGINT\|atexit" *.py  # ZERO matches

When SIGTERM or SIGINT arrives:

Python's default handler terminates the process immediately
No DITAv2LauncherBundle.close() is called
No ExecutionKernel.__del__ is called (CPython may run GC on normal exit but not reliably)
All shared memory (RealZincPlane, RealZincControlPlane) is orphaned
In-flight BingX HTTP calls are interrupted mid-stream
Rust kernel handle is leaked

Severity: High

G30: `ExecutionKernel` has no `close()` — relies on `del` for Rust handle cleanup

ExecutionKernel has __del__ which calls _get_rust().destroy(backend). No close() method. DITAv2LauncherBundle.close() never touches the kernel — the Rust handle is only freed by GC at unpredictable time.

If any code holds a stale _backend pointer, the handle dangles when GC runs. If __del__ is suppressed (e.g., during interpreter shutdown with cyclic references), the Rust handle leaks permanently.

Fix: Add close() to ExecutionKernel, call it from DITAv2LauncherBundle.close().

Severity: High

G31: `projection` (Hazelcast) never closed

build_projection() returns a HazelcastProjection which holds a Hazelcast client connection. No close() or disconnect() method exists on the projection, projector, or row writer. DITAv2LauncherBundle.close() doesn't touch the projection. The Hazelcast client connection leaks on shutdown.

Severity: Medium

G32: `_maybe_close()` only calls the first method found — `break` skips the second

File: `launcher.py:233-243**

for method_name in ("close", "disconnect"):
    method = getattr(obj, method_name, None)
    if method is None:
        continue
    try:
        result = method()
    except TypeError:
        continue
    if inspect.isawaitable(result):
        try:
            asyncio.run(result)
        except RuntimeError:
            pass
    break  # ← ONLY calls the FIRST found method, never both

If an object has both close() and disconnect(), only close() is called. disconnect() is silently skipped. Also: asyncio.run(result) silently swallows RuntimeError when a running event loop exists — the coroutine is never executed.

Currently no object has both, but the pattern is fragile.

Severity: Low

G33: `close()` is not idempotent for RealZinc components

RealZincPlane.close() and RealZincControlPlane.close() call their Zinc region's close() method. If called twice, the second call operates on an already-closed region — likely crashes from Hazelcast's shared memory code.

No nulling of references after close: DITAv2LauncherBundle.close() sets self.venue, self.zinc_plane, self.control_plane to None — wait, it doesn't. It calls _maybe_close() which doesn't null references. Double close() is unsafe.

Severity: Low

G34: No context manager on `DITAv2LauncherBundle`

DITAv2LauncherBundle has no __enter__/__exit__. Users must manually call close(). No with pattern exists anywhere in the source for lifecycle management. No __del__ fallback on the bundle either.

Severity: Low (ergonomic, not a leak source if caller follows the pattern)

G35: `BingxVenueAdapter.connect()` exists but is never called by the launcher

BingxDirectExecutionAdapter has a connect() method that initializes the lifetime HTTP client. BingxVenueAdapter has connect() that calls _call_backend("connect"). Neither is called in build_launcher_bundle() or _build_venue(). If the adapter's submit_intent() relies on a connected client, it initializes lazily — but the connect path is dead code that exists but is never invoked.

Severity: Informational

G36: Only one `try/finally` in the entire codebase

The only try/finally is _RustKernelLib._take_string() (rust_backend.py:140-143) which frees the Rust C string. All other resource management uses try/except with no finally.

No cleanup is guaranteed on exception:

build_launcher_bundle() — no cleanup on failure
process_intent() — no cleanup of partial slot state on venue event exception
on_venue_event() — no cleanup on FFI failure
_set_slot() — no cleanup on projection or Zinc write failure

Severity: High (across all layers)

Pass 4 Summary

#	Flaw	Layer	Severity
G1	EXIT_RESIDUAL action missing from Rust KernelCommandType	Rust	Critical
G2	`into_c_string` unwrap() panics on NUL byte	Rust	Critical
G3	EXIT hardcodes prev_state=POSITION_OPEN, allows backward FSM transition	Rust	Critical
G4	`consume_exit_leg` stale `all_legs_done` variable — wrong branch after last leg	Rust	Critical
G5	`realized_pnl` unbounded f64 overflow to inf	Rust	High
G6	`mark_price` unbounded unrealized_pnl — no result guard	Rust	High
G7	ENTER no is_finite() guard on target_size	Rust	High
G8	`reconcile_slots_json` no dedup or bounds validation	Rust	High
G9	`exchange_order_id` update targets wrong order — exit cancel broken	Rust	High
G10	CANCEL diagnostic always says NO_ACTIVE_EXIT_ORDER	Rust	High
G11	`apply_fill` overwrites intended_size with slot.size	Rust	Medium
G12	No max leverage cap enforced by kernel	Rust	Medium
G13	`resolve_slot` fallback returns unwrap_or(0) — misroutes events	Rust	Medium
G14	`commit_slot` silently ignores out-of-bounds slot_id	Rust	Medium
G15	Zero `__post_init__` validators on all config dataclasses	Config	High
G16	DITA_V2_DEBUG_CLICKHOUSE defaults to True when unset	Config	Info
G17	String config fields — Zinc region injection risk	Config	Medium
G18	`exit_leg_ratios` no sum-to-1 validation	Config	Low
G19	RealZincControlPlane.read() no sequence check — torn-read risk	Config	Low
G20	ClickHouse journal strategy/db env vars — SQL injection risk	Config	Low
G21	entry_price used as exit_price in trade_events — data loss	Persistence	High
G22	active_leg_index → entry_bar semantic mis-mapping	Persistence	Medium
G23	capital_before arithmetic absorbs cross-slot PnL	Persistence	Medium
G24	Recovery trade_reconstruction always has trade_id=""	Persistence	Medium
G25	seen_event_ids, exit_leg_ratios, VenueOrder, metadata not in flat CH tables	Persistence	Low
G26	_safe_float silently converts NaN/None/Inf to 0.0	Persistence	Low
G27	build_launcher_bundle no exception safety — prior resources leak	Lifecycle	High
G28	RealZincPlane/RealZincControlPlane no del — SHM orphaned	Lifecycle	High
G29	Zero signal handlers — no cleanup on SIGTERM/SIGINT	Lifecycle	High
G30	ExecutionKernel has no close() — relies on del for Rust handle	Lifecycle	High
G31	Hazelcast projection never closed	Lifecycle	Medium
G32	_maybe_close() break skips second method	Lifecycle	Low
G33	close() not idempotent for RealZinc components	Lifecycle	Low
G34	No context manager on DITAv2LauncherBundle	Lifecycle	Low
G35	BingxVenueAdapter.connect() never called	Lifecycle	Info
G36	Only one try/finally in entire codebase	Lifecycle	High

Pass 4 Severity Distribution

Severity	Count
Critical	4 (G1, G2, G3, G4)
High	11 (G5-G10, G15, G21, G27, G28, G29, G30, G36)
Medium	11 (G11-G14, G17, G22, G23, G24, G31)
Low	8 (G16, G18, G19, G20, G25, G26, G32, G33, G34, G35)
Info	2

Combined Catalog (All 4 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
Total		116	5	21	32	40	18

PASS 5 — EDGE DOMAINS (Dependencies, Error Handling, Types, Contracts)

H1: No Python dependency declaration files exist in workspace

Files: workspace root

Zero requirements.txt, setup.py, setup.cfg, pyproject.toml, Pipfile, or poetry.lock anywhere. All Python package dependencies are entirely implicit — determined by what's installed in the runtime environment. No reproducible installs, no version pinning, no audit trail.

The Rust side does have Cargo.toml + Cargo.lock — but all 4 direct Rust deps use open ranges ("0.4", "0.2", "1", "1").

Severity: Critical

H2: Rust kernel compiled from source on every cold start via subprocess

File: rust_backend.py:60-72

def _ensure_library() -> Path:
    path = _library_path()
    if not path.exists():
        _build_library()  # cargo build --release
    return path

def _build_library():
    subprocess.run(
        ["cargo", "build", "--release", ...],
        check=True,        # no timeout!
    )

First load takes 3-10 minutes (Rust compilation). Requires Rust toolchain in production. subprocess.run() has no timeout= — if cargo hangs (network, disk, lock contention), the Python process hangs indefinitely. No prebuilt binary distribution.

Severity: Critical

H3: Zero logging — every swallowed error is invisible

The entire codebase has zero use of Python's logging module, print(), or warnings.warn() for error reporting. Every except: pass, except Exception: pass, and return default silently discards the error. There is no mechanism to detect, alert, or diagnose production failures.

All try/except: pass sites found:

#	File:Line	What's Hidden
1	`bingx_venue.py:51`	`float()` conversion failure on any API field value
2	`bingx_venue.py:133`	regex match failure in rate-limit parsing
3	`bingx_venue.py:136`	int/float conversion of retry_after
4	`bingx_venue.py:325`	slot lookup failure during cancel asset resolution
5	`bingx_venue.py:350`	BingXHttpError in cancel — network error looks like rejection
6	`control.py:213`	RealZincControlPlane construction failure
7	`launcher.py:187`	RealZincPlane construction failure
8	`launcher.py:119`	malformed env var for active_slot_limit
9	`launcher.py:243`	asyncio.run() RuntimeError in _maybe_close
10	`launcher.py:277`	RealZincControlPlane fallback in build_control_plane
11	`real_control_plane.py:97`	region.wait() exception — timeout and error both return False
12	`real_control_plane.py:112`	region.notify() exception — writer thinks broadcast succeeded
13	`real_zinc_plane.py:31`	Zinc SharedRegion import failure
14	`projection.py:87`	HazelcastRowWriter import failure
15	`rust_backend.py:102`	del exception in Rust kernel destroy
16	`bingx_venue.py:55`	`_row_float` tries 5+ key fallbacks, each failing silently

Severity: Critical

H4: `_row_float` rejects zero as a valid value — `or` pattern treats 0 as missing

File: bingx_venue.py:47-55

def _row_float(row, *keys, default=0.0):
    for key in keys:
        try:
            value = float(row.get(key) or 0.0)  # `or 0.0` treats 0 as missing
        except Exception:
            continue
        if value == value and value not in (float("inf"), float("-inf")) and value != 0.0:
            return value                         # explicitly rejects 0.0
    return default

Two bugs: (a) except Exception: continue swallows ALL conversion errors, and (b) value != 0.0 explicitly rejects zero as a valid return value. A legitimate zero price, zero filled quantity, or zero position amount causes _row_float to skip that key and search further. If ALL keys return 0, the default 0.0 is returned — indistinguishable from "none of the keys existed."

Called by every single BingX API response parser: _position_qty(), _position_price(), _venue_order_from_row(), _event_from_row(), _fill_event_from_row(), _events_from_submit(), _events_from_cancel(), _filled_size_from_snapshots(). None verify the returned 0.0 is real vs. missing-vs-zero.

Severity: High

H5: `_backend_snapshot` timeout returns stale data with no signal to callers

File: `bingx_venue.py:242-251**

def _backend_snapshot(self, *, timeout_ms=5000.0):
    if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
        with self._snap_lock:
            return self._last_snapshot    # STALE — could be hours old

When the snapshot-fetch condition times out, returns self._last_snapshot — initialized to None and only updated on successful fetches. First timeout returns None. All callers (cancel(), open_orders(), open_positions(), reconcile(), submit()) access .open_orders, .open_positions immediately — crash with AttributeError: 'NoneType' object has no attribute 'open_orders'.

Even after the first fetch succeeds, subsequent timeouts return the last-good snapshot which could be arbitrarily stale. No caller timestamps, version-checks, or requests a refresh.

Severity: High

H6: All enum-from-raw-string sites crash on unknown value — zero fallback

Files: rust_backend.py:250-386, real_zinc_plane.py:70-106

Every site that reconstructs a Python enum from a string received from the Rust kernel:

side=TradeSide(str(payload.get("side", TradeSide.FLAT.value)))
status=VenueOrderStatus(str(payload.get("status", VenueOrderStatus.NEW.value)))
fsm_state=TradeStage(str(payload.get("fsm_state", TradeStage.IDLE.value)))
kind=KernelEventKind(str(row.get("kind", KernelEventKind.ORDER_ACK.value)))

If the Rust kernel introduces a new enum variant (e.g., TradeStage::ENTRY_REJECTED) not in the Python TradeStage enum, TradeStage("ENTRY_REJECTED") raises ValueError with zero fallback. Crashes _outcome_from_payload() and takes down the kernel's event processing loop.

17 sites total across rust_backend.py and real_zinc_plane.py. No try/except, no mapping, no fallback on any of them.

Severity: High

H7: `_legacy_intent` reads `getattr(intent, "order_type", "MARKET")` — always defaults to MARKET

File: `bingx_venue.py:282-285**

metadata["_order_type"] = getattr(intent, "order_type", "MARKET")
metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)

order_type and limit_price are NOT fields on KernelIntent (contracts.py). They only exist in intent.metadata as metadata["order_type"] if set by the caller. getattr(intent, "order_type", "MARKET") checks the dataclass field — not the metadata dict — so it ALWAYS returns "MARKET".

Even when the PINK runtime produces a LIMIT intent (LIMIT_DECISION → metadata["order_type"] = "LIMIT"), the legacy adapter converts is to MARKET because it reads the wrong source. Every LIMIT order is submitted as MARKET.

Similarly, limit_price is always 0.0 — any limit price from the metadata dict is lost.

Severity: High

H8: `_venue_event_status_from_row` silently maps unknown venue status to ACKED

File: `bingx_venue.py:83-96**

def _venue_event_status_from_row(status: str) -> VenueEventStatus:
    normalized = _normalize_status(status)
    # ... checks known statuses ...
    return VenueEventStatus.ACKED  # fallthrough for anything unknown

If BingX introduces a new status ("SUSPENDED", "PENDING_CANCEL", "EXPIRED"), it doesn't match any known mapping and silently returns ACKED. The kernel treats a suspended/cancelled/expired order as acknowledged — dangerous misclassification.

Severity: High

H9: `RealZincPlane.write_slot()` — slot written to `slot_id >= slot_count` is invisible

File: `real_zinc_plane.py:206-210**

def write_slot(self, slot):
    with self._lock:
        self._slot_cache[int(slot.slot_id)] = slot
        payload = {"slots": [self._slot_cache[key].to_dict() for key in range(self._slot_count)]}

_slot_cache is a plain dict — accepts any key. But read_slots() only reads 0..slot_count-1. Writing to slot_id >= slot_count stores the slot in the cache but it's never serialized or read back. No error.

Severity: High

H10: `RealZincControlPlane.read()` has no atomicity with concurrent `update()`

File: `real_control_plane.py:70-77**

_write_region() zero-fills the buffer then writes the packet. If read() interleaves between zero-fill and write, it sees a partially-zeroed buffer → _decode_packet returns {} → returns stale self._snapshot with no observable error. No lock, no sequence check, no atomic read.

The same bug exists in RealZincPlane.read_slots() (real_zinc_plane.py:220-230) — reads shared memory while a concurrent write_slot() is in progress.

Severity: High

H11: `_RustKernelLib` lazily initialized with race condition

File: `rust_backend.py:187-190**

_RUST: _RustKernelLib | None = None

def _get_rust():
    global _RUST
    if _RUST is None:
        _RUST = _RustKernelLib()  # no lock — two threads can both create
    return _RUST

No threading lock. Two concurrent calls to _get_rust() (possible via BingxVenueAdapter's thread pool) can create two _RustKernelLib objects. The _RustKernelLib() constructor runs _ensure_library() which runs subprocess.run(["cargo", "build", ...], check=True) — concurrent cargo build can corrupt the build directory.

Severity: High

H12: `ExecutionKernel.del` can deadlock or use-after-free

File: `rust_backend.py:527-531**

def __del__(self):
    backend = getattr(self, "_backend", None)
    if backend is not None:
        try:
            _get_rust().destroy(backend)  # accesses module singleton
        except Exception:
            pass

_get_rust() accesses the module-level _RUST singleton, which may already be destroyed if the module's garbage collection runs before the instance's. The destroy call happens outside any lock — one thread's destructor could destroy the Rust kernel while another thread is still using it. Use-after-free.

Severity: High

H13: `MirroredControlPlane` missing protocol methods

File: `control.py:171-184**

ControlPlane protocol defines wait() and notify(). MirroredControlPlane inherits from nothing and only implements read(), update(), and mirror(). Calling plane.wait() on a MirroredControlPlane raises AttributeError.

Severity: Medium

H14: `TradeSlot.remaining_size()` and `VenueOrder.remaining_size()` — same name, different semantics

Files: contracts.py:207-208, `contracts.py:143-145**

# TradeSlot:
def remaining_size(self) -> float:
    return max(0.0, float(self.size))  # open position size

# VenueOrder:
def remaining_size(self) -> float:
    return max(0.0, self.intended_size - self.filled_size)  # unfilled order qty

Same method name, completely different semantics. TradeSlot.remaining_size() returns the current open position size. VenueOrder.remaining_size() returns the untracked/unfilled order quantity. A caller using slot.remaining_size() to check if an order is fully filled gets position size, which doesn't change with fills — it changes with entry/exit.

Severity: Medium

H15: `_maybe_close()` — `asyncio.run()` RuntimeError silently swallowed for coroutines

File: `launcher.py:233-243**

if inspect.isawaitable(result):
    try:
        asyncio.run(result)
    except RuntimeError:
        pass  # SILENT — coroutine never executed

When maybe_close is called from an async context (which it is — DITAv2LauncherBundle.close() is used in async test code), asyncio.run() raises RuntimeError("Cannot run the event loop while another loop is running"). The exception is swallowed, the coroutine is never awaited, and the close/disconnect never happens.

Also: break after calling the first found method means if an object has both close() and disconnect(), disconnect() is never called.

Severity: Medium

H16: `_build_launcher_bundle` imports `BingxDirectExecutionAdapter` inside function — import-time side effect is safe but lazy loading masks errors

File: `launcher.py:254**

def _build_venue(...):
    from prod.clean_arch.adapters.bingx_direct import BingxDirectExecutionAdapter

Import inside function — safe, lazy, no side effects. But if the bingx_direct module has an import error (missing dependency, version mismatch), it only surfaces at bundle construction time, not at process start. A misconfigured production deployment would fail on the first trade, not on boot.

Severity: Informational

H17: `load_dotenv()` at module level — import-time filesystem I/O and env mutation

File: `launcher.py:49-51**

load_dotenv(PROJECT_ROOT / ".env")  # executes on module import

Runs on every import of launcher.py — reads filesystem, mutates process environment. Hard to mock in tests — setting env vars in test setup gets overwritten on module import. Also: if .env doesn't exist, load_dotenv() silently does nothing — missing config is invisible.

Severity: Medium

H18: `_run()` in `BingxVenueAdapter` — `asyncio.run()` thread-pool bridge blocks on every call

File: `bingx_venue.py:225-233**

def _run(self, result):
    if inspect.isawaitable(result):
        try:
            asyncio.get_running_loop()
        except RuntimeError:
            return asyncio.run(result)
        pool = self._get_executor()
        return pool.submit(asyncio.run, result).result()  # BLOCKS

Every call to _run() that receives an awaitable blocks the calling thread via .result(). The BingX HTTP call inside submit_intent() can take 1-5 seconds. During this block, the event loop cannot process other tasks. In a single-runtime deployment, this stalls the entire policy cycle.

Severity: Medium

H19: `HazelcastClientLike` protocol has zero concrete implementations in workspace

File: `hazelcast_projection.py:13-15**

class HazelcastClientLike(Protocol):
    def get_map(self, name: str): ...
    def get_topic(self, name: str): ...

Used as a type hint. No code in the workspace creates an object that satisfies this protocol. The Hazelcast client comes from an external package. If the external API changes, the protocol silently drifts — no compilation check.

Severity: Low

H20: `_decode_packet` in RealZinc — no bound check on `size` beyond `> len(buf)-16`

Files: real_control_plane.py:50-52, `real_zinc_plane.py:70-81**

seq, size = struct.unpack_from("!QQ", buf, 0)
if size <= 0 or size > len(buf) - 16:
    return {}
payload = bytes(buf[16 : 16 + size]).decode("utf-8")  # can raise UnicodeDecodeError
out = json.loads(payload)  # can raise ValueError

If shared memory contains a corrupted size field within bounds, .decode() or json.loads() raises — uncaught by callers. A single corrupted byte in shared memory crashes the kernel.

Severity: Low

H21: All Rust crate features enabled by default — `wasm-bindgen` compiled into native shared library

File: _rust_kernel/Cargo.toml, transitive through chrono → iana-time-zone → js-sys → wasm-bindgen

The Rust kernel is a native .so/.dylib but chrono's iana-time-zone pulls in js-sys and wasm-bindgen (WebAssembly support) even on native Linux. Larger binary, longer compile times. cc crate pulled in for iana-time-zone-haiku which only compiles on Haiku OS.

Severity: Low

H22: `socket.getaddrinfo` monkey-patch in test generator code

File: `gen2.py:295-298**

Monkey-patches Python stdlib socket.getaddrinfo to force IPv4 as a workaround for IPv6 resolution failure in the deployment environment. If copied to production code, would break IPv6 connectivity.

Severity: Low

Pass 5 Summary

#	Flaw	Layer	Severity
H1	No Python dependency files (requirements.txt, pyproject.toml, etc.)	Build	Critical
H2	Rust kernel compiled from source on every cold start — no prebuilt binary	Build	Critical
H3	Zero logging — 16+ silent except:pass sites, no error observability	All	Critical
H4	`_row_float` rejects zero as valid, `except Exception: continue` swallows all	Venue	High
H5	`_backend_snapshot` timeout returns stale data/None — callers crash	Venue	High
H6	All enum-from-raw-string sites crash on unknown variant (17 sites)	Bridge	High
H7	`_legacy_intent` reads `getattr(intent, "order_type")` not metadata — always MARKET	Venue	High
H8	Unknown venue status silently mapped to ACKED	Venue	High
H9	`RealZincPlane.write_slot()` `slot_id >= slot_count` silently lost	Zinc	High
H10	`RealZincControlPlane.read()` no atomicity with concurrent `update()`	Control	High
H11	`_RustKernelLib` lazy init with race condition — concurrent cargo build	Bridge	High
H12	`ExecutionKernel.__del__` use-after-free on Rust handle	Bridge	High
H13	`MirroredControlPlane` missing protocol methods (wait/notify)	Control	Medium
H14	`TradeSlot.remaining_size` vs `VenueOrder.remaining_size` — different semantics	Contracts	Medium
H15	`_maybe_close` asyncio.run RuntimeError silently swallowed	Launcher	Medium
H16	Lazy import of bingx_direct masks config errors until first trade	Build	Info
H17	`load_dotenv()` at module level — import-time I/O side effect	Launcher	Medium
H18	`_run()` blocks event loop on every HTTP call via thread pool	Venue	Medium
H19	`HazelcastClientLike` protocol has zero concrete implementations	Projection	Low
H20	`_decode_packet` uncaught UnicodeDecodeError/ValueError on corrupted SHM	Zinc	Low
H21	`wasm-bindgen` compiled into native library unnecessarily	Build	Low
H22	`socket.getaddrinfo` monkey-patch in test code	Test	Low

Pass 5 Severity Distribution

Severity	Count
Critical	3 (H1, H2, H3)
High	9 (H4-H12)
Medium	5 (H13, H14, H15, H17, H18)
Low	4 (H19, H20, H21, H22)
Info	1 (H16)

Combined Catalog (All 5 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
Total		138	8	30	37	44	19

PASS 6 — MATH, TESTS, CONCURRENCY, RECOVERY, SECURITY

I1: Entry `apply_fill` sets `slot.size = fill_size` — multiple partial fills overwrite instead of accumulating

File: _rust_kernel/src/lib.rs:798

// Entry fill path in apply_fill:
slot.size = fill_size;          // DIRECT ASSIGNMENT
slot.initial_size = slot.initial_size.max(fill_size);  // max, not sum

If a single entry order receives multiple partial fills (e.g., LIMIT order on the book):

Fill #1: fill_size = 0.5 → slot.size = 0.5, initial_size = max(0, 0.5) = 0.5
Fill #2: fill_size = 0.3 → slot.size = 0.3, initial_size = max(0.5, 0.3) = 0.5

After both fills, the actual position is 0.8 but slot.size reports 0.3. The position is under-counted by 0.5 — 62.5% error.

The exit path correctly does slot.size = (slot.size - fill_size).max(0.0) (subtractive). The entry path should accumulate: slot.size += fill_size.

This only manifests with LIMIT orders that receive multiple partial fills over time — a scenario entirely absent from tests (I7).

Severity: Critical

I2: `exit_ratio = 0.0` creates zero-size exit order — slot stuck in EXIT_REQUESTED

File: _rust_kernel/src/lib.rs:467-469

let exit_ratio = slot.next_exit_ratio();         // returns 0.0 from exit_leg_ratios=[0.0, ...]
let base_size = if slot.initial_size > 0.0 { ... } else { slot.size };
let exit_size = (base_size * exit_ratio).max(0.0); // = 0.0

When exit_leg_ratios contains 0.0 in any position, exit_size = 0.0. The zero-size exit order is submitted to the venue (intended_size = 0). On the fill side, realized_pnl() returns 0.0 (guarded by exit_size <= 0.0), and slot.size is unchanged. The slot stays in EXIT_REQUESTED with no means to advance — the leg is consumed but nothing happened. Subsequent exits may eventually handle this, but the zero-size leg is a wasted FSM transition that leaves the slot in a confusing intermediate state.

Also: NaN in exit_leg_ratios (from clamp(0.0, 1.0) not guarding NaN, though serde_json rejects NaN) would produce the same zero-size exit behavior.

Severity: Medium

I3: `entry_price` inconsistency — Python uses falsy check, Rust uses `<= 0.0`

File: contracts.py:88-98 (Python), _rust_kernel/src/lib.rs:227-228 (Rust)

# Python TradeSlot.mark_price():
self.entry_price = self.entry_price or price   # falsy — keeps -0.5, 0.0 replaced

# Rust TradeSlot::mark_price():
if self.entry_price <= 0.0 { self.entry_price = price; }  // catches -0.5, replaces it

If entry_price is negative (possible only via set_slot_json direct injection — not from normal trading), Python keeps it and computes unrealized_pnl with wrong sign. Rust replaces it. The Python-side mark_price is only called from ExecutionKernel.mark_price() in rust_backend.py:LOW-1, which never writes back to the Rust kernel — so the Python-side calculation is purely local and the inconsistency has no effect on the Rust kernel's canonical state. However, the observe_slots call after mark_price re-reads from the Rust kernel, which recomputes PnL correctly. The Python-side mark_price is effectively wasted computation that never feeds back.

Severity: Informational

I4: No Rust unit tests for 99% of kernel functionality

File: _rust_kernel/src/lib.rs:1731-1765

Only 1 Rust test exists: enter_then_ack_fill — creates a 2-slot kernel, submits ENTER, sends ACK, asserts state transitions.

Not tested in Rust:

EXIT, CANCEL, MARK_PRICE, RECONCILE, CONTROL actions
Any FILL event (PARTIAL, FULL)
CANCEL_ACK, CANCEL_REJECT, ORDER_REJECT
RATE_LIMITED handling
Multi-leg exits
consume_exit_leg edge cases
realized_pnl() formula with boundary values
mark_price() with extreme values
resolve_slot() fallback path
reconcile_slots_json dedup/overflow
Any C FFI boundary function
Any serde deserialization failure
Null pointer handling

No #[cfg(test)] module exists — the single test is inline. No Rust integration tests (tests/ directory).

Severity: High

I5: `MockVenueScenario` rejection flags exist but zero tests use them

File: mock_venue.py:23-35

@dataclass
class MockVenueScenario:
    reject_entries: bool = False
    reject_exits: bool = False
    cancel_reject: bool = False

Three boolean flags to simulate venue rejection of orders. Not a single test in test_flaws.py sets any of them to True. The ORDER_REJECT handler in the Rust kernel's on_venue_event exists (lib.rs lines ~1440-1460) but is never exercised by any test.

Similarly, entry_partial_fill_ratio and exit_partial_fill_ratio exist on MockVenueScenario but only one test (test_cancel_entry_with_partial_fill) uses partial fills at all — and it only checks size > 0, not the full capital-accrual chain.

Severity: High

I6: No LIMIT order test through the full kernel path

The test suite has zero LIMIT orders. The Rust kernel doesn't even contain LIMIT-specific logic — all orders are MARKET. The generated live tests have limit_does_not_fill and limit_immediate_fill scenario placeholders, but:

limit_does_not_fill uses reference_price=0.0 (not a real LIMIT order)
limit_immediate_fill uses target_size=-0.001 (negative size → clamped to 0.0)

Neither scenario actually submits a LIMIT order with order_type="LIMIT" and a non-zero limit_price. The _legacy_intent bug (H7) would convert any LIMIT attempt to MARKET anyway.

The only LIMIT-related code is the Rust kernel's if intent.order_type == "LIMIT" branches (lib.rs:503, 1584) which are compile-time dead code — KernelIntent doesn't have an order_type field that serde would populate.

Severity: High

I7: Three weak/vacuous assertions in `test_flaws.py`

File: test_flaws.py

Line 512: assert order.metadata.get("asset") is not None or order.metadata.get("slot_id") is not None — mock venue always sets both, this can never fail.
Line 700: test_pnl_warning_on_unsettled_reentry — titled to assert a warning is raised but only checks r.accepted. Never checks diagnostic_code or verifies the warning was issued.
Line 318: assert slot.active_entry_order is None or slot.active_entry_order.status == VenueOrderStatus.FILLED — the or allows two different scenarios to pass, reducing diagnostic power.

Severity: Low

I8: `slot.size = fill_size` entry overfill no guard

File: _rust_kernel/src/lib.rs:798

Already noted in I1 — entry fill sets slot.size directly to fill_size. Unlike exit fill which has (slot.size - fill_size).max(0.0), there's no guard against entry overfill (venue fills more than the intended order size). For MARKET orders this is fine (one fill per order), but for LIMIT orders with multiple partial fills, the accumulated fill could exceed initial_size.

Severity: Low (only relevant with LIMIT + partial fills, which don't exist in the codebase)

I9: No crash durability — slot state is pure in-memory until step 7 of process_intent

File: rust_backend.py:470-560

The process_intent sequence:

validate → 2. Rust FSM → 3. venue.submit() → 4. on_venue_event() → 5. projection → 6. zinc_plane

If the process crashes between steps 2-5, the slot state accumulated in the Rust kernel's in-memory KernelCore is completely lost. The Rust kernel has no WAL, no journal, no persistent store. On restart, ExecutionKernel.__init__ creates a fresh KernelCore with all slots IDLE.

The crash between step 3 and step 5 is the most dangerous: the exchange has an open order/position, but the kernel has no record of it. On restart:

The Rust kernel sees slot.slot_id = IDLE
The Zinc slot cache may or may not have the pre-crash state (depends on timing)
No code on restart loads Zinc state back into the Rust kernel (I14)
The exchange order lives until it fills (unexpected position) or is manually cancelled

Concrete example: venue.submit() sends POST to BingX, order placed. HTTP response arrives. on_venue_event(ORDER_ACK) transitions slot to ENTRY_WORKING. Crash between returning from on_venue_event and zinc_plane.write_slot(). On restart: slot is IDLE, no active entry order, _last_settled_pnl is reset. The exchange has a live ENTRY_WORKING order. Next process_intent(ENTER) gets SLOT_BUSY because... wait — the fresh kernel doesn't know the order exists, so it sees slot as IDLE and allows a new ENTER. The old order fills on the exchange → double position.

Severity: Critical

I10: `seen_event_ids` lost on restart — events replayed after restart are double-processed

File: _rust_kernel/src/lib.rs:672-683

seen_event_ids is per-slot, per-[KernelCore] instance — purely in-process memory. On restart with a fresh KernelCore, every slot has seen_event_ids = Vec::new(). If events are replayed (from pump_venue_events() calling venue.reconcile() which re-fetches exchange state):

Original run: order fills → FULL_FILL with event_id = "EV-00000042" → processed, slot → POSITION_OPEN
Crash
Restart: fresh KernelCore, seen_event_ids empty
pump_venue_events() fetches same exchange state → new VenueEvent objects with new event IDs (adapter's _event_seq resets)
Rust kernel sees these as novel events — processes them again
Position is double-booked, PnL double-settled

The bingx_venue._event_seq is an instance-level itertools.count() starting from 1. On adapter restart, it resets — so the new event IDs won't match the old ones anyway. Dedup is fundamentally impossible across restarts.

Severity: Critical

I11: No idempotency key (`newClientOrderId`) sent to BingX

File: bingx_venue.py:282-285, bingx_direct.py (external)

BingX supports newClientOrderId for order idempotency — sending the same ID twice returns the original order status instead of creating a duplicate. The DITAv2 kernel passes intent.intent_id as decision_id to the legacy adapter, but there's no guarantee this maps to newClientOrderId in the BingX payload.

If the HTTP POST to /trade/order times out before the response is read:

The order was placed on the exchange
_call_backend raises a BingxHttpError (or similar network exception)
process_intent() propagates the exception — no retry
Next cycle: caller may retry with a new intent_id
Second POST creates a second order on the exchange — duplicate position

Without a client-order-id that persists across retries, the system can create duplicate orders on network timeouts. The exchange has no way to deduplicate.

Severity: High

I12: No graceful degradation for ANY subsystem

Every subsystem failure mode examined:

Subsystem	Failure	Current behavior
Zinc SHM init	Corrupted region, OOM	Silent fallback to InMemoryZincPlane (no operator signal)
Zinc SHM write	Region overflow, write error	Unhandled exception → kernel crashes
Hazelcast write	Cluster unavailable	`.put()` raises → unhandled exception → kernel crashes
ClickHouse journal	Sink failure	Exception propagates (no try/except in callers)
BingX HTTP	Timeout, rate limit	Exception or REJECTED → slot stuck in ORDER_REQUESTED
Rust kernel	Null pointer from FFI	`_take_string` raises RuntimeError → kernel crash
Memory pressure	OOM	Process killed by kernel. No signal handler. Zero signal handlers.

No subsystem has a graceful degradation path. No circuit breaker, no retry queue, no fallback to log-only mode, no offline/cached trading mode. Every failure (except the two init-time silent fallbacks) crashes the current kernel operation.

Severity: High

I13: Stray venue event can reactivate a CLOSED slot — no guard

File: _rust_kernel/src/lib.rs:625+

The on_venue_event function has no guard for closed slots:

fn on_venue_event(&mut self, event: VenueEvent) -> KernelResult {
    // ... resolve slot, check duplicates ...
    // NO: if slot.closed { return ... }
    let prev_state = slot.fsm_state.clone();
    match event.kind {
        SOME_EVENT_KIND => { /* transitions regardless of closed state */ }
    }
}

If a stray venue event arrives for a CLOSED slot:

ORDER_ACK → sets ENTRY_WORKING — slot re-opens from CLOSED
FULL_FILL → apply_fill runs → slot.size = fill_size, fsm_state = POSITION_OPEN
ORDER_REJECT → clears trade_id, asset, sets IDLE — actually benign reset

A CLOSED slot should be a terminal state that rejects all events. Currently only CANCEL_ACK is harmless on a closed slot; the rest can revive a dead position.

Severity: High

I14: No `reconcile_from_slots` call on startup — Zinc state never loaded into Rust kernel

Files: rust_backend.py:435-465 (init), real_zinc_plane.py:95-115 (init)

On restart:

RealZincPlane.__init__ reads state from Zinc shared memory into _slot_cache
ExecutionKernel.__init__ creates fresh KernelCore — all slots IDLE
KernelStateView(self) reads from the fresh kernel
account.observe_slots([self._get_slot(i) for i in range(max_slots)]) — all slots IDLE

Step 3 and 4 read from the Rust kernel, NOT from Zinc. The Zinc _slot_cache populated in step 1 is never loaded into the Rust kernel. The reconcile_on_restart flag exists in KernelControlSnapshot (default True) but is never checked anywhere in ExecutionKernel.__init__ or the launcher.

The system always starts with a blank state even when durable shared memory state exists.

Severity: High

I15: CANCEL_REJECT doesn't clear `active_exit_order` — slot stuck in EXIT_WORKING

File: _rust_kernel/src/lib.rs:1165-1175

KernelEventKind::CANCEL_REJECT => {
    if slot.fsm_state == TradeStage::EXIT_WORKING {
        // stays EXIT_WORKING — no state transition
        // active_exit_order remains attached
    }
    diagnostic_code = KernelDiagnosticCode::CANCEL_REJECTED;
}

When the exchange rejects a cancel (typically because the order was already filled or no longer exists), the slot stays in EXIT_WORKING with active_exit_order still attached. Every subsequent CANCEL attempt hits the same path — the exchange returns "order not found," the kernel sees CANCEL_REJECT, and the slot is stuck forever.

If the order was already filled (CANCEL_REJECT means "can't cancel, no longer open"), the slot should check the actual position size and potentially transition to POSITION_OPEN or CLOSED depending on fill status.

Severity: Medium

I16: Zinc shared memory — world-readable/writable by same-machine processes

Files: real_control_plane.py, real_zinc_plane.py

The Zinc shared memory regions are created with these names:

self.region_name = f"{base}_intent"       # e.g., "dita_v2_intent"
self.state_name = f"{base}_state"          # "dita_v2_state"
self.control_name = f"{base}_control"      # "dita_v2_control"

Region names are predictable (prefix defaults to "dita_v2"). The SharedRegion uses POSIX shm_open — the default permissions depend on umask (typically 0644 or 0600). Any process on the same machine can:

Read: Open the region → as_buffer() → _decode_packet() → read all slot state, PnL, open orders, control settings
Write: Open the region → forge a packet (struct.pack("!QQ", seq, len) + json_bytes) → overwrite slot state, inject fake intents, modify control plane

No access control, no encryption, no integrity check (HMAC/signature) on the wire format. The sequence number is the only ordering mechanism, and it's trivially predictable.

Severity: High

I17: `KernelSlotView` exposes full slot state via unrestricted `getattr`/`setattr`

File: rust_backend.py:411-460

class KernelSlotView:
    def __getattr__(self, name):
        slot = self._snapshot()
        return getattr(slot, name)         # read ANY field

    def __setattr__(self, name, value):
        setattr(slot, name, value)
        self._kernel._set_slot(slot)       # write ANY field — bypasses FSM

Any code with a KernelSlotView reference can:

Read all slot fields: trade_id, size, entry_price, unrealized_pnl, realized_pnl, seen_event_ids, metadata
Write all slot fields: slot_view.realized_pnl = -9999999 — directly manipulates PnL figures flowing into capital settlement

The _set_slot call writes through to the Rust kernel without any FSM validation. The entire kernel state is exposed through mutable Python objects with zero access control.

Severity: High

I18: `sys.path.insert(0, ...)` at import time in three production files

Files: real_control_plane.py:14, real_zinc_plane.py:22, test_flaws.py:13, _build_pink_bodies.py:2, _gen_test.py:3

# real_control_plane.py, real_zinc_plane.py — at MODULE LEVEL:
sys.path.insert(0, str(_ZINC_ADAPTER_PATH))

# test_flaws.py, _build_pink_bodies.py, _gen_test.py — at MODULE LEVEL:
sys.path.insert(0, '/mnt/dolphinng5_predict')

sys.path.insert(0, ...) gives the injected path highest import priority. An attacker with filesystem write access to the inserted path can create a malicious module that shadows a legitimate import (e.g., zinc.py, utils.py, typing.py). When any subsequent from X import Y runs, the attacker's module loads with the full privileges of the kernel process.

The production files use a relative path resolution (Path(__file__).resolve().parents[3] / "zinc" / "adapters" / "python"), while the test files use a hardcoded absolute path ('/mnt/dolphinng5_predict'). Both patterns are dangerous.

Severity: High

I19: `pump_venue_events` re-fetches exchange state that can produce phantom position events

File: bingx_venue.py:395-415

reconcile() calls _backend_snapshot() which fetches current positions and open orders from the exchange. The _events_from_snapshot method diff-s the current snapshot against the last-known snapshot to produce events:

def _events_from_snapshot(self, before, after):
    for symbol, current_pos in after.open_positions.items():
        prev_pos = before.open_positions.get(symbol)
        if current_pos and (not prev_pos or abs(prev_pos.position_amount) < 1e-12):
            # This looks like a new position — emit event

If before is stale (from _backend_snapshot timeout), the diff can produce spurious events. A position that existed before the crash is absent from the stale snapshot → the diff sees it as "new" → emits an entry fill event → Rust kernel processes it as a fresh enter → double position. This compounds with I10 (seen_event_ids lost on restart).

Severity: High

I20: `exit_leg_ratios` no guard against empty list — `next_exit_ratio` returns 1.0

File: contracts.py:196-198

def next_exit_ratio(self) -> float:
    if self.active_leg_index < len(self.exit_leg_ratios):
        return self.exit_leg_ratios[self.active_leg_index]
    return 1.0

If exit_leg_ratios is empty (default (1.0,) prevents this normally, but the default is only (1.0,) in the dataclass), next_exit_ratio() returns 1.0. This is the same as "exit everything" — the consume_exit_leg then advances active_leg_index to min(1, 1) = 1, and all_legs_done = active_leg_index >= exit_leg_ratios.len() → 1 >= 0 = true → slot closes. The empty-ratios edge case is silently handled with unwrap_or(1.0), which happens to be correct — but undocumented.

Severity: Informational

I21: No test for rate-limited events — `RATE_LIMITED` kernel path is dead code

File: _rust_kernel/src/lib.rs (event handler), MockVenueScenario.mock_venue.py (no rate_limit flag)

The Rust kernel has a handler for KernelEventKind::RATE_LIMITED (lib.rs lines ~1480-1500). The event flows through the Python bridge's process_intent() rate-limit detection (rust_backend.py:585-593). But MockVenueScenario has no flag to emit rate-limited events. The only path to trigger RATE_LIMITED is from the real BingX adapter — which requires live exchange connectivity.

The entire RATE_LIMITED code path — in both Python and Rust — is untested in CI. Any bug in this path only surfaces in production under rate-limit conditions.

Severity: Medium

I22: Thread pool for `_run` — `max_workers=3` shared across ALL adapter instances

File: `bingx_venue.py:236-245**

@classmethod
def _get_executor(cls):
    if cls._EXECUTOR is None:
        with cls._EXECUTOR_LOCK:
            if cls._EXECUTOR is None:
                cls._EXECUTOR = ThreadPoolExecutor(max_workers=3, ...)
    return cls._EXECUTOR

Class-level singleton — all BingxVenueAdapter instances share the same 3-thread pool. With the runtime's step() calling submit() (1 thread) + _backend_snapshot (potentially another thread for open orders) + cancel() (1 thread in parallel), all 3 threads are consumed. A fourth concurrent call blocks the calling thread at .result() indefinitely — freezing the entire event loop.

The pool is never shut down. If a BingxVenueAdapter is destroyed, the threads remain running (zombie workers). No close()/disconnect() path shuts down the executor.

Severity: Medium

Pass 6 Summary

#	Flaw	Layer	Severity
I1	Entry `apply_fill` multiple partial fills overwrite size instead of accumulating	Rust	Critical
I2	Zero exit_ratio creates zero-size exit order — slot stuck in EXIT_REQUESTED	Rust	Medium
I3	entry_price inconsistency — Python falsy vs Rust `<= 0.0` gate	Bridge	Info
I4	Only 1 Rust unit test for 1765-line kernel — 99% untested at Rust layer	Rust	High
I5	MockVenueScenario rejection flags exist but zero tests use them	Test	High
I6	No LIMIT order test through full kernel path	Test	High
I7	Three weak/vacuous assertions in test_flaws.py	Test	Low
I8	Entry overfill no guard	Rust	Low
I9	No crash durability — slot state pure in-memory until step 7 of process_intent	Bridge	Critical
I10	seen_event_ids lost on restart — events double-processed	Rust	Critical
I11	No idempotency key sent to BingX — lost response creates duplicate orders	Venue	High
I12	No graceful degradation for ANY subsystem	All	High
I13	Stray venue event can reactivate CLOSED slot — no guard	Rust	High
I14	No reconcile_from_slots call on startup — Zinc state never loaded into kernel	Restart	High
I15	CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING	Rust	Medium
I16	Zinc shared memory world-readable/writable by same-machine processes	Zinc	High
I17	KernelSlotView unrestricted getattr/setattr — bypasses all FSM guards	Bridge	High
I18	sys.path.insert(0) at import time in 3 production files — malicious module loading	Build	High
I19	pump_venue_events stale snapshot diff produces phantom position events	Venue	High
I20	exit_leg_ratios empty list — next_exit_ratio defaults to 1.0 (undocumented)	Contracts	Info
I21	RATE_LIMITED code path in both Python and Rust is completely untested	All	Medium
I22	Thread pool max_workers=3 shared across all adapter instances — never shut down	Venue	Medium

Pass 6 Severity Distribution

Severity	Count
Critical	3 (I1, I9, I10)
High	9 (I4, I5, I6, I11, I12, I13, I14, I16, I17, I18, I19)
Medium	4 (I2, I15, I21, I22)
Low	2 (I7, I8)
Info	2 (I3, I20)

Combined Catalog (All 6 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
Total		160	11	41	41	46	21

PASS 7 — TEST INFRA, DATA FEED, RUST DEEPER, ENV PARSING, CONNECTIONS

J1: Test `_flatten` helper submits wrong direction for LONG positions

File: _build_pink_extended.py (patch), gen2.py:399-412, gen_live_tests.py:155-169

Every instance of _flatten in the codebase submits a SHORT exit regardless of the actual position direction:

def _flatten(k, symbol, price, label):
    _exit(k, symbol, price, slot_id=0)    # _exit creates a SHORT exit
    # ... if still not free, tries LONG exit

_exit calls _si(k, EXIT, ..., "SHORT", ...). If the open position is LONG, this SHORT exit is actually an enter short — a new position opening, not a flatten. Only after the first attempt fails does it try a LONG exit. This can double the position instead of flattening it.

No test in the suite has ever hit this because no test before the _verify step has an open position with the wrong direction — but the code is fundamentally wrong: it assumes all positions are SHORT.

Severity: Medium

J2: Test `_check_slot_accounting` double-counts unrealized PnL

File: _build_pink_extended.py (patched into generated file)

total_rp = sum(k.slot(i).realized_pnl for i in range(k.max_slots))
total_up = sum(k.slot(i).unrealized_pnl for i in range(k.max_slots))
expected = start_cap + total_rp + total_up
actual = k.account.snapshot.capital
assert abs(actual - expected) < 0.01

The accounting identity capital = start_cap + Σrealized_pnl + Σunrealized_pnl double-counts unrealized PnL if the Rust kernel's capital computation already includes it. The kernel's account.snapshot.capital is updated by settle() which adds realized_pnl only — so unrealized PnL is NOT included in capital. This means the assertion is actually correct semantically: capital = start + realized + unrealized. Wait — that IS correct. Let me re-examine...

Actually, account.settle(realized_pnl) adds only realized PnL to capital. Capital does NOT include unrealized. So capital = start + realized and the test adds unrealized on top. If unrealized > 0, the assertion actual == expected where expected = actual + unrealized will always fail for open positions. The test only passes when unrealized ≈ 0 (closed positions or when mark_price hasn't been called — which is always, per J4).

This assertion produces false failures for every test with an open position. The only reason it doesn't trigger is that mark_price is never called, so slot.unrealized_pnl is always 0. Silent near-miss.

Severity: Medium

J3: `_build_live_snapshot` uses `time.time()` (float) as timestamp — downstream expects datetime

File: gen_live_tests.py:81

def _build_live_snapshot(client, symbol, interval=None):
    # ...
    return MarketSnapshot(
        timestamp=time.time(),   # ← float (Unix epoch seconds)
        ...
    )

While gen2.py:352 and _gen_test.py:138 correctly use datetime.now(timezone.utc) (timezone-aware datetime). If any downstream code calls .isoformat() or .strftime() on the snapshot's timestamp, it crashes with AttributeError: 'float' object has no attribute 'isoformat'.

This function is used by the newer _run harness in the generated live-test file. Whether the crash manifests depends on what MarketSnapshot and PinkDirectRuntime.step() do with the timestamp field.

Severity: High

J4: `ExecutionKernel.mark_price()` exists but is never called — no periodic mark-to-market

File: rust_backend.py:667-672

def mark_price(self, asset: str, price: float) -> None:
    for slot in self.state.slots:
        if slot.asset == asset and slot.is_open():
            slot.mark_price(price)
    self.account.observe_slots(...)

This method exists on ExecutionKernel but has zero callers in the entire codebase. Unrealized PnL is never updated outside of process_intent and on_venue_event (which only compute realized PnL). The slot.unrealized_pnl field stays at its initial value (0) unless mark_price is called externally.

The AccountProjection.observe_slots() (account.py:53-66) reads slot.unrealized_pnl and reports it — but since nothing ever updates it, unrealized PnL is always 0 in the account snapshot.

This means the capital figure reported by kernel.snapshot()["account"]["unrealized_pnl"] is always zero for open positions — the system has no live mark-to-market.

Severity: High

J5: All VenueEvent timestamps use local machine clock, not exchange timestamp

File: bingx_venue.py (7 locations)

Every VenueEvent constructed in the venue adapter uses the local machine's clock:

VenueEvent(
    timestamp=datetime.now(timezone.utc),  # local clock, not exchange
    ...
)

This includes:

_events_from_submit() (lines 370, 390) — with getattr(receipt, "timestamp", ...) fallback that still uses local clock
_events_from_cancel() (lines 455, 480)
_event_from_row() (line 546)
_fill_event_from_row() (line 570)

The exchange's HTTP response includes timestamps (transactTime, updateTime) that are authoritative. These are available in the raw response dict (stored in raw_payload) but are never extracted as the event timestamp. Clock skew between the local machine and the exchange is invisible — event timestamps may be ahead of or behind exchange time.

Severity: Medium

J6: No monotonic timestamp verification anywhere in the system

No code path in the entire codebase checks whether a new timestamp is >= the previous one for the same asset/slot:

process_intent() — no comparison between intent timestamp and slot's last_event_time
on_venue_event() — no check that event timestamp >= previous events
TradeSlot.last_event_time is stored but never validated for monotonicity
VenueEvent timestamps from pump_venue_events() are never compared with event history

With NTP clock adjustments, daylight saving time changes, or VM clock drift, timestamps can go backwards. The system has no detection or guard.

Severity: Low

J7: `rebuild_indexes()` silently overwrites duplicate `trade_id` — last slot wins, first becomes invisible

File: _rust_kernel/src/lib.rs:571-596

fn rebuild_indexes(&mut self) {
    for slot in &self.slots {
        if !slot.trade_id.is_empty() {
            self.active_trade_index.insert(slot.trade_id.clone(), slot.slot_id);
            // ↑ HashMap::insert overwrites — no duplicate check
        }
    }
}

If two slots happen to have the same trade_id (not prevented by any invariant check), the index maps to the last slot with that trade_id. The first slot becomes invisible to resolve_slot()'s trade_id-based fallback. Any venue event for that trade_id with an unspecified or negative slot_id always resolves to the last slot.

The process_intent ENTER handler checks slot.trade_id != intent.trade_id to prevent overwriting a different trade on the same slot — but there's no global uniqueness check across all slots.

Severity: High

J8: `resolve_slot()` falls back to slot 0 when all indexes miss — stray event corrupts slot 0

File: _rust_kernel/src/lib.rs:606-622

fn resolve_slot(&self, event: &VenueEvent) -> usize {
    // ... try by slot_id, trade_id, venue_order_id, client_order_id ...
    self.slots.first().map(|slot| slot.slot_id).unwrap_or(0)
}

When a venue event has:

slot_id = -1 (negative — can't be used as usize)
Empty trade_id (trade not found on new kernel after restart)
Empty venue_order_id and venue_client_id

...the event is routed to slot 0 regardless of which slot it was intended for. If slot 0 is in the middle of a trade, the stray event (e.g., a stale ORDER_ACK from a pre-crash order) overwrites slot 0's state. Combined with I10 (seen_event_ids lost on restart), this is a concrete crash-recovery failure path.

Severity: High

J9: `dita_kernel_get_slot_json` and `dita_kernel_snapshot_json` return null with no diagnostic

File: _rust_kernel/src/lib.rs (FFI exports)

The intent/event processing paths (process_intent_json, on_venue_event_json) have two layers of error handling — parse errors produce a structured invalid_intent_cstring() diagnostic JSON, and serialization errors also produce diagnostics.

But the slot/snapshot read functions return bare null pointers:

// dita_kernel_get_slot_json (line 1608):
Err(_) => ptr::null_mut()    // ← no diagnostic

// dita_kernel_snapshot_json (line 1765):
Err(_) => ptr::null_mut()    // ← no diagnostic

The Python caller (_RustKernelLib.get_slot_json, line 164) checks if not raw: raise IndexError(...) — so null is caught, but the IndexError provides no detail about why it failed. If snapshot() returns null (serialization failure with f64 NaN/Inf in some slot), the Python code gets a bare IndexError or RuntimeError with no diagnostic.

Severity: Medium

J10: Two processes with same `DITA_V2_PREFIX` corrupt shared Zinc memory

File: real_zinc_plane.py:79-82, launcher.py:302

# launcher.py:
resolved_prefix = (prefix or os.environ.get("DITA_V2_PREFIX", "dita_v2")).strip() or "dita_v2"

# real_zinc_plane.py:
self.intent_name = f"{base}_intent"   # e.g., "dita_v2_intent"
self.state_name = f"{base}_state"      # "dita_v2_state"
self.control_name = f"{base}_control"  # "dita_v2_control"

Two processes on the same machine with the same prefix will:

Attach to the same named shared memory regions
Overwrite each other's slot state, intents, and control settings
Race on concurrent writes — last writer wins with no coordination
One process's create=True conflicts with another's — SharedRegion.create() may fail

There is no prefix uniqueness validation, no PID suffix, no UUID, no lock file, no access control. The prefix defaults to "dita_v2" — trivially guessable.

Severity: High

J11: `load_dotenv()` only runs when `launcher.py` is imported — env vars unset for other module paths

File: launcher.py:49-51, control.py:205, projection.py:71

# launcher.py (at module level):
load_dotenv(PROJECT_ROOT / ".env")  # only runs on `import launcher`

# control.py (at function call time):
raw = os.environ.get("DITA_V2_CONTROL_PLANE")  # reads env var

# projection.py (at function call time):
raw = os.environ.get("DITA_V2_HAZELCAST")      # reads env var

If any code imports from .control import build_control_plane directly (without first importing launcher.py), load_dotenv() has not run. The .env file is never loaded. Env vars that should have been set from .env are absent.

This creates an ordering dependency: module import order determines whether config files are loaded. Different import paths can produce different runtime behavior.

Severity: Medium

J12: `BINGX_API_KEY`/`BINGX_SECRET_KEY` passed as `None` with no validation — fails at HTTP time

File: launcher.py:195-196

api_key=os.environ.get("BINGX_API_KEY"),      # None if unset
secret_key=os.environ.get("BINGX_SECRET_KEY"), # None if unset

When keys are unset, None is passed to BingxExecClientConfig and then to the HTTP client. No validation occurs at config/build time. The system:

Successfully builds a full DITAv2LauncherBundle with empty keys
Creates an ExecutionKernel
The first trade's venue.submit() call sends an HTTP request to BingX with empty auth
BingX returns 401 — cryptic "signature verification failed" error

This is a late failure — the operator has no indication of misconfiguration until the first trade attempt. Fast failure at launcher time would catch this.

Also: gen_live_tests.py:116-117 and gen2.py:320 use bracket access os.environ["BINGX_API_KEY"] which crashes with KeyError if the var is missing — an inconsistent pattern (crash immediately vs fail at HTTP time).

Severity: Medium

J13: API credentials never masked in error messages or tracebacks

File: launcher.py:195-196, bingx_venue.py (through config object)

Credentials flow through:

os.environ.get("BINGX_API_KEY") → BingxExecClientConfig(api_key=...)
BingxExecClientConfig → BingxDirectExecutionAdapter.__init__(config)
Config object stored as Python attribute — accessible via repr(), str(), error tracebacks

No code masks, redacts, truncates, or otherwise protects the API key or secret key. If an exception propagates and the traceback includes the config object (through local variables, frame inspection, or exception chaining), the credentials are exposed in logs.

The generated live-test code also embeds credentials literally:

client = BingxHttpClient(api_key="<ACTUAL_KEY>", secret="<ACTUAL_SECRET>", ...)

When test files are checked into version control (even temporarily), credentials are at risk.

Severity: High

J14: `_env_bool` treats empty-string var as `False` while unset returns `default` — inconsistent

File: launcher.py:84-88

def _env_bool(name: str, default: bool = False) -> bool:
    raw = os.environ.get(name)
    if raw is None:
        return default           # unset → uses caller's default
    return str(raw).strip().lower() in {"1", "true", "yes", "on"}
    # empty/whitespace → "" → False

Env Var State	`_env_bool(name, True)` returns
Unset (key absent)	`True` (caller's default)
Set to `""` (empty)	`False` (empty not in truthy set)
Set to `" "` (whitespace)	`False`

Setting DITA_V2_DEBUG_CLICKHOUSE="" (intending "don't set, use default") actually forces it to False, overriding the default. And setting DITA_V2_DEBUG_CLICKHOUSE=" " (whitespace accidentally) does the same. The operator would need to know that empty and whitespace are treated as explicit falsy values, not as "unset."

Severity: Low

J15: `gen2.py` and `_gen_test.py` both write to the same output file — last writer wins

Files: gen2.py and _gen_test.py

Both generators write to:

OUTPUT = "/mnt/dolphinng5_predict/prod/tests/test_pink_bingx_dita_live_e2e.py"

_gen_test.py is more complete (includes _inspect_outcome, _assert_accepted, _check_slot_accounting, _build_fresh_kernel_from_slot helpers). gen2.py is simpler (no helpers). The last file to execute determines what the test file contains.

If gen2.py runs last, the helpers from _build_pink_extended.py and _build_pink_bodies.py are lost — their patches to the generated file become stale updates to a now-overwritten file. The _check_slot_accounting assertions in 14 body functions silently become dead code.

Severity: Medium

J16: Shim test bridge has no `step()`, `decision_engine`, `intent_engine` — zero fidelity to production runtime

File: (generated in _build_rb sections across all test generators)

class Shim:
    def __init__(self, k): self.kernel = k
    async def connect(self, ic=0): self.kernel.venue.connect()
    async def disconnect(self):
        try: self.kernel.venue.disconnect()
        except: pass

The Shim provides none of PinkDirectRuntime's capabilities:

No step() method — tests call k.process_intent() directly
No data_feed — tests must provide prices manually
No decision_engine — tests construct intents manually
No intent_engine — no intent sizing/validation
No lifecycle beyond connect/disconnect

The test suite effectively tests ExecutionKernel in isolation, not the full runtime pipeline. Any bug in the decision→intent→kernel→fill→persist chain that passes through step() is invisible to these tests.

Severity: High

Pass 7 Summary

#	Flaw	Layer	Severity
J1	`_flatten` submites wrong direction for LONG positions	Test	Medium
J2	`_check_slot_accounting` double-counts unrealized PnL	Test	Medium
J3	`_build_live_snapshot` timestamp is float vs datetime — type crash risk	Data Feed	High
J4	`ExecutionKernel.mark_price()` never called — no mark-to-market	Bridge	High
J5	All VenueEvent timestamps use local clock, not exchange timestamp	Venue	Medium
J6	No monotonic timestamp verification anywhere	All	Low
J7	`rebuild_indexes()` overwrites duplicate trade_id — last wins, first invisible	Rust	High
J8	`resolve_slot()` falls back to slot 0 — stray event corrupts slot 0	Rust	High
J9	`get_slot_json`/`snapshot_json` return null with no diagnostic	Rust	Medium
J10	Two processes with same DITA_V2_PREFIX corrupt shared Zinc memory	Zinc	High
J11	`load_dotenv()` only runs on launcher.py import — ordering dependency	Config	Medium
J12	BINGX_API_KEY passed None with no validation — fails at HTTP time	Config	Medium
J13	API credentials never masked in error messages or tracebacks	Config	High
J14	`_env_bool` inconsistent: empty string = False vs unset = default	Config	Low
J15	gen2.py and _gen_test.py write to same output — last writer wins	Test	Medium
J16	Shim test bridge lacks step(), decision_engine — zero runtime fidelity	Test	High

Pass 7 Severity

Severity	Count
High	7 (J3, J4, J7, J8, J10, J13, J16)
Medium	7 (J1, J2, J5, J9, J11, J12, J15)
Low	2 (J6, J14)

Combined Catalog (All 7 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
Total		176	11	48	48	48	21

PASS 8 — OBSERVABILITY, MEMORY, TIME, DEAD CODE, MODULE INIT

K1: Zero stdout/stderr output — system is completely silent

No production code path emits any stdout or stderr output. Zero print(), zero logging output, zero warnings.warn(). The system runs with zero operator-visible evidence of being alive. If Hazelcast and ClickHouse are both disabled, the system is a black box — no logs, no metrics, no health checks, no output of any kind.

logging is imported in exactly one file (bingx_user_stream.py) with no root logger configuration anywhere. Even those logging calls produce no output without logging.basicConfig().

Severity: Critical

K2: No health check endpoint, no metrics, no monitoring surface

There are zero:

HTTP health check endpoints (/health, /ready)
Prometheus metrics endpoints
Statsd/Graphite reporters
Periodic heartbeats
Liveness/readiness probes
Process manager integration (no systemd unit, no supervisor config, no container healthcheck)

The only monitoring surface is programmatic — calling kernel.snapshot() from Python code with access to the same ExecutionKernel instance. For cross-process monitoring, the operator must write custom code to read Zinc shared memory regions and parse the undocumented JSON packets.

Severity: Critical

K3: Failed trades produce no notification — error exists only in return value

process_intent() returns KernelOutcome(accepted=False, diagnostic_code=..., details=...) but:

No log line is written for the failure
No stdout/stderr output
The failure is not persisted to any durable store (unless debug_clickhouse is enabled and sink is configured)
If the caller (strategy/algo) doesn't inspect the return value, the failure is completely invisible
There is no alert mechanism, no error counter, no dead-letter queue

Severity: High

K4: Exception tracebacks not captured in production — all `except:` blocks swallow silently

Every except Exception: pass and except Exception: continue in the codebase discards the full Python traceback. There is no logging infrastructure to capture it. When an exception occurs:

launcher.py:187: RealZincPlane init failure → traceback lost
rust_backend.py:102: __del__ exception → traceback lost
bingx_venue.py:51: _row_float conversion failure → traceback lost
bingx_venue.py:325: slot lookup failure → traceback lost
bingx_venue.py:350: cancel HTTP error → traceback lost
control.py:213: control plane fallback → traceback lost
All real_control_plane.py try/except blocks → traceback lost

The only exception information that survives is the final exception message in BingxHttpError (converted to a dict) and Rust kernel diagnostic codes (structured JSON). Full Python tracebacks are invisible.

Severity: High

K5: ~85+ Python objects allocated per `process_intent()` call — 36 TradeSlot copies via JSON round-trip

Every _get_slot() call does a full JSON serialization (Rust) → C FFI → JSON parse (Python) → new TradeSlot dataclass. A single ENTER intent with 2 venue events results in approximately:

36 TradeSlot instances from repeated _get_slot() calls (state refresh, observe_slots, projection writes)
4 VenueEvent instances
3 KernelOutcome instances
~30 dicts for serialization payloads
~4 KernelTransition instances

No caching exists — every _get_slot() call goes through the full FFI round-trip. Multiple calls within the same process_intent() invocation fetch the same slot data multiple times.

Severity: Medium

K6: Circular reference cycle `Kernel` → `StateView` → `SlotView` → `Kernel` — prevents refcount GC

# KernelStateView and KernelSlotView both hold strong references:
self._kernel = kernel     # strong reference

This forms ExecutionKernel → state → slots[]._kernel → ExecutionKernel. Python's refcounting cannot free this cycle — it depends on the generational GC. The __del__ method on ExecutionKernel (which destroys the Rust KernelHandle) fires at an unpredictable time, potentially long after the last explicit reference to the kernel is dropped.

Severity: High

K7: `MemoryKernelJournal` silently drops transitions after 10,000 rows — no warning, no rollover

def record(self, row):
    if len(self.rows) < self.capture_limit:  # capture_limit = 10,000
        self.rows.append(dict(row))
    # else: silently no-op — every subsequent transition is lost

After 10,000 transitions, record() becomes a no-op. No error, no warning, no FIFO eviction, no rollover to disk. In a production system with 10+ transitions per trade and 100+ trades/day, the journal dies in ~10 days. At that point, all field debugging/troubleshooting capability is silently lost.

Each row holds a full slot.to_dict() (~1 KB) plus event/control payloads. The 10,000 rows retain ~10-15 MB permanently.

Severity: High

K8: `RealZincPlane._intent_cache` Python list unbounded — only shared memory write is bounded

# real_zinc_plane.py:189-191
self._intent_cache.append(row)
self._write_region(self.intent_region, self._intent_seq, {"items": self._intent_cache[-512:]})

The shared memory write limits to the last 512 entries. But the Python _intent_cache list grows unbounded — every intent ever published remains in memory forever. After 1M intents: ~1M dict objects, ~500 MB+ of Python memory.

Note: InMemoryZincPlane.intent_region has the same unbounded growth (already documented as F12).

Severity: High

K9: `_backend_snapshot` timeout uses wall-clock `threading.Event.wait()` — NTP can truncate/extend

def _backend_snapshot(self, *, timeout_ms=5000.0):
    if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):  # wall clock!
        return self._last_snapshot  # stale data

threading.Event.wait(timeout) uses the system wall clock. If NTP adjusts the clock:

Forward (e.g., +2 seconds): the timeout is truncated to ~3 seconds — spurious timeout, stale snapshot returned
Backward (e.g., -2 seconds): the timeout extends to ~7 seconds — caller blocks longer than expected

The correct pattern is time.monotonic() with a deadline loop — which InMemoryControlPlane.wait() already uses correctly. The _backend_snapshot timeout is the single highest-impact site because it controls whether the venue adapter returns fresh or stale exchange state.

Severity: High

K10: `RealZincControlPlane.wait()` uses wall clock — no monotonic guarantee

# real_control_plane.py:126-130
def wait(self, timeout_ms=1000):
    try:
        return bool(self.region.wait(timeout_ms))  # wall clock
    except Exception:
        return False

The SharedRegion.wait() implementation (external) uses wall clock. Same NTP sensitivity as K9, though lower impact (controls shared memory synchronization, not exchange data freshness).

Severity: Medium

K11: `exchange_ts` falls back to local `time.time()` when exchange timestamp `E` is missing

# bingx_user_stream.py:278
ts = int(frame.get("E") or time.time() * 1000)  # local clock fallback

When the exchange's WebSocket frame lacks the E (event time) field, the code substitutes the local machine's wall clock. Two problems:

Local clock may differ from exchange clock by seconds or minutes (VM drift)
time.time() is wall-clock — subject to NTP backward jumps

Events that lack E will have timestamps from a different clock source than events that have E. This creates ordering paradoxes in any downstream consumer that sorts by timestamp.

Severity: Medium

K12: No monotonic timestamp verification anywhere in the system

Zero code paths check whether timestamps progress forward:

process_intent() — no comparison between intent timestamp and slot's last_event_time
on_venue_event() — no check that event timestamp >= previous events
AccountProjectionV2._build() — no monotonicity check on ReconcileResult.ts
Rust kernel — last_event_time = Some(event.timestamp) stored but never validated

NTP backward jumps, clock skew, or VM migration all can produce decreasing timestamps. The system has no detection, no guard, no warning log.

Severity: Medium

K13: `ControlPlane.wait()` and `notify()` have zero callers across all implementations — dead protocol surface

The ControlPlane protocol defines wait(timeout_ms=1000) and notify(). Both are implemented by InMemoryControlPlane, ZincControlPlane, and RealZincControlPlane. But zero callers exist in production code:

ExecutionKernel never calls self.control_plane.wait() or .notify()
launcher.py never calls them
No test exercises them

Combined with the protocol methods having real implementations (with monotonic clock logic in InMemoryControlPlane), this is ~40 lines of dead-but-maintained code.

Similarly: all 7 ZincPlane wait/notify methods (wait_on_intent, notify_intent, wait_on_state, notify_state, wait_on_control, notify_control, read_slots) have zero callers — dead protocol surface.

Severity: Informational

K14: `AccountProjection.to_account_event()` has zero callers

# account.py:86
def to_account_event(self, metadata=None):
    ...

Defined, never called anywhere in production code or tests. Dead code.

Severity: Informational

K15: `HazelcastProjector` entire class dead — zero callers

# hazelcast_projection.py:18-48
class HazelcastProjector:
    def publish_slot(self, slot): ...
    def publish_event(self, event_type, payload): ...

Both methods have zero callers anywhere in the codebase. The class can never be constructed from any production code path. The actively-used projection class is HazelcastRowWriter.

Severity: Informational

K16: `_order_to_payload()` dead code

# rust_backend.py:220
def _order_to_payload(order):
    ...

Defined, never called. Serializing a VenueOrder to dict is done inline in TradeSlot.to_dict() (contracts.py:127-134), not via this function.

Severity: Informational

K17: `MirroredControlPlane` entire class dead — never constructed

# control.py:171-184
class MirroredControlPlane:
    def __init__(self, inner, mirror_sink=None): ...

build_control_plane() never returns a MirroredControlPlane. The class can only be constructed if someone explicitly instantiates it — no code path does. Similarly, KernelJournal protocol is never used as a type annotation outside journal.py.

Severity: Informational

K18: 12 of 20 `TradeStage` variants never matched in Rust FSM logic

Defined in the Rust string_enum! but never matched in any process_intent or on_venue_event arm: DECISION_CREATED, INTENT_CREATED, ORDER_SENT, ORDER_ACKED, ORDER_REJECTED, POSITION_OPENED, EXIT_SENT, EXIT_ACKED, EXIT_REJECTED, POSITION_PARTIALLY_CLOSED, POSITION_CLOSED, TRADE_TERMINAL_WRITTEN

Only 7 variants are used in FSM logic: IDLE, ORDER_REQUESTED, ENTRY_WORKING, POSITION_OPEN, EXIT_REQUESTED, EXIT_WORKING, CLOSED, STALE_STATE_RECONCILING. The other 12 are serialization-only — they exist in the enum but the kernel never transitions a slot to them.

Severity: Low

K19: Unused imports in `projection.py` and `hazelcast_projection.py`

projection.py imports AccountProjection, TradeStage, datetime, Iterable, List — none used. hazelcast_projection.py imports KernelTransition, TradeSlot, KernelControlSnapshot, _transition_row — none used.

These are carryovers from earlier code versions. They add no runtime cost (import is cached after first load) but indicate stale code structure.

Severity: Informational

K20: `sys.path` mutation on import — importing the package appends Zinc path globally

Both real_control_plane.py:13-15 and real_zinc_plane.py:22-24 do:

if _ZINC_ADAPTER_PATH.exists() and str(_ZINC_ADAPTER_PATH) not in sys.path:
    sys.path.append(str(_ZINC_ADAPTER_PATH))

This fires at module import time as a side effect of importing __init__.py (through the chain: __init__ → launcher → real_control_plane/real_zinc_plane). It modifies the process-global sys.path, which persists for the entire process lifetime. If the Zinc adapter path shadows or conflicts with other modules, the consequences are global and hard to debug.

Severity: Medium

K21: `load_dotenv()` runs at module import time — mutates `os.environ` as side effect

# launcher.py:49-51 (at module level)
PROJECT_ROOT = Path(__file__).resolve().parents[3]
load_dotenv(PROJECT_ROOT / ".env")

This fires on import launcher (which happens via __init__.py). Mutates os.environ process-globally. Tests that need to set specific env vars must import launcher first to get .env loaded, then override — or the .env values win. Also: if .env doesn't exist, load_dotenv() silently does nothing, and the import dependency shifts — importing the package may or may not load .env depending on filesystem state.

Severity: Medium

K22: `ControlPlane` protocol not in `init.py.all`

# __init__.py (__all__)
"ControlPlane" not in __all__  # ← hidden from star imports

from prod.clean_arch.dita_v2 import * exports 44 names but does NOT include ControlPlane (the main interface type). Concrete implementations (InMemoryControlPlane, RealZincControlPlane, etc.) are all exported. The protocol class itself is hidden.

Severity: Informational

K23: `KernelSlotView.getattr` makes a ctypes call per attribute access — no caching

# rust_backend.py:422-426
def __getattr__(self, name):
    slot = self._snapshot()              # FFI round-trip every time
    if hasattr(slot, name):
        return getattr(slot, name)
    raise AttributeError(name)

Every attribute access on a KernelSlotView (e.g., slot.size, slot.fsm_state, slot.trade_id) does a full JSON round-trip to the Rust kernel. The _snapshot() method calls self._kernel._get_slot(self._slot_id) which calls _get_rust().get_slot_json() → Rust serializes slot to JSON → Python parses → creates new TradeSlot → attribute is read from the new object.

Accessing 5 fields on a KernelSlotView does 5 FFI round-trips. There is no caching of the deserialized TradeSlot between accesses.

Severity: Medium

Pass 8 Summary

#	Flaw	Layer	Severity
K1	Zero stdout/stderr — system completely silent	All	Critical
K2	No health check, metrics, or monitoring surface	All	Critical
K3	Failed trades produce no notification — error in return value only	Bridge	High
K4	Exception tracebacks not captured — all except:pass swallow silently	All	High
K5	~85+ Python objects per process_intent — 36 TradeSlot copies via FFI	Bridge	Medium
K6	Circular ref cycle Kernel→StateView→SlotView→Kernel — delays del	Bridge	High
K7	MemoryKernelJournal silently drops transitions after 10K rows	Journal	High
K8	RealZincPlane._intent_cache unbounded Python list growth	Zinc	High
K9	_backend_snapshot timeout uses wall clock — NTP truncates/extends	Venue	High
K10	RealZincControlPlane.wait() uses wall clock — no monotonic	Control	Medium
K11	exchange_ts fallback to local time.time() when E missing	Stream	Medium
K12	No monotonic timestamp verification anywhere	All	Medium
K13	ControlPlane.wait()/notify() — zero callers across all impls	Control	Info
K14	AccountProjection.to_account_event() — zero callers	Account	Info
K15	HazelcastProjector entire class dead	Projection	Info
K16	_order_to_payload() dead code	Bridge	Info
K17	MirroredControlPlane entire class dead — never constructed	Control	Info
K18	12 of 20 TradeStage variants never matched in Rust FSM	Rust	Low
K19	Unused imports in projection.py and hazelcast_projection.py	Projection	Info
K20	sys.path mutation on import — global side effect	Config	Medium
K21	load_dotenv() at module import time — mutates os.environ globally	Config	Medium
K22	ControlPlane protocol not exported in all	Config	Info
K23	KernelSlotView.getattr makes FFI call per attribute access	Bridge	Medium

Pass 8 Severity

Severity	Count
Critical	2 (K1, K2)
High	7 (K3, K4, K6, K7, K8, K9)
Medium	7 (K5, K10, K11, K12, K20, K21, K23)
Low	1 (K18)
Info	6 (K13, K14, K15, K16, K17, K19, K22)

Combined Catalog (All 8 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
Total		199	13	55	55	49	27

PASS 9 — CONTRACTS, EXCHANGE EVENTS, NETWORK, FFI, BACKUP DIFFS

L1: `KernelOutcome(accepted=True, diagnostic_code=INVALID_INTENT)` is parseable — no invariant check

File: rust_backend.py:388-402

def _outcome_from_payload(payload):
    return KernelOutcome(
        accepted=bool(payload.get("accepted", False)),
        diagnostic_code=KernelDiagnosticCode(str(payload.get("diagnostic_code", "OK"))),
    )

No validation that accepted=True implies diagnostic_code=OK, or that accepted=False implies a non-OK diagnostic code. If the Rust kernel ever returns contradictory values (e.g., {"accepted": true, "diagnostic_code": "INVALID_INTENT"}), Python silently accepts them. The default for both KernelOutcome.diagnostic_code and _outcome_from_payload fallback is OK — an accepted=False with no explicit diagnostic_code would silently show OK.

Similarly, KernelTransition has no FSM validation — any (prev_state, next_state) pair is accepted, even impossible transitions like IDLE → POSITION_CLOSED.

Severity: Medium

L2: `VenueEvent.filled_size > VenueEvent.size` possible — `_fill_event_from_row` uses different source fields

File: bingx_venue.py:530-531

size=abs(_row_float(row, "executedQty", "z", "lastFilledQty", default=0.0)),
filled_size=abs(_row_float(row, "lastFilledQty", "l", "z", default=0.0)),

size comes from executedQty (cumulative) while filled_size comes from lastFilledQty (incremental). If lastFilledQty > executedQty (exchange-side rounding, partial fill of a partially-cancelled order), filled_size > size. The Rust kernel's apply_fill uses event.filled_size for PnL and position adjustment — an oversized fill could over-count position reduction.

Also: VenueOrder.filled_size > intended_size possible via _venue_order_from_row() (line 157-163) when the exchange reports executedQty > origQty.

Severity: Medium

L3: `VenueEvent.price=0` can reach the kernel from multiple paths

File: bingx_venue.py:495 (via _row_float default 0.0), mock_venue.py:180 (via 0.0 when reference_price=0), rust_backend.py:411 (via outcome default 0.0)

The Rust kernel's realized_pnl() guards against entry_price <= 0.0 and exit_size <= 0.0, but exit_price=0 in a fill event produces delta = (0 - entry) / entry = -1.0. For LONG: PnL = -1.0 * notional → -100% of position. A zero-price fill event would register as a total loss.

The mark_price() function guards against price <= 0, so unrealized PnL is safe. But realized PnL from a zero-price fill is not guarded.

Severity: High

L4: `BingxUserStream` — `available_margin` set to `cw` (cross wallet balance) instead of `crossWalletBalance - usedMargin`

File: bingx_user_stream.py:336

available_margin=cw   # cw = cross wallet balance, NOT available margin

In BingX's ACCOUNT_UPDATE frame, "cw" is the cross wallet balance (total equity), not the available margin. Available margin = crossWalletBalance - usedMargin. The ExchangeEvent.available_margin field receives the wrong value. This flows into the dual-ledger accounting's EBlock.available_margin — if used for reconcile rules, the exchange-side available_margin is overstated.

Severity: High

L5: `BingxUserStream` — `wallet_balance` silently defaults to 0 when `"wb"` is absent

File: bingx_user_stream.py:334

wallet = _safe_float(usdt_bal.get("wb") or usdt_bal.get("walletBalance"))

If neither "wb" nor "walletBalance" exists in the USDT balance object (possible for some account types or frame formats), _safe_float(None | None) returns 0.0. The exchange wallet balance is silently zeroed, making the E-side of the dual-ledger reconciliation see wallet_balance=0 when the actual balance is positive. This always produces an ERROR reconcile status (R1: capital >> 0 vs wallet=0).

Severity: High

L6: `BingxUserStream` — `_keepalive_loop` has no stop mechanism — runs forever on old listen key after rotation

File: bingx_user_stream.py:394-405

async def _keepalive_loop(self, listen_key):
    while True:
        await asyncio.sleep(self._keepalive_secs)
        await self._http.signed_put_raw(...)

The keepalive loop is an asyncio.Task with no stop signal. When the 24h rotation creates a new listen key, the old keepalive task keeps sending PUT requests to the old (now-deleted) listen key indefinitely. BingX returns errors for keepalive on deleted keys — these errors are suppressed by with suppress(Exception) in the delete path but NOT in the keepalive path. The keepalive loop's errors are unhandled.

Severity: Medium

L7: `BingxUserStream` — `event_id` from `frame.get("i")` can be integer 0 — `str(0)` is falsy on `or` chain, generates random UUID

File: bingx_user_stream.py:283

event_id = str(frame.get("i") or frame.get("event_id") or uuid.uuid4().hex)

If frame.get("i") returns integer 0 (valid event ID in some BingX frames), str(0) gives "0" which is falsy on the or chain → falls through to uuid.uuid4().hex, losing the real event ID. Event dedup downstream sees a random UUID instead of the exchange's ID.

Severity: Medium

L8: BingX test URLs hardcoded in test generators — wrong environment if system targets LIVE

Files: gen_live_tests.py:70,77, gen2.py:135

"https://open-api-vst.bingx.com/openApi/swap/v2/user/positions"
"https://open-api-vst.bingx.com/openApi/swap/v2/quote/price"

Hardcoded vst (testnet) URLs. The production launcher.py path selects VST vs LIVE via BingxEnvironment and DOLPHIN_BINGX_ENV, but the test generators hardcode VST. If the system is configured for LIVE and these tests run, they hit the wrong exchange environment.

Severity: Medium

L9: No proxy support — cannot be deployed behind corporate proxy

No code parses HTTP_PROXY, HTTPS_PROXY, SOCKS_PROXY or passes proxy configuration to aiohttp.TCPConnector or ClientSession. The aiohttp.ClientSession in bingx_user_stream.py is created without any proxy parameter. Deployment behind a corporate proxy or SOCKS proxy requires code changes.

Severity: Low (deployment constraint, not a correctness bug)

L10: 5-minute DNS cache TTL in WebSocket adapter — stale IPs on infrastructure change

File: bingx_user_stream.py:425

aiohttp.TCPConnector(limit=4, ttl_dns_cache=300)  # 300 seconds = 5 minutes

If BingX changes server IPs during an infrastructure migration or failover, the system continues using stale IPs for up to 5 minutes. The connector is recreated on each WS reconnect, so the cache resets — but a reconnection that uses the stale DNS from the just-discarded connector's cache... actually, ttl_dns_cache=300 means aiohttp caches DNS results for 5 minutes. After a reconnect, the new connector starts with an empty cache. But if the system doesn't reconnect and just keeps the WS alive, DNS changes go undetected for 5 minutes.

Severity: Low

L11: `getattr(intent, "limit_price", 0.0)` reads from dataclass field, not metadata dict — always 0.0

File: bingx_venue.py:267

metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)

intent.limit_price is a field on KernelIntent (default 0.0). The or 0.0 is redundant — if it's somehow None, float(None) raises TypeError before or is evaluated. Actually, getattr(intent, "limit_price", 0.0) returns 0.0 (the default), then 0.0 or 0.0 → 0.0, then float(0.0) → 0.0. The result is always 0.0 regardless of what the policy layer set in metadata.

But wait — limit_price IS a real field on KernelIntent (contracts.py:257, added in this version). If the policy layer sets intent.limit_price = 10.50, then getattr(intent, "limit_price", 0.0) returns 10.50, and float(10.50) → 10.50. So this IS correct for the new code where KernelIntent has the field. But the _legacy_intent function (identical to H7) doesn't check intent.metadata.get("limit_price") — it reads the dataclass field. If any caller passes limit_price via metadata dict only, it's lost.

Severity: Low

L12: Backup diff — Rust kernel added 428 lines including entire dual-ledger accounting, 14+ bug fixes

Comparing the backup rust_kernel_src/lib.rs (1614 lines) against current _rust_kernel/src/lib.rs (2042 lines) reveals:

Bugs fixed between backup and current:

CANCEL now works on entry orders (backup only checked exit orders)
Partial fills now accumulate (backup overwrote filled_size)
Stale venue events on closed slots now rejected (TERMINAL_STATE guard, I13 fix)
CANCEL_ACK properly resets entry orders to IDLE
EXIT transition captures actual prev_state instead of hardcoded POSITION_OPEN
into_c_string sanitizes NUL bytes instead of panicking (G2 fix)
Null-string FFI returns diagnostic JSON instead of null pointer
invalid_intent_cstring() helper returns structured diagnostics
Reconcile validates slot invariants before applying

New Rust features:

AccountState dual-ledger struct with K-value vs E-fact reconcile rules
on_account_event() FFI for account-level events
set_seed_capital() FFI
INVALID_INTENT diagnostic code

Critical finding: The backup still has the entry-fill size overwrite bug (I1), the backward EXIT prev_state bug (G3), and the CANCEL-only-exit-order bug (G10). These were all fixed in the current code. The backup represents a pre-fix state that would double-settle PnL on partial fills.

Severity: Informational

L13: `_build_full_runtime` in gen_live_tests.py is never called — dead code

File: gen_live_tests.py:148-161

def _build_full_runtime(initial_capital):
    # Creates HazelcastDataFeed, DecisionEngine, IntentEngine, PinkDirectRuntime
    # ... but never called by any test function

This function wires the full production pipeline: HazelcastDataFeed + PinkDirectRuntime + DecisionEngine + IntentEngine. But every test function calls _build_runtime_bundle() instead, which returns a _RuntimeShim with zero fidelity (J16). The real PinkDirectRuntime — with step(), data feed, decision engine, intent engine — is never instantiated in any test.

Also: hz_client=build_projection(...) passes a HazelcastProjection (write-side wrapper) where a Hazelcast client object should go — type mismatch.

Severity: High

L14: `BingxUserStream` — `listenKeyExpired` raises RuntimeError instead of clean return — triggers full reconnect

File: bingx_user_stream.py:273

if frame.get("e") == "listenKeyExpired":
    raise RuntimeError("listenKeyExpired")

When the exchange sends listenKeyExpired, the code raises RuntimeError inside the _consume() async generator. This propagates to the outer subscribe() loop's try/except, which treats it as a connection failure — delays, creates a new listen key, reconnects. The proper behavior is to yield an ExchangeEvent(kind=RECONNECTED) and return cleanly, letting the caller handle the rotation without backoff delay.

Severity: Medium

L15: `BingxUserStream` — `_delete_listen_key` suppresses all exceptions — leaked keys on auth failures

File: bingx_user_stream.py:413-416

async def _delete_listen_key(self, listen_key):
    with suppress(Exception):
        await self._http.signed_delete_raw(...)

If the DELETE call fails (invalid signature, expired key, network error), the exception is swallowed. The old listen key remains active on BingX, wasting server resources. Over days of operation with unhandled auth failures, leaked listen keys accumulate server-side.

Severity: Low

L16: Backup diff — `venue_order_id` propagation logic has ambiguous target selection

File: _rust_kernel/src/lib.rs:1110-1125 (current code)

if !event.venue_order_id.is_empty() {
    let target = if slot.active_entry_order.is_some() {
        slot.active_entry_order.as_mut()
    } else {
        slot.active_exit_order.as_mut()
    };

If an entry order exists (even if fully filled and the slot is in POSITION_OPEN), ANY incoming event's venue_order_id propagates to the entry order — even if the event is for the exit order. The active_entry_order status might be FILLED but it's still Some(...), so the exit event's ID goes to the wrong order.

Severity: Medium

Pass 9 Summary

#	Flaw	Layer	Severity
L1	`KernelOutcome(accepted=True, diag=INVALID_INTENT)` parseable — no invariant check	Bridge	Medium
L2	`VenueEvent.filled_size > size` possible via different source fields	Venue	Medium
L3	`VenueEvent.price=0` reaches kernel — zero-price fill = 100% loss PnL	Venue	High
L4	`available_margin` set to cross-wallet balance, not available margin	Stream	High
L5	`wallet_balance` defaults to 0 when `"wb"` absent — E-side reconcile always ERROR	Stream	High
L6	`_keepalive_loop` no stop mechanism — runs on old key after rotation	Stream	Medium
L7	`event_id` integer 0 → `str(0)` falsy on `or` → random UUID generated	Stream	Medium
L8	Hardcoded VST URLs in test generators — wrong env if LIVE configured	Test	Medium
L9	No proxy support — can't deploy behind corporate proxy	Network	Low
L10	5-minute DNS cache TTL — stale IPs on infrastructure change	Network	Low
L11	`limit_price` getattr reads dataclass field, not metadata dict	Venue	Low
L12	Backup diff: 14+ critical bugs fixed, 428-line dual-ledger accounting added	Rust	Info
L13	`_build_full_runtime` dead — real pipeline never tested	Test	High
L14	`listenKeyExpired` raises RuntimeError instead of clean yield	Stream	Medium
L15	`_delete_listen_key` suppresses all exceptions — leaked server keys	Stream	Low
L16	`venue_order_id` target selection ambiguous when entry order exists	Rust	Medium

Pass 9 Severity

Severity	Count
High	4 (L3, L4, L5, L13)
Medium	8 (L1, L2, L6, L7, L8, L14, L16)
Low	4 (L9, L10, L11, L15)
Info	1 (L12)

Combined Catalog (All 9 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
Total		215	13	59	63	53	27

PASS 10 — RUNTIME, TEST BUGS, FSM AUDIT, PERSISTENCE, MEASUREMENT

M1: ENTER transition hardcodes `prev_state = IDLE` — every non-IDLE entry corrupts the audit trail

File: _rust_kernel/src/lib.rs:1117

let transition = self.transition(
    &slot,
    TradeStage::IDLE,           // HARDCODED — lies about actual prev_state
    slot.fsm_state.clone(),
    "ENTER_INTENT",
);

When a slot is entered from CLOSED (re-entry) or from any other state that passed the is_free() or same-trade bypass, the transition record claims prev_state = IDLE. This is always wrong unless the slot was genuinely IDLE. Every ENTER transition in the journal for a re-entered slot or a slot coming from CLOSED records an impossible transition (CLOSED → ORDER_REQUESTED recorded as IDLE → ORDER_REQUESTED).

This corrupts any downstream FSM analysis, journal audit, or trade-lifecycle reconstruction that relies on accurate prev_state values.

Severity: Critical

M2: CANCEL intent creates no transition record — invisible in audit log

File: _rust_kernel/src/lib.rs:1287-1305

The CANCEL branch in process_intent returns a KernelResult with no call to self.transition(). Every other intent (ENTER, EXIT, MARK_PRICE, RECONCILE) records a transition. CANCEL operations — including accepted cancels — are invisible in the transition audit log.

Additionally, CANCEL returns accepted = true but never mutates the slot's fsm_state. The slot stays in whatever state it was in. The caller sees accepted = true with no visible effect.

Severity: Critical

M3: `_mk_intent` test helper drops `order_type`/`limit_price` into `metadata` instead of setting proper fields

File: test_flaws.py:43

def _mk_intent(action, trade_id="t1", size=0.001, price=100.0, slot_id=0, **kw):
    return KernelIntent(
        ...
        metadata=kw,   # order_type="LIMIT" goes into metadata dict, not the dataclass field!
    )

KernelIntent has dedicated fields order_type: str = "MARKET" and limit_price: float = 0.0 (contracts.py:274-275), but _mk_intent passes **kw as metadata=kw. So _mk_intent(order_type="LIMIT") produces intent.order_type == "MARKET" (the default) while intent.metadata["order_type"] == "LIMIT".

The Flaw 6 tests in test_flaws.py that verify order_type/limit_price preservation through _legacy_intent pass for the wrong reason — they check legacy.metadata.get("order_type") which finds the value in the passthrough metadata, not because _legacy_intent correctly reads intent.order_type. If the production code changes and the test helper isn't fixed, the tests silently become false positives.

Severity: High

M4: `test_cancel_entry_with_partial_fill` never sends a CANCEL — misnamed vacuous test

File: test_flaws.py:161-172

def test_cancel_entry_with_partial_fill(self):
    k = _fresh_kernel(scenario=MockVenueScenario(partial_fill_ratio=0.5))
    k.process_intent(_mk_intent(action=E.ENTER, trade_id="ce4", size=0.002))
    slot_after = k._get_slot(0)
    assert slot_after.size > 0, "Should have partial fill"

Named "Cancel entry with partial fill," belongs to TestFlaw1EntryCancel — but no CANCEL intent is ever sent. It only verifies that a partial fill occurred. The test is completely vacuous for its stated purpose.

The same pattern affects Flaw 9 tests — test_cancel_uses_slot_asset_not_trade_id and test_mock_venue_cancel_event_has_asset both have "cancel" in their names but never call any cancel function.

Severity: High

M5: Flaw 7 tests (`test_entry_exit_different_ratios`, `test_per_action_type_ratios`) never send EXIT

File: test_flaws.py Flaw 7 test class

Both tests set exit_partial_fill_ratio on the mock venue scenario but only ever process an ENTER intent. The exit_partial_fill_ratio is configured but never exercised. The tests verify entry partial fill behavior only — they don't test what their titles and class name claim.

Severity: Medium

M6: `test_dedup_window_accepts_many_events` uses wrong constant — actual=256, flaw claims 64-only 70 events sent

File: test_flaws.py:536-555

The Flaw 10 tests reference a 64-event dedup window, but the actual Rust constant is MAX_SEEN_EVENT_IDS = 256 (lib.rs:8). The test sends 70 events and asserts >= 70. Since 70 < 256, no eviction occurs. The test passes trivially regardless of whether the old-64-bound flaw exists. To meaningfully test eviction, >256 events would be needed.

Similarly, test_dedup_eviction_does_not_accept_old_event sends only 70 events then checks for dedup — with a 256-entry window, the first event is never evicted. The test verifies basic dedup (non-evicted), not eviction behavior.

Severity: Medium

M7: `test_outcome_state_matches_actual_slot` is tautological — compares value with itself

File: test_flaws.py:200-210

result = k.process_intent(_mk_intent(action=E.ENTER, trade_id="oc1"))
slot = k._get_slot(0)
assert result.state == slot.fsm_state,

result.state is set from final_slot.fsm_state (which comes from self._get_slot(outcome.slot_id) inside process_intent). The test then calls k._get_slot(0) again. Both read from the same Rust backend — they must be equal by construction. This test proves nothing; it's a tautology.

Severity: Low

M8: ORDER_ACK silent fallthrough when no active order — accepts event with no effect

File: _rust_kernel/src/lib.rs:1476-1498

When on_venue_event receives an ORDER_ACK for a slot with neither active_entry_order nor active_exit_order (shouldn't happen normally, but possible after a reconcile or race), the match arm executes no branch. The state is unchanged, diagnostic_code stays OK, and accepted = true. The event is silently accepted with no effect — no diagnostic, no warning.

The same bug exists for CANCEL_ACK (line 1545): if no matching active order exists, the event is silently accepted with no state change and OK diagnostic.

Severity: Medium

M9: ORDER_REJECT on POSITION_OPEN with stale entry order destroys the position

File: _rust_kernel/src/lib.rs:1499-1530

KernelEventKind::ORDER_REJECT => {
    if slot.active_entry_order.is_some() && slot.fsm_state != TradeStage::POSITION_OPEN {
        // clear entry, wipe trade data, set IDLE
    } else if slot.active_exit_order.is_some() {
        // clear exit order only, set POSITION_OPEN
    } else {
        // no match — reset to IDLE
    }
}

If a slot is in POSITION_OPEN (position active) but active_entry_order is still Some (stale — didn't get cleared on fill), the entry-reject guard fsm_state != POSITION_OPEN prevents the entry path. It falls to the exit check. If no exit order, the final else branch fires — resetting the slot to IDLE and destroying the open position and all trade data.

Severity: Critical

M10: No aggregation of any metric — trade count, success/fail, latency all zero

File: entire codebase

The following metrics are completely impossible to obtain from the current system:

Metric	Why unavailable
Total trades processed	`trade_seq` declared on `AccountSnapshot` but never incremented anywhere
Succeeded vs failed trades	No aggregation of `KernelDiagnosticCode` outcomes
PnL per individual trade	`slot.realized_pnl` is overwritten on slot reuse — no per-trade persistence
Slippage (fill vs intended price)	Data exists transiently but no computed metric
API calls per minute	No call counters anywhere in the venue adapter
`process_intent` latency	Zero timing instrumentation — no `time.monotonic()` in kernel path
Process memory usage	No memory tracking of any kind
Deduplicated vs fresh event count	Dedup detection exists but is never counted

The AccountSnapshot.trade_seq field (account.py:27) is declared as trade_seq: int = 0 but never assigned — no code path ever sets it above 0. It's a dead field.

Severity: High

M11: Flaw 6 tests pass via metadata passthrough, not via `_legacy_intent` field logic

File: test_flaws.py Flaw 6 tests

The two Flaw 6 tests verify that _legacy_intent preserves order_type and limit_price. They pass because _mk_intent(order_type="LIMIT") puts the value into intent.metadata, and _legacy_intent copies intent.metadata into legacy.metadata verbatim. The tests check legacy.metadata.get("order_type") which finds the value in the passthrough — not because _legacy_intent reads intent.order_type correctly.

_legacy_intent actually reads getattr(intent, "order_type", "MARKET") which returns "MARKET" (the default, since _mk_intent put it in metadata not the field), and sets legacy.metadata["_order_type"] = "MARKET". The assertion passes via the wrong code path. If _legacy_intent stopped copying metadata entirely, the tests would still pass as long as intent.metadata is passed through.

Severity: High

M12: No retry or fallback for ClickHouse INSERT failures

Evidence across all persistence paths: every sink(table, row) call in pink_clickhouse.py is unprotected. If ClickHouse is unreachable, slow, or returns an error, the exception propagates unhandled through persist_step() → step(). No retry, no backoff, no fallback, no queue, no error reporting to anomaly_events.

This means a transient ClickHouse outage (common in cloud deployments) crashes the entire policy cycle. The slot state in the Rust kernel may be lost as the exception unwinds.

Severity: High

M13: `AccountSnapshot.trade_seq` declared but never incremented — dead field

File: account.py:27

@dataclass
class AccountSnapshot:
    ...
    trade_seq: int = 0

This field is part of the AccountSnapshot dataclass. It's initialized to 0 and never assigned or incremented anywhere in the entire codebase. Every snapshot from kernel.snapshot()["account"] returns trade_seq: 0. Despite being a standard field in every persistence row, it's always 0 — making it impossible to order trades chronologically by sequence number from any persisted data.

Severity: Medium

M14: `test_reentry_after_full_close_no_pnl_loss` uses absurdly loose 50% bound

File: test_flaws.py:686-706

assert abs(cap_after_second - cap_before) < cap_before * 0.5

Allows a 50% capital deviation (12,500 USDT on 25,000). The actual PnL from the test's tiny trades (~0.02 USDT) is orders of magnitude smaller. A bug that silently leaked 10,000 USDT of PnL would pass this test. The bound provides no meaningful verification.

Also: the test never checks diagnostic_code for the warning it claims to test (already documented as I7 weakness).

Severity: Low

M15: `test_reconcile_rejects_position_open_with_zero_size` passes even if reconcile silently ignores bad data

File: test_flaws.py:568-585

result = k.reconcile_from_slots([bad_slot])
slot = k._get_slot(0)
assert slot.fsm_state != TradeStage.POSITION_OPEN or slot.size > 0

The assertion was true before calling reconcile (slot starts IDLE with size=0). The test never checks result.accepted == False or verifies the diagnostic code. If reconcile_from_slots silently ignores the bad slot and returns accepted=True, the test still passes — it only proves the slot wasn't in POSITION_OPEN after reconcile, which was already true.

The same structural weakness exists in test_reconcile_rejects_idle_with_nonzero_size.

Severity: Low

M16: No built-in metric for active slot count, event throughput, or memory usage

The following operational metrics cannot be obtained without writing custom code:

Active slot count: len([s for s in kernel.state.slots if not s.is_free()]) — requires Python access to the ExecutionKernel object. No active_slot_count property exists.
Total event count: No counter. The journal tracks individual transitions but there's no total_events_processed: int anywhere.
Memory usage: No tracemalloc, no psutil, no RSS polling. Nothing.
Runtime uptime: No start_time or uptime() method anywhere.

Severity: Medium

M17: M4 duplicate — test_cancel_uses_slot_asset_not_trade_id and test_mock_venue_cancel_event_has_asset never call cancel

File: test_flaws.py Flaw 9 class

Both tests verify that an entry order's metadata contains an asset key. They never call scenario.cancel() or k.process_intent(action=CANCEL). Despite their names and class (TestFlaw9CancelSymbolFallback), they test metadata preservation on entry, not cancel behavior.

Severity: High

M18: `_decision_to_kernel_intent` drops `order_type` and `limit_price` — LIMIT orders unreachable from the runtime

File: pink_direct.py:79-115 (inferred from E2E trace)

The bridge function converts a Decision to a KernelIntent. It sets timestamp, intent_id, trade_id, asset, side, action, reference_price, target_size, leverage, exit_leg_ratios, reason, and metadata. It does NOT set order_type or limit_price — both default to "MARKET" and 0.0.

Even if the DecisionEngine produced a LIMIT decision with a limit price, the runtime has no path to express it. The entire LIMIT-order pipeline is dead code from the runtime — LIMIT orders can only be set via direct KernelIntent(...) construction in tests, which is itself broken (M3).

Severity: High

Pass 10 Summary

#	Flaw	Layer	Severity
M1	ENTER transition hardcodes prev_state=IDLE — audit trail lies for re-entries	Rust	Critical
M2	CANCEL creates no transition record — invisible in audit log	Rust	Critical
M3	`_mk_intent` drops order_type/limit_price into metadata, not proper field	Test	High
M4	test_cancel_entry_with_partial_fill never sends CANCEL — misnamed vacuous test	Test	High
M5	Flaw 7 tests never send EXIT — exit_partial_fill_ratio untested	Test	Medium
M6	test_dedup tests use wrong constant (actual=256, claim 64) — 70 events insufficient	Test	Medium
M7	test_outcome_state_matches_actual_slot is tautological	Test	Low
M8	ORDER_ACK silent fallthrough when no active order — accepted with no effect	Rust	Medium
M9	ORDER_REJECT on POSITION_OPEN with stale entry order destroys position	Rust	Critical
M10	No aggregation of trade count, success/fail, latency — all zero	All	High
M11	Flaw 6 tests pass via metadata passthrough, not field logic	Test	High
M12	No retry/fallback for ClickHouse INSERT failures — crashes policy cycle	Persistence	High
M13	AccountSnapshot.trade_seq never incremented — always 0	Account	Medium
M14	test_reentry_after_full_close_no_pnl_loss uses 50% bound — absurd	Test	Low
M15	test_reconcile_rejects_position_open_with_zero_size passes for wrong reason	Test	Low
M16	No built-in metric for active slots, event throughput, or memory	All	Medium
M17	Flaw 9 tests named for cancel but never call cancel	Test	High
M18	_decision_to_kernel_intent drops order_type and limit_price — LIMIT dead from runtime	Runtime	High

Pass 10 Severity

Severity	Count
Critical	3 (M1, M2, M9)
High	7 (M3, M4, M10, M11, M12, M17, M18)
Medium	5 (M5, M6, M8, M13, M16)
Low	3 (M7, M14, M15)

Combined Catalog (All 10 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
Total		233	16	66	68	56	27

PASS 11 — ASYNC/SYNC SEAMS, LOCK ANALYSIS, THREADING

N1: Rust kernel `with_handle_mut` has zero synchronization — `&mut KernelCore` from raw pointer, UB on concurrent FFI

File: _rust_kernel/src/lib.rs:2042

fn with_handle_mut<F, R>(handle: *mut KernelHandle, f: F) -> Result<R, String>
where
    F: FnOnce(&mut KernelCore) -> Result<R, String>,
{
    // Safety: single-threaded; caller holds exclusive access for the duration.
    let core = unsafe { &mut (*handle).core };   // raw ptr → &mut

The comment says "single-threaded" but provides zero enforcement — no Mutex, no RwLock, no atomic flag, no thread-local constraints, no !Send/!Sync marker on KernelCore. The unsafe block converts a raw pointer to a &mut reference, which under Rust's aliasing rules must be exclusive — two simultaneous &mut references to the same data is undefined behavior (data race, torn reads, LLVM miscompilation).

The ctypes FFI mechanism releases the GIL during the Rust call (Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS). Two Python threads can call any two dita_kernel_* functions simultaneously — one in process_intent (writing slot state), another in snapshot_json (reading). Both produce &mut KernelCore. This is a compiler-level UB, not just a logical race.

Trigger scenario: Thread A calls process_intent() (ENTRY fill → mutates slot). Thread B calls on_venue_event() (exit fill → mutates slot). The GIL is released during both Rust FFI calls. Both get &mut KernelCore. The Rust compiler can reorder, elide, or speculate any memory operation. Slot data becomes corrupted, PnL doubles, or the process segfaults.

Severity: Critical — undefined behavior, no enforcement, no mitigation.

N2: `_run()` has two completely different code paths depending on event loop state — runtime branch, not design decision

File: bingx_venue.py:225-238

def _run(self, result):
    if inspect.isawaitable(result):
        try:
            asyncio.get_running_loop()
        except RuntimeError:
            return asyncio.run(result)           # Path A: no loop → direct run
        pool = self._get_executor()
        return pool.submit(asyncio.run, result).result()  # Path B: loop → pool + block
    return result

Path A (no event loop running): asyncio.run(result) — creates a new event loop, runs the coroutine, closes it. All on the same thread. Correct for sync contexts.

Path B (event loop running): pool.submit(asyncio.run, result).result() — submits to a 3-thread pool, each worker creates yet ANOTHER event loop via asyncio.run(), then blocks the calling thread with .result().

The asyncio.get_running_loop() check is a runtime probe — the code doesn't know from its design whether it's in an async context. Same logical operation (run a coroutine), two completely different implementations. Path B is a documented anti-pattern (creating/destroying event loops per call), Path A is correct.

This is the root cause of the entire async/sync seam problem — the architecture never committed to being async or sync.

Severity: Critical

N3: `_run()` Path B blocks the event loop thread for every venue HTTP operation

File: bingx_venue.py:236

return pool.submit(asyncio.run, result).result()  # BLOCKS calling thread

When called from within a running event loop (all live tests, any async deployment), .result() blocks the event loop thread until the thread pool worker completes. During this block:

No WS messages can be received from the BingxUserStream
No keepalive tasks can run
No timer-based events can fire
The event loop is stuck

If the thread pool is exhausted (3 concurrent HTTP calls — e.g., _backend_snapshot from submit() which calls it twice plus cancel() which calls it three times), the 4th call blocks at .result() indefinitely — the work item is queued but no worker is free. This is a stuck-process scenario where the entire system freezes.

The event loop thread is blocked on .result(), which means it cannot process the WS events that might contain the fill for the order it just submitted. If the exchange fills instantly, the WS message arrives before .result() returns — the WS data sits in the kernel's TCP receive buffer, unprocessed, until process_intent completes and the event loop can schedule the WS reader again. This delay can cause stale fills, missed state transitions, or WS timeouts.

Severity: Critical

N4: `asyncio.run()` called repeatedly inside thread pool — creates/destroys event loops per call, documented anti-pattern

File: bingx_venue.py:236

return pool.submit(asyncio.run, result).result()

Each call to asyncio.run() creates a new SelectorEventLoop, runs it, then closes it. Doing this repeatedly for every HTTP call is a documented CPython anti-pattern:

Each loop allocation costs memory (selector, callbacks, timeout queue)
Each loop destruction leaves loop-internal objects for GC
Over many calls (hundreds of trades), this creates GC pressure and memory fragmentation
The asyncio.run() documentation explicitly says "don't call this repeatedly" — use a long-lived loop

Path A (no event loop) has the same issue — asyncio.run() is called per-_run() invocation.

The total cost: each process_intent() may call _run() 3-4 times (_backend_snapshot ×2 + submit_intent + optionally cancel). Each _run() creates/destroys an event loop. With 10 trades/min, that's 30-40 event loop creations/destructions per minute.

Severity: Critical

N5: `_snapshot_ready` Event cascading re-fetch — N concurrent callers produce N overlapping HTTP calls

File: bingx_venue.py:258-274

def _backend_snapshot(self, ...):
    if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
        return self._last_snapshot  # stale
    self._snapshot_ready.clear()
    try:
        snapshot = self._call_backend("refresh_state", ...)  # HTTP call
    except Exception:
        self._snapshot_ready.set()
        raise
    with self._snap_lock:
        self._last_snapshot = snapshot
    self._snapshot_ready.set()

When _snapshot_ready.set() fires at the end, ALL threads waiting on .wait() wake up. Each one proceeds to clear() and start a new HTTP call — even though a fresh snapshot was just written. With N concurrent callers to _backend_snapshot, this produces N overlapping refresh_state HTTP calls instead of N-1 callers reading the just-received result.

On BingX VST (rate limit ~10 req/s), 3 overlapping refresh_state calls (each doing 5 parallel sub-requests) burns 15 of the 10 req/s budget. The calls overlap and cascade, wasting rate-limit capacity with redundant work.

Severity: High

N6: `BingxUserStream.close()` does not cancel pending tasks — keepalive/rotation tasks continue after close

File: bingx_user_stream.py:160-169

async def close(self) -> None:
    self._closed.set()
    if self._session is not None and not self._session.closed:
        await self._session.close()

close() sets the _closed event and closes the aiohttp session. It does not cancel the keepalive_task or rotation_task created inside subscribe(). These tasks are only cancelled in the finally block of subscribe(). If close() is called while nobody is iterating the subscribe() generator (or if iteration is blocked in _consume()), those tasks keep running until:

The event loop shuts down (automatic task cancellation)
The subscribe generator is garbage collected
An exception occurs in the WS reader

During this window, the keepalive loop continues sending PUT requests to the (now potentially deleted) listen key. The rotation task continues its 23h50m sleep. Both are zombie tasks with no cleanup path.

Severity: Medium

N7: Live test architecture forces worst-case `_run()` behavior for every operation

File: gen_live_tests.py, gen2.py, _gen_test.py (all test generators)

The live tests use this pattern:

def test_pink_ditav2_xxx(_live_client) -> None:
    ...
    result = asyncio.run(_run_scenario(bundle, _live_client, body_fn, name, ic))

Each test is a synchronous function that calls asyncio.run(). Inside the resulting event loop, every call to k.process_intent() triggers Path B of _run() — the pool-submit-.result() path. The test architecture forces the architecture's slowest, most thread-expensive code path for every single intent.

Every HTTP call: creates a new event loop on a pool thread → blocks the main event loop thread → blocks WS processing → wastes pool slots. Even for trivial mock-venue tests that don't need HTTP at all, the architecture still goes through the same _run() → pool → .result() path because the mock venue also returns awaitables.

Severity: Medium

N8: `BingxUserStream subscribe()` creates new tasks on every reconnect — rapid reconnect causes task churn

File: bingx_user_stream.py:100-120

async def subscribe(self):
    while not self._closed.is_set():
        ...
        keepalive_task = asyncio.create_task(self._keepalive_loop(listen_key))
        rotation_task = asyncio.create_task(self._rotation_sentinel())
        ...
        async for event in self._consume(listen_key, rotation_task):
            yield event

Each iteration of the reconnect loop creates new keepalive_task and rotation_task, then cancels the previous ones in the finally block. If the connection drops every few seconds (unstable WS), tasks are created and cancelled in rapid succession. Cancellation races with task creation — a task can be cancelled before its first await, which changes its state machine.

Also: no rate limiting on the reconnect loop beyond the delay_ms exponential backoff. If the WS repeatedly fails immediately after connection, the loop creates/destroys tasks in a tight cycle.

Severity: Medium

N9: No `asyncio.all_tasks()` or task accounting anywhere — leaked tasks undetectable

No code in the entire workspace calls asyncio.all_tasks() or maintains a task registry. If a task is leaked (cancellation not propagated, generator not cleaned up), there is:

No way to detect it programmatically
No warning log
No metrics
No __del__ fallback

Combined with N6 (tasks not cancelled on close) and N8 (task churn on reconnect), leaked tasks accumulate silently. Each leaked task holds references to its coroutine frame, which may hold references to aiohttp.ClientSession, websocket connections, and other resources.

Severity: Low

N10: `_snap_lock` / `_snapshot_ready` pattern has no reader-side protection on `_last_snapshot`

File: bingx_venue.py:258-274

The _snap_lock protects _last_snapshot only during writes (line 269-271). The fallback path (timeout at line 260-262) also reads _last_snapshot under _snap_lock. But the _call_backend call at line 266 is outside the lock — the snapshot is fetched without holding _snap_lock, which is correct (don't hold a lock across HTTP). However, the time between releasing the lock and reacquiring it for the write (line 269) means another thread could also be writing _last_snapshot concurrently. The _snap_lock ensures only one write at a time, but the _last_snapshot can still be overwritten between threads — this is the intended behavior (last writer wins for staleness purposes, not a correctness bug).

Severity: Informational

Pass 11 Summary

#	Flaw	Layer	Severity
N1	Rust kernel `with_handle_mut` zero synchronization — `&mut` from raw ptr, UB on concurrent FFI	Rust	Critical
N2	`_run()` has two completely different code paths — runtime branch, not design decision	Venue	Critical
N3	`_run()` path B blocks event loop thread for every venue HTTP operation	Venue	Critical
N4	`asyncio.run()` called repeatedly — creates/destroys event loops per call, documented anti-pattern	Venue	Critical
N5	`_snapshot_ready` cascading re-fetch — N callers produce N overlapping HTTP calls	Venue	High
N6	`BingxUserStream.close()` doesn't cancel pending tasks — zombie keepalive/rotation after close	Stream	Medium
N7	Live test architecture forces worst-case `_run()` path for every operation	Test	Medium
N8	`subscribe()` reconnect creates new tasks per iteration — rapid reconnect causes task churn	Stream	Medium
N9	No `asyncio.all_tasks()` or task accounting — leaked tasks undetectable	All	Low
N10	`_snap_lock`/`_snapshot_ready` no reader-side protection (informational)	Venue	Info

Pass 11 Severity

Severity	Count
Critical	4 (N1, N2, N3, N4)
High	1 (N5)
Medium	3 (N6, N7, N8)
Low	1 (N9)
Info	1 (N10)

Combined Catalog (All 11 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
Total		243	20	67	70	58	28

PASS 12 — SYNC/ASYNC WIDER SCOPE (launcher, generators, streams, FFI, tests)

O1: `_maybe_close()` calls `asyncio.run()` without checking for a running event loop — close/disconnect silently skipped

File: launcher.py:270-274

def _maybe_close(obj):
    ...
    if inspect.isawaitable(result):
        try:
            asyncio.run(result)
        except RuntimeError:
            pass  # SILENT — coroutine never executed

When _maybe_close() is called from any context that already has a running event loop (which includes all async tests, any async def main() orchestrator, or any code path that imports and runs DITAv2LauncherBundle inside an async context), asyncio.run(result) raises RuntimeError: asyncio.run() cannot be called from a running event loop. The except RuntimeError: pass swallows it — the close/disconnect method never executes.

Affected resources when called from async context:

RealZincPlane.close() — never called → 3 shared memory regions leaked
RealZincControlPlane.close() — never called → 1 shared memory region leaked
BingxVenueAdapter has neither close() nor disconnect() — N/A
InMemoryZincPlane has no close — N/A

The DITAv2LauncherBundle.close() method calls _maybe_close(self.venue), _maybe_close(self.zinc_plane), _maybe_close(self.control_plane) — if any of these have async close/disconnect methods, they're all silently skipped when called from async context.

This means: in any async deployment (which is the only deployment pattern — tests, and presumably production via asyncio.run() at top level), shared memory regions are never explicitly closed. They rely on process exit cleanup.

Severity: High

O2: `async def connect()` shims in all test generators call sync `venue.connect()` without `await` — misleading pattern

Files: gen_live_tests.py:143-146, gen2.py:332-333, _gen_test.py:70 (via Shim/Shim pattern)

# All three test harnesses have this pattern:
async def connect(self, initial_capital=0):
    self.kernel.venue.connect()          # sync method, no await

BingxVenueAdapter.connect() (bingx_venue.py:301) is a sync def that returns bool. It internally calls self._run(result()) which under a running event loop submits to the thread pool and blocks with .result(). The async def connect() wrapper is misleading — it's async but immediately calls a sync method that will block the event loop for the HTTP round-trip duration.

The caller's perspective: await runtime.connect() should yield the event loop. Instead, it blocks until the BingX HTTP call inside connect() completes (via _run()'s thread pool path).

Severity: Medium

O3: `gen_live_tests.py:171` — `_contract_rows(client)` NOT awaited in `async def _pick_live_symbol` — silent failure

File: `gen_live_tests.py:171**

async def _pick_live_symbol(client):
    rows = _contract_rows(client)  # MISSING await! _contract_rows is async def
    ...
    pos_rows = [r for r in rows if ...]

_contract_rows is async def (line 69). Without await, rows is a coroutine object, not the actual data. The subsequent iteration for r in rows would iterate over a coroutine object — in Python 3.12+, coroutines raise TypeError: 'coroutine' object is not iterable when iterated.

This function is called from _run_scenario (line 260) and _run_pink_live_roundtrip (line 297). If either path reaches _pick_live_symbol, it crashes with TypeError. This bug may not have manifested in practice if the code paths that call _pick_live_symbol are rarely exercised or if the test generator's output file hasn't been regenerated recently.

Severity: High

O4: `test_exchange_event_seam_parity.py` uses deprecated `asyncio.get_event_loop().run_until_complete()`

File: `test_exchange_event_seam_parity.py:243,264**

snap = asyncio.get_event_loop().run_until_complete(mock.account_snapshot())  # line 243
asyncio.get_event_loop().run_until_complete(asyncio.wait_for(_collect(), timeout=2.0))  # line 264

asyncio.get_event_loop() is deprecated in Python 3.12+ (raises DeprecationWarning). If no running event loop exists at call time, it creates a new loop and sets it as the current event loop — which can cause subtle issues when multiple event loops are active. The modern pattern is asyncio.run().

These are the only two places in the workspace that use the deprecated get_event_loop().run_until_complete() pattern.

Severity: Medium

O5: `_run()` thread pool has no timeout on `.result()` — if backend hangs, calling thread hangs forever

File: `bingx_venue.py:236**

return pool.submit(asyncio.run, result).result()  # NO timeout

concurrent.futures.Future.result() has an optional timeout parameter. None is set here. If the thread pool worker hangs (e.g., the asyncio.run() call in the worker gets stuck on a never-responding HTTP request, a deadlocked coroutine, or an infinite loop), the calling thread blocks forever on .result().

If the calling thread is the event loop thread (Path B), the entire event loop is frozen indefinitely. No WS messages, no keepalive tasks, no timer events. The system is completely dead.

The _backend_snapshot() method has a 5-second timeout for its threading.Event.wait(), but the actual _call_backend("refresh_state", ...) that runs inside the thread pool has no timeout. The HTTP client (BingxHttpClient) may have its own default timeout (typically 30-60 seconds for aiohttp), but there's no fallback if it hangs beyond that.

Severity: High

O6: MockVenueAdapter never exercises the thread-pool bridge — all CI tests use mock venue, bridge untested

Files: mock_venue.py vs bingx_venue.py

MockVenueAdapter.submit() is pure sync — it does return self._events_from_submit(...) with no awaitables, no thread pools. BingxVenueAdapter.submit() is a sync-bridge that goes through _run() → pool.submit(asyncio.run, ...).result().

All 35+ tests in test_flaws.py use MockVenueAdapter. All generated live tests use BingxVenueAdapter but are rarely executed (require live exchange credentials and API key env vars). The thread-pool bridge — including:

Thread creation and lifecycle
asyncio.run() inside pool workers
Event loop per HTTP call
Thread pool exhaustion handling
Exception propagation through .result()

— is never exercised in CI. If the bridge has a bug (e.g., the asyncio.run() inside the pool worker corrupts shared state, or thread-safety issues in aiohttp), it surfaces only in production.

Severity: Medium

O7: `BingxUserStream._keepalive_loop` and `_rotation_sentinel` are fire-and-forget tasks — unhandled exceptions silently lost

File: `bingx_user_stream.py:105-112**

keepalive_task = asyncio.create_task(self._keepalive_loop(listen_key), name="lk_keepalive")
rotation_task = asyncio.create_task(self._rotation_sentinel(), name="lk_rotation")

Both are created with create_task() and tracked for later cancellation, but not supervised during normal operation. If _keepalive_loop raises an exception that's not caught by its internal try/except (e.g., a asyncio.CancelledError variant, or a RuntimeError from the HTTP layer), the exception is stored in the Task object. If .result() or .exception() is never called on that Task, the exception is logged by the asyncio event loop as "Task exception was never retrieved" — a warning message, but no structured error handling.

_rotation_sentinel has no exception handling in its body — it just does await asyncio.sleep(secs) and returns. It can't raise an exception unless the event loop is shut down during its sleep (in which case CancelledError is raised, which is properly handled in the finally block).

Severity: Low

O8: `KernelSlotView.getattr` makes a ctypes call per attribute — each read triggers Rust FFI and is not cached

File: `rust_backend.py:422-426**

def __getattr__(self, name: str) -> Any:
    slot = self._snapshot()   # FFI call → Rust serialize → JSON parse → TradeSlot
    if hasattr(slot, name):
        return getattr(slot, name)
    raise AttributeError(name)

Every attribute access on a KernelSlotView — including slot.size, slot.fsm_state, slot.trade_id, slot.active_entry_order, etc. — does a full JSON round-trip to the Rust kernel:

Python calls _get_rust().get_slot_json(self._backend, slot_id)
ctypes calls Rust dita_kernel_get_slot_json
Rust serializes the entire TradeSlot to a JSON string
ctypes returns the C string pointer
Python calls _take_string(raw) → text.decode("utf-8")
Python calls json.loads(text) → dict
_slot_from_payload(dict) → new TradeSlot dataclass
getattr(slot, name) → read the one field from the new object

Accessing 5 fields on a KernelSlotView (e.g., slot.size, slot.fsm_state, slot.entry_price, slot.active_entry_order, slot.trade_id) does 5 FFI round-trips. The deserialized TradeSlot is created and immediately discarded for each access.

The _snapshot() method (line 435) calls self._kernel._get_slot(self._slot_id) which does the full FFI round-trip. There is no caching of the deserialized TradeSlot between successive accesses. This is an N+1 performance issue — accessing N fields costs N FFI calls instead of 1.

Severity: Medium

O9: `DITAv2LauncherBundle` has no `del` — bundle that's garbage collected leaks its entire resource tree

File: `launcher.py:64-95**

@dataclass
class DITAv2LauncherBundle:
    kernel: ExecutionKernel
    control_plane: ControlPlane
    projection: HazelcastProjection
    zinc_plane: ZincPlane
    venue: VenueAdapter

    def close(self) -> None:
        _maybe_close(self.venue)
        _maybe_close(self.zinc_plane)
        _maybe_close(self.control_plane)

No __del__ method. If a bundle is garbage collected without an explicit close() call:

The Rust kernel's KernelHandle is freed by ExecutionKernel.__del__ (if GC runs)
If RealZincPlane was in use, its close() is never called → 3 shared memory regions leaked
If RealZincControlPlane was in use, its close() is never called → 1 shared memory region leaked
The projection (Hazelcast) client connection is never closed
The venue adapter's thread pool executor is never shut down

If the bundle is created and dropped in a loop (e.g., per-test setup/teardown), shared memory regions accumulate until the system runs out of /dev/shm/ space.

Severity: Medium

O10: ExecutionKernel has no `close()` — `del` is the only cleanup path for the Rust handle

File: `rust_backend.py:519-525**

def __del__(self) -> None:
    backend = getattr(self, "_backend", None)
    if backend is not None:
        try:
            _get_rust().destroy(backend)
        except Exception:
            pass

No close() method exists on ExecutionKernel. The DITAv2LauncherBundle.close() doesn't touch the kernel (it calls _maybe_close on venue, zinc_plane, and control_plane only). The Rust _backend handle is only freed when __del__ runs during garbage collection.

If the kernel is part of a reference cycle (K3/K6 — Kernel → KernelStateView → KernelSlotView → Kernel), __del__ may be delayed indefinitely until the cycle GC runs. During that delay, the Rust KernelHandle is alive but unreachable — its memory is leaked until GC.

Severity: Medium

O11: `KernelSlotView.setattr` triggers 5 side effects including durable writes — undocumented

File: `rust_backend.py:428-453**

def __setattr__(self, name: str, value: Any) -> None:
    ...
    slot = self._snapshot()
    setattr(slot, name, value)
    self._kernel._set_slot(slot)  # triggers: Rust FFI write + state refresh
                                 #           + account.observe_slots
                                 #           + projection.write_slot
                                 #           + zinc_plane.write_slot

Setting any attribute on a KernelSlotView — even something trivial like slot.some_metadata_field = "test" — triggers 5 side effects: Rust FFI write to the kernel, KernelStateView.refresh(), account.observe_slots(), projection.write_slot(), and zinc_plane.write_slot(). The method name __setattr__ gives no indication that setting a field triggers durable writes across multiple persistence layers.

There is no read-only view that prevents accidental mutation. Any code that holds a KernelSlotView reference and assigns a field bypasses all FSM guards and directly mutates the Rust kernel state.

Severity: Medium

Pass 12 Summary

#	Flaw	Layer	Severity
O1	`_maybe_close()` asyncio.run without loop guard — close/disconnect silently skipped from async context	Launcher	High
O2	`async def connect()` shims call sync `venue.connect()` without await — blocking pattern	Test	Medium
O3	`_contract_rows(client)` NOT awaited in `_pick_live_symbol` — silent coroutine iteration crash	Test	High
O4	`test_exchange_event_seam_parity.py` uses deprecated `get_event_loop().run_until_complete()`	Test	Medium
O5	`_run()` thread pool `.result()` has no timeout — backend hang freezes process indefinitely	Venue	High
O6	MockVenueAdapter never exercises thread-pool bridge — bridge untested in CI	Venue	Medium
O7	`_keepalive_loop`/`_rotation_sentinel` fire-and-forget tasks — exceptions silently lost	Stream	Low
O8	`KernelSlotView.__getattr__` makes N FFI calls for N attribute accesses — no caching	Bridge	Medium
O9	`DITAv2LauncherBundle` no `__del__` — GC'd bundle leaks entire resource tree	Launcher	Medium
O10	`ExecutionKernel` no `close()` — Rust handle only freed by unpredictable `__del__`	Bridge	Medium
O11	`KernelSlotView.__setattr__` triggers 5 persistence side effects — read-only view missing	Bridge	Medium

Pass 12 Severity

Severity	Count
High	3 (O1, O3, O5)
Medium	7 (O2, O4, O6, O8, O9, O10, O11)
Low	1 (O7)

Combined Catalog (All 12 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
Total		254	20	70	73	60	28

PASS 13 — FFI BOUNDARY SAFETY, DANGLING POINTERS, COVERAGE GAPS

P1: `dita_kernel_destroy` double-free UB — Python does not null `handle.value` after destroy

File: rust_backend.py:145-148, _rust_kernel/src/lib.rs:2081-2088

# Python destroy():
def destroy(self, handle):
    if handle and handle.value:
        self.lib.dita_kernel_destroy(handle)  # handle.value NOT nulled

// Rust dita_kernel_destroy:
pub extern "C" fn dita_kernel_destroy(handle: *mut KernelHandle) {
    if !handle.is_null() {
        unsafe { drop(Box::from_raw(handle)); }
    }
}

If destroy() is called twice on the same handle:

First call: Box::from_raw(handle) frees the memory. Python's handle.value still points to the old (now dangling) memory.
Second call: handle and handle.value is True (dangling but non-null). Passes to Rust.
Rust: !handle.is_null() is True. Box::from_raw(handle) on a dangling pointer is undefined behavior — heap corruption, use-after-free, or silent data corruption.

Trigger scenarios:

ExecutionKernel.__del__ calls destroy() if _backend is non-null. If user code also calls destroy() explicitly (no such code today, but the method is public), double-free.
If __del__ runs during interpreter shutdown after the _RustKernelLib CDLL object is partially finalized, the self.lib attribute might be None → TypeError rather than double-free, but if the CDLL is still alive, double-free.
Test code that creates/destroys kernels in a loop (fresh_kernel pattern) could trigger double-destroy if GC runs finalization twice on the same handle (possible with reference cycles and PEP 442).

Fix: Python destroy() should set handle.value = None after calling, or use a _destroyed flag.

Severity: Critical — undefined behavior on any double-destroy path.

P2: `CStr::from_ptr(payload)` without null guard in multiple FFI exports

File: _rust_kernel/src/lib.rs — dita_kernel_set_exchange_config_json, dita_kernel_calibrate_fee_json, dita_kernel_on_account_event_json

pub extern "C" fn dita_kernel_set_exchange_config_json(handle: *mut KernelHandle, payload: *const c_char) -> i32 {
    let payload = unsafe { CStr::from_ptr(payload) };  // NO NULL CHECK — UB if payload is null
    let payload_str = payload.to_str().map_err(|_| -1)?;
    ...
}

Three FFI functions call CStr::from_ptr(payload) directly on the raw *const c_char parameter without checking for null first. If a null pointer is passed (from Python ctypes passing None, or from a bug in a future caller), this reads from memory address 0 — segfault or undefined behavior.

The existing helper cstr_to_string() (line 1500) correctly checks for null:

fn cstr_to_string(ptr: *const c_char) -> Result<String, String> {
    if ptr.is_null() { return Err("NULL_POINTER".to_string()); }
    unsafe { CStr::from_ptr(ptr) }
        .to_str().map(|s| s.to_string()).map_err(|e| e.to_string())
}

But these three FFI functions bypass it. Only dita_kernel_set_exchange_config_json is called from Python; calibrate_fee and on_account_event are newer functions.

Fix: Use cstr_to_string() or add an explicit if payload.is_null() { return -1; } guard.

Severity: High — null pointer dereference on any call with a null payload.

P3: `_check_open_orders` calls `asyncio.run()` from within async `_verify()` — RuntimeError in live test execution

File: _gen_test.py:104, _build_pink_extended.py:75-78

def _check_open_orders(c, vs):
    r = __import__('asyncio').run(c._request_json("GET", "/openApi/swap/v2/trade/openOrders", ...))

This is a sync def that calls asyncio.run(). It is called from _verify() which is async def (inside the generated test file). When _verify runs inside asyncio.run(_run(...)), there is a running event loop. _check_open_orders calls asyncio.run(...) which detects the running loop and raises RuntimeError: asyncio.run() cannot be called from a running event loop.

The same pattern exists in _build_pink_extended.py:75-78 in the patched version of _check_open_orders.

Fix: Make _check_open_orders async def and use await instead of asyncio.run().

Severity: High — any live test that calls _verify (which all live tests do via _run) will crash.

P4: `into_c_string` replaces NUL bytes with `"\\u0000"` — produces invalid JSON

File: _rust_kernel/src/lib.rs:2006-2013

fn into_c_string(value: &str) -> *mut c_char {
    match CString::new(value) {
        Ok(cs) => cs.into_raw(),
        Err(_) => {
            let sanitized = value.replace('\0', "\\u0000");  // literal backslash-u-0-0-0-0
            CString::new(sanitized).unwrap_or_else(|_| CString::new("").unwrap()).into_raw()
        }
    }
}

When a string contains an interior NUL byte (\0), into_c_string replaces it with the 8-character ASCII string "\\u0000". If this string is a JSON payload — which it always is for process_intent_json, on_venue_event_json, etc. — the sanitized string is not valid JSON. Python's json.loads() in _take_string receives invalid JSON and raises json.JSONDecodeError.

This is a data integrity issue: a NUL byte in an intent field (which shouldn't happen in normal use but could come from a malformed exchange response) causes the entire intent to fail with a JSONDecodeError rather than a clean INVALID_INTENT diagnostic.

Note: The NUL-byte panic (G2) was fixed by adding this sanitizer, but the sanitizer produces invalid JSON, trading a crash for a different failure mode.

Fix: Strip NUL bytes entirely (.replace('\0', "")) before JSON construction, or reject the intent with invalid_intent_cstring if NUL bytes are detected.

Severity: Medium

P5: `reconcile_slots_json` returns null on serialize failure — inconsistent with intent/venue error paths

File: _rust_kernel/src/lib.rs:2258

// reconcile_slots_json unwrap_or:
.with_handle_mut(handle, |core| ...)
.unwrap_or(ptr::null_mut())  // returns null — NO diagnostic

When reconcile_slots_json encounters a parse or serialize failure, it returns ptr::null_mut(). Python's _take_string raises RuntimeError("Rust kernel returned null string") — an unhandled exception.

Compare with process_intent_json and on_venue_event_json which use:

.map_err(|e| invalid_intent_cstring("INVALID_INTENT_PARSE", &e))
.unwrap_or_else(|ptr| ptr)  // returns structured diagnostic JSON

The reconcile and snapshot paths return bare null — no diagnostic, no structured error, no way for the Python side to distinguish "parse error" from "serialize error" from "null handle."

The same issue affects dita_kernel_snapshot_json (line 2269).

Severity: Medium

P6: `_get_rust()` TOCTOU race on first call — concurrent threads both see `_RUST is None`

File: rust_backend.py:271-275

def _get_rust():
    global _RUST
    if _RUST is None:
        _RUST = _RustKernelLib()  # two threads can both enter here
    return _RUST

Two threads calling _get_rust() simultaneously on first access both see _RUST is None. Both enter the if block. Both call _RustKernelLib() which:

Calls _ensure_library() which runs subprocess.run(["cargo", "build", "--release", ...], check=True) — two concurrent cargo builds can corrupt the build directory.
Calls ctypes.CDLL(path) — loads the shared library twice. The second CDLL object is assigned to _RUST (overwriting the first), which is then GC'd, but the Rust library's global state may have been initialized twice.

Fix: Use a module-level lock.

Severity: High

P7: `KernelHandle` has no `!Send`/`!Sync` — but ctypes FFI bypasses all Rust ownership rules

File: _rust_kernel/src/lib.rs

KernelHandle and KernelCore have no explicit unsafe impl Send or unsafe impl Sync. The Rust compiler would auto-derive Send/Sync based on their fields — but because they contain HashMap<String, Value> (serde_json::Value is not Sync), they should NOT be auto-Send/Sync. However, the compiler's auto-derivation may include them in the Send/Sync set based on field composition.

The real issue: even if Rust correctly determined KernelHandle is !Send and !Sync, the *mut KernelHandle pointer passed across FFI has no type-system enforcement. Python's ctypes calls dita_kernel_process_intent_json(handle, ...) which immediately converts the raw pointer to &mut KernelCore via unsafe { &mut (*handle).core }. The Rust compiler cannot enforce ownership rules across the FFI boundary.

This means: the Rust kernel's thread-safety design relies entirely on the Python side never calling FFI from multiple threads simultaneously. There is no mechanism in either language to enforce this.

Severity: Informational — documenting the existing design constraint (already covered in N1, but worth noting the Send/Sync aspect).

P8: `dita_kernel_destroy` not called from bundle close — no explicit Rust handle cleanup path

Files: launcher.py:83-95, rust_backend.py

DITAv2LauncherBundle.close() calls:

_maybe_close(self.venue)
_maybe_close(self.zinc_plane)
_maybe_close(self.control_plane)

It does not call anything on self.kernel. ExecutionKernel has no close() method (O10). The Rust _backend handle is only freed when __del__ runs during garbage collection.

The bundle holds self.kernel as a strong reference. As long as the bundle is alive, the kernel is alive. When the bundle is GC'd (or goes out of scope), the kernel's refcount may drop to zero, triggering __del__. But if the kernel has a reference cycle (K6: Kernel → StateView → SlotView → Kernel), __del__ is delayed until the GC cycle.

Fix: Add ExecutionKernel.close() and call it from DITAv2LauncherBundle.close().

Severity: Medium

P9: `ExecutionKernel.del` accesses module-level `_RUST` — NameError during shutdown

File: `rust_backend.py:519-525**

def __del__(self):
    backend = getattr(self, "_backend", None)
    if backend is not None:
        try:
            _get_rust().destroy(backend)   # accesses module-level _RUST
        except Exception:
            pass

During Python interpreter shutdown, the interpreter clears module globals before calling __del__ on remaining objects. If ExecutionKernel survives to shutdown, _get_rust() accesses the module-level _RUST variable which may have already been set to None (module globals cleared). This raises TypeError: 'NoneType' object is not callable (when _RUST is None and _get_rust() tries to call _RustKernelLib()), or NameError if the variable itself has been deleted from the module namespace.

The except Exception: pass catches this, but the Rust handle is never destroyed — it leaks.

Severity: Low — caught by except, only at shutdown, Rust kernel handle is lost but process is exiting.

P10: `_check_open_orders` in `_gen_test.py` has redundant `asyncio.run` already covered in P3 — different location, same pattern

Already covered in P3. Same root cause in _build_pink_extended.py:75-78. No additional finding.

Pass 13 Summary

#	Flaw	Layer	Severity
P1	`dita_kernel_destroy` double-free UB — Python doesn't null handle.value	Bridge	Critical
P2	`CStr::from_ptr(payload)` without null guard in 3 FFI exports	Rust	High
P3	`_check_open_orders` calls `asyncio.run()` from async `_verify` — RuntimeError	Test	High
P4	`into_c_string` NUL sanitizer produces invalid JSON — json.loads fails	Rust	Medium
P5	`reconcile_slots_json`/`snapshot_json` return null on failure — no diagnostic	Rust	Medium
P6	`_get_rust()` TOCTOU race — concurrent cargo build corruption	Bridge	High
P7	`KernelHandle` no Send/Sync — FFI bypasses Rust ownership rules	Rust	Info
P8	No explicit Rust handle destroy path from bundle.close()	Launcher	Medium
P9	`__del__` accesses module `_RUST` during shutdown — NameError leak	Bridge	Low

Pass 13 Severity

Severity	Count
Critical	1 (P1)
High	3 (P2, P3, P6)
Medium	3 (P4, P5, P8)
Low	1 (P9)
Info	1 (P7)

Combined Catalog (All 13 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Total		263	21	73	76	64	29

PASS 14 — SERDE EDGE CASES, BACKUP DIFFS, MARKET DATA/TIMESTAMPS

Q1: `datetime.fromisoformat()` in Python < 3.11 cannot parse Rust `Z`-suffix timestamps

Files: rust_backend.py:215,241,260, real_zinc_plane.py:95,122

entry_time=datetime.fromisoformat(payload["entry_time"]) if payload.get("entry_time") else None

When Rust's chrono::DateTime<Utc> serializes a timestamp via serde, it produces the RFC 3339 format with Z suffix: "2026-05-31T12:00:00Z". Python's datetime.fromisoformat() — until Python 3.11 — cannot parse the Z suffix. It was accepted in 3.11 (PEP 678). On Python 3.10 and earlier, fromisoformat("2026-05-31T12:00:00Z") raises ValueError.

This affects every _slot_from_payload() and _event_to_payload() round-trip. The Rust kernel's entry_time, last_event_time, and all VenueEvent/KernelTransition timestamps are serialized with Z. If the Python runtime is < 3.11, every deserialization of any timestamp from the Rust kernel will crash with ValueError.

The system uses Python 3.10 features (str | None, from __future__ import annotations) — so it's targeting 3.10+ in practice. If the environment is exactly 3.10, every FFI call that returns a timestamp will fail.

Severity: High

Q2: No `#[serde(deny_unknown_fields)]` on any Rust struct — misspelled field names cause silent no-op

File: _rust_kernel/src/lib.rs — all structs

None of the Rust kernel's serializable structs (KernelIntent, VenueEvent, TradeSlot, KernelOutcome, AccountState) use #[serde(deny_unknown_fields)]. When the Python side sends JSON with a misspelled field (e.g., "slotid" instead of "slot_id", "tradeid" instead of "trade_id"), serde silently ignores it. The struct is deserialized with the default value or Option::None for that field.

For required fields without #[serde(default)], a missing field causes an error (serde returns Err). But for optional/defaulted fields, a typo produces a silent no-op — the field value from the Python side is silently dropped, and the default value is used instead. No error, no warning.

Trigger example: If Python sends "entry_price" (correct) but the Rust expects "entry_price" — fine. But if someone adds a new field to KernelIntent in Python and the Rust struct doesn't have it yet, the field is dropped entirely. The round-trip silently loses data.

Severity: Medium

Q3: `indexmap` dependency added with `features = ["serde"]` — new transitive dependency chain

File: _rust_kernel/Cargo.toml (current vs backup)

indexmap = { version = "2", features = ["serde"] }

This is a new dependency in the current code (not in backup). indexmap is an IndexMap<K, V, S> backed by Vec<(K, V)> for ordered insertion + hashed lookup. It adds ~10+ transitive crates (hashbrown, equivalent, etc.) to the build. Used for AccountState::seen_account_event_ids: IndexSet<String> — account-level dedup that supports ordered iteration and LRU eviction at 1024 entries.

Not a bug, but a significant increase in the dependency graph. The IndexSet with serde feature enables seen_account_event_ids to be serialized in KernelFullSnapshot for crash recovery. The LRU eviction at 1024 entries means account-level event dedup survives across save/restore cycles.

Severity: Informational

Q4: Backup vs current — `on_venue_event` TERMINAL_STATE guard and `venue_order_id` propagation are the largest functional changes

Files: _rust_kernel/src/lib.rs (current), _backup_20260530/rust_kernel_src/lib.rs (backup)

Comparing the two Rust kernels, the current version adds:

TERMINAL_STATE guard (~28 lines): Prevents stale venue events from reactivating closed slots. Backup had no guard — a FULL_FILL arriving on a CLOSED slot would re-open the position.
Venue order ID propagation (~20 lines): Before entering the FSM match block, the current kernel enriches the working order with venue_order_id and venue_client_id from incoming events. Essential for LIMIT order cancel tracking. Backup had no such enrichment.
CANCEL_ACK entry-order handling (~20 lines): Backup only handled exit-order cancellation. Current correctly resets entry-order state to IDLE on CANCEL_ACK, clearing trade_id, asset, side, size, and PnL.
apply_fill incremental accumulation (~15 lines changed): Backup overwrote filled_size on every fill. Current accumulates prev_filled + fill_size. This is a critical fix — without it, multiple partial fills would report only the latest fill size.
with_handle_mut catch_unwind guard (~15 lines): Backup had no panic protection at the FFI boundary. Current wraps every FFI entry in catch_unwind. If Rust panics, the guard catches it and returns an error result instead of unwinding across the FFI boundary (which is UB).
process_intent CAPITAL_FROZEN exit-early guard: Added before the main FSM logic — if capital is frozen, all intents return CAPITAL_FROZEN diagnostic.

These are bug fixes on top of the backup version — the backup represents a pre-fix state with ~6 serious bugs that have since been corrected.

Severity: Informational

Q5: `MarketSnapshot.timestamp` type is inconsistent — `time.time()` float vs `datetime` in the same file

File: gen_live_tests.py:82 vs gen_live_tests.py:169

# _build_live_snapshot (line 82):
MarketSnapshot(timestamp=time.time(), ...)    # float

# _snap helper (line 169):
MarketSnapshot(timestamp=datetime.now(timezone.utc), ...)  # datetime

Both construct MarketSnapshot in the same file. One for the _build_live_snapshot path (used in _run_pink_live_roundtrip), one for the _snap helper path (used in _run_pink_live_recovery and _run_scenario). Any code that reads snap.timestamp must handle both float and datetime — or crashes with AttributeError trying to call .isoformat() on a float.

This is a type mismatch in the same test infrastructure. Depending on which test path executes, the operator sees different timestamp types and may not notice the inconsistency.

Severity: High

Q6: `datetime.fromisoformat()` cannot parse Rust `Z`-suffix timestamps on Python 3.10 — same root cause as Q1, applies to all serialized Rust timestamps

Same analysis as Q1. This is pervasive — every VenueEvent.timestamp, KernelTransition.timestamp, TradeSlot.entry_time, and TradeSlot.last_event_time deserialized from the Rust kernel will crash on Python < 3.11. The fix is either to upgrade to Python 3.11+, or to add a str.replace("Z", "+00:00") before calling fromisoformat().

All 5 call sites in rust_backend.py and real_zinc_plane.py are affected.

Severity: High

Q7: No upper-bound price validation — `reference_price = 1e300` passes all guards

File: rust_backend.py:390, _rust_kernel/src/lib.rs mark_price

The Python-side _first_invalid_intent_field() checks math.isfinite(value) for reference_price. A value of 1e300 passes (it's finite). The Rust-side mark_price() checks !price.is_finite() || price <= 0.0 — 1e300 passes.

When this extreme price is used in realized_pnl():

let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);
delta * notional

With entry_price = 1e300 and a modest exit_size = 0.001, notional = 1e297 — which is within f64 range (max ~1.8e308). But delta = (exit - 1e300) / 1e300 ≈ -1.0 (for exit=0), and PnL = -1.0 * 1e297 = -1e297 — a completely nonsensical loss number that corrupts the account.

No upper bound exists on any price field in the system. There's no configurable MAX_PRICE or per-market sanity check.

Severity: Medium

Q8: `_first_invalid_intent_field()` does not reject `reference_price <= 0` or `target_size == 0`

File: rust_backend.py:395-410

scalar_checks = (
    ("target_size", float(intent.target_size if intent.target_size is not None else 0.0)),
    ("reference_price", float(intent.reference_price if intent.reference_price is not None else 0.0)),
    ("leverage", float(intent.leverage if intent.leverage is not None else 0.0)),
    ("limit_price", float(getattr(intent, "limit_price", 0.0) or 0.0)),
)
for name, value in scalar_checks:
    if not math.isfinite(value):
        return (name, value)
# Then only checks target_size < 0:
size = float(intent.target_size if intent.target_size is not None else 0.0)
if size < 0.0:
    return ("target_size", size)

The guard catches NaN/Inf and negative target_size. It does NOT catch:

reference_price = 0 (valid zero-price? No — price should never be zero)
reference_price < 0 (negative price — should never happen)
target_size = 0 (zero-quantity order — waste of a process_intent call)
leverage = 0 (Rust silently falls back to 1.0)

A reference_price = 0 passes through, and the Rust kernel's mark_price silently skips it (returns early on price <= 0.0). The intent is processed as if no price was provided.

Severity: Low

Q9: Rust `Utc::now()` and Python `datetime.now(timezone.utc)` timestamps can diverge within the same process

File: _rust_kernel/src/lib.rs (transition timestamp), Python bingx_venue.py (event timestamp)

When the kernel processes an on_venue_event:

The event carries timestamp = Python's datetime.now(timezone.utc)
The kernel's transition() method uses event.timestamp if present, else falls back to Utc::now() (Rust side)

If the Python and Rust sides query their respective clocks at nearly the same time, the timestamps should match within microseconds. But if the system has clock skew between different clock sources (Python's datetime.now() uses gettimeofday() or similar, Rust's Utc::now() uses chrono which calls the same system clock — they should agree), but there's an architectural asymmetry: some transitions get Python-sourced timestamps and others get Rust-sourced timestamps.

A specific case: the TERMINAL_STATE guard (Q4 item 1) records a transition using event.timestamp (Python source) in the transition. But the CANCEL branch in process_intent (which creates no transition — flaw M2) as a counter-example. When transitions DO exist, they mix clock sources.

Severity: Low

Q10: `threading.Event.wait(timeout)` uses platform-dependent clock — CLOCK_REALTIME on some platforms, affected by NTP jumps

File: bingx_venue.py:259

if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
    return self._last_snapshot  # stale data

threading.Event.wait(timeout) is implemented differently across platforms and Python versions. On some platforms (notably older glibc), pthread_cond_timedwait uses CLOCK_REALTIME (wall clock). If an NTP correction jumps the wall clock forward by even 1 second during the wait, the timeout expires 1 second early. If it jumps backward, the wait extends by 1 second.

The _backend_snapshot method is the single most important timeout in the system — it controls whether the venue adapter returns fresh or stale exchange state. A premature timeout (NTP forward jump) causes a stale snapshot to be used for order submission, potentially causing wrong position sizing or duplicate orders.

Fix: Use time.monotonic() with a deadline loop around Event.wait() — exactly what InMemoryControlPlane.wait() already does correctly (control.py:131-138).

Severity: Medium

Q11: Backup `_on_venue_event` had no `STALE_STATE_RECONCILING` guard — current added it

File: _backup_20260530/rust_kernel_src/lib.rs vs current _rust_kernel/src/lib.rs

The backup on_venue_event only had a STALE_STATE_RECONCILING check on process_intent (reconcile branch). The current version also checks it in on_venue_event — when the slot is in STALE_STATE_RECONCILING, only RECONCILE events are accepted. All other event kinds return STALE_STATE_RECONCILE diagnostic.

This is a safety improvement — prevents stray fills/acks from modifying a slot that's being reconciled.

Severity: Informational

Q12: 5 of 5 timestamp deserialization sites use `datetime.fromisoformat()` — all fail on Python < 3.11 with Rust `Z` suffix

Covered in Q1 and Q6. Listing all sites: rust_backend.py:215,241,260 and real_zinc_plane.py:95,122. Same root cause.

Severity: High

Pass 14 Summary

#	Flaw	Layer	Severity
Q1	`fromisoformat()` can't parse Rust `Z` suffix on Python < 3.11 — crashes every timestamp deserialization	Bridge	High
Q2	No `#[serde(deny_unknown_fields)]` — misspelled fields silently default	Rust	Medium
Q3	`indexmap` new dependency in current code (informational)	Rust	Info
Q4	Backup diff: 6 critical bug fixes between backup and current (informational)	Rust	Info
Q5	`MarketSnapshot.timestamp` type inconsistent — float vs datetime in same file	Data Feed	High
Q6	`fromisoformat()` Z-suffix fail on all 5 timestamp deserialization sites	Bridge	High
Q7	No upper-bound price validation — 1e300 passes all guards	Bridge	Medium
Q8	`_first_invalid_intent_field` does not reject zero/negative price or zero size	Bridge	Low
Q9	Rust/Python clock sources diverge — transition timestamps mixed source	Rust	Low
Q10	`threading.Event.wait()` uses platform-dependent clock — NTP jump risk	Venue	Medium
Q11	Backup had no `STALE_STATE_RECONCILING` guard in `on_venue_event` (info)	Rust	Info
Q12	All 5 `fromisoformat()` sites fail on Python < 3.11 (duplicate of Q1)	Bridge	High

Pass 14 Severity

Severity	Count
High	4 (Q1, Q5, Q6, Q12)
Medium	3 (Q2, Q7, Q10)
Low	2 (Q8, Q9)
Info	3 (Q3, Q4, Q11)

Combined Catalog (All 14 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
Total		275	21	77	79	64	34

PASS 15 — RESOURCE LEAKS, TRUST BOUNDARIES, SECURITY

R1: `BingxVenueAdapter` `ThreadPoolExecutor` never shut down — 3 threads leak for process lifetime

File: bingx_venue.py:194-208

class BingxVenueAdapter(VenueAdapter):
    _EXECUTOR: concurrent.futures.ThreadPoolExecutor | None = None
    _EXECUTOR_LOCK: threading.Lock = threading.Lock()

    @classmethod
    def _get_executor(cls) -> concurrent.futures.ThreadPoolExecutor:
        if cls._EXECUTOR is None:
            with cls._EXECUTOR_LOCK:
                if cls._EXECUTOR is None:
                    cls._EXECUTOR = concurrent.futures.ThreadPoolExecutor(
                        max_workers=3, thread_name_prefix="bingx_adapter",
                    )
        return cls._EXECUTOR

The ThreadPoolExecutor is a class-level singleton with no shutdown path. No close() method, no atexit handler, no classmethod for cleanup. The 3 worker threads persist for the entire process lifetime.

ThreadPoolExecutor.__del__ in CPython calls shutdown(wait=False), but this only runs when the executor object is GC'd. Since _EXECUTOR is a class variable, it's only GC'd when the class is (at interpreter shutdown). The CPython source for ThreadPoolExecutor.__del__ calls shutdown(wait=False) which interrupts idle threads but doesn't wait for them. During shutdown, this races with module cleanup — threads accessing module globals see None.

Trigger: Every call to BingxVenueAdapter.submit() (line 371), cancel() (line 435), or snapshot() (line 509) submits to this executor. After 24+ hours of trading, 3 unnamed worker threads have consumed thread-local resources (stack ~8MB each = 24MB minimum) with no clean reclamation path.

Severity: High

R2: `BingxVenueAdapter` has no `close()` method — backend `BingxDirectExecutionAdapter` HTTP client unreleasable

File: bingx_venue.py (no close method), venue.py (protocol), launcher.py:262-264

The VenueAdapter protocol (venue.py) does not define close() or disconnect(). BingxVenueAdapter holds a _backend: BingxDirectExecutionAdapter which in turn holds an HTTP client session. The launcher's _maybe_close() tries .close() then .disconnect() but gets AttributeError (caught by except Exception: pass).

This means:

The BingxDirectExecutionAdapter's aiohttp.ClientSession is never closed
The underlying TCPConnector connection pool remains open
Any HTTP keep-alive connections to BingX remain open until OS timeout
No clean teardown path exists at any level of the venue stack

The BingxUserStream (WebSocket handler) has a close() method and proper task cancellation. The venue adapter (the synchronous REST path) has none. Asymmetric design.

Severity: High

R3: `real_zinc_plane._intent_cache` grows unboundedly — memory proportional to total lifetime intents

File: real_zinc_plane.py:157,202

# Line 157 (__init__):
self._intent_cache: List[Dict[str, Any]] = []

# Line 201-203 (publish_intent):
self._intent_cache.append(row)
self._write_region(self.intent_region, self._intent_seq, {"items": self._intent_cache[-512:]})

Every call to publish_intent() appends one dict to _intent_cache. Only the last 512 items are written to shared memory, but the cache list itself is never trimmed. Over a 24-hour session at 1 intent/second, this grows to 86,400 dicts — approximately 50-100 MB of memory for the cache alone (each dict contains timestamp, intent_id, asset, side, size, price, etc.).

The -512 slice on write is a dead giveaway that the developer knew only the last 512 items were relevant — but forgot to trim the source list. The fix is self._intent_cache = self._intent_cache[-512:] after the append.

Compare with: account.py's seen_account_event_ids (Rust side, capped at 1024 via IndexSet LRU), and journal.py's MemoryKernelJournal (capped at 10,000). Every other cache in the system has a bound; this one doesn't.

Severity: High

R4: `RealZincPlane`/`RealZincControlPlane` partial-construction leak — `SharedRegion` never cleaned up on init failure

File: real_zinc_plane.py:161-176, real_control_plane.py:72-83

# real_zinc_plane.py __init__:
self.intent_region = SharedRegion.create(f"{prefix}_intent", 65536)
self.state_region = SharedRegion.create(f"{prefix}_state", 65536)    # if this fails...
self.control_region = SharedRegion.create(f"{prefix}_control", 4096)  # ...or this

If SharedRegion.create() succeeds for intent_region but fails for state_region (e.g., out of shared memory, permission denied, name collision), the constructor raises. The already-created intent_region has no cleanup path — close() is never called because the caller never gets a valid object reference. The shared memory segment leaks until the OS cleans it on process exit (or reboot on some systems).

Same pattern in RealZincControlPlane.__init__ with 2 regions.

Severity: Medium

R5: `BingxUserStream` `ClientSession` has no `del` fallback — connection pool leak if `close()` not called

File: bingx_user_stream.py:229-230,433-436

async def close(self) -> None:
    self._closed.set()
    if self._session is not None and not self._session.closed:
        await self._session.close()

The aiohttp.ClientSession (created in _get_session() with TCPConnector(limit=4)) is only closed when close() is explicitly called. There is no __del__, __aenter__, or __aexit__. If a caller abandons the BingxUserStream object without calling close() — or if close() is never called because an exception occurs before the call — the TCP connection pool (4 connections to BingX) leaks.

During the reconnect loop (subscribe()), if self._session.closed is detected between retries, a new session and connector are created — the old connector's connections are released by ClientSession.close() which is called in the retry path. So the reconnect path itself is clean. But the top-level cleanup depends entirely on external discipline.

Severity: Medium

R6: `test_alpha_blue_untouched_g7.py` — two `open()` calls without context manager, file descriptors leak

File: test_alpha_blue_untouched_g7.py:31,63

src = open("/mnt/dolphinng5_predict/prod/clean_arch/dita_v2/gen2.py").read()  # line 31
src = open(full).read()                                                        # line 63

Both open a file, chain .read() to load contents, but never close the file handle. The file descriptor is leaked until garbage collection. In a test suite with thousands of tests, this can exhaust the ulimit (default 1024 on Linux, lower on macOS).

Severity: Low (test code, non-production)

R7: All exchange REST/WS data parsed without schema validation — exchange controls all field values

Files: bingx_venue.py:60-74,80-88,96-121,151-186, bingx_user_stream.py:267-379

The system has a single trust boundary for exchange data: all BingX REST API responses and WebSocket frames are parsed without schema validation. Key entry points:

bingx_user_stream.py:267: json.loads(text) on raw WebSocket frame — any valid JSON structure is accepted. No schema validation before field access with .get().
bingx_venue.py:60-74: _row_text() extracts string values from exchange response dicts with no sanitization beyond .strip().
bingx_venue.py:80-88: _row_float() — catches ValueError on float parse but does not filter NaN/Inf (these pass float() fine).
bingx_venue.py:297-301: _rate_limit_retry_after_ms() parses exchange error message with re.search(r"unblocked after (\d+)", msg) — exchange controls the error message content.
bingx_venue.py:338-340: cancel() exception handler includes str(exc) directly in the response event dict.

An exchange sending crafted responses could inject:

Arbitrary strings into reason/msg fields (propagated to journal/ClickHouse)
Non-numeric values that fail float() only on consumption
Enormous lists in snapshot responses (OOM risk — no size limit on snapshot.open_orders iteration)
NaN/Inf in price/size fields (pass through float() — Rust kernel is_finite() check on kernel side catches some but not all)

Severity: Critical — exchange controls all inbound data with no schema validation, and data flows to ClickHouse journaling and the Rust kernel memory.

R8: Shared memory JSON deserialization without integrity check

Files: real_zinc_plane.py:127-128, real_control_plane.py:60-61

# real_zinc_plane.py
def _decode_packet(self, payload: bytes) -> dict:
    return json.loads(payload)

# real_control_plane.py
payload = region.read()
if payload:
    data = json.loads(payload)

Both the Zinc plane and control plane deserialize JSON from shared memory without any integrity check (no HMAC, no checksum, no signature). Any process with access to the shared memory segment (/dev/shm on Linux, world-readable by default) can:

Inject arbitrary KernelIntent objects — the control plane reads intents from intent_region and dispatches them to process_intent(). An attacker could submit fake intents with malicious parameters.
Inject fake events into the event stream via the control plane region.
Corrupt slot state via the state region.

The shared memory segments are named by the DITA_V2_PREFIX env var (default dita_v2). On a shared system, any process running as the same user can read/write these segments.

Severity: High

R9: `restore_state()` deserializes arbitrary JSON into full kernel state — no provenance tracking

File: rust_backend.py:293-296 (Python), _rust_kernel/src/lib.rs:934-968 (Rust)

# Python:
def restore_state(self, json_str: str) -> bool:
    result = self.lib.dita_kernel_restore_state_json(self._backend, json_str.encode("utf-8"))
    return result == 0

# Rust:
pub extern "C" fn dita_kernel_restore_state_json(handle: *mut KernelHandle, payload: *const c_char) -> i32 {
    let payload = ...; // parse to KernelFullSnapshot
    let core = unsafe { &mut (*handle).core };
    core.restore_full_snapshot(&payload)  // overwrites ALL kernel state
}

dita_kernel_restore_state_json overwrites the entire kernel state — all slots, account balances, fee configuration, seen_event_ids, and capital_frozen flag — from a single JSON string. The method is public on ExecutionKernel.restore_state() with no authentication, no authorization check, and no call stack validation.

The JSON string can come from:

The DITAv2LauncherBundle (restart path)
A file read from disk
An attacker who gains access to the Python runtime (e.g., via shared memory injection, R8)

Once restored, the kernel accepts the state as truth. There is no restore_state counter or version chain to prevent replay of old snapshots.

Severity: Critical

R10: `DOLPHIN_BINGX_ENV` + `DOLPHIN_BINGX_ALLOW_MAINNET` — mainnet switch via env var

File: launcher.py:189-190

DOLPHIN_BINGX_ENV = os.environ.get("DOLPHIN_BINGX_ENV", "VST")  # VST = testnet
DOLPHIN_BINGX_ALLOW_MAINNET = os.environ.get("DOLPHIN_BINGX_ALLOW_MAINNET", "false").lower() in ("true", "1", "yes")

Setting DOLPHIN_BINGX_ENV=LIVE + DOLPHIN_BINGX_ALLOW_MAINNET=true switches from testnet to production mainnet BingX. The DOLPHIN_BINGX_ALLOW_MAINNET check exists specifically as a safety gate, but both are attacker-controlled env vars with the same provenance as all other env config.

An attacker with access to set env vars (container breakout, CI/CD injection, shared hosting) could:

Redirect all trades to mainnet
Use real capital instead of testnet funds
Cost real money on every trade

Severity: High

R11: `.env` file loaded from project root — secrets exposure risk

File: launcher.py:23,51

from dotenv import load_dotenv
...
load_dotenv(PROJECT_ROOT / ".env")

The .env file is loaded from PROJECT_ROOT, which is Path(__file__).resolve().parents[3] — three directories up from the launcher file. On a shared development machine or CI runner, this file is:

World-readable if not explicitly chmod'd (default umask creates files 644)
Accessible to any process running as the same user
Often committed to version control accidentally (no .gitignore guarantee)
Visible in Docker layer history if included in the build context

The .env file contains BINGX_API_KEY and BINGX_SECRET_KEY — the exchange credentials. On a shared system, every user with read access can extract these keys.

Severity: High

R12: Unvalidated `int()` on env vars — `DOLPHIN_BINGX_RECV_WINDOW_MS` could accept extreme values

File: launcher.py:191-193

recv_window_ms = int(os.environ.get("DOLPHIN_BINGX_RECV_WINDOW_MS", "5000"))
default_leverage = int(os.environ.get("DOLPHIN_BINGX_DEFAULT_LEVERAGE", "1"))
exchange_leverage_cap = int(os.environ.get("DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP", "3"))

Three env vars are directly passed to int() with only a string default — no bounds checking. An attacker setting DOLPHIN_BINGX_RECV_WINDOW_MS=2147483647 could set the exchange recv window to ~24 days, allowing replay attacks on signed requests. An attacker setting DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP=1000 could allow 1000x leverage on the exchange.

Severity: Medium

R13: `BingxUserStream` `listenKey` from exchange response used in WebSocket URL — MITM injection surface

File: bingx_user_stream.py:230,398

# Line 230:
url = f"{self._ws_url}?listenKey={listen_key}"

# Line 398:
listen_key = resp.get("listenKey", "")  # from exchange POST /openApi/user/auth/userDataStream

The listenKey comes from the BingX REST API response (POST /openApi/user/auth/userDataStream). It is used directly in the WebSocket connection URL with no encoding or validation. The listenKey is a short opaque string (looks like a UUID), but:

If an attacker can MITM the REST response (DNS spoofing, proxy, etc.), they control the listenKey value
A malicious listenKey with URL metacharacters (&, =, #) could inject query parameters into the WebSocket URL
The listenKey is BingX's session authentication mechanism — once an attacker controls it, they can hijack the user data stream

The fix is urllib.parse.urlencode({"listenKey": listen_key}) but the current code uses an f-string.

Severity: High

R14: `mock_venue._exchange_event_queue` unbounded growth — event enqueue rate can exceed consumption rate

File: mock_venue.py:220,230

# _queue_exchange_event:
self._exchange_event_queue.append(event)

# subscribe (generator):
if self._exchange_event_queue:
    yield self._exchange_event_queue.pop(0)

The mock venue's event queue is consumed one event at a time via a generator in subscribe(). If queue_exchange_event() is called faster than the consumer calls next() on the generator (which happens on every _on_exchange_event() callback), the list grows unboundedly. In test scenarios with rapid fire events, this can exhaust memory.

Not a production risk (mock is test-only), but the unbounded growth pattern is worth noting.

Severity: Low (test code)

Pass 15 Summary

#	Flaw	Layer	Severity
R1	`ThreadPoolExecutor` never shut down — 3 threads leak	Venue	High
R2	`BingxVenueAdapter` no `close()` — backend HTTP client unreleasable	Venue	High
R3	`real_zinc_plane._intent_cache` grows unboundedly	Plane	High
R4	`RealZincPlane`/`ControlPlane` partial-construction SharedRegion leak	Plane	Medium
R5	`BingxUserStream.ClientSession` no `__del__` — connection pool leak	Venue	Medium
R6	`test_alpha_blue_untouched_g7.py` open() without context manager	Test	Low
R7	All exchange REST/WS data parsed without schema validation	Venue	Critical
R8	Shared memory JSON deserialization without integrity check	Plane	High
R9	`restore_state()` deserializes arbitrary JSON — full kernel takeover	Bridge	Critical
R10	`DOLPHIN_BINGX_ENV` + `ALLOW_MAINNET` mainnet switch via env	Config	High
R11	`.env` file loaded from project root — secrets exposure	Config	High
R12	Unvalidated `int()` on env vars — recv_window, leverage extremes	Config	Medium
R13	`listenKey` from exchange in WS URL f-string — MITM injection	Venue	High
R14	`mock_venue._exchange_event_queue` unbounded growth	Test	Low

Pass 15 Severity

Severity	Count
Critical	2 (R7, R9)
High	6 (R1, R2, R3, R8, R10, R11, R13)
Medium	3 (R4, R5, R12)
Low	2 (R6, R14)

Combined Catalog (All 15 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
Total		289	23	83	82	64	37

PASS 16 — ERROR HANDLING, ARITHMETIC STABILITY, TEST INFRASTRUCTURE

S1: `realized_pnl()` and `mark_price()` NaN bypasses `<= 0.0` guard — NaN PnL corrupts `k_realized_pnl`

Files: _rust_kernel/src/lib.rs:1121-1129 (realized_pnl), _rust_kernel/src/lib.rs:400-408 (mark_price), contracts.py:193-200 (Python mark_price)

IEEE 754 semantics: all comparisons with NaN return false. The guard if slot.entry_price <= 0.0 || exit_size <= 0.0 is bypassed by NaN — a NaN entry_price or exit_size passes through silently.

Rust realized_pnl() (line 1121-1129):

if slot.entry_price <= 0.0 || exit_size <= 0.0 { return 0.0; }  // NaN passes
let mut delta = (exit_price - slot.entry_price) / slot.entry_price;  // NaN / NaN = NaN
delta * notional  // NaN → corrupts k_realized_pnl via +=

Rust mark_price() (line 400-408):

if self.entry_price <= 0.0 || self.size <= 0.0 { return; }  // NaN passes
self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage;  // stores NaN

Python mark_price() (contracts.py:193-200): Same pattern — if self.entry_price <= 0 or self.size <= 0 passes NaN, produces NaN PnL.

Once NaN enters k_realized_pnl or unrealized_pnl, every subsequent arithmetic operation propagates NaN: k_capital, available_capital, margin checks, reconcile deltas. The kernel enters a dead state where all financial computations produce NaN.

Trigger paths for NaN entry_price:

set_slot_json() bypasses process_intent — can set arbitrary slot fields
An INVALID_ORDER_PARSE event that produces entry_price = NaN in exchange data
A divide-by-zero in a prior computation (extremely unlikely but theoretically possible)

Fix: Replace <= 0.0 with !is_finite() || <= 0.0 in all three sites.

Severity: Critical

S2: MockVenue `_exchange_event_queue` property has check-then-act race — silently drops events

File: mock_venue.py:228-232

@property
def _exchange_event_queue(self) -> list:
    if not hasattr(self, "_exeq"):
        object.__setattr__(self, "_exeq", [])
    return self._exeq

hasattr + object.__setattr__ is a classic TOCTOU race. If queue_exchange_event() (called from sync test code) and subscribe() (async generator started on event loop thread) interleave:

Thread A calls hasattr → False
Thread B calls hasattr → False
Thread A calls object.__setattr__ → creates _exeq = []
Thread B calls object.__setattr__ → overwrites _exeq = [], losing Thread A's window
Thread A appends event to its _exeq reference
Thread B returns its _exeq reference — Thread A's append is invisible

The same list is then subject to list.append() vs list.pop(0) without synchronization — pop(0) on empty list raises IndexError, crashed events.

Fix: Use threading.Lock around queue access, or use collections.deque with self._exeq initialized in __post_init__.

Severity: Critical

S3: No FSM-specific test files — `test_kernel_fsm.py` and `test_kernel_fsm_recovery.py` do not exist

Files: missing — no test_kernel_fsm.py or test_kernel_fsm_recovery.py anywhere in workspace

The kernel's FSM is the core of the system, with states IDLE → ORDER_REQUESTED → ORDER_SENT → ENTRY_WORKING → POSITION_OPEN → EXIT_REQUESTED → EXIT_WORKING → CLOSED and additional states STALE_STATE_RECONCILING, INTERVENTION_REQUIRED, TRADE_TERMINAL_WRITTEN.

Missing transition coverage:

Transition	Status
`IDLE → ENTRY_WORKING` via ENTER intent	✅ tested incidentally via test_flaws
`ENTRY_WORKING → POSITION_OPEN` via fill	✅ tested incidentally
`POSITION_OPEN → EXIT_REQUESTED` via EXIT	✅ partial
`POSITION_OPEN → IDLE` via exit cancel	⚠️ single test only
`POSITION_OPEN → POSITION_OPEN` via partial exit	❌ NOT tested
`STALE_STATE_RECONCILING → *`	❌ NOT tested
`INTERVENTION_REQUIRED → *`	❌ NOT tested
`TRADE_TERMINAL_WRITTEN → *`	❌ NOT tested
All error transitions (RATE_LIMITED, INVALID_INTENT)	❌ NOT tested
FSM timeout transitions	❌ NOT tested (no timer exists)
Concurrent intent processing (two EXIT intents same slot)	❌ NOT tested

The only FSM testing is incidental through test_flaws.py — which tests specific flaw behaviors, not FSM correctness.

Severity: Critical

S4: Generated tests use `await asyncio.sleep(0.8)` assuming fast mock venue — flaky false positives on slow CI

Files: _gen_test.py (all generated bodies), gen2.py (all generated bodies), gen_live_tests.py (all generated bodies)

Every generated test body follows this pattern:

r = _si(k, E.ENTER, tid, sym, "LONG", p, 0.001); await asyncio.sleep(0.8)
r = _si(k, E.EXIT, tid, sym, "FLAT", p, 0.001); await asyncio.sleep(0.8)
r = _si(k, E.CANCEL, tid, sym, "FLAT", p, 0.001); await asyncio.sleep(0.8)

The 0.8 second sleep assumes the mock venue fills, cancels, and processes in <0.8s. On a loaded CI system (with virtualization, resource contention), the mock venue may take longer. The test then:

Operates EXIT on a slot still in ENTRY_WORKING — intent is rejected with SLOT_BUSY
The test checks r.accepted (or generated assertions) — gets False
The test fails, but not because the system is buggy — it fails because the sleep was too short

This is a timing-dependent false negative pattern. The mock venue processes synchronously on subscribe(), which is called from _on_exchange_event_callback which is triggered on intent.apply(). In tests, the DITAv2LauncherBundle._run() calls intent.apply() which calls process_intent which calls _on_exchange_event_callback — but if the venue hasn't yet yielded the fill event from subscribe(), the slot isn't updated.

The fix is to await an event condition (e.g., slot.fsm_state == POSITION_OPEN) instead of using sleep.

Severity: Critical

S5: `bingx_venue._rate_limit_retry_after_ms()` returns 0 on any parse failure — instant retry with no backoff

File: bingx_venue.py:169-184

@staticmethod
def _rate_limit_retry_after_ms(msg: str) -> int:
    try:
        # Checks multiple response fields for retry-after hint
        ...
        m = re.search(r"unblocked after (\d+)", msg)  # regex on exchange error message
        if m: return max(0, int(float(m.group(1))))  # integer parse
        return 0  # no retry-after found → default to 0
    except Exception:  # catches ANY failure in the try block
        return 0  # returns 0 = INSTANT RETRY

If the regex fails, int() fails, or any other exception occurs, the function returns 0 — meaning "retry immediately." This defeats the purpose of rate-limit detection. Every parse failure produces a retry storm rather than a safe default (e.g., 5000 ms).

Specific failure paths:

Exchange returns a new/bilingual rate-limit message format → regex misses → returns 0
int(float(raw_retry)) on a non-numeric string → ValueError → caught → returns 0
float() on a value with locale-specific decimal (e.g., European ,) → ValueError → returns 0

Fix: Default to a safe backoff (e.g., 5000 ms) in the except block. Log the parse failure for debugging.

Severity: High

S6: Venue adapter detects rate limits but enforces zero backoff — retry storm reaches exchange

File: bingx_venue.py:384-386,471

When submit() or cancel() receives a rate-limited response, the adapter:

Extracts retry_after_ms from the response ✅
Tags the event with RATE_LIMITED status and retry_after_ms ✅
Returns the event to the kernel, which marks it retryable:true ✅
Does NOT enforce the backoff delay ❌ — the caller must decide when to retry

If the caller (the algo or scheduler) ignores retry_after_ms and resubmits immediately, the adapter does not block or queue the request. The rate-limited request reaches the exchange again, potentially getting another 429, which wastes bandwidth and exchange quota.

No circuit breaker, no request queue, no automatic backoff at the venue adapter level. The adapter is purely passive — it reports rate limits but does not enforce them.

Fix: Add a _last_rate_limit_time and _last_rate_limit_delay on the adapter. If a request arrives before last_rate_limit_time + retry_after_ms, queue it or return RATE_LIMITED immediately without calling the exchange.

Severity: High

S7: `capital_epsilon = 1e-4` (0.0001 USDT) too tight for f64 precision — false WARN classifications

File: account.py:224

capital_epsilon: float = 1e-4  # 0.0001 USDT — extremely tight

At 25k USDT capital, f64 has ~15-16 significant digits in base-10. The unit in the last place (ULP) at 25k is ~3.6e-12, so the absolute rounding error is ~8.8e-8 USDT — well below 1e-4. But accrual operations (100+ PnL additions) accumulate ~sqrt(N) × ULP ≈ 1e-6 USDT — still below 1e-4.

The problem: the R1 and R2 reconcile deltas compare k.capital vs e.wallet_balance which come from different computation paths (kernel fold vs exchange aggregation). With different rounding behaviors, the delta can exceed 1e-4 even on perfectly correct state. The abs(k.capital - e.wallet_balance) < 1e-4 test produces WARN on the third or fourth fill at typical sizes.

At $1M capital, ULP is ~1.2e-10, absolute rounding ~1.2e-6 USDT — about 10x below 1e-4. Tight but not triggering. But aggregated across 100 fills, the accumulated rounding from different computation paths can exceed 1e-4.

Fix: Increase to at least 1e-3 (0.001 USDT) or make it configurable per-asset.

Severity: High

S8: Generated tests use module-level `asyncio.run()` — leaks pending tasks on Python 3.12+

Files: test_flaws.py (all test functions), test_exchange_event_seam_parity.py (all test functions), all generated test files

Each test function calls asyncio.run() to execute async kernel operations within a sync test:

def test_something(self):
    asyncio.run(self._run_test())  # creates event loop, runs, closes

Python 3.10+ issues a deprecation warning for repeated asyncio.run() calls if the previous loop had pending tasks. In Python 3.12+, this raises RuntimeError: asyncio.run() cannot be called from a running event loop if any tasks survive from the previous invocation.

All test functions call asyncio.run() directly. If a test creates a task that doesn't complete (e.g., a timeout that's not properly awaited), that task survives the loop close and the next asyncio.run() call crashes.

Fix: Use pytest-asyncio with @pytest.mark.asyncio and async def test_method, or add try/finally with task cancellation.

Severity: High

S9: `_build_pink_extended.py` and `_build_pink_bodies.py` use `str.replace()` patching — silently does nothing on format change

Files: _build_pink_extended.py (all), _build_pink_bodies.py (all)

Both scripts modify test_pink_bingx_dita_live_e2e.py in-place using str.replace() and str.find() index math:

content = content.replace(old_imports, new_imports)
content = content.replace(old_build, new_build)
idx = content.find(old_body)
content = content[:idx] + new_body + content[idx+len(old_body):]

If the generated file's whitespace or ordering changes (e.g., Python version updates, import sorting), str.replace() silently does nothing — the old string is not found, so no replacement occurs. The file is written back unchanged. Since this is a build-time preprocess step, there's no test that validates the patched output.

The index-based insertion is even more fragile — if the generated file's structure shifts by even 1 character (blank line added, comment changed), the index math inserts new code at the wrong position, producing syntactically broken import/assert blocks.

Fix: Parse the generated file as AST and insert/modify nodes, or use a template engine with well-defined insertion points.

Severity: High

S10: `bingx_user_stream._consume()` has no per-message timeout — silent WS hang blocks forever

File: bingx_user_stream.py:251-270

async def _consume(self, ws: aiohttp.ClientWebSocketResponse) -> AsyncIterator[dict]:
    async for msg in ws:  # no timeout on individual message read
        if msg.type == aiohttp.WSMsgType.TEXT:
            yield json.loads(msg.data)
        elif msg.type == aiohttp.WSMsgType.CLOSED:
            break

The async for msg in ws: loop blocks until the next message arrives. If the WebSocket connection silently drops (no CLOSE frame, no TCP RST), the loop blocks until the TCP keepalive timeout — which can be 2 hours on some Linux configurations.

No application-level heartbeat, no ping/pong timer, no asyncio.wait_for() wrapper. The BingX listenKey keepalive (every 30 min) is HTTP-based, not WS-based, so it doesn't detect a WS-level silence.

Fix: Wrap the async for with asyncio.wait_for(..., timeout=60) or implement WS ping/pong.

Severity: High

S11: `bingx_venue._run()` blocks `ThreadPoolExecutor` thread with no timeout — backend HTTP hang freezes the adapter

File: bingx_venue.py:202-209

def _run(self, result: Awaitable) -> Any:
    loop = ...
    if loop is None:  # no running loop
        return asyncio.run(result)  # blocks until HTTP completes or TCP timeout
    else:
        pool = self._get_executor()
        fut = pool.submit(asyncio.run, result)
        return fut.result()  # BLOCKS FOREVER — no timeout argument

fut.result() with no timeout argument blocks until the future completes. If the HTTP call hangs (BingX server never responds, TCP half-open), the ThreadPoolExecutor thread blocks indefinitely. Since the pool has only 3 threads, 3 hung HTTP calls consume all worker threads, and all subsequent adapter operations (submit, cancel, snapshot) hang forever because no threads are available.

This is a partial-DoS on the adapter. If the BingX API becomes unresponsive, the adapter locks up completely.

Fix: Use fut.result(timeout=30) and handle TimeoutError with a fallback event.

Severity: High

S12: `bingx_venue._rate_limit_retry_after_ms()` regex depends on exchange error message format — non-portable to other exchanges

File: bingx_venue.py:176

m = re.search(r"unblocked after (\d+)", msg)

The regex looks for the English phrase "unblocked after <number>" in the exchange error message. If BingX changes their message format, localizes messages (Chinese exchange — could return Chinese text), or updates the wording, the regex silently returns 0 (caught by except Exception).

Additionally, the phrase "unblocked" is specific to BingX's rate-limit error wording. If the adapter is later extended to support other exchanges, this regex needs to be parameterized.

Fix: Prefer the numeric retry-after header field (response header Retry-After) rather than parsing the error message body.

Severity: Medium

S13: `bingx_venue._row_float()` silently skips malformed rows — missing fields produce silent continue

File: bingx_venue.py:51-56

@staticmethod
def _row_float(row: dict, keys: tuple[str, ...]) -> float:
    for k in keys:
        v = row.get(k)
        if v is not None and v != 0.0:  # also filters out 0.0 values!
            try:
                return float(v)
            except Exception:
                continue  # silently skip; next key tried
    return 0.0

Two issues:

v != 0.0 filters out legitimate zero values. If an exchange response has "origQty": "0" for a cancelled order, _row_float skips it and tries the next key — which may be a different field with a non-matching value.
except Exception: continue silently skips ValueError, TypeError, and any other parsing error. No log, no diagnostic. A corrupted exchange response produces 0.0 with no trace.

Severity: Medium

S14: `bingx_user_stream` reconnection backoff lacks jitter — thundering herd when multiple clients reconnect simultaneously

File: bingx_user_stream.py:133-138

delay_ms = min(self._reconnect_delay_ms * 2, self._reconnect_max_ms)

Pure exponential backoff with no jitter. If multiple BingxUserStream instances (for different symbols or accounts) disconnect simultaneously (e.g., BingX WS maintenance), their reconnection attempts synchronize. Each retries at exactly the same intervals, creating a thundering herd against the BingX WebSocket endpoint.

Fix: Add random jitter: delay_ms = min(base * 2, max_ms) * (0.5 + random.random()).

Severity: Medium

S15: `_venue_event_status_from_row()` falls back to ACKED for unrecognized statuses — masks new rejection types

File: bingx_venue.py:85-101

@staticmethod
def _venue_event_status_from_row(row: dict) -> VenueEventStatus:
    status = (row.get("status") or "").strip().upper()
    if status == "NEW": return VenueEventStatus.ACKED
    elif status == "CANCELED": return VenueEventStatus.CANCELED
    elif status == "FILLED": return VenueEventStatus.FILLED
    elif status == "PARTIALLY_FILLED": return VenueEventStatus.PARTIALLY_FILLED
    elif status == "REJECTED": return VenueEventStatus.REJECTED
    elif status in ("EXPIRED", "EXPIRED") : return VenueEventStatus.EXPIRED
    else:
        return VenueEventStatus.ACKED  # fallback — unknown → ACKED (dangerous!)

If BingX introduces a new status (e.g., "DEACTIVATED", "PENDING_CANCEL", "SUSPENDED"), it maps to ACKED — which the kernel interprets as "order acknowledged by exchange and working." This could cause the kernel to believe an order is active when it's actually suspended, leading to:

No cancel sent (kernel thinks order is working and waiting for fill)
Premature exit intent submission (order not actually active)
Incorrect slot FSM state

The fallback should be REJECTED (conservative — assume the worst) or should log a warning and escalate.

Fix: Change fallback to VenueEventStatus.REJECTED or log an error for unknown statuses.

Severity: Medium

S16: `gen2.py` generates `except: pass` in test code — swallows KeyboardInterrupt and SystemExit

File: gen2.py:335 (embedded in generated test template)

try:
    bundle.close()
except:
    pass  # bare except — catches KeyboardInterrupt, SystemExit

The generated test files contain bare except: pass blocks in cleanup code. This catches KeyboardInterrupt and SystemExit, preventing the user from stopping a running test suite with Ctrl+C. The process must be killed with SIGKILL.

Same pattern in _build_pink_extended.py templates and other generated test builders.

Fix: Use except Exception: in generated code.

Severity: Medium

Pass 16 Summary

#	Flaw	Layer	Severity
S1	`realized_pnl()`/`mark_price()` NaN bypasses `<=0.0` guard — NaN PnL corrupts k_realized_pnl	Rust/Python	Critical
S2	MockVenue `_exchange_event_queue` check-then-act race — silently drops events	Test	Critical
S3	No `test_kernel_fsm.py` or `test_kernel_fsm_recovery.py` exists	Test	Critical
S4	Generated tests use `await asyncio.sleep(0.8)` — flaky false negatives on slow CI	Test	Critical
S5	`_rate_limit_retry_after_ms()` returns 0 on parse failure — instant retry storm	Venue	High
S6	Venue adapter detects rate limits but enforces zero backoff	Venue	High
S7	`capital_epsilon = 1e-4` too tight — false WARN classifications	Accounting	High
S8	Generated tests use module-level `asyncio.run()` — leaks tasks on Python 3.12+	Test	High
S9	`str.replace()` patching silently does nothing on format change	Build	High
S10	`_consume()` no per-message WS timeout — silent hang blocks forever	Venue	High
S11	`_run()` blocks pool thread with no timeout — 3 hung calls lock adapter	Venue	High
S12	Rate-limit regex depends on exchange message format — non-portable	Venue	Medium
S13	`_row_float()` silently skips malformed rows, filters zero values	Venue	Medium
S14	Reconnection backoff lacks jitter — thundering herd risk	Venue	Medium
S15	`_venue_event_status_from_row()` falls back to ACKED — masks new rejections	Venue	Medium
S16	`except: pass` in generated test code — swallows KeyboardInterrupt	Test	Medium

Pass 16 Severity

Severity	Count
Critical	4 (S1, S2, S3, S4)
High	7 (S5, S6, S7, S8, S9, S10, S11)
Medium	5 (S12, S13, S14, S15, S16)

Combined Catalog (All 16 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
S	Pass 16 (Error Handling/Arithmetic/Test Infra)	16	4	7	5	0	0
Total		305	27	90	87	64	37

PASS 17 — UNSAFE REVIEW, DEAD CODE/BACKUP DEBRIS, BUILD/PLANE PROTOCOLS

T1: `catch_unwind` + `AssertUnwindSafe` on `&mut KernelCore` — partially mutated heap state persists after caught panic, no rollback

File: _rust_kernel/src/lib.rs:2057-2071

fn with_handle_mut<F, R>(handle: *mut KernelHandle, f: F) -> Result<R, String>
where F: FnOnce(&mut KernelCore) -> Result<R, String>,
{
    let core = unsafe { &mut (*handle).core };
    match std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| f(core))) {
        Ok(result) => result,
        Err(panic_payload) => {
            let msg = ...;
            eprintln!("[KERNEL PANIC caught at FFI boundary] {msg}");
            Err(msg)  // Partially mutated KernelCore still live in heap Box<KernelHandle>
        }
    }
}

catch_unwind prevents Rust panics from unwinding across the FFI boundary (which would be UB). But the KernelCore behind the raw pointer is mutated in-place on the heap. When a panic occurs mid-mutation:

f(core) calls some kernel function like process_intent() or apply_fill()
The function panics partway through — e.g., k_realized_pnl was incremented but event_seq was not bumped; slots[i] was replaced but rebuild_indexes() was not called
catch_unwind catches the panic, returns Err(msg) to the Python caller
The KernelCore on the heap retains the partially applied state
The next FFI call operates on this corrupted state — k_capital = seed + realized_pnl - fees_paid is computed with mismatched values
The code comment acknowledges this: "the slot/account mutation that panicked may be partially applied"
The mitigations (reconcile WARN/ERROR → capital frozen) only work if the corruption is detectable — if the panic corrupts slot.seen_event_ids such that dedup fails, duplicate fills can process

AssertUnwindSafe on &mut KernelCore is sound for memory safety (after panic, the reference is still valid, just the data is inconsistent — no use-after-free, no double-free). But it is logically unsound — data invariants are violated, and the recovery path relies on a downstream reconcile to detect the issue, which may not catch all corruption patterns.

Trigger paths: Any panic inside process_intent(), on_venue_event(), reconcile_slots(), apply_fill(), or save_full_snapshot() while mutating KernelCore. A panic in HashMap::insert() (extremely rare, only on OOM) would leave the HashMap in an undefined state.

Severity: High

T2: Empty backup directory `_backup_20260530_105512/` and stale `tea_debug.log` (0 bytes)

Files: _backup_20260530_105512/ (empty directory), tea_debug.log (0 bytes)

_backup_20260530_105512/ is a completely empty directory — zero files. Its sibling _backup_20260530/ contains 22 source files and a rust_kernel_src/ subdirectory. The _105512 variant was created during an earlier backup attempt but never populated.

tea_debug.log is a 0-byte empty file in the workspace root. No code writes to it. It's a stale artifact — likely a log file that was opened but never written to, or a debugging aid that was never used.

Both should be deleted to avoid confusion.

Severity: Low

T3: `HazelcastRowWriter.call` uses bare `json.dumps(row, default=str)` — Enums and datetimes serialize as Python `str()` representations

File: hazelcast_projection.py:60-63

def __call__(self, name: str, row: dict[str, Any]) -> None:
    if name.endswith("trade_events"):
        self.client.get_topic(name).publish(
            json.dumps(row, ensure_ascii=False, sort_keys=True, default=str)
        )

The default=str fallback serializes Enum values as "TradeSide.SHORT" (Python's repr() format) instead of "SHORT" (the .value). Datetimes become "2026-01-01 00:00:00" (Python str() format) instead of "2026-01-01T00:00:00+00:00" (ISO 8601). Downstream Hazelcast consumers expecting standard formats get unexpected strings.

Compare with HazelcastProjector.publish_event() (line 38) which correctly uses json_safe(payload) before json.dumps():

self.writer(self.trade_events_topic, json_safe(row))  # uses json_safe() first

The inconsistency: HazelcastProjector correctly serializes via json_safe(), but HazelcastRowWriter.__call__ (used directly elsewhere) does not. Any code path that calls HazelcastRowWriter directly — rather than through HazelcastProjector — produces malformed output.

Severity: High

T4: `real_zinc_plane._slot_from_payload()` uses `payload["entry_time"]` direct key access — crashes with `KeyError` if key missing

File: real_zinc_plane.py:116,133

entry_time=datetime.fromisoformat(payload["entry_time"]) if payload.get("entry_time") else None,
# ... yet at line 133:
last_event_time=datetime.fromisoformat(payload["last_event_time"])  # NO .get() guard!

Line 116 uses payload.get("entry_time") — correct. Line 133 uses payload["last_event_time"] — missing .get(), crashes with KeyError if the key is absent.

Compare with rust_backend.py:396-402 (the equivalent function):

entry_time=datetime.fromisoformat(payload["entry_time"]) if payload.get("entry_time") else None,
last_event_time=datetime.fromisoformat(payload["last_event_time"]) if payload.get("last_event_time") else None,

Both fields use .get() in rust_backend.py. The real_zinc_plane.py version has a copy-paste error where the guard on last_event_time was omitted. If any slot is deserialized via the shared memory path (RealZincPlane) and lacks a last_event_time (e.g., a fresh slot that hasn't received a venue event yet), this crashes.

Severity: High

T5: `_build_pink_bodies.py` uses `str.index("]")` to find SCENARIOS list close bracket — corrupts list if any entry contains `]`

File: _build_pink_bodies.py:214

close_bracket = with_bodies.index("]", scenarios_open)
final = with_bodies[:close_bracket] + "\n" + param_block + "\n" + with_bodies[close_bracket:]

str.index("]") finds the first ] character after scenarios_open. If any SCENARIOS entry contains a ] inside a string literal (e.g., a diagnostic code like INVALID_INTENT_PARSE, a format string, or a nested data structure), the split lands inside the entry — truncating it and injecting the new param_block mid-entry.

The resulting file is syntactically incorrect only if the truncation produces unparseable code. If it happens to produce valid (but semantically wrong) code, the build succeeds with silently corrupted test data.

Fix: Use ast module to parse the list, or count bracket depth.

Severity: High

T6: `VenueAdapter` protocol missing `connect()`/`disconnect()` — `AttributeError` at runtime

File: venue.py (protocol), _build_pink_extended.py:31-32 (caller)

# _build_pink_extended.py — Shim class:
async def connect(self, initial_capital=0):
    self.kernel.venue.connect()  # assumes VenueAdapter has connect()

async def disconnect(self):
    try:
        self.kernel.venue.disconnect()  # assumes VenueAdapter has disconnect()
    except:
        pass

VenueAdapter (defined in venue.py as a Protocol) defines submit(), cancel(), snapshot(), subscribe(), open_positions(), and reconcile() — but not connect() or disconnect().

MockVenueAdapter has both methods (mock_venue.py:160-166). BingxVenueAdapter does not have them — calling connect() on a BingxVenueAdapter raises AttributeError.

The Shim class in _build_pink_extended.py is used for live-test infrastructure. If a live test runs with a venue that lacks connect()/disconnect(), the error is swallowed by the bare except: pass in disconnect(), but connect() propagates uncaught.

Fix: Add connect()/disconnect() to the VenueAdapter protocol, or add them as no-ops on BingxVenueAdapter.

Severity: High

T7: `real_control_plane.py` and `real_zinc_plane.py` shared memory writes are non-atomic — reader sees partial state

Files: real_control_plane.py:110-114, real_zinc_plane.py:252-253

# real_control_plane.py _write_region:
view[:len(packet)] = packet                      # writes new packet
if len(view) > len(packet):
    view[len(packet):] = b"\x00" * (len(view) - len(packet))  # zeroes tail

# real_zinc_plane.py _write_region:
view[:] = b"\x00" * len(view)                    # full zero (visible-zero window)
view[:len(packet)] = packet                       # writes packet

Both implementations write the shared memory buffer in multiple non-atomic operations. A reader process that reads between these operations sees:

real_control_plane.py: The new header with stale tail from a previous larger packet → _decode_packet() may return stale data or parse failure
real_zinc_plane.py: All zeros → _decode_packet() returns {} (empty dict) or parse failure

The visible-zero window in real_zinc_plane.py is particularly dangerous — if a reader reads the zeroed buffer, all slot states appear empty, which could trigger a spurious reconcile or incorrect position tracking.

Fix: Either:

Write the packet atomically (if the shared memory size supports it — write new data in a single slice assignment)
Use a sequence number in the header that the reader validates (sequence odd while writing, even when complete)
Use an explicit "writing" flag byte set before and cleared after the write

Severity: High

T8: `real_zinc_plane._slot_from_payload()` reconstructs `internal_trade_id` from slot's `trade_id` instead of order's own — data loss on round-trip

File: real_zinc_plane.py:92,106

active_entry_order = VenueOrder(
    internal_trade_id=str(payload.get("trade_id", "")),  # uses SLOT's trade_id
    ...
)

TradeSlot.to_dict() serializes the order's own internal_trade_id inside the "active_entry_order" sub-dict. But _slot_from_payload() ignores the per-order value and uses the slot-level trade_id instead.

If a slot has multiple orders (e.g., an entry order with trade_id="abc" and an exit order with trade_id="def"), the slot-level trade_id is the current trade's ID — which may match one of the orders. But after a CANCEL_ACK that clears the entry order, the slot trade_id may be empty or changed. The reconstructed order always gets the slot's trade_id, losing the distinction between entry-order and exit-order trade IDs.

This only affects the shared-memory round-trip (RealZincPlane). The FFI path (rust_backend.py) correctly uses the order's serialized internal_trade_id.

Severity: Medium

T9: `_slot_from_payload()` duplicated verbatim between `real_zinc_plane.py` and `rust_backend.py` — double maintenance burden

Files: real_zinc_plane.py:83-138, rust_backend.py:379-402

The slot deserialization function _slot_from_payload() (or equivalent inline code) exists in two separate files with nearly identical logic. The real_zinc_plane.py version is a 55-line function; the rust_backend.py version is inline in _slot_from_payload().

Both deserialize TradeSlot from the same to_dict() output format. Any schema change (field added, removed, renamed, or type-changed) must be updated in both places. T4 (missing .get() on last_event_time) and T8 (internal_trade_id from wrong source) are direct consequences of this duplication — the bug exists in one copy but not the other.

Fix: Extract shared _slot_from_payload() into contracts.py (or utils.py).

Severity: Medium

T10: `_build_pink_extended.py` string index math finds first `finally:` — could match nested `try/finally` inside function body

File: _build_pink_extended.py:117-119

idx = content.index(old_run_pat)
run_end = content.index("    finally:", idx)  # finds FIRST "finally:" — could be nested!
run_end = content.index("\n\n", run_end) + 2  # boundary detection for function end

The search for " finally:" finds the first occurrence after idx. If the _run() function body (or any function it calls, like _si() or _verify()) contains a nested try/finally block — or if the function contains the word "finally:" in a string or comment — the index points to the wrong location. The "\n\n" search then terminates inside the function body, producing a truncated replacement that generates syntactically broken output.

The generated test_pink_bingx_dita_live_e2e.py is patched with index math that has no validation. A malformed patch silently produces a non-functional test file (syntax error only caught at test import time).

Fix: Parse the function boundaries using ast module or use a well-defined sentinel comment (e.g., # END _run) as anchor points.

Severity: Medium

T11: No workspace-root `.gitignore` — `pycache`, backup dirs, context files, build artifacts untracked

File: (missing — should be dita_v2/.gitignore)

The only .gitignore in the workspace is inside _rust_kernel/ (covers /target). There is no .gitignore at the workspace root (dita_v2/). This means:

__pycache__/ directories (29 .pyc files present) are tracked or untracked depending on global git config
_backup_20260530/ and _backup_20260530_105512/ are visible to git (the 22 source files in the backup are tracked? may or may not be indexed)
_backup_20260530_105512/ (empty dir) is visible
Codex_CONTEXT_RESTORE__*.txt context files are visible
tea_debug.log is visible
Any .pyc files that end up in the index cause merge conflicts

The git status shows ?? 2004 untracked files — many of these would be excluded by a proper .gitignore.

Severity: Low

T12: `projection.py` lazy import failure silently swallowed — caller gets `writer=None` with no diagnostic

File: projection.py:75-77

try:
    from .hazelcast_projection import HazelcastRowWriter
    writer = HazelcastRowWriter(client)
except Exception:  # catches import errors, constructor errors, everything
    writer = None

If the hazelcast_projection module has a syntax error, HazelcastRowWriter doesn't exist, or the constructor raises, the exception is silently swallowed. The caller gets a HazelcastProjection with writer=None. The write_transition() and write_control() methods check if not self.writer: and silently return — so all Hazelcast writes are silently dropped with no log, no error, no diagnostic.

The "Hazelcast unavailable — fallback active" log message is only printed for the first import attempt. If the module is later fixed (e.g., a missing dependency is installed), the stale writer=None persists because the import is not retried.

Severity: Medium

T13: `Codex_CONTEXT_RESTORE__*.txt` and other AI context files in workspace root — debris

Files: Codex_CONTEXT_RESTORE__2026-06-02-130508-*.txt, other .md analysis documents

The workspace root contains AI-assistant context restore files and 6+ Markdown flaw analysis documents (PINK_DITAv2_E2E_TRACE_ANALYSIS.md, PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md, PINK_DITAv2_THREADING_ATOMICITY.md, etc.). These are analysis artifacts, not source code.

While the flaw documents are intentional project records, the Codex_CONTEXT_RESTORE__*.txt files are ephemeral AI context dumps that should not be in version control. They contain session state information that is meaningless outside the AI session.

Severity: Low

T14: `_backup_20260530/` contains 22 live source files — risk of stale import confusion

File: _backup_20260530/ (22 Python files including rust_backend.py, launcher.py, bingx_venue.py, etc.)

The backup directory contains full copies of all Python source files from May 30. If a developer runs import from within the dita_v2 directory, the backup directory's __init__.py makes it a valid Python package. An accidental from _backup_20260530 import rust_backend would load the old code instead of the current implementation — silently, with no warning.

The backup rust_backend.py lacks the Rust FFI integration, has no _first_invalid_intent_field(), and uses the old ExecutionKernel class. Accidentally importing from the backup would produce hard-to-diagnose errors (missing methods, wrong behavior).

Fix: Rename backup directories to non-Python-package names (e.g., backup_20260530 without the leading underscore), or add __init__.py that raises ImportError with a clear message.

Severity: Medium

Pass 17 Summary

#	Flaw	Layer	Severity
T1	`catch_unwind` + `AssertUnwindSafe` — partially mutated state persists, no rollback	Rust	High
T2	Empty backup dir `_backup_20260530_105512/` and stale `tea_debug.log`	Repo	Low
T3	`HazelcastRowWriter` uses bare `json.dumps(row, default=str)` — Enums/datetimes wrong format	Bridge	High
T4	`real_zinc_plane._slot_from_payload()` direct key access `payload["last_event_time"]` — `KeyError` crash	Plane	High
T5	`_build_pink_bodies.py` `str.index("]")` finds first `]` — corrupts SCENARIOS list	Build	High
T6	`VenueAdapter` protocol missing `connect()`/`disconnect()` — `AttributeError` at runtime	Venue	High
T7	Shared memory writes non-atomic — visible-zero window, stale tail exposes partial state	Plane	High
T8	`_slot_from_payload()` reconstructs `internal_trade_id` from slot's trade_id — order-level data loss	Plane	Medium
T9	`_slot_from_payload()` duplicated in two files — double maintenance burden, schema drift risk	Plane	Medium
T10	`_build_pink_extended.py` `str.index("finally:")` finds first match — nested try/finally mismatch	Build	Medium
T11	No workspace-root `.gitignore` — `__pycache__`, backup dirs, debris untracked	Repo	Low
T12	`projection.py` lazy import failure silently swallowed — `writer=None` drops all Hazelcast writes	Bridge	Medium
T13	`Codex_CONTEXT_RESTORE__*.txt` AI context files in workspace root — debris	Repo	Low
T14	`_backup_20260530/` is a valid Python package — accidental old-code import risk	Repo	Medium

Pass 17 Severity

Severity	Count
High	5 (T1, T3, T4, T5, T6, T7)
Medium	5 (T8, T9, T10, T12, T14)
Low	4 (T2, T11, T13)

Combined Catalog (All 17 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
S	Pass 16 (Error Handling/Arithmetic/Test Infra)	16	4	7	5	0	0
T	Pass 17 (Unsafe Review/Dead Code/Build/Protocols)	14	0	5	5	4	0
Total		319	27	95	92	64	37

PASS 18 — RUST TEST GAPS, ACCOUNTING RECONCILIATION BUGS, FFI TYPE MISMATCHES

U1: Rust `KernelIntent` has no `order_type` or `limit_price` fields — Python sends them, serde silently drops them

Files: rust_backend.py:375-377 (Python sends), _rust_kernel/src/lib.rs:439-456 (Rust receives)

Python's _intent_to_payload() serializes two fields that the Rust KernelIntent struct does not have:

# Python sends (rust_backend.py:375-377):
"order_type": getattr(intent, "order_type", "MARKET"),
"limit_price": float(getattr(intent, "limit_price", 0.0) or 0.0),

// Rust receives (lib.rs:439-456) — no order_type or limit_price fields:
struct KernelIntent {
    timestamp: DateTime<Utc>,
    intent_id: String,
    trade_id: String,
    slot_id: i64,
    asset: String,
    side: TradeSide,
    action: KernelCommandType,
    reference_price: f64,
    target_size: f64,
    leverage: f64,
    exit_leg_ratios: Vec<f64>,
    reason: String,
    metadata: Map<String, Value>,
    stage: TradeStage,
    // order_type — MISSING
    // limit_price — MISSING
}

Serde's default behavior ignores unknown fields during deserialization. Both order_type and limit_price are transmitted across the FFI boundary and silently thrown away. Any downstream logic in Rust that depends on these fields is dead code. The Python KernelIntent dataclass declares them with defaults ("MARKET", 0.0), and they're used in Python-side _first_invalid_intent_field() (which checks limit_price for NaN), but the Rust kernel never sees them.

Impact: If the Rust kernel were to use order_type to distinguish MARKET from LIMIT orders (which would be needed for realistic exchange interaction), the field exists in Python but is never delivered. This was clearly designed to be added to Rust but the addition was never completed.

Severity: High

U2: Rust `VenueEventStatus` deserializer expects `"CANCEL_REJECTED"` — Python sends `"CANCELED_REJECTED"`, deserialization fails

Files: _rust_kernel/src/lib.rs:269-278 (Rust deserializer), Python contracts.py VenueEventStatus enum

The Rust VenueEventStatus custom deserializer has a typo in one of its string literals:

// Rust deserializer (lib.rs:269-278):
"CANCEL_REJECTED" => Ok(VenueEventStatus::CANCELED_REJECTED),
//     ^ no D — TYPO

The string literal "CANCEL_REJECTED" is missing the D after L. The enum variant is correctly named CANCELED_REJECTED (with a D), and Python's VenueEventStatus.CANCELED_REJECTED.value produces "CANCELED_REJECTED" (with a D).

When Python sends a VenueEvent with status=CANCELED_REJECTED, the JSON contains "status": "CANCELED_REJECTED". Rust's deserializer tries to match "CANCELED_REJECTED" against the string "CANCEL_REJECTED" — which fails. Serde returns an error: "invalid VenueEventStatus: CANCELED_REJECTED". The on_venue_event() returns an error diagnostic.

Impact: Any venue event with status CANCELED_REJECTED fails deserialization on the Rust side. The event is rejected with INVALID_EVENT_PARSE instead of being processed normally. This means CANCEL_REJECT events (which are important FSM signals — they tell the kernel that a cancel was rejected by the exchange) are silently discarded rather than being used to transition the FSM.

Note on usage: The mock venue does not produce CANCELED_REJECTED events in the current test suite. The bingx live venue adapter (bingx_venue.py) maps exchange cancel responses but may use a different status string. This bug is dormant until a live exchange returns a cancel-rejected status — at which point the venue event is silently dropped with an error diagnostic.

Severity: High

U3: R2 reconciliation compares cumulative `k.realized_pnl` against single-last-fill `e.last_fill_realized_pnl` — broken after 2+ fills

File: account.py:459-473

# K side (line 296):
self._k_realized += safe_float(realized_pnl, 0.0)  # ACCUMULATES all fills

# E side (line 299):
self._e_last_fill_realized_pnl = e_fill_realized_pnl  # ONLY the LAST fill

# R2 comparison (line 460):
delta_r2 = abs(k.realized_pnl - e.last_fill_realized_pnl)

After the first fill:

k.realized_pnl = fill_1_pnl (e.g., 10.0)
e.last_fill_realized_pnl = fill_1_pnl (10.0)
delta_r2 = 0 ✅ OK

After the second fill:

k.realized_pnl = fill_1_pnl + fill_2_pnl (e.g., 10.0 + 15.0 = 25.0)
e.last_fill_realized_pnl = fill_2_pnl (15.0 — only the last fill)
delta_r2 = |25.0 - 15.0| = 10.0
With realized_rounding = 0.05, 10.0 > 0.05 → ERROR

After every fill beyond the first, R2 fires ERROR because the K accumulator includes all fills but the E value is reset to only the most recent fill. This is a fundamental design flaw — K and E track realized PnL differently (K accumulates, E replaces), and the comparison is apples-to-oranges.

Fix: Either e.last_fill_realized_pnl must be accumulated (add each new fill to the previous total) on the Python side, or R2 should compare only the per-fill delta (which would require storing per-fill values on both sides).

Severity: Critical

U4: R4 reconciliation compares `k.open_notional` against `e.used_margin` — fundamentally different quantities

File: account.py:488-498

# K side (line 411):
open_notional += abs(slot.size) * mark  # Σ |qty| · mark_price

# E side (line 370):
self._e_used_margin = _safe(...)  # exchange-reported used margin

# R4 comparison (line 490):
delta_notional = abs(k.open_notional - e.used_margin)

K open_notional = sum of absolute position notional values: |size| × mark_price. This is a gross market exposure measure.

E used_margin = exchange-reported margin used: Σ notional / leverage, possibly adjusted for cross-margin, risk weighting, position tiers, and maintenance margin requirements. This is a margin requirement measure.

For a 100 USDT position at 10x leverage:

K open_notional = 100 × mark = 100 USDT
E used_margin = 100 / 10 = 10 USDT (approximately, ignoring cross-margin effects)
delta = |100 - 10| = 90 USDT
With capital_epsilon = 1e-4 → 90 > 1e-4 → instantly exceeds even the WARN band

R4 produces ERROR on every position with any leverage > 1x. The only time R4 would pass is with 1x leverage on a single position (where notional = used_margin approximately). This makes R4 completely broken as designed.

Fix: R4 should compare like-with-like: either compare K open_notional against E open_notional (if available from exchange), or compare K used_margin against E used_margin. The current comparison of notional vs margin is meaningless.

Severity: Critical

U5: R3 skipped when `len(e.positions) == 0` — K has open positions but E reports none, silent false negative

File: account.py:478

if len(e.positions) > 0:  # guard — only run R3 when E reports positions
    if k.open_positions != len(e_pos_map):
        return "ERROR", ...

If the exchange reports zero positions (corresponds to exchange-side position clear, connection loss, or initial state) while K has open positions (slots in POSITION_OPEN), R3 is entirely skipped. The position count mismatch is not detected.

This is asymmetric: if K=0 and E=1 (E has a position K doesn't know about), R3 fires because len(e.positions)=1 > 0 and k.open_positions=0 != 1. But if K=1 and E=0, the guard prevents detection. The position that exists only in K's view is invisible to reconciliation.

Scenario: The exchange liquidates a position (or it expires) and sends no explicit event. K still thinks the position is open. The reconciler is called with E reporting 0 positions. R3 is skipped. R1 capital divergence may catch it eventually (if the trade had PnL), but if PnL is zero, R1 delta is also zero → no detection at all.

Severity: High

U6: `on_venue_event` and `apply_fill` have no NaN/Inf guards on incoming venue event fields — NaN price/size propagate unchecked

File: _rust_kernel/src/lib.rs — on_venue_event(), apply_fill()

The apply_fill() function uses event.price, event.size, event.filled_size directly in arithmetic without finiteness checks:

// apply_fill (approximately):
slot.entry_price = event.price;           // NaN → stored directly
let realized = realized_pnl(&slot, event.price, fill_size);  // NaN → NaN PnL
slot.realized_pnl += realized;             // NaN accumulates
slot.size = (slot.size - fill_size).max(0.0);  // NaN - x = NaN, .max(0.0) = 0.0 (safe for size)

The realized_pnl() function (line 1121) guards entry_price <= 0.0 but NaN passes through (IEEE 754: NaN <= 0.0 is false). The mark_price() function (line 395) guards price.is_finite() on input but entry_price can be set to NaN by apply_fill before mark_price is called.

Trigger path: If a venue event arrives with price = NaN (corrupted exchange data, deserialization of malformed JSON, or a bug in the venue adapter), apply_fill stores NaN into entry_price. Every subsequent realized_pnl() and mark_price() call produces NaN. Once NaN enters slot.realized_pnl, it propagates to k_realized_pnl → k_capital → all financial computations.

The Rust-side apply_fill_settled() (line 761) has if realized_pnl.is_finite() — but this only guards the account level, not the slot level. The slot-level realized_pnl has already been corrupted.

Severity: Critical

U7: Rust kernel has zero tests for ORDER_REJECT, PARTIAL_FILL, TERMINAL_STATE guard, STALE_STATE_RECONCILING guard, RATE_LIMITED, or MARK_PRICE transitions

File: _rust_kernel/src/lib.rs — mod tests (32 tests total)

The 32 Rust unit tests cover ExchangeFeeConfig (7), AccountState fill_settled (5), predicted→settled (2), JSON dispatch (3), dedup (3), snapshot save/restore (4), capital_frozen (4), and one FSM transition test (enter_then_ack_fill).

FSM transitions with ZERO Rust test coverage:

Transition	Rust test	Python FFI test	Overall
ORDER_REJECT → IDLE	❌	❌	Uncovered
PARTIAL_FILL → ENTRY_WORKING / EXIT_WORKING	❌	❌ (no explicit FSM state check)	Uncovered
FULL_FILL → POSITION_OPEN / CLOSED	❌	✅ (test_flaws.py)	Python only
CANCEL_ACK → IDLE (entry)	❌	✅ (TestFlaw2)	Python only
CANCEL_ACK → POSITION_OPEN (exit)	❌	✅ (TestFlaw2)	Python only
CANCEL_REJECT → POSITION_OPEN	❌	✅ (TestI15)	Python only
TERMINAL_STATE guard	❌	❌	Uncovered
STALE_STATE_RECONCILING guard	❌	❌	Uncovered
RATE_LIMITED response	❌	❌	Uncovered
MARK_PRICE (unrealized_pnl update)	❌	❌	Uncovered
SLOT_BUSY rejection	❌	✅ (incidental)	Weak
ENTER with capital frozen	❌	✅ (TestCapitalFrozen)	Python only
EXIT on IDLE slot	❌	✅ (incidental)	Weak
Multi-leg exit FSM	❌	✅ (TestFlaw4)	Python only

The most critical gap: ORDER_REJECT — if the exchange rejects an entry order, the kernel should transition back to IDLE (or emit a diagnostic). This path exists in the Rust code (line ~1525: KernelEventKind::ORDER_REJECT match arm) but has zero test coverage in Rust or Python.

Severity: High

U8: `safe_float()` in `utils.py` returns NaN/Inf instead of default — contradictory behavior with `_safe()` in `account.py`

File: utils.py:13-19, account.py:229-233

# utils.py safe_float():
def safe_float(value, default=0.0):
    try:
        out = float(value)
    except Exception:
        return default
    if not math.isfinite(out):
        return out       # BUG: returns NaN/Inf unchanged!
    return out

# account.py _safe():
def _safe(v, default=0.0):
    try:
        f = float(v)
        return f if math.isfinite(f) else default  # CORRECT: returns default
    except:
        return default

safe_float() returns the non-finite value (NaN/Inf) when encountered, while _safe() correctly returns the default. Both functions serve the same purpose (safe float conversion) but have opposite behavior for non-finite inputs.

safe_float() is used in AccountProjectionV1.observe_slots() (line 56) which feeds data into the legacy reconciliation path. If an exchange returns NaN for a price or size, safe_float() propagates NaN into the accounting state rather than defaulting to 0.0.

Fix: safe_float() should return default when the value is non-finite, matching _safe()'s behavior.

Severity: Medium

U9: `_scan_slots()` uses `slot.metadata.get("leverage")` not `slot.leverage` — wrong leverage source for used_margin computation

File: account.py:414-416

metadata = slot.metadata or {}
lev = max(1.0, _safe(metadata.get("leverage", slot.leverage if hasattr(slot, "leverage") else 1.0), 1.0))

The used_margin computation reads leverage from slot.metadata.get("leverage") rather than slot.leverage. The slot's own .leverage field is only used as a fallback if metadata doesn't have the key.

The Rust kernel sets slot.leverage from the intent's leverage field during process_intent() (line ~1258). It does NOT write leverage into slot.metadata. The metadata is populated from intent.metadata which is a generic dict that may or may not contain a "leverage" key.

Result: Unless the calling algo explicitly places leverage in intent.metadata["leverage"], _scan_slots uses the default 1.0 for lev, regardless of what slot.leverage actually is. The used_margin computation is wrong for any slot with leverage ≠ 1x.

This affects R5 (used_margin comparison). If the slot has 10x leverage but _scan_slots computes with 1x, k.used_margin is 10x larger than it should be, producing a false R5 ERROR.

Severity: Medium

U10: `AccountState` serializes `k_fees_paid` but Rust manually injects JSON key `"k_net_fees"` — two keys for same value

File: _rust_kernel/src/lib.rs:1144-1148 (Rust), rust_backend.py:907 (Python)

Rust's on_account_event() manually injects an additional JSON key into the account event result:

obj.insert("k_net_fees".to_string(), json!(self.account.k_fees_paid));

The serde-serialized AccountState already contains the field k_fees_paid (from #[derive(Serialize)]). This creates two JSON keys ("k_net_fees" and "k_fees_paid") holding the same value.

Python's ExecutionKernel.snapshot() reads "k_fees_paid" from the deserialized account data — it never reads "k_net_fees". The injected key is dead data on the wire. This is not a functional bug but represents confusion about which key names are canonical.

Severity: Low

U11: 10+ `AccountState` fields transmitted across FFI but never read by Python — wasted bandwidth, confusion risk

File: _rust_kernel/src/lib.rs (Rust serializes), rust_backend.py (Python reads)

Fields serialized by Rust's AccountState serde and transmitted across FFI but never read by Python:

seed_capital — initial capital, available from config
k_taker_fees — individual taker fee bucket
k_maker_fees — individual maker fee bucket
k_maker_rebates — individual rebate bucket
fee_config — entire ExchangeFeeConfig struct (5 subfields: calibration_ratio, taker_rate, maker_rate, etc.)
last_predicted_fee — most recent predicted fee value
last_calibration_ratio — most recent calibration ratio
seen_account_event_ids — entire IndexSet<String> (1024 entries at capacity)

These fields are serialized to JSON, sent across the FFI boundary via CString, and silently discarded by Python. The most wasteful is seen_account_event_ids — a 1024-element set of strings that is transmitted on every snapshot read but never used on the Python side.

Severity: Low

U12: `_order_from_payload()` overwrites `internal_trade_id` with enclosing slot's `trade_id` — loses order-level distinction

File: rust_backend.py:302-310,334-335

def _order_from_payload(payload: dict, trade_id: str) -> VenueOrder:
    return VenueOrder(
        internal_trade_id=trade_id,  # OVERWRITES with slot's trade_id
        ...
    )

# Called as (line 334):
active_exit_order=_order_from_payload(order_dict, trade_id=str(payload.get("trade_id", ""))),

The _order_from_payload() function takes trade_id from the caller and uses it as the order's internal_trade_id. The JSON payload's own internal_trade_id field is ignored. If Rust's TradeSlot.to_dict() serializes the order with its own internal_trade_id (which may differ from the slot's trade_id — e.g., a slot that was re-entered after a cancel), the per-order ID is silently replaced with the slot-level ID.

This affects the Python-side VenueOrder.internal_trade_id field — it will always match the slot's trade_id rather than the order's original ID. Any Python code that relies on internal_trade_id to distinguish between entry and exit orders, or to track orders across cancel/re-enter cycles, gets wrong data.

Note: T8 covers the same bug in real_zinc_plane.py. This is the same bug in rust_backend.py — the FFI path. The bug exists in both code paths.

Severity: Medium

U13: Reconciliation has no independent third reference — any divergence affecting both K and E equally is invisible

File: account.py:442-510

Every R-rule compares K (kernel-computed state) against E (exchange-reported state). If both K and E share a common error source — stale mark price, wrong position count, outdated wallet balance — the delta is small and reconciliation reports OK.

Specific blind spots:

R1: If both k.capital and e.wallet_balance are wrong by the same amount in the same direction (e.g., both show 9,800 but true capital is 10,000), abs(9800 - 9800) = 0 → OK
R4: If both K open_notional and E used_margin are inflated by the same stale mark price, the delta is small → OK (but read U4 — these are fundamentally different quantities, so this blind spot is theoretical)
R3: If both K and E report 2 positions but the positions have wrong sizes or entry prices, the count matches → OK (no per-position quality comparison)

No cross-check against a third data source (independent market data feed, broker API, or blockchain) exists.

Severity: Medium

U14: `lot_step` declared in `ReconcileConfig` but never used anywhere — dead config field

File: account.py:220

@dataclass
class ReconcileConfig:
    ...
    lot_step: float = 0.001   # DEAD — never referenced in _classify or _scan_slots

The lot_step field is declared with a default of 0.001 but is never read by any code path. It was intended for per-position quantity comparison (R6 mentioned in a comment) but that rule was never implemented. A developer configuring reconciliation might set lot_step expecting it to affect behavior, but it has no effect.

Severity: Low

Pass 18 Summary

#	Flaw	Layer	Severity
U1	`order_type`/`limit_price` sent to Rust, no serde fields — silently dropped	FFI	High
U2	Rust `VenueEventStatus` expects `"CANCEL_REJECTED"` (typo) — `"CANCELED_REJECTED"` fails	Rust	High
U3	R2 compares cumulative K realized vs single-last-fill E realized — broken after 2nd fill	Accounting	Critical
U4	R4 compares K open_notional vs E used_margin — fundamentally different quantities	Accounting	Critical
U5	R3 skipped when `len(e.positions)==0` — K has positions but E reports none, silent	Accounting	High
U6	`on_venue_event`/`apply_fill` no NaN guards on venue event price/size — NaN propagates	Rust	Critical
U7	Zero Rust tests for ORDER_REJECT, PARTIAL_FILL, TERMINAL_STATE, etc.	Test	High
U8	`safe_float()` returns NaN/Inf instead of default — contradicts `_safe()`	Bridge	Medium
U9	`_scan_slots` uses `metadata.get("leverage")` not `slot.leverage` — wrong leverage source	Accounting	Medium
U10	Rust injects `"k_net_fees"` key alongside serde's `k_fees_paid` — duplicate key	Bridge	Low
U11	10+ AccountState fields transmitted across FFI but never read by Python	FFI	Low
U12	`_order_from_payload()` overwrites `internal_trade_id` with slot's `trade_id`	Bridge	Medium
U13	No independent third reference — symmetrical K=E errors invisible	Accounting	Medium
U14	`lot_step` declared in ReconcileConfig but never used — dead field	Accounting	Low

Pass 18 Severity

Severity	Count
Critical	3 (U3, U4, U6)
High	4 (U1, U2, U5, U7)
Medium	4 (U8, U9, U12, U13)
Low	3 (U10, U11, U14)

Combined Catalog (All 18 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
S	Pass 16 (Error Handling/Arithmetic/Test Infra)	16	4	7	5	0	0
T	Pass 17 (Unsafe Review/Dead Code/Build/Protocols)	14	0	5	5	4	0
U	Pass 18 (Rust Test Gaps/Accounting/FFI Types)	14	3	4	4	3	0
Total		333	30	99	96	64	37

PASS 19 — STARTUP/SHUTDOWN LIFECYCLE, RUST KERNEL SUBTLETIES, GENERATED TEST INFRA

V1: `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust kernel handle always leaks via `del`

File: launcher.py:40-42

def close(self) -> None:
    _maybe_close(self.venue)
    _maybe_close(self.zinc_plane)
    _maybe_close(self.control_plane)
    # NOTE: self.kernel is never closed!

The bundle's close() method closes the venue, zinc plane, and control plane — but not the kernel. The ExecutionKernel Rust handle (self._backend) is only freed when Python's garbage collector calls __del__, which is non-deterministic.

In a with block or explicit bundle.close() pattern, the Rust handle survives until the ExecutionKernel object's refcount drops to zero (usually when the bundle goes out of scope). If the caller holds a reference to the kernel (e.g., k = bundle.kernel for inspection), the handle lives indefinitely.

Compare with _build_rb() in test code (gen2.py:326-338) which creates a Shim that has a close method calling bundle.close() and bundle.kernel.close(). The production DITAv2LauncherBundle has no equivalent.

Severity: Critical

V2: `BingxVenueAdapter` has no `close()` or `disconnect()` — ThreadPoolExecutor, HTTP client sessions, connections never released

File: bingx_venue.py (entire file)

BingxVenueAdapter is the synchronous REST adapter to BingX. It has no close(), disconnect(), or any cleanup method. The _maybe_close() in the launcher tries .close() → AttributeError (caught), then .disconnect() → AttributeError (caught). Nothing happens.

Leaked resources:

ThreadPoolExecutor (class-level _EXECUTOR, 3 threads) — never shut down
BingxDirectExecutionAdapter backend — holds an aiohttp.ClientSession with TCP connection pool (limit=4) — never closed
Any in-flight HTTP connections — abandoned at process exit

This is a compounding of R1 and R2 from Pass 15. It's the single largest resource leak in the system.

Severity: Critical

V3: `process_intent` ENTER path does NOT clear `seen_event_ids` — old trade's dedup set pollutes new trade

File: _rust_kernel/src/lib.rs:1260-1290 (ENTER path)

// ENTER path — slot reuse:
slot.trade_id = intent.trade_id.clone();
slot.asset = intent.asset.clone();
slot.entry_price = 0.0;
slot.size = 0.0;
slot.initial_size = 0.0;
slot.unrealized_pnl = 0.0;
slot.realized_pnl = 0.0;
slot.active_leg_index = 0;
slot.active_entry_order = None;
slot.active_exit_order = None;
slot.close_reason.clear();
slot.closed = false;
slot.last_event_time = None;
// 🔴 seen_event_ids is NOT cleared — survives from old trade
slot.fsm_state = TradeStage::ORDER_REQUESTED;

When a slot is reused by a new ENTER intent, the seen_event_ids vector from the previous trade survives. If any event_id from the new trade collides with one observed by the old trade, the event is rejected as DUPLICATE_EVENT and silently dropped.

Example:

Trade A in slot 0 receives events evt-001, evt-002, evt-003. All stored in seen_event_ids.
Trade A completes. Slot 0 returns to IDLE.
Trade B enters on slot 0. seen_event_ids still contains evt-001, evt-002, evt-003.
Exchange sends evt-001 for Trade B (a legitimate new event with a reused event_id — possible on exchanges that recycle IDs daily).
Rust kernel sees evt-001 in seen_event_ids → rejects as DUPLICATE_EVENT. The fill/ack is lost.

The probability of collision depends on the exchange's event ID scheme. If event IDs are UUIDs, the risk is negligible. If event IDs are daily counters (e.g., "20260601-001" → "20260602-001"), the collision risk is guaranteed on every day boundary.

Fix: Add slot.seen_event_ids.clear(); in the ENTER path.

Severity: High

V4: Three generators (`gen2.py`, `gen_live_tests.py`, `_gen_test.py`) all write to same output file — last writer wins, incompatible prologues

Files: gen2.py:431-432, gen_live_tests.py:680-681, _gen_test.py:1233-1234

All three generator scripts write to the same file:

/mnt/dolphinng5_predict/prod/tests/test_pink_bingx_dita_live_e2e.py

Each produces different prologues with incompatible import styles, helper function signatures, and test runner implementations:

gen2.py: Uses RB/Shim tuple, _si() helper, _flatten() helper
gen_live_tests.py: Uses _RuntimeBundle/_RuntimeShim dataclass, _run_scenario() runner, _flatten_via_kernel_intent() helper
_gen_test.py: Uses scenario-style B() body definitions, single parametrized test_pink_ditav2 function

The 68 named body functions are identical across gen2.py and gen_live_tests.py but their signatures differ ((k, symbol, p) vs (bundle, client, symbol, snap)). Running gen2.py then gen_live_tests.py produces a file where the bodies don't match the runner — the file compiles (Python sees valid functions) but silently tests nothing meaningful.

_build_pink_extended.py and _build_pink_bodies.py then mutate the same file in-place with str.replace() — which silently does nothing if the expected format doesn't match.

Severity: Critical

V5: Generated tests are triple env-gated — never run in CI, making them dead code

File: gen2.py (generated output), gen_live_tests.py (generated output), _gen_test.py (generated output)

Every generated test file has three pytest.skip() guards at the top:

if not os.environ.get("BINGX_SMOKE_LIVE"): pytest.skip("...")
if not os.environ.get("BINGX_SMOKE_ALLOW_TRADE"): pytest.skip("...")
if not os.environ.get("PINK_DITA_E2E"): pytest.skip("...")

These three env vars are never set in CI. The CI pipeline has no step that sets them. As a result:

68 test functions from gen2.py → always skipped
70 test functions from gen_live_tests.py → always skipped
157 body scenarios from _gen_test.py → always skipped
~295 combined test scenarios → zero executed in CI

Only the mock-venue tests (test_flaws.py, test_account_core_v2.py, test_account_reconcile_faults.py, test_exchange_event_seam_parity.py, test_kernel_reliability.py, test_kernel_fee_friction.py, test_pink_clickhouse_phase4.py, test_alpha_blue_untouched_g7.py) actually run in CI.

Severity: Critical

V6: `ExecutionKernel.close()` destroys Rust handle immediately — no drain of in-flight intents, no wait for pending FFI calls, no state flush

File: rust_backend.py:320-330

def close(self) -> None:
    backend = self._backend
    if backend is not None:
        self._backend = None
        try:
            _get_rust().destroy(backend)  # immediately frees Rust kernel memory
        except Exception:
            pass

destroy(backend) calls dita_kernel_destroy(handle) which calls drop(Box::from_raw(handle)), freeing the entire KernelHandle heap allocation including KernelCore, all slots, all account state.

If process_intent() or on_venue_event() has been called from another thread and is mid-execution in Rust when close() fires:

Rust's state machine is destroyed mid-transition — dangling self reference
Any HTTP calls (venue.submit()) already in-flight complete, but their results are never processed
The Rust Box::from_raw calls drop() while another thread may be holding a reference to core — use-after-free UB
_last_settled_pnl dict is orphaned
The zinc plane, projection, and account state are all inconsistent with the destroyed kernel

There is no cancel/token mechanism, no pending-call queue drain, and no state flush before destroy.

Severity: Critical

V7: `_last_settled_pnl` dict accessed from both `process_intent` and `on_venue_event` without locks — thread-unsafe

File: rust_backend.py:440,475

# process_intent (line 440):
self._last_settled_pnl[intent.slot_id] = 0.0

# on_venue_event (line 475):
incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0)
self._last_settled_pnl[slot.slot_id] = slot.realized_pnl

process_intent() (called from the async event loop or scheduler) and on_venue_event() (called from the venue event stream callback) both read and write _last_settled_pnl without any synchronization primitive. If these run on different threads, the dict can experience:

Lost update: Two concurrent writes to the same slot_id — one overwrites the other
Dict corruption: Python dict isn't thread-safe for concurrent writes — can produce KeyError on iteration or silently drop entries
Incorrect PnL settlement: The incremental PnL calculation uses stale values

In the current single-threaded async architecture, this isn't triggered. But the lack of protection means any future multi-threaded usage introduces a data race.

Severity: Medium

V8: `#[serde(default)]` on `leverage: f64` defaults to 0.0 — `mark_price()` uses leverage directly without `.max(1.0)`, silent accounting error

File: _rust_kernel/src/lib.rs:400-408

fn mark_price(&mut self, price: f64) {
    // ...
    if self.entry_price <= 0.0 || self.size <= 0.0 { return; }
    let mut delta = (price - self.entry_price) / self.entry_price;
    if self.side == TradeSide::SHORT { delta = -delta; }
    self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage;  // uses leverage directly
}

The TradeSlot struct has #[serde(default)] leverage: f64 which defaults to 0.0 when deserialized from JSON without a leverage field.

mark_price() uses self.leverage directly in unrealized_pnl computation — no .max(1.0) guard. If leverage is 0.0 (from missing JSON field, set_slot_json, or snapshot restore without leverage), unrealized_pnl is always 0.0 regardless of price movement.

Compare with realized_pnl() which correctly guards:

let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);  // correct

The process_intent ENTER path handles this (sets leverage to 1.0 if ≤ 0), but set_slot_json and restore_full_snapshot can bypass this and store leverage=0.0 directly into the slot.

Impact: Any slot restored from a snapshot that predates the leverage field (or saved by a version without it) gets leverage=0.0 silently. unrealized_pnl is always 0.0 — the operator sees no mark-to-market PnL even though the position has moved. This PnL is only realized on close (via realized_pnl which correctly uses .max(1.0)), so the total PnL is correct but the intra-period mark appears flat — a silent accounting error.

Severity: Medium

V9: No `conftest.py`, no `pytest.ini`, no `asyncio_mode` configuration — test discovery relies on default pytest behavior

File: (missing — no pytest configuration in workspace)

The workspace has zero pytest configuration files. No conftest.py, pytest.ini, setup.cfg, pyproject.toml with pytest settings. This means:

asyncio_mode defaults to "strict" (pytest 8.x+), but all test files use asyncio.run() inline rather than @pytest.mark.asyncio. In strict mode, async def test functions are not discovered unless explicitly marked. Since all current tests are sync def wrapping asyncio.run(), this works — but the discovery relies on this specific pattern holding.
No timeout configuration — a hanging test blocks the entire suite until killed
No test filtering markers (no slow, live, offline markers)
No shared fixtures — each test file repeats the _build_rb() setup pattern
No cache or rerun configuration — flaky tests fail the CI suite

If a developer adds a new test using async def + await (the natural pattern for async code) without @pytest.mark.asyncio, the test silently doesn't run — a false negative.

Severity: High

V10: `kernel.close()` has `except Exception: pass` — silently swallows all destroy errors

File: rust_backend.py:325

try:
    _get_rust().destroy(backend)
except Exception:
    pass  # silently swallows RuntimeError, OSError, ctypes errors

If dita_kernel_destroy fails (e.g., segfault caught by catch_unwind returning an error, or the shared library was unloaded), the exception is silently consumed. The caller believes the kernel was closed successfully — but the handle may still be allocated. This is a leak path that's invisible to monitoring.

Severity: Low

V11: `build_launcher_bundle()` has no cleanup for partially-created components — kernel OOM orphans venue, zinc, control plane

File: launcher.py:163-223

The build order is:

Control plane ✅ built
Projection ✅ built
Zinc plane ✅ built
Venue ✅ built
ExecutionKernel — if this fails (OOM, cargo build fail, CDLL load), components 1-4 are all orphaned

There's no try/finally around the sequence. If ExecutionKernel.__init__() raises, the five already-built components (control plane, projection, zinc plane, venue, and any partially-initialized kernel state) leak. No cleanup code exists.

Fix: Use a context manager or try/finally to close created components if a later one fails.

Severity: Medium

V12: `KernelResult` clones entire kernel state (all slots + indexes + AccountState) on every FFI call — performance issue

File: _rust_kernel/src/lib.rs:1030-1050 (snapshot method)

Every FFI call to process_intent() or on_venue_event() returns a KernelResult containing a full snapshot():

fn snapshot(&self) -> KernelSnapshot {
    KernelSnapshot {
        slots: self.slots.clone(),                           // O(n) clone ALL TradeSlots
        active_trade_index: self.active_trade_index.clone(), // clone entire HashMap
        venue_order_index: self.venue_order_index.clone(),
        client_order_index: self.client_order_index.clone(),
        account: self.account.clone(),                       // clone entire AccountState
    }
}

For a kernel with 10 slots, each with metadata, seen_event_ids (1024 IDs), nested VenueOrders, and a full AccountState with fee_config, this produces thousands of heap allocations per FFI call. At 10 intents/second, this is tens of thousands of allocations/second.

The Python side only reads KernelResult.outcome and KernelResult.slot (the single affected slot) from each response — the full snapshot is never used. The snapshot field is transmitted and decoded by Python (json.loads parses all of it) but the caller only accesses outcome and slot. The snapshot data is generated, serialized, transmitted, deserialized, and silently discarded on every call.

Severity: Medium (performance, not correctness)

V13: `_build_rb()` leaks bundle on any post-creation failure — Shim construction failure orphans kernel and venue

File: gen2.py:326-338, _gen_test.py:69-80

def _build_rb(ic=25000.0, max_slots=1):
    cfg = _build_config(ic)
    b = build_launcher_bundle(...)  # bundle created with active kernel, venue, etc.
    k = b.kernel
    k.account.snapshot.capital = ic  # <-- if this raises (e.g., AttributeError)
    ...  # Shim construction, etc.
    return RB(runtime=Shim(k), config=cfg)

If any line after build_launcher_bundle() raises, the bundle b (with live kernel, venue connections, zinc plane) is leaked. No try/finally to call b.close(). In test code this is acceptable (process exits soon), but if test suites grow large, accumulated leaked bundles exhaust kernel slots or file descriptors.

Severity: Low

V14: `_maybe_close` uses `break` after first method match — only tries `close` OR `disconnect`, never both

File: launcher.py:67

for method_name in ("close", "disconnect"):
    method = getattr(obj, method_name, None)
    if callable(method):
        ...
        break  # <-- breaks after first successful match — never tries the second method

If an object has both close() and disconnect(), only close() is called. The disconnect() fallback is never reached if close() exists. This is correct for objects like RealZincPlane (which has close() but no disconnect()), but for objects with both methods (possible future adapters), the second method is silently skipped.

Severity: Low

Pass 19 Summary

#	Flaw	Layer	Severity
V1	`DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust handle leaks via `__del__`	Launcher	Critical
V2	`BingxVenueAdapter` no `close()`/`disconnect()` — ThreadPoolExecutor/HTTP never release	Venue	Critical
V3	`process_intent` ENTER doesn't clear `seen_event_ids` — old dedup pollutes new trade	Rust	High
V4	3 generators write same output file — last writer wins, incompatible prologues	Test	Critical
V5	Generated tests triple env-gated — never run in CI, dead code	Test	Critical
V6	`kernel.close()` destroys Rust handle immediately — no drain, no flush, use-after-free risk	Bridge	Critical
V7	`_last_settled_pnl` dict accessed from process_intent and on_venue_event without locks	Bridge	Medium
V8	`#[serde(default)] leverage: f64` default 0.0 — mark_price uses directly no `.max(1.0)`	Rust	Medium
V9	No `conftest.py`, no `pytest.ini`, no `asyncio_mode` — test discovery fragile	Test	High
V10	`kernel.close()` `except Exception: pass` — silently swallows destroy errors	Bridge	Low
V11	`build_launcher_bundle()` no cleanup on partial failure — OOM orphans 4 components	Launcher	Medium
V12	`KernelResult` clones entire kernel state every FFI call — wasted allocations	Rust	Medium
V13	`_build_rb()` leaks bundle on post-creation failure	Test	Low
V14	`_maybe_close` breaks after first method — never tries both close and disconnect	Launcher	Low

Pass 19 Severity

Severity	Count
Critical	5 (V1, V2, V4, V5, V6)
High	2 (V3, V9)
Medium	4 (V7, V8, V11, V12)
Low	3 (V10, V13, V14)

Combined Catalog (All 19 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
S	Pass 16 (Error Handling/Arithmetic/Test Infra)	16	4	7	5	0	0
T	Pass 17 (Unsafe Review/Dead Code/Build/Protocols)	14	0	5	5	4	0
U	Pass 18 (Rust Test Gaps/Accounting/FFI Types)	14	3	4	4	3	0
V	Pass 19 (Lifecycle/Rust Subtleties/Test Infra)	14	5	2	4	3	0
Total		347	35	101	100	64	37

PASS 20 — CONFIGURATION MANAGEMENT, MATH SIGN CONVENTIONS, BINGX PROTOCOL

W1: `int()` on three env vars (`RECV_WINDOW_MS`, `DEFAULT_LEVERAGE`, `EXCHANGE_LEVERAGE_CAP`) — `ValueError` uncaught, immediate crash on non-numeric input

File: launcher.py:189-191

recv_window_ms = int(os.environ.get("DOLPHIN_BINGX_RECV_WINDOW_MS", "5000"))
default_leverage = int(os.environ.get("DOLPHIN_BINGX_DEFAULT_LEVERAGE", "1"))
exchange_leverage_cap = int(os.environ.get("DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP", "3"))

Three consecutive int() calls on raw env var strings with no try/except. If any of these env vars is set to a non-numeric value (e.g., DOLPHIN_BINGX_RECV_WINDOW_MS=abc from a typo in Docker config), int("abc") raises ValueError which propagates uncaught through build_bingx_exec_client_config() → build_launcher_bundle() → crashes the process.

Compare with DITA_V2_ACTIVE_SLOT_LIMIT (launcher.py:140-144) which correctly wraps int() in try/except Exception: pass. The slot limit parsing is safe; these three are not.

Severity: Critical

W2: `DITA_V2_PREFIX` default `"dita_v2"` — multi-process with `ZINC=REAL` causes silent shared-memory data corruption

File: launcher.py:311

resolved_prefix = (prefix or os.environ.get("DITA_V2_PREFIX", "dita_v2")).strip()

The prefix defaults to "dita_v2" if not set. When two processes on the same machine use DITA_V2_ZINC=REAL (or one process restarts without cleaning shared memory), both try to SharedRegion.create("dita_v2_intent", ...), "dita_v2_state", "dita_v2_control".

On Linux, shared memory segments (/dev/shm/) persist until explicitly unlinked or the system reboots. A second process:

Gets EEXIST from SharedRegion.create() — which real_zinc_plane.py's __init__ does NOT handle (no try/except)
Or if the region already exists with different size, get a mismatch error
Or if the region is simply opened (not created), both processes read/write the same memory — simultaneous writes corrupt the state

The _write_region functions are non-atomic (T7), so two processes writing concurrently see partial updates.

Severity: Critical

W3: Funding sign convention opposite between Python V2 `apply_funding()` and Rust `apply_funding_fee()` — same raw exchange value produces opposite capital effect

Files: _rust_kernel/src/lib.rs:839-841 (Rust), account.py:299 (Python V2)

// Rust: amount > 0 = received (capital ↑)
self.k_funding_net -= amount;
// If amount = -3.75 (paid out): k_funding_net -= (-3.75) = k_funding_net + 3.75
// k_capital = seed + realized - fees - k_funding_net = seed + realized - fees - (+3.75) = capital ↓

# Python V2: amount > 0 = paid out (capital ↓)
self._k_funding += amount
self._k_capital = self._seed + self._k_realized - self._k_fees - self._k_funding
# If amount = -3.75 (paid out): _k_funding += (-3.75) = -3.75
# k_capital = seed + realized - fees - (-3.75) = seed + realized - fees + 3.75 = capital ↑ WRONG

Both use k_capital = seed + realized - fees - funding. But:

Rust: A funding payment (capital decreases) is represented as a negative amount. k_funding_net -= (-3.75) = +3.75, then capital - 3.75 = capital decreases. Correct.
Python V2: A funding payment (capital decreases) is represented as a positive amount. _k_funding += 7.25, then capital - 7.25 = capital decreases. Also correct for its own convention — but opposite sign convention.

The same raw exchange value (e.g., funding_amount = -3.75 from BingX WS showing a funding cost):

System	Input	`k_funding`	k_capital effect	Correct?
Rust	`-3.75`	`funding_net = 0 - (-3.75) = +3.75`	`capital - 3.75` (decrease)	✅
Python V2	`-3.75`	`_k_funding = 0 + (-3.75) = -3.75`	`capital - (-3.75) = capital + 3.75` (increase)	❌ WRONG

The parity test (test_exchange_event_seam_parity.py:426) compares WS path vs Poll path — both use Python V2 apply_funding() with the same convention, so they match each other but both are wrong in absolute value. The Rust kernel produces the correct absolute value.

Severity: Critical

W4: `BingxUserStream` `listenKeyExpired` frames silently swallowed — `continue` at line 272 skips the expiry check at line 275, dead code

File: bingx_user_stream.py:272-276

# Line 272-276 — the main WS message dispatch:
kind = frame.get("e", "")
if kind in self._NORMALISE_MAP:
    yield self._NORMALISE_MAP[kind](frame)
else:
    continue  # <-- UNKNOWN event type → continue

# Line 275: THIS LINE IS NEVER REACHED for listenKeyExpired
if kind == "listenKeyExpired":
    raise RuntimeError("listenKeyExpired")

The else: continue at line 273 skips the listenKeyExpired check at line 275. When BingX sends {"e": "listenKeyExpired"}, the dispatch:

Check kind in self._NORMALISE_MAP — "listenKeyExpired" is NOT in the map
Falls to else: continue — skips the rest of the loop body
Line 275 is dead code — never reaches the raise RuntimeError

The stream stays connected with a dead listenKey. The keepalive loop (which runs independently) keeps sending PUT keepalive requests to the dead key. The 24-hour rotation timer eventually fires, but in the meantime (potentially hours), all WS events are silently lost.

The raise RuntimeError("listenKeyExpired") at line 276 was clearly intended to trigger a reconnect, but the continue before it makes it unreachable.

Severity: Critical

W5: `int()` on `DOLPHIN_BINGX_RECV_WINDOW_MS` with no bounds check — extreme values can enable replay attacks

File: launcher.py:189

recv_window_ms = int(os.environ.get("DOLPHIN_BINGX_RECV_WINDOW_MS", "5000"))

The recv window is used in HMAC-signed BingX requests as the recvWindow parameter. It defines the timestamp tolerance for signed requests — how far off the request timestamp can be from the server's clock.

A value like 86400000 (24 hours) means any signed request is valid for 24 hours from its timestamp. An attacker who intercepts a signed request can replay it for an entire day.

There's no upper bound. The code only clamps max(1, recv_window_ms) — so the minimum is 1ms but the maximum is unbounded.

Severity: High

W6: `DITA_V2_ACTIVE_SLOT_LIMIT` stored in control snapshot but never enforced by Rust kernel — dead config value

File: launcher.py:140-144 (read and stored), _rust_kernel/src/lib.rs (never checked)

# launcher.py: stored in control snapshot
raw = os.environ.get("DITA_V2_ACTIVE_SLOT_LIMIT")
if raw is not None:
    fields["active_slot_limit"] = max(1, int(str(raw).strip()))

The env var is read, parsed, clamped, and stored into the control plane's ControlSnapshot. But the Rust kernel allocates max_slots slots at construction (from ExecutionKernel.__init__'s max_slots parameter) and never reads the active_slot_limit from the control snapshot.

The active_slot_limit field is written to projections (hazelcast_projection.py:41 writes control.as_dict() which includes active_slot_limit) and visible in the control plane state, but the Rust kernel never limits slot usage based on it. An algorithm could send ENTER intents to any slot up to max_slots regardless of the configured limit.

Severity: High

W7: No fill/trade history fetched during WS reconnect gap-backfill — fills during disconnect window permanently lost

File: bingx_user_stream.py:117-121

try:
    snap = await self.account_snapshot()
    yield snap
except Exception as exc:
    log.warning("bingx_user_stream: gap-backfill REST failed: %s", exc)

account_snapshot() (lines 153-219) fetches:

GET /openApi/swap/v3/user/balance — wallet balance, available margin
GET /openApi/swap/v2/user/positions — open positions with entry price, qty

It does NOT fetch:

GET /openApi/swap/v2/trade/fill/history — fill history during the gap

If a LIMIT order filled during the reconnect window (e.g., a resting limit order that was placed before disconnect and filled while the WS was down), the fill event is permanently lost. The balance snapshot shows the result (changed wallet balance), but no fill event with price, qty, fee, realized_pnl is emitted. The kernel processes only an ACCOUNT_UPDATE, missing the individual fill details.

Additionally, funding fee events accrued during the disconnect are invisible. The balance reflects them, but no FUNDING_FEE event arrives. The kernel's k_funding_net drifts until the next explicit funding event.

Severity: High

W8: `BingxVenueAdapter` rate limit detection fails on HTTP 429 without matching message body — `_rate_limit_retry_after_ms` returns 0, instant retry

File: bingx_venue.py:169-183

The rate limit detection has three paths:

Response header (retryAfter, retry_after_ms, retryAfterMs) — extracted from response dict
Error message regex — re.search(r"unblocked after (\d+)", msg) on the exchange error text
Return 0 — everything else

If BingX returns HTTP 429 with a message body that doesn't contain the phrase "unblocked after" (e.g., a generic "too many requests" or a localized message), the regex misses and returns 0. The caller then retries immediately, burning more rate limit quota.

The BingxHttpError catch at line 316-317 catches all HTTP errors (including 429) and converts them to {"status": "REJECTED", ...} — the REJECTED tag prevents the kernel from recognizing it as RATE_LIMITED. The rate-limit detection is entirely dependent on the error message body format, not the HTTP status code.

Fix: Check for HTTP 429 status code first, then fall back to message parsing.

Severity: High

W9: `DITA_V2_CONTROL_PLANE=REAL_ZINC` silently falls back to in-memory on any exception — operator thinks they have persistence but don't

File: control.py:205-212

if env_choice in {"REAL", "REAL_ZINC", "SHARED", "SHARED_MEM"}:
    try:
        from .real_control_plane import RealZincControlPlane
        plane = RealZincControlPlane(prefix=prefix, create=True)
    except Exception:
        pass  # <-- silent fallback, no log
return ZincControlPlane(snapshot=snapshot)  # in-memory fallback

If RealZincControlPlane() raises (Zinc library not installed, shared memory creation fails, permission denied), the exception is silently swallowed and the function returns an InMemoryControlPlane. No log, no warning, no diagnostic. The operator configured persistent shared-memory control plane but gets ephemeral in-memory.

Compare with build_launcher_bundle()'s _build_zinc_plane() (launcher.py:122-125) which also silently falls back — same pattern.

Severity: High

W10: `BingxVenueAdapter` HTTP error handling maps ALL `BingxHttpError` to `"REJECTED"` — cannot distinguish "order not found" from "exchange is down"

File: bingx_venue.py:316-317

except BingxHttpError as exc:
    response = {"status": "REJECTED", "msg": str(exc), ...}

Every BingxHttpError is mapped to status="REJECTED". A 500 Internal Server Error, a 403 forbidden, a 404 order-not-found — all become "REJECTED". The Rust kernel treats "REJECTED" as a specific FSM signal. It cannot distinguish "this cancel was rejected because the order doesn't exist" (harmless, order already cancelled) from "the exchange is returning 500 errors" (system-wide failure, should halt trading).

Impact: If BingX has a transient 500 error, every submit/cancel in that window returns "REJECTED". The kernel may interpret this as genuine order rejections and transition the FSM to CLOSED or trigger cancels, even though the orders may have actually gone through.

Severity: High

W11: `DOLPHIN_BINGX_API_KEY` accessed via bracket `os.environ["BINGX_API_KEY"]` in generated tests — `KeyError` crash if unset; inconsistent with launcher's `.get()` which silently passes `None`

Files: gen2.py:320, gen_live_tests.py:116-117, _gen_test.py:60

# Generated test code (all three generators):
BINGX_API_KEY = os.environ["BINGX_API_KEY"]     # bracket access — KeyError if unset
BINGX_SECRET_KEY = os.environ["BINGX_SECRET_KEY]"

# launcher.py:195-196 — different access pattern:
api_key=os.environ.get("BINGX_API_KEY"),          # .get() — returns None if unset
secret_key=os.environ.get("BINGX_SECRET_KEY"),

The test generators use bracket access (os.environ["KEY"]) which raises KeyError instantly if the env var is missing. The launcher uses .get() access (os.environ.get("KEY")) which silently returns None.

This means:

Running generated tests without env vars → KeyError crash at module import time
Running the launcher without env vars → None silently passed to BingxExecClientConfig → delay failure until first HTTP call (confusing 401)

Two different failure modes for the same missing configuration. The launcher should validate None immediately.

Severity: High

W12: `MockVenueScenario` has no `rate_limit` flag — entire RATE_LIMITED code path untested in CI

File: mock_venue.py:27-35

@dataclass
class MockVenueScenario:
    reject_entries: bool = False
    reject_exits: bool = False
    reject_cancels: bool = False
    all_fills_partial: bool = False
    # NOTE: no rate_limit field

The MockVenueScenario dataclass has flags for rejection and partial fill simulation but no rate_limit flag. The entire RATE_LIMITED code path — in the Python adapter (bingx_venue.py:384-396 detects retry-after, tags event as RATE_LIMITED) and in the Rust kernel (the FSM match arm for KernelEventKind::RATE_LIMITED) — has zero simulation in mock venue tests.

Only a live BingX connection can trigger the rate-limit path, and the live tests are triple env-gated (V5) and never run in CI.

Severity: Medium

W13: `_rate_limit_retry_after_ms` regex uses English phrase `"unblocked after"` — non-portable, fails on localized exchange messages

File: bingx_venue.py:177

m = re.search(r"unblocked after (\d+)", msg)

The regex relies on the English phrase "unblocked after" in the exchange's error message. BingX is a Chinese exchange. If the error response is localized to Chinese (e.g., "解封后(\d+)毫秒"), or if BingX changes their English wording, the regex silently misses and returns 0 (instant retry).

Fix: Prioritize the Retry-After HTTP response header or JSON field retryAfter/retry_after_ms over parsing the error message body.

Severity: Medium

W14: `DITA_V2_ACTIVE_SLOT_LIMIT` invalid values silently discarded with no logging — operator sets `"abc"`, gets default with no warning

File: launcher.py:140-144

raw = os.environ.get("DITA_V2_ACTIVE_SLOT_LIMIT")
if raw is not None:
    try:
        fields["active_slot_limit"] = max(1, int(str(raw).strip()))
    except Exception:
        pass  # no log, no warning

If the operator sets DITA_V2_ACTIVE_SLOT_LIMIT=abc, the int() raises ValueError, the except swallows it, and the field is never written to fields. The slot limit silently uses the control plane default (10). No log, no warning, no error — the operator thinks they configured a limit but the config was silently ignored.

Severity: Medium

Pass 20 Summary

#	Flaw	Layer	Severity
W1	`int()` on 3 env vars uncaught `ValueError` — non-numeric input crashes process	Config	Critical
W2	`DITA_V2_PREFIX` default `"dita_v2"` — multi-process shared memory corruption	Config	Critical
W3	Funding sign opposite Python V2 vs Rust — same raw value opposite capital effect	Accounting	Critical
W4	`listenKeyExpired` frames silently swallowed — `continue` skips expiry check, dead code	Venue	Critical
W5	`RECV_WINDOW_MS` no upper bound — extreme values enable replay attacks	Config	High
W6	`ACTIVE_SLOT_LIMIT` stored but never enforced by Rust kernel — dead config	Config	High
W7	No fill history fetched during WS reconnect gap-backfill — fills permanently lost	Venue	High
W8	Rate limit detection fails on HTTP 429 without matching message — returns 0 instant retry	Venue	High
W9	`CONTROL_PLANE=REAL_ZINC` silently falls back to in-memory — no persistence	Config	High
W10	All `BingxHttpError` mapped to "REJECTED" — can't distinguish errors from real rejections	Venue	High
W11	`os.environ["KEY"]` bracket access in tests vs `.get()` in launcher — inconsistent	Test	High
W12	`MockVenueScenario` no `rate_limit` flag — RATE_LIMITED path untested in CI	Test	Medium
W13	Rate-limit regex uses English phrase `"unblocked after"` — non-portable	Venue	Medium
W14	Invalid `ACTIVE_SLOT_LIMIT` values silently discarded — no log, no warning	Config	Medium

Pass 20 Severity

Severity	Count
Critical	4 (W1, W2, W3, W4)
High	7 (W5, W6, W7, W8, W9, W10, W11)
Medium	3 (W12, W13, W14)

Combined Catalog (All 20 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
S	Pass 16 (Error Handling/Arithmetic/Test Infra)	16	4	7	5	0	0
T	Pass 17 (Unsafe Review/Dead Code/Build/Protocols)	14	0	5	5	4	0
U	Pass 18 (Rust Test Gaps/Accounting/FFI Types)	14	3	4	4	3	0
V	Pass 19 (Lifecycle/Rust Subtleties/Test Infra)	14	5	2	4	3	0
W	Pass 20 (Config/Math Signs/BingX Protocol)	14	4	7	3	0	0
Total		361	39	108	103	64	37

PASS 21 — RUST BUILD/DEPS, PYTHON PACKAGING, SHARED MEMORY PROTOCOL

X1: Critical — No ABI compatibility check on Rust `.so` load — stale/wrong binary can crash or silently corrupt state

File: rust_backend.py:86-92

path = _ensure_library()
self.lib = ctypes.CDLL(str(path))

The Python code loads whatever .so/.dylib exists at the computed path with zero verification. Problems:

No Rust version check: If the .so was built with a different Rust compiler version that changed struct layout, data is silently corrupted.
No recompile-on-version-mismatch: If Cargo.lock is updated, the old .so is used until manually deleted.
No hash/checksum: No mechanism to detect stale binary, wrong branch, or tampering.
No #[repr(C)] on internal types: Only KernelHandle has #[repr(C)]. Serde JSON is the FFI wire format, which is type-safe, but the Box::from_raw(handle) in dita_kernel_destroy assumes exact same memory layout.

Severity: Critical

X2: Critical — `real_zinc_plane._write_region()` zeroes entire buffer before writing — visible all-zero window, inconsistent with real_control_plane

File: real_zinc_plane.py:258-260

# real_zinc_plane.py — zero THEN write:
view[:] = b"\x00" * len(view)          # Zero entire 1MB buffer
view[:len(packet)] = packet            # Then write packet

# real_control_plane.py — write THEN zero tail:
view[:len(packet)] = packet            # Write packet first
view[len(packet):] = b"\x00" * (len(view) - len(packet))  # Then zero tail

Two different implementations for the same operation. The zinc plane zeros the full buffer (1MB allocation and memcpy) before writing the packet. During the window between zero and write, a concurrent reader sees all zeros → _decode_packet returns {} (empty dict). Reader gets stale/wrong state.

The control plane correctly writes the packet first then zeros the tail — minimizing the visible window and avoiding the wasteful full-buffer zero.

Additionally, the full-buffer zero b"\x00" * 1MB allocates and copies 1MB for every write, even though the packet is typically <1KB. Performance issue.

Severity: Critical

X3: Critical — No `requirements.txt`, `setup.py`, or `pyproject.toml` — zero Python dependency declarations

File: (missing — workspace root)

The workspace has no Python dependency management files at all. No requirements.txt, setup.py, setup.cfg, pyproject.toml, Pipfile, or poetry.lock.

Undocumented external dependencies:

aiohttp — used by bingx_user_stream.py
requests — used by gen_live_tests.py
python-dotenv — used by launcher.py
pytest — used by all test files and generators
zinc (SharedRegion C extension) — used by real_zinc_plane.py, real_control_plane.py
prod.bingx.* — 3+ modules outside workspace
prod.clean_arch.* — 5+ modules outside workspace

Without a requirements file:

No pinned versions → build non-reproducible
pip install on a fresh machine installs only what happens to be present
Version conflicts between environments cause silent behavior changes
CI cannot install dependencies deterministically

Severity: Critical

X4: High — `RealZincControlPlane.update()` has no thread lock — concurrent calls corrupt sequence number and shared memory

File: real_control_plane.py:98-99

# No lock on RealZincControlPlane (unlike RealZincPlane which has self._lock)
def update(self) -> None:
    self._seq += 1                       # race: two threads read 5, both write 6
    self._write_region(self._seq, self._snapshot.as_dict())  # race: both write seq=6

RealZincPlane (real_zinc_plane.py:154) has a threading.Lock and uses with self._lock: around all write operations. RealZincControlPlane has no lock. If two threads call update() simultaneously:

Both read self._seq = 5, both increment to 6, both write with seq=6 → one write is lost
Both call _write_region simultaneously → concurrent writes to shared memory → data corruption
Sequence number jumps: two calls, sequence goes 5→6 with only one write visible

Severity: High

X5: High — `libc` declared in `Cargo.toml` but never used — dead dependency

File: _rust_kernel/Cargo.toml:8

[dependencies]
libc = "0.2"

The libc crate is declared as a dependency but grep 'libc' src/lib.rs returns zero matches. The code uses std::ffi::{c_char, CStr, CString} from the standard library (stable since Rust 1.64), not libc::c_char.

Not harmful at runtime (compiler optimizes it out), but:

Dead dependency to maintain (version bumps, audit)
Adds to supply chain attack surface
Indicates refactoring residue from an earlier version that used libc types directly

Severity: High

X6: High — 5 test files use hardcoded `sys.path.insert(0, "/mnt/dolphinng5_predict")` — non-portable, environment-specific path

Files: test_flaws.py:13, test_account_core_v2.py:16, test_account_reconcile_faults.py:15, test_alpha_blue_untouched_g7.py:13, test_exchange_event_seam_parity.py (similar)

Every mock-venue test file prepends /mnt/dolphinng5_predict to sys.path using a hardcoded absolute path. This path is specific to the current deployment machine. On any other machine, these tests fail with ModuleNotFoundError for prod.* imports.

The real_zinc_plane.py:13-14 also adds a Zinc adapter path using Path(__file__).resolve().parents[3] which is relative (better) but assumes a rigid directory structure.

Severity: High

X7: High — Shared memory `_decode_packet()` has no try/except on `json.loads` — partial body read causes unhandled `JSONDecodeError`, crashes reader

File: real_zinc_plane.py:120-130, real_control_plane.py:53-63

def _decode_packet(buf: memoryview) -> Dict[str, Any]:
    if len(buf) < 16: return {}
    seq, size = struct.unpack_from("!QQ", buf, 0)
    if size <= 0 or size > len(buf) - 16: return {}
    payload = bytes(buf[16 : 16 + size]).decode("utf-8")
    out = json.loads(payload)       # NO try/except — crash on partial body
    if isinstance(out, dict): out["_seq"] = seq
    return out

If a reader reads the shared memory at the exact moment when the 16-byte header is written but the JSON body is partially written (or not yet written), json.loads() receives truncated data and raises json.JSONDecodeError. This is not caught — the exception propagates up through all read paths:

RealZincPlane.read_slots() → crash
RealZincPlane.read_intents() → crash
RealZincPlane.read_control() → crash
RealZincControlPlane.read() → crash
RealZincPlane.__init__ open path → crash during init

The header size check (size > len(buf) - 16) prevents reading beyond buffer bounds, but it doesn't prevent reading incomplete body data. The writer writes header+body in a single memcpy, so on x86-64 this is unlikely — but on ARM or under heavy memory pressure, the writes can be observed in any order.

Severity: High

X8: High — `ExchangeEvent` and `ExchangeEventKind` not exported from `init.py` — package API inconsistency

File: __init__.py:44-88

The __init__.py exports 45+ names from 12 sub-modules but does not export ExchangeEvent, ExchangeEventKind, or ExchangePosition. Consumers import them directly via the raw module path:

from prod.clean_arch.dita_v2.exchange_event import ExchangeEvent

This is a package hygiene violation. mypy --strict flags this. IDE autocomplete fails for these types. If the module is restructured (e.g., exchange_event.py renamed to seam.py), all direct imports break silently.

Severity: High

X9: Medium — No MSRV (`rust-version`) in `Cargo.toml`, no `rust-toolchain.toml` — builds differ per Rust version

File: _rust_kernel/Cargo.toml

[package]
edition = "2021"
# NO rust-version field

No rust-toolchain.toml, no CI config to pin a Rust version. cargo build uses whatever Rust version is on the builder. Cross-machine, cross-developer, cross-deployment builds can produce different binaries.

The code uses std::ffi::c_char (stabilized in Rust 1.64), so building with <1.64 fails. But any version >=1.64 could produce slightly different codegen — and more importantly, if the .so from one Rust version is loaded into a Python process that built it with a different Rust version, the ABI may differ.

Severity: Medium

X10: Medium — RealZincPlane and RealZincControlPlane both use `{prefix}_control` region name — collision when both are REAL

Files: real_zinc_plane.py:153, real_control_plane.py:72

# real_zinc_plane.py:
self.control_region = SharedRegion.create(f"{base}_control", 4096)

# real_control_plane.py (via region_name):
self.region = SharedRegion.create(f"{base}_control", ...)

When both DITA_V2_ZINC=REAL and DITA_V2_CONTROL_PLANE=REAL are set, the launcher creates both RealZincPlane(prefix="dita_v2") and RealZincControlPlane(prefix="dita_v2"). Both create/open a shared memory region named "dita_v2_control". They write different payload structures to the same region — one overwrites the other's data.

Severity: Medium

X11: Medium — Sequence number (`_seq`) is decoded and injected into output dict but never read by any consumer — transmitted waste

Files: real_zinc_plane.py:128, real_control_plane.py:61

out["_seq"] = seq  # written to output dict

The sequence number is packed into the 16-byte header, transmitted, decoded, and injected into the output dict — but no consumer ever reads "_seq":

RealZincPlane.read_slots() reads payload.get("slots", []) — ignores _seq
RealZincPlane.read_intents() reads payload.get("items", []) — ignores _seq
RealZincControlPlane.read() reads payload.get("control") — ignores _seq

No gap detection, no staleness check, no ordering verification. The sequence number is dead data on the wire.

Severity: Medium

X12: Medium — `_maybe_close()` uses `ThreadPoolExecutor` + `result(timeout=10.0)` — `TimeoutError` unhandled, strand coroutine

File: launcher.py:63-65

pool = concurrent.futures.ThreadPoolExecutor(max_workers=1)
fut = pool.submit(asyncio.run, result)
try:
    fut.result(timeout=10.0)       # TimeoutError if >10s
except Exception:
    pass                           # catches TimeoutError but coroutine still running

If the async close()/disconnect() coroutine takes longer than 10 seconds, fut.result(timeout=10.0) raises TimeoutError. The except Exception: pass catches it — but the coroutine is still running in the thread pool. When the coroutine eventually completes, it writes to a self._closed event or similar attribute on an object that the caller has already forgotten about.

On every _maybe_close call, a new ThreadPoolExecutor(1) is created. If multiple components are closed in sequence, multiple executors are created and never shut down (the with block is missing — each _maybe_close call creates an executor that's never .shutdown()).

Severity: Medium

X13: Medium — `init.py` re-exports 45 names from 12 modules — flat namespace risks naming collisions

File: __init__.py:44-88

The __init__.py flattens all imports into a single namespace. Examples:

BingxVenueAdapter (from .bingx_venue) and MockVenueAdapter (from .mock_venue) — no collision
RealZincPlane (from .real_zinc_plane) and RealZincUnavailable (from both .real_zinc_plane and .real_control_plane via alias) — the alias RealZincControlUnavailable avoids this but shows the risk

If any two sub-modules export the same name, the second import silently overwrites the first. No warning is raised.

Severity: Medium

X14: Medium— `real_control_plane.close()` is not idempotent, no `_closed` guard — double-close depends on C extension behavior

File: real_control_plane.py:85-86, real_zinc_plane.py:187-190

# Both implementations:
def close(self) -> None:
    self.intent_region.close()    # or self.region.close()
    # no _closed flag, no guard

Neither RealZincPlane.close() nor RealZincControlPlane.close() has a _closed guard. Calling close() twice calls SharedRegion.close() twice on the same region. The Zinc library's C extension behavior on double-close is unknown — it could segfault (use-after-free pattern common with C extensions) or silently return successfully. No Python-side protection.

Additionally, close() does not clear Python-level caches (_slot_cache, _intent_cache, _control_cache). After closing, stale data is still accessible from the cache.

Severity: Medium

Pass 21 Summary

#	Flaw	Layer	Severity
X1	No ABI compatibility check on Rust `.so` load — stale binary corrupts silently	Bridge	Critical
X2	`real_zinc_plane._write_region()` zeroes entire buffer before write — visible all-zero window	Plane	Critical
X3	No `requirements.txt`/`setup.py`/`pyproject.toml` — zero Python dependency declarations	Build	Critical
X4	`RealZincControlPlane.update()` no thread lock — concurrent calls corrupt seq and shared memory	Plane	High
X5	`libc` declared in `Cargo.toml` but never used — dead dependency	Rust	High
X6	5 test files use hardcoded `sys.path.insert(0, "/mnt/dolphinng5_predict")` — non-portable	Test	High
X7	`_decode_packet()` no try/except on `json.loads` — partial body read crashes reader	Plane	High
X8	`ExchangeEvent`/`ExchangeEventKind` not exported from `__init__.py` — package API inconsistency	Bridge	High
X9	No MSRV or `rust-toolchain.toml` — builds differ per Rust version	Rust	Medium
X10	`RealZincPlane` and `RealZincControlPlane` collide on `{prefix}_control` region name	Plane	Medium
X11	Sequence number decoded but never read by any consumer — dead data on wire	Plane	Medium
X12	`_maybe_close()` `fut.result(timeout=10.0)` — `TimeoutError` leaves coroutine stranded, executor leaks	Launcher	Medium
X13	`__init__.py` flat re-exports 45 names — naming collision risk	Bridge	Medium
X14	`close()` not idempotent on RealZincPlane/RealZincControlPlane — double-close risk	Plane	Medium

Pass 21 Severity

Severity	Count
Critical	3 (X1, X2, X3)
High	5 (X4, X5, X6, X7, X8)
Medium	6 (X9, X10, X11, X12, X13, X14)

Combined Catalog (All 21 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
H	Edge Domains (Pass 5)	22	3	9	5	4	1
I	Pass 6 (Math/Tests/Recovery/Security)	22	3	11	4	2	2
J	Pass 7 (Test Infra/Data/Rust/Env/Conn)	16	0	7	7	2	0
K	Pass 8 (Observability/Memory/Time/DeadCode)	23	2	7	7	1	6
L	Pass 9 (Contracts/Events/Network/FFI/Diffs)	16	0	4	8	4	0
M	Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics)	18	3	7	5	3	0
N	Pass 11 (Async/Sync Seams/Locks/Threading)	10	4	1	3	1	1
O	Pass 12 (Sync/Async Wider Scope)	11	0	3	7	1	0
P	Pass 13 (FFI Safety/Dangling Pointers/Coverage)	9	1	3	3	1	1
Q	Pass 14 (Serde Edges/Backup Diffs/Market Data)	12	0	4	3	2	3
R	Pass 15 (Resource Leaks/Trust Boundaries/Security)	14	2	6	3	2	1
S	Pass 16 (Error Handling/Arithmetic/Test Infra)	16	4	7	5	0	0
T	Pass 17 (Unsafe Review/Dead Code/Build/Protocols)	14	0	5	5	4	0
U	Pass 18 (Rust Test Gaps/Accounting/FFI Types)	14	3	4	4	3	0
V	Pass 19 (Lifecycle/Rust Subtleties/Test Infra)	14	5	2	4	3	0
W	Pass 20 (Config/Math Signs/BingX Protocol)	14	4	7	3	0	0
X	Pass 21 (Rust Build/Deps/Python Packaging/Shared Mem)	14	3	5	6	0	0
Total		375	42	113	109	64	37

392 KiB Raw Blame History Unescape Escape

PINK DITAv2 — End-to-End Trace & Flaw Analysis

E2E Data Flow (One Call)

Layer 1: Policy Cycle Entry (pink_direct.py:422)

E1: step() calls pump_venue_events() every cycle unconditionally

E2: kernel.snapshot()["account"] returns a fresh dict, not a live view

Layer 2: Decision/Intent Bridging (pink_direct.py:79-115)

E3: _decision_to_kernel_intent drops order_type and limit_price

E4: _exit_intent_from_slot trusts slot.size but slot may be stale

Layer 3: Kernel Bridge — Rust FSM Entry (rust_backend.py)

E5: JSON serialization round-trip loses numeric precision

E6: _RustKernelLib is a global singleton — shared across all kernels

Layer 4: Rust Kernel FSM (lib.rs:728)

E7: ENTER handler silently allows re-entry with same trade_id

E8: EXIT handler uses initial_size not current size

E9: CANCEL handler returns diagnostic even when nothing happened

E10: apply_fill entry branch double-sets active_entry_order

Layer 5: Venue Adapter Boundary (bingx_venue.py)

E11: _legacy_intent() is a lossy conversion

E12: _events_from_submit() price fallback chain can lose venue price

E13: _backend_snapshot() timeout returns stale data

E14: _events_from_cancel uses stale slot_id from order metadata

Layer 6: BingX Direct Adapter (bingx_direct.py)

E15: Submit sets leverage via separate HTTP call

E16: _format_quantity and _format_price use _instrument_step/_instrument_tick — both may be zero

E17: Cancel uses truth-based confirmation — can mask real errors

Layer 7: Fill Feedback Loop (rust_backend.py on_venue_event)

E18: on_venue_event settles PnL incrementally — but fees are never included

E19: observe_slots called with ALL slots, not just changed ones

Layer 8: Persistence Boundary (pink_clickhouse.py)

E20: _capital() reads live from AccountProjection — stale row risk

E21: persist_fill_events() synthesizes fake Decision/Intent

E22: _write_trade_exit_leg capital_before uses arithmetic reconstruction

E23: _write_trade_event uses slot_dict.get("entry_price") as exit_price

Layer 9: Test Infrastructure

E24: MockVenueAdapter.submit() always emits fill on partial_fill_ratio > 0

E25: Test scenarios use MARKET-only _si() helper — no LIMIT tests

E26: Fresh-kernel reconcile tests create second kernel but share venue

Summary: Critical E2E Flaw Chain

Complete Flaw Catalog (All Layers)

PASS 3 — NEW FINDINGS (Deepest E2E Trace)

F1: process_intent CANCEL returns "accepted" before the cancel happens — caller gets wrong outcome.state

F2: _last_settled_pnl reset before venue.submit() — transient window

F3: _first_invalid_intent_field allows leverage=0 and target_size=0

F4: outcome.emitted_events only contains venue events — Rust kernel's events silently dropped

F5: on_venue_event does redundant FFI read of slot already returned by Rust

F6: _record_transitions in process_intent records pre-venue transitions with event=None

F7: reconcile_from_slots writes ALL slots to projection/zinc, not just reconciled ones

F8: HazelcastRowWriter.put() is synchronous with no error handling — Hazelcast failure crashes the intent

F9: RealZincPlane.write_slot() serializes ALL slots, not just the changed one

F10: RealZincPlane.write_slot zeros buffer before write — concurrent read sees empty data

F11: RealZincPlane._write_region has no partial-write recovery

F12: InMemoryZincPlane intent_region grows without bound

F13: InMemoryZincPlane uses non-re-entrant threading.Condition

F14: KernelSlotView.__setattr__ round-trips unknown fields through Rust — silently dropped

F15: on_venue_event loop in process_intent stops on first exception — slot left in partial state

F16: venue.submit() returning empty events leaves slot in ORDER_REQUESTED

F17: Cancel truth-based confirmation returns REJECTED for already-cancelled orders on GET failure

F18: Leverage-set and order-submit failures share error handler — poor diagnostics

F19: _events_from_submit stale snapshot fallback → wrong fill detection

F20: __del__ frees Rust handle at unpredictable GC time — no explicit close()

F21: DITAv2LauncherBundle.close() closes venue before kernel is done with it

F22: Silent fallback from real Zinc/Hazelcast to in-memory on error — operator unaware

F23: VenueEvent.size = intent.target_size not actual fill — wrong for multi-leg EXIT

F24: asyncio.run() inside async function in test generator — nested event loops

F25: _build_fresh_kernel_from_slot leaks old kernel objects per call

F26: seen_event_ids not cleared on re-entry — event IDs accumulate across trades

F27: RealZincControlPlane.read() parses Zinc region every call — no caching

F28: _legacy_intent hardcodes confidence=1.0 and bars_held=0

F29: _slot_to_payload in real_zinc_plane.py is dead code

F30: Duplicate _slot_from_payload in real_zinc_plane.py and rust_backend.py

Complete Flaw Catalog

All-Passes Combined

Most Dangerous Single Flaw: F15

PASS 4 — SYSTEMATIC DOMAIN SCANS (Config, Rust, Persistence, Lifecycle)

Rust Kernel — Numeric & FSM Invariants

G1: EXIT_RESIDUAL action is entirely missing from Rust KernelCommandType

G2: into_c_string uses unwrap() — panics on interior NUL byte

G3: process_intent EXIT hardcodes prev_state = POSITION_OPEN unconditionally

G4: consume_exit_leg advances beyond last valid index — stale all_legs_done variable

392 KiB

Raw Blame History

E1: `step()` calls `pump_venue_events()` every cycle unconditionally

E2: `kernel.snapshot()["account"]` returns a fresh dict, not a live view

E3: `_decision_to_kernel_intent` drops `order_type` and `limit_price`

E4: `_exit_intent_from_slot` trusts slot.size but slot may be stale

E6: `_RustKernelLib` is a global singleton — shared across all kernels

E8: EXIT handler uses `initial_size` not `current size`

E10: `apply_fill` entry branch double-sets `active_entry_order`

E11: `_legacy_intent()` is a lossy conversion

E12: `_events_from_submit()` price fallback chain can lose venue price

E13: `_backend_snapshot()` timeout returns stale data

E14: `_events_from_cancel` uses stale `slot_id` from order metadata

E16: `_format_quantity` and `_format_price` use `_instrument_step`/`_instrument_tick` — both may be zero

E18: `on_venue_event` settles PnL incrementally — but fees are never included

E19: `observe_slots` called with ALL slots, not just changed ones

E20: `_capital()` reads live from `AccountProjection` — stale row risk

E21: `persist_fill_events()` synthesizes fake Decision/Intent

E22: `_write_trade_exit_leg` capital_before uses arithmetic reconstruction

E23: `_write_trade_event` uses `slot_dict.get("entry_price")` as exit_price

E24: `MockVenueAdapter.submit()` always emits fill on `partial_fill_ratio > 0`

E25: Test scenarios use MARKET-only `_si()` helper — no LIMIT tests

F1: `process_intent` CANCEL returns "accepted" before the cancel happens — caller gets wrong `outcome.state`

F2: `_last_settled_pnl` reset before `venue.submit()` — transient window

F3: `_first_invalid_intent_field` allows `leverage=0` and `target_size=0`

F4: `outcome.emitted_events` only contains venue events — Rust kernel's events silently dropped

F5: `on_venue_event` does redundant FFI read of slot already returned by Rust

F6: `_record_transitions` in `process_intent` records pre-venue transitions with `event=None`

F7: `reconcile_from_slots` writes ALL slots to projection/zinc, not just reconciled ones

F8: `HazelcastRowWriter.put()` is synchronous with no error handling — Hazelcast failure crashes the intent

F9: `RealZincPlane.write_slot()` serializes ALL slots, not just the changed one

F10: `RealZincPlane.write_slot` zeros buffer before write — concurrent read sees empty data

F11: `RealZincPlane._write_region` has no partial-write recovery

F12: `InMemoryZincPlane` intent_region grows without bound

F13: `InMemoryZincPlane` uses non-re-entrant `threading.Condition`

F14: `KernelSlotView.setattr` round-trips unknown fields through Rust — silently dropped

F15: `on_venue_event` loop in `process_intent` stops on first exception — slot left in partial state

F16: `venue.submit()` returning empty events leaves slot in `ORDER_REQUESTED`

F17: Cancel truth-based confirmation returns `REJECTED` for already-cancelled orders on GET failure

F19: `_events_from_submit` stale snapshot fallback → wrong fill detection

F20: `del` frees Rust handle at unpredictable GC time — no explicit `close()`

F21: `DITAv2LauncherBundle.close()` closes venue before kernel is done with it

F23: `VenueEvent.size` = `intent.target_size` not actual fill — wrong for multi-leg EXIT

F24: `asyncio.run()` inside async function in test generator — nested event loops

F25: `_build_fresh_kernel_from_slot` leaks old kernel objects per call

F26: `seen_event_ids` not cleared on re-entry — event IDs accumulate across trades

F27: `RealZincControlPlane.read()` parses Zinc region every call — no caching

F28: `_legacy_intent` hardcodes `confidence=1.0` and `bars_held=0`

F29: `_slot_to_payload` in `real_zinc_plane.py` is dead code

F30: Duplicate `_slot_from_payload` in `real_zinc_plane.py` and `rust_backend.py`

G2: `into_c_string` uses `unwrap()` — panics on interior NUL byte

G3: `process_intent` EXIT hardcodes `prev_state = POSITION_OPEN` unconditionally

G4: `consume_exit_leg` advances beyond last valid index — stale `all_legs_done` variable

G5: `realized_pnl` uses unbounded f64 — overflows to inf at extreme values

G6: `mark_price` produces unbounded `unrealized_pnl`

G7: `process_intent` ENTER — no `is_finite()` guard on `target_size`

G8: `reconcile_slots_json` — no dedup or bounds validation

G9: `exchange_order_id` propagation uses wrong order target

G11: `apply_fill` entry-fill overwrites `active_entry_order.intended_size` with `slot.size`

G12: `leverage` unbounded after `is_finite()` — no maximum cap

G13: `resolve_slot` fallback returns `unwrap_or(0)` — can misroute events

G14: `commit_slot` silently ignores out-of-bounds slot_id

G15: Zero `__post_init__` validators on all config dataclasses

G16: `DITA_V2_DEBUG_CLICKHOUSE` defaults to `True` when env var is unset

G18: `exit_leg_ratios` no sum-to-1 validation

G19: `RealZincControlPlane.read()` has no sequence check — torn-read risk

G20: `DOLPHIN_BINGX_JOURNAL_STRATEGY`/`_DB` — ClickHouse SQL injection risk

G21: `entry_price` used as `exit_price` in `trade_events` — data loss

G22: `active_leg_index` → `entry_bar` semantic mis-mapping

G23: `capital_before` arithmetic reconstruction absorbs cross-slot PnL

G24: Recovery `trade_reconstruction` always has `trade_id=""`

G25: `seen_event_ids`, `exit_leg_ratios`, `VenueOrder`, `metadata` not in flat ClickHouse tables

G26: `_safe_float` silently converts NaN/None/Inf to 0.0

G27: `build_launcher_bundle` has no exception safety — prior resources leak

G28: `RealZincPlane` and `RealZincControlPlane` have no `del`

G30: `ExecutionKernel` has no `close()` — relies on `del` for Rust handle cleanup

G31: `projection` (Hazelcast) never closed

G32: `_maybe_close()` only calls the first method found — `break` skips the second

G33: `close()` is not idempotent for RealZinc components

G34: No context manager on `DITAv2LauncherBundle`

G35: `BingxVenueAdapter.connect()` exists but is never called by the launcher