Public Access

Files

Codex d9dd54c24e PINK: E2E trace analysis — Pass 4 domain scans (G1-G36)

Four systematic passes covering Rust kernel invariants (4 criticals — missing
EXIT_RESIDUAL action, unwrap() panic on NUL, backward FSM transition, stale
all_legs_done variable), config validation chain (zero validators on 127 fields),
persistence schema drift (7 confirmed field-level mismatches), and lifecycle
management (no signal handlers, no __del__, no exception safety in builder).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>

2026-06-01 14:26:36 +02:00

91 KiB

Raw Blame History

PINK DITAv2 — End-to-End Trace & Flaw Analysis

Analysis date: 2026-05-31 Method: Full-trace static analysis — every file, every data path, every boundary crossing in the PINK execution pipeline. No test execution. System scope: 34 active source files, ~12,000 lines across Rust kernel, Python bridge, venue adapter, runtime, and persistence.

E2E Data Flow (One Call)

Every E2E path in the PINK system traces through this sequence. Each numbered step below is a site where data crosses a module boundary and can be lost, mangled, or misinterpreted.

PinkDirectRuntime.step()                    # R1: policy cycle entry
  ├─ pump_venue_events()                    # R2: drain async fills
  ├─ kernel.snapshot()["account"]           # R3: read capital
  ├─ kernel.slot(0)                         # R4: read slot state
  ├─ decision_engine.decide()               # R5: policy-layer ENTER/EXIT
  ├─ intent_engine.plan()                   # R6: intent sizing
  ├─ _decision_to_kernel_intent()           # R7: Decision → KernelIntent
  ├─ kernel.process_intent(kernel_intent)   # R8: KERNEL BOUNDARY
  │   ├─ rust_backend._intent_to_payload()  # R8a: KernelIntent → JSON
  │   ├─ _RustKernelLib.process_intent()    # R8b: JSON → C FFI
  │   │   └─ Rust process_intent()          # R8c: FSM mutates TradeSlot
  │   ├─ venue.submit(intent)               # R9: VENUE BOUNDARY
  │   │   ├─ bingx_venue._legacy_intent()   # R9a: KernelIntent → LegacyIntent
  │   │   ├─ BingxDirectExecutionAdapter    # R9b: HTTP POST /trade/order
  │   │   │   .submit_intent()
  │   │   └─ bingx_venue._events_from_submit() # R9c: receipt → VenueEvent[]
  │   └─ on_venue_event(event)              # R10: FEEDBACK BOUNDARY
  │       ├─ _RustKernelLib → Rust FSM      # R10a: C FFI → FSM transition
  │       ├─ account.settle(delta)          # R10b: incremental PnL settlement
  │       └─ persistence writes             # R10c: ClickHouse / Zinc / HZ
  ├─ kernel.snapshot()["account"]           # R11: read final capital
  └─ persistence.persist_step()             # R12: PERSISTENCE BOUNDARY

Layer 1: Policy Cycle Entry (pink_direct.py:422)

E1: `step()` calls `pump_venue_events()` every cycle unconditionally

pink_direct.py:436

await self.pump_venue_events(snapshot, market_state=market_state)

This is called before reading slot/account state for the policy decision. The pump calls venue.reconcile() which for BingxVenueAdapter does 5 HTTP requests (balance, positions, open orders, plus history if include_history).

For MARKET-only workflows, no resting orders exist, so reconcile() returns empty events every time. But the HTTP calls still happen. On BingX VST with ~10 req/s limit and a 5s policy cycle, this burns 1 req/s just to learn "nothing changed." Add the actual trade HTTP calls, and the budget is tight.

Flaw: E1 — unconditional exchange poll wastes rate limit. Already documented as A10, but worse when traced E2E: each pump_venue_events calls venue.reconcile() → _backend_snapshot() → parallel asyncio.gather of 3 HTTP GETs. The _refresh_exchange_state at bingx_direct.py:281-352 always fetches balance + positions + openOrders concurrently. Even when include_history=False (which it is for the pump), that's 3 HTTP calls every policy cycle regardless of whether any orders are resting.

Severity: Medium. Wasteful but not destructive on testnet.

E2: `kernel.snapshot()["account"]` returns a fresh dict, not a live view

pink_direct.py:437

acc = self.kernel.snapshot()["account"]

ExecutionKernel.snapshot() at rust_backend.py:740-752 builds a dict from kernel state at call time. The decision/intent engines then consume this snapshot. Between the snapshot and process_intent() (line 523), another caller (or the same runtime in a concurrent cycle) could advance the kernel state, making the decision based on stale capital.

Flaw: E2 — TOCTOU between capital snapshot and intent execution. The context.capital read at line 437 is used at line 523 for the ENTER safety guard (_unsafe_entry_reason) and possibly by the decision/intent engines. If capital changes between these two points (e.g. an async fill arrives via a concurrent test-HTTP path), the guard uses stale capital.

Severity: Low in single-threaded deployment. Critical under concurrency.

Layer 2: Decision/Intent Bridging (pink_direct.py:79-115)

E3: `_decision_to_kernel_intent` drops `order_type` and `limit_price`

pink_direct.py:79-115

def _decision_to_kernel_intent(decision, intent, slot_id=0):
    return KernelIntent(
        ...
        # order_type and limit_price are NOT SET here
    )

KernelIntent has order_type="MARKET" and limit_price=0.0 as defaults, so MARKET orders work correctly. But the runtime never sets these fields from the policy layer. If decision or intent ever carries order_type or limit_price, it's silently dropped because the bridge doesn't map them.

Flaw: E3 — LIMIT support in runtime is dead code. The order_type/limit_price fields in KernelIntent and the LIMIT payload building in bingx_direct.py lines 384-398 are unreachable from the runtime. The only path that can set them is direct KernelIntent(...) construction in tests (_build_pink_bodies.py style scenarios). The _decision_to_kernel_intent bridge must be patched when a policy engine needs to emit LIMIT orders.

Severity: Medium. Blocks any production path to LIMIT orders.

E4: `_exit_intent_from_slot` trusts slot.size but slot may be stale

pink_direct.py:398-420

def _exit_intent_from_slot(self, kernel_intent):
    try:
        slot_size = float(self.kernel.slot(int(kernel_intent.slot_id)).size or 0.0)
    except Exception:
        slot_size = 0.0
    ...
    exit_size = min(policy_size, slot_size) if policy_ok else slot_size

Reads slot.size fresh from the Rust kernel at call time, then uses it to cap the exit size. Between this read and the process_intent call that actually executes the EXIT (line 523), the slot can be modified by pump_venue_events (line 436) or a concurrent cycle. If a partial fill arrived between the slot read and the EXIT, the exit size could be wrong.

Flaw: E4 — TOCTOU between exit sizing and exit execution. Same class as E2 but for exit size rather than capital. If the pump drained a partial fill between R4 (slot read) and R8 (process_intent), the EXIT requests a size based on pre-pump remaining size. The kernel caps it at actual remaining, so this is self-correcting — but the intent payload has wrong metadata.

Severity: Low. Self-correcting at kernel level.

Layer 3: Kernel Bridge — Rust FSM Entry (rust_backend.py)

E5: JSON serialization round-trip loses numeric precision

rust_backend.py:460-485 (_intent_to_payload)

KernelIntent fields like reference_price, target_size, leverage are Python floats. They're serialized to JSON text, sent through C FFI, parsed by serde_json into Rust f64, then serialized back to JSON, parsed by Python json.loads(). Each serialization step can introduce precision loss:

# Python float → JSON: 0.1 → "0.1" → Rust f64: 0.10000000000000000555
# Rust f64 → JSON: → serde_json may print "0.10000000000000001"
# Python json.loads → 0.10000000000000001

For prices (TRXUSDT at ~$0.08), a 1e-16 relative error is negligible. For PnL accumulation over thousands of trades at 9x leverage, the error can grow to cents or dollars. The |Δcapital − realized| < 1e-9 assertion in tests would catch gross errors but not sub-cent accumulation.

Flaw: E5 — JSON serialization precision drift over long runs. Severity: Low. Not a practical concern for the current deployment scale.

E6: `_RustKernelLib` is a global singleton — shared across all kernels

rust_backend.py:40-45

_RUST: _RustKernelLib | None = None

def _get_rust() -> _RustKernelLib:
    global _RUST
    if _RUST is None:
        _RUST = _RustKernelLib()
    return _RUST

The _RustKernelLib singleton loads the .so shared library once and provides FFI functions. Each ExecutionKernel instance gets its own KernelHandle via _get_rust().create(max_slots). The FFI functions take the handle as the first argument, so multiple kernels are isolated at the Rust level.

However, the singleton means ALL kernels share the same ctypes function pointer table. If a second kernel is created and the first is destroyed, KernelHandle of the first becomes a dangling pointer. Calling any FFI function on the destroyed kernel's handle is use-after-free.

Flaw: E6 — No protection against use-after-free on kernel destroy. Already documented as T7. Worth re-emphasizing in the E2E trace because the test infrastructure creates and destroys kernels frequently (fresh-kernel reconcile tests, each _build_rb() call in scenario wrappers).

Severity: High. Use-after-free in C FFI is memory corruption.

Layer 4: Rust Kernel FSM (lib.rs:728)

E7: ENTER handler silently allows re-entry with same trade_id

lib.rs:740-745

if !slot.is_free() && !slot.trade_id.is_empty() && slot.trade_id != intent.trade_id {
    return SLOT_BUSY;
}

If slot.trade_id == intent.trade_id, the ENTER is accepted even if the slot is not free (e.g., POSITION_OPEN with an active position). This is by design — it lets the same trade_id re-enter after the slot was partially reconciled or restored from a snapshot. But it also means:

EXIT sets slot.closed=true and transitions to CLOSED
A new ENTER with the same trade_id re-enters the CLOSED slot
The slot resets slot.closed=false, slot.size=0.0, slot.initial_size=0.0
Kernel now thinks the trade is new, but the Rust indexes still have the old trade_id pointing to slot 0

Downstream effect: After a re-entry with the same trade_id, the active_trade_index[trade_id] still correctly points to slot 0. But the old VenueOrder in client_order_index and venue_order_index is still present until the new entry fills and creates new orders. A reconcile event addressed to the old venue_client_id could stomp on the new trade.

Flaw: E7 — Re-entry with same trade_id leaves stale index entries. Severity: Low. The rebuild_indexes() call in commit_slot() rebuilds from scratch, so stale entries are cleared on the first write.

E8: EXIT handler uses `initial_size` not `current size`

lib.rs:770-775

let exit_ratio = slot.next_exit_ratio();
let base_size = if slot.initial_size > 0.0 { slot.initial_size } else { slot.size };
let exit_size = (base_size * exit_ratio).max(0.0);

Already documented as A1. In the E2E trace, this is the single most impactful execution flaw. A concrete scenario:

Enter size=1.0, initial_size=1.0, exit_leg_ratios=(0.5, 0.5, 1.0)
EXIT leg 0: requests 1.0 * 0.5 = 0.5. Slot goes to 0.5.
EXIT leg 1: requests 1.0 * 0.5 = 0.5. Slot goes to 0.0. active_leg_index advances to 2. all_legs_done = (2 >= 3) = false. But wait — exit_leg_ratios.len() is 3: [0.5, 0.5, 1.0]. So all_legs_done = (2 >= 3) = false. The slot stays at POSITION_OPEN, size=0.0, !closed.
EXIT leg 2 (ratio 1.0): exit_size = 1.0 * 1.0 = 1.0. Slot is at 0.0. slot.is_free(): fsm_state=POSITION_OPEN, not in {IDLE, CLOSED}. slot.size <= 0.0 is true. But !slot.is_free() returns true because of the FSM state check, not the size check. The ENTER guard !slot.is_free() blocks re-entry. The EXIT guard slot.is_free() || slot.closed || size <= 0.0 triggers — returns NO_OPEN_POSITION.
Slot is stuck forever. No operation can advance it.

Severity: High. Concrete, reproducible, and not caught by any test.

E9: CANCEL handler returns diagnostic even when nothing happened

lib.rs:795-810

if matches!(intent.action, KernelCommandType::CANCEL) {
    let has_cancellable_exit = slot.active_exit_order.is_some();
    let has_cancellable_entry = slot.active_entry_order.is_some()
        && matches!(slot.fsm_state, ENTRY_WORKING | ORDER_REQUESTED | ORDER_SENT | IDLE);
    if !has_cancellable_exit && !has_cancellable_entry {
        return KernelResult {
            outcome: KernelOutcome {
                accepted: false,
                diagnostic_code: NO_ACTIVE_EXIT_ORDER,
                ...
            },
            ...
        };
    }
    return KernelResult {
        outcome: KernelOutcome {
            accepted: true,
            ...
        },
        ...
    };
}

Two issues:

When neither is cancellable, the diagnostic is NO_ACTIVE_EXIT_ORDER even if the actual reason is "no active entry order either" or "slot is already IDLE". The diagnostic is misleading.
When at least one IS cancellable, the Rust kernel returns accepted=true but does not mutate the slot at all — it returns immediately with the slot as-is. The actual cancel (HTTP call + FSM transition) happens in the Python bridge. The Rust kernel's "accept" just means "yes you may try to cancel this" — not "the cancel is complete."

This disconnect means: if the Python bridge's venue.cancel() fails (HTTP error), the Rust kernel has already returned accepted=true for a cancel that never happened. The caller sees accepted=true but the slot state hasn't changed.

Flaw: E9 — Rust CANCEL "accepts" before Python actually cancels. Severity: Medium. The outcome.accepted boolean is misleading for CANCEL.

E10: `apply_fill` entry branch double-sets `active_entry_order`

lib.rs:1330-1390

// First set — at the top of the entry branch:
slot.active_entry_order = Some(VenueOrder {
    ...
    filled_size: fill_size,
    status: if partial { PARTIALLY_FILLED } else { FILLED },
    ...
});

// ... then later for full fill:
if !partial {
    slot.fsm_state = TradeStage::POSITION_OPEN;
    slot.active_entry_order = Some(VenueOrder {  // SECOND SET
        ...
        filled_size: slot.size,    // uses updated slot.size
        ...
    });
}

The entry branch sets active_entry_order at the top with filled_size from the event, then for a FULL_FILL, sets it again with filled_size = slot.size (which may have been updated by slot.initial_size = fill_size above). The first VenueOrder's intended_size is from the event, the second uses slot.size. Both are correct in isolation, but the double-write is wasteful.

More importantly, for a PARTIAL_FILL entry, the first set is the ONLY set. If a second PARTIAL_FILL arrives for the same order, the entry branch at line 1334 checks slot.active_entry_order.is_some() which is true (set by the first partial), but the FSM state is ENTRY_WORKING (also set by first partial). The condition at line 1334-1338 matches ENTRY_WORKING, so the second partial enters the entry branch again. But fill_size is the event's filled_size — the total filled, not the incremental amount.

Flaw: E10 — Second PARTIAL_FILL on entry overwrites, doesn't accumulate.

let fill_size = if event.filled_size > 0.0 {
    event.filled_size      // ← TOTAL filled, not incremental
} else {
    event.size
}.max(0.0);

slot.active_entry_order = Some(VenueOrder {
    ...
    filled_size: fill_size,  // ← overwrites previous filled_size
    ...
});

slot.initial_size = slot.initial_size.max(fill_size);  // ← OK, uses max
slot.size = fill_size;  // ← OVERWRITES previous size with total

On a RESTING LIMIT entry that partially fills in two events:

Event 1: filled_size=0.3 → slot.size=0.3, entry_order.filled_size=0.3
Event 2: filled_size=0.7 → slot.size=0.7, entry_order.filled_size=0.7

The filled_size on the VenueOrder correctly reflects cumulative fill (0.7), but slot.size jumps from 0.3 to 0.7 — the increment is 0.4, which is correct because fill_size IS the cumulative fill (0.7). Actually this is correct — the venue sends cumulative filled_size, not incremental. Let me re-verify: at bingx_venue._events_from_submit() line ~480:

filled_size = _row_float(ack_row, "executedQty", ...)

This reads executedQty which on BingX IS cumulative. So the second event's filled_size=0.7 means "total filled across all fills = 0.7." The kernel sets slot.size = 0.7 which is the total position size. This is correct.

But the second fill event has slot.entry_price overwritten by the new fill's price. If the first fill was at 0.0834 and the second at 0.0836, the slot's entry_price becomes 0.0836 — losing the blended average. For a LIMIT entry with two partial fills at different prices, the entry_price in the slot is the price of the LAST fill, not the VWAP.

Flaw: E10a — Entry price on multi-partial entry is last-fill, not VWAP. Severity: Low. Unrealized PnL computation uses this price. Error is small for tight spreads.

Layer 5: Venue Adapter Boundary (bingx_venue.py)

E11: `_legacy_intent()` is a lossy conversion

bingx_venue.py:270-285

@staticmethod
def _legacy_intent(intent: KernelIntent) -> LegacyIntent:
    action = LegacyDecisionAction.ENTER if intent.action == E.ENTER else ...
    side = LegacyTradeSide.SHORT if intent.side == TS.SHORT else ...
    metadata = dict(intent.metadata)
    metadata["_order_type"] = getattr(intent, "order_type", "MARKET")
    metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
    return LegacyIntent(
        timestamp=intent.timestamp,
        trade_id=intent.trade_id,
        decision_id=intent.intent_id,
        asset=intent.asset,
        action=action,
        side=side,
        reason=intent.reason,
        target_size=float(intent.target_size),
        leverage=float(intent.leverage),
        reference_price=float(intent.reference_price),
        confidence=1.0,           # ← HARDCODED
        bars_held=0,              # ← HARDCODED
        exit_leg_ratios=tuple(intent.exit_leg_ratios or (1.0,)),
        metadata=metadata,
    )

confidence is always 1.0 and bars_held is always 0. The LegacyIntent carries these to BingxDirectExecutionAdapter.submit_intent() which ignores them (it only reads asset, side, action, target_size, leverage, and metadata). So the hardcoded values don't affect execution — but they affect the ExecutionReceipt and any downstream consumers that might read receipt.confidence.

Flaw: E11 — Lossy conversion with hardcoded metadata. Severity: Informational. No downstream consumer reads these fields.

E12: `_events_from_submit()` price fallback chain can lose venue price

bingx_venue.py:375-400 (_events_from_submit)

base_event = VenueEvent(
    ...
    price=safe_float(getattr(receipt, "price", 0.0), 0.0),
    ...
)

# ... later for fill event:
fill_price = safe_float(
    _row_float(ack_row, "avgPrice", "ap", "price", "lastFillPrice",
               default=getattr(receipt, "price", 0.0)),
    0.0
)

The fill price is read from ack_row (the HTTP response dict) first, falling back to receipt.price (the ExecutionReceipt field). The executionReceipt price comes from bingx_direct.py:434:

fill_price = 0.0
for key in ("avgPrice", "avgFilledPrice", "price", "lastFillPrice", "tradePrice"):
    try: value = float(ack_row.get(key) or 0.0)
    except: value = 0.0
    if value > 0: fill_price = value; break
if fill_price <= 0 and self._state is not None:
    fill_price = next((float(...)) for ... in self._state.open_positions.values() ...)

So the price flows: BingX HTTP ack → ack_row[key] → receipt.price → _events_from_submit() → fill_price in VenueEvent.

If ack_row has no price field AND self._state.open_positions has no matching position (e.g., first fill on a new entry), fill_price stays 0.0. The kernel's apply_fill at lib.rs:1397 checks if event.price > 0.0 before setting entry_price — so a zero fill price leaves entry_price at 0.0. This means:

The slot's entry_price stays 0.0
realized_pnl() at lib.rs:662 checks if slot.entry_price <= 0.0 → returns 0.0
PnL is never computed for this fill
Capital never settles

This is very unlikely on BingX VST, which always returns avgPrice in order acknowledgements. But on any venue that doesn't, PnL is silently zeroed.

Flaw: E12 — Zero fill price → zero entry_price → zero PnL. Severity: Medium. Silent PnL loss if venue returns no price.

E13: `_backend_snapshot()` timeout returns stale data

bingx_venue.py:290-320

def _backend_snapshot(self, *, include_history=False, timeout_ms=5000.0):
    if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
        with self._snap_lock:
            return self._last_snapshot  # ← STALE DATA

If the previous snapshot fetch is still in-flight when a new caller arrives, the timeout returns self._last_snapshot — which could be seconds or minutes old. The caller (e.g., submit()) then uses this stale snapshot to compute _filled_size_from_snapshots() — potentially comparing stale "before" data with fresh "after" data, producing a wrong delta.

Flaw: E13 — Stale snapshot fallback causes wrong fill-size detection. Severity: Medium. The _filled_size_from_snapshots diff can be wrong.

E14: `_events_from_cancel` uses stale `slot_id` from order metadata

bingx_venue.py:485-510

VenueEvent(
    ...
    slot_id=int(order.metadata.get("slot_id", 0) or 0),
    ...
)

The slot_id in the CANCEL event comes from the VenueOrder.metadata which was set when the order was created (in Rust FSM's process_intent or on_venue_event). If the slot was re-assigned or the kernel's slot count changed since order creation, this slot_id is wrong. The Rust kernel's resolve_slot() at lib.rs:610-624 would use the event's slot_id (the stale one) and find the wrong slot.

Flaw: E14 — Cancel event carries stale slot_id from order creation. Severity: Low. Slots are stable and never renumbered.

Layer 6: BingX Direct Adapter (bingx_direct.py)

E15: Submit sets leverage via separate HTTP call

bingx_direct.py:376-379

await self._client.signed_post(
    "/openApi/swap/v2/trade/leverage",
    {"symbol": symbol, "side": "BOTH", "leverage": leverage},
)

This is a POST to set exchange leverage before each order. If this call fails (rate limit, network error), the exception at line 417 sets status = "RATE_LIMITED" and returns a rejection — the order is NOT submitted. But the error handling at line 417 catches BingxHttpError for the leverage call AND the order call with the same handler. If the leverage call fails with a non-rate-limit error (e.g., 400 Bad Request for invalid symbol), the status is "REJECTED" and no order is placed. This is correct behavior — but the error message doesn't distinguish "leverage set failed" from "order submission failed."

Flaw: E15 — Leverage-set failure and order failure share error handler. Severity: Low. Correct behavior, poor diagnostics.

E16: `_format_quantity` and `_format_price` use `_instrument_step`/`_instrument_tick` — both may be zero

bingx_direct.py:234-268

def _instrument_step(self, asset):
    instrument = self._resolve_instrument(asset)
    if instrument is not None:
        try: return Decimal(str(instrument.size_increment.as_decimal()))
        except: pass
    return Decimal("0.001")  # fallback

def _format_quantity(self, asset, quantity):
    step = self._instrument_step(asset)
    if step <= 0:
        return str(max(0.0, quantity))
    ...

If _resolve_instrument returns None (asset not in provider), step=0.001 and tick=0.01. These defaults are correct for most USDT perpetuals on BingX VST, but may be wrong for non-standard symbols. The format functions still produce a valid string — just possibly with wrong precision.

More concerning: _resolve_instrument at line 211-226 tries three lookup strategies and iterates all instruments on the third. This iteration is O(n) in the number of instruments and happens on EVERY submit_intent() call. With 540 instruments, this is ~0.5ms — acceptable. But _instrument_step and _instrument_tick each call _resolve_instrument independently, so submit_intent() calls it twice (once for quantity, once for price, plus once for _instrument_venue_symbol at line 358). Three full-instrument-list iterations per order.

Flaw: E16 — Instrument resolution called 3x per order with O(n) scan. Severity: Low. Performance, not correctness.

E17: Cancel uses truth-based confirmation — can mask real errors

bingx_direct.py:474-498

still_open = True
try:
    oo = await self._client.signed_get("/openApi/swap/v2/trade/openOrders", ...)
    ...
    still_open = (venue_order_id in ids) if venue_order_id else (venue_client_id in cids)
except Exception:
    still_open = None

if still_open is False:
    return {"status": "CANCELED", ...}
if str(delete_resp.get("status", "")).upper() in {"CANCELED", "CANCELLED", "SUCCESS", "OK"}:
    return {"status": "CANCELED", ...}
return {"status": delete_resp.get("status", "REJECTED"), ...}

The cancel logic:

DELETE the order on BingX
GET open orders to verify
If the order is no longer open, return CANCELED
If the DELETE response says CANCELED, return CANCELED
Otherwise return REJECTED

If step 2's GET fails (network error, rate limit), still_open=None. Then step 4 checks the DELETE response. If the DELETE also returned an error (e.g., "order not found" because it was already cancelled by another caller), status is "ERROR" or "not found" — neither matches "CANCELED". The cancel is reported as REJECTED even though the order IS cancelled.

The bingx_venue._events_from_cancel() then emits CANCEL_REJECT instead of CANCEL_ACK. The Rust kernel handles CANCEL_REJECT at lib.rs:1218:

KernelEventKind::CANCEL_REJECT => {
    if slot.fsm_state == TradeStage::EXIT_WORKING {
        slot.fsm_state = TradeStage::EXIT_WORKING;  // no-op
    }
    diagnostic_code = KernelDiagnosticCode::CANCEL_REJECTED;
}

The slot stays in its current state (e.g., EXIT_WORKING) with no active order (the exchange has no record of it). The slot is stuck until a manual reconcile.

Flaw: E17 — Cancel can return false REJECTED for already-cancelled orders. Severity: Medium. Leads to stuck slot requiring manual intervention.

Layer 7: Fill Feedback Loop (rust_backend.py on_venue_event)

E18: `on_venue_event` settles PnL incrementally — but fees are never included

rust_backend.py:530-545

incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0)
if abs(incremental_pnl) > 1e-12:
    self.account.settle(incremental_pnl)
    self._last_settled_pnl[slot.slot_id] = slot.realized_pnl

The Rust kernel's apply_fill computes realized PnL as:

let realized = Self::realized_pnl(slot, event.price, fill_size);
slot.realized_pnl += realized;

No fee subtraction. No commission reading from the event. The VenueEvent could carry fee data via metadata["fee"] or raw_payload["commission"], but the Rust kernel doesn't read it and the Python bridge doesn't extract it.

Over the 142 live test scenarios on VST (where fees are 0 or negligible), this is invisible. On live mainnet with exchange fees of 0.02-0.04%, the cumulative error is unbounded.

Flaw: E18 — PnL settlement ignores fees. Already documented as A7. In the E2E trace, the gap is specifically here: VenueEvent.price is used for realized_pnl() but VenueEvent.metadata (which could carry commission from the venue) is never read.

Severity: Medium (grows with trade volume).

E19: `observe_slots` called with ALL slots, not just changed ones

rust_backend.py:538-545

slots = [self._get_slot(i) for i in range(self.max_slots)]
self.account.observe_slots(slots)

Every on_venue_event call re-reads ALL slots from the Rust kernel (N FFI calls) and calls observe_slots with the full list. With max_slots=10, this is 10 FFI round-trips per venue event. Each round-trip serializes a TradeSlot to JSON, passes through C FFI, parses on the Rust side, serializes the result, passes back, and parses on the Python side. For a multi-leg EXIT with 3 fills (ACK + PARTIAL + FULL), that's 3 × 10 = 30 slot reads per process_intent call.

Flaw: E19 — Full-slot-list read on every event is N×FFI overhead. Severity: Low (performance). Not a correctness issue.

Layer 8: Persistence Boundary (pink_clickhouse.py)

E20: `_capital()` reads live from `AccountProjection` — stale row risk

pink_clickhouse.py:199-200

def _capital(self) -> float:
    return float(self.account.snapshot.capital or 0.0)

Every row writer calls _capital() at write time to get the current capital. But persist_result() is called AFTER kernel.process_intent() returns — at which point the account has already been settled. The account_events, position_state, and trade_events rows all record the SAME capital value (the post-settle value). capital_before is then reconstructed by subtracting PnL (already documented as A5).

The effect: all ClickHouse rows for a single process_intent() call show identical capital / account_capital / portfolio_capital values, because they're all written within the same Python call stack with no intervening events. This is correct for single-threaded operation — all rows reflect POST-trade state. But it means ClickHouse querying for "capital before trade" must use capital_after - pnl, which is the wrong formula under multi-slot.

Flaw: E20 — All persistence rows write post-trade capital, not pre-trade. Already documented as A5 from the capital_before angle.

Severity: High for multi-slot accounting reconstruction.

E21: `persist_fill_events()` synthesizes fake Decision/Intent

pink_clickhouse.py:383-435

def persist_fill_events(self, *, snapshot, events, slot_dict, market_state):
    ...
    decision = Decision(
        timestamp=ts, decision_id=trade_id or "async", asset=asset,
        action=action, side=side, reason="ASYNC_FILL",
        confidence=0.0, velocity_divergence=0.0, irp_alignment=0.0,
        reference_price=price, target_size=cur_size, leverage=leverage,
        ...
    )
    intent = Intent(
        timestamp=ts, trade_id=trade_id, decision_id=trade_id or "async",
        ...
    )

The async fill pump (called by pump_venue_events) constructs fake Decision/Intent objects because there's no real policy decision backing an async fill — it just arrived from the exchange. These synthetic objects have:

decision_id = trade_id (or "async" if trade_id is empty)
decision_id and trade_id are the same string
confidence=0.0, velocity_divergence=0.0, irp_alignment=0.0
target_size = cur_size (the remaining size after the fill, not the size that was filled)

These are written to policy_events, trade_reconstruction, and trade_events with the same row shapes as real policy-driven fills. Any ClickHouse query that joins policy_events to trade_events on decision_id will find matching rows (both set to trade_id), but the policy_events row's target_size is the POST-fill size, not the pre-fill size. A replay system that reconstructs position from policy_events → trade_reconstruction would see incorrect sizing.

Flaw: E21 — Async fill persistence uses synthetic decision with wrong data. Severity: Medium. Misleading historical records.

E22: `_write_trade_exit_leg` capital_before uses arithmetic reconstruction

pink_clickhouse.py:761-762

capital_after = self._capital()
capital_before = capital_after - pnl_leg

Already documented as A5. In the E2E trace, the specific path is:

Slot 0 exit leg fills → _capital() returns capital AFTER settlement (because the kernel's on_venue_event already called account.settle)
capital_before = capital_after - pnl_leg reconstructs pre-leg capital

If slot 1 also settled between the leg fill and the persistence write (possible in multi-threaded or concurrent scenario), capital_after includes slot 1's PnL, and capital_before is wrong by exactly slot 1's contribution.

Severity: High for multi-slot.

E23: `_write_trade_event` uses `slot_dict.get("entry_price")` as exit_price

pink_clickhouse.py:813-815

entry_price = _safe_float(slot_dict.get("entry_price", 0.0), ...)
exit_price = _safe_float(slot_dict.get("entry_price", 0.0), ...)  # ← SAME FIELD

Already documented as A13. The exit_price is set to entry_price from the same slot dict field. The BingX ack payload does contain the fill price, but it's not propagated to the slot dict's entry_price for exit fills — the slot's entry_price is set during entry fill and remains unchanged during exit. The exit fill price is only on the VenueEvent, which is not passed through to _write_trade_event.

The trade_events row in ClickHouse always shows exit_price == entry_price, making PnL reconstruction from (exit_price - entry_price) × size × lev impossible. The pnl field IS correct (it's slot.realized_pnl), but only the summary is accurate — the component prices are wrong.

Severity: Low. pnl is correct, only the decomposed price is wrong.

Layer 9: Test Infrastructure

E24: `MockVenueAdapter.submit()` always emits fill on `partial_fill_ratio > 0`

mock_venue.py:60-90

if self.scenario.emit_fill_on_submit or self.scenario.partial_fill_ratio > 0:
    fill_ratio = max(0.0, min(1.0, float(effective_ratio)))
    ...
    if is_entry:
        effective_ratio = self.scenario.entry_partial_fill_ratio if \
            self.scenario.entry_partial_fill_ratio != 1.0 else \
            self.scenario.partial_fill_ratio
    else:
        effective_ratio = self.scenario.exit_partial_fill_ratio ...

The default MockVenueScenario() has partial_fill_ratio=1.0. So every submit() call on a default mock emits a FULL_FILL event immediately. This means mock-venue tests always test the "order fills instantly" path — they never test resting orders, partial fills, or async fills.

Any test that relies on the mock venue is testing a subset of real venue behavior. The mock never produces:

DELAYED fills (fill arrives on a later reconcile() call)
PARTIAL fills with subsequent fills
Partial fills during entry (entry fills partially, then more later)
Mixed entry/exit partial behavior

Flaw: E24 — Mock venue always fills synchronously — never tests async path. Severity: Medium. The pump_venue_events() path has never been exercised with the mock venue.

E25: Test scenarios use MARKET-only `_si()` helper — no LIMIT tests

gen_live_tests.py and _gen_test.py

The _si() helper constructs a KernelIntent with order_type="MARKET" and limit_price=0.0 (the defaults). All 157 live test scenarios use _si(). The 3 "LIMIT" scenarios (limit_does_not_fill, limit_immediate_fill) use reference_price=0.0 and target_size=-0.001 respectively — they test intent validation, not actual LIMIT order submission.

There is zero live-test coverage of:

Submitting a LIMIT order that rests on the book
A resting LIMIT being cancelled
A resting LIMIT receiving a partial fill then a subsequent fill
An async fill arriving via pump_venue_events()

The Rust kernel's PARTIAL_FILL event handling and the Python bridge's on_venue_event + incremental settle + async pump has never been exercised on a live exchange.

Flaw: E25 — Zero live tests for LIMIT/resting/async-fill paths. Severity: High. The partial-fill code path is untested in production.

gen_live_tests.py (fresh_kernel_reconcile_entry body)

fresh = _build_fresh_kernel_from_slot(slot_data, ic=cb)
k2 = fresh.runtime.kernel

The _build_fresh_kernel_from_slot function creates a new PinkDirectRuntime with a new ExecutionKernel. But the venue adapter is shared or re-created with the same BingX backend. Two kernels making concurrent HTTP calls to BingX through shared or separate venue adapters is exactly the multi-threaded scenario that triggers T1 (Rust kernel UB) — except the tests are sequential, not concurrent, so they don't trigger it.

The fresh kernel does NOT restore the venue state (open orders, positions). The fresh kernel has a blank venue adapter state — it can't know about previous LIMIT orders resting on the exchange. This is correct for MARKET-only tests (no resting orders) but would fail for LIMIT tests.

Flaw: E26 — Fresh-kernel reconcile doesn't restore venue state. Severity: Medium (would break LIMIT scenarios).

Summary: Critical E2E Flaw Chain

The most dangerous E2E scenario is a LIMIT order with partial fills on a live exchange:

1. Policy emits LIMIT ENTER                       [E3: can't happen — bridge drops order_type]
2. KernelIntent with order_type="LIMIT"            [dead code path from step 1]
3. bingx_direct.submit_intent builds LIMIT payload [works if reached]
4. BingX accepts LIMIT, returns ACK with no fill   [VenueEvent.price may be 0]
5. FSM transitions to ENTRY_WORKING                [correct]
6. RESTING LIMIT sits on book                      [no further kernel events]
7. Next policy cycle: pump_venue_events()           [E1: expensive HTTP calls]
8. Reconciled venue has no fill events              [nothing to drain]
9. Repeated cycles with no progress                 [wasteful but safe]
10. Eventually BingX fills partially               [VenueEvent arrives]
11. apply_fill PARTIAL_FILL entry branch runs       [E10: entry_price = last fill, not VWAP]
12. on_venue_event settles incremental PnL          [E18: fees not included]
13. persistence writes                              [E20/E21/E22/E23: wrong capital_before, exit_price]
14. Remaining LIMIT still rests on book             [continues to step 7]
15. Eventually full fill or cancel                  [E17: cancel can return false REJECTED]

None of steps 4-15 have live test coverage.

Complete Flaw Catalog (All Layers)

#	Flaw	Layer	Step	Severity
E1	Unconditional pump_venue_events wastes rate limit	Runtime	R2	Medium
E2	TOCTOU between capital snapshot and intent	Runtime	R3→R8	Medium
E3	Runtime bridge drops order_type/limit_price	Bridging	R7	Medium
E4	TOCTOU between exit sizing and execution	Runtime	R8	Low
E5	JSON precision drift over long runs	Bridge	R8a→R8c	Low
E6	Global FFI singleton no guard vs use-after-free	Bridge	R8b	High
E7	Same-trade-id re-entry leaves stale index entries	Rust	R8c	Low
E8	EXIT uses initial_size not remaining size	Rust	R8c	High
E9	CANCEL "accepted" before cancel actually happens	Rust	R8c	Medium
E10	Entry price on multi-partial fill = last fill, not VWAP	Rust	R10a	Low
E11	_legacy_intent hardcodes confidence/bars_held	Venue	R9a	Info
E12	Zero fill price → zero PnL	Venue	R9c	Medium
E13	Stale snapshot fallback causes wrong fill delta	Venue	R9c	Medium
E14	Cancel event carries stale slot_id	Venue	R9c	Low
E15	Leverage-set failure and order failure share handler	Adapter	R9b	Low
E16	Instrument resolution 3x per order, O(n) scan	Adapter	R9b	Low
E17	Cancel returns false REJECTED for already-cancelled	Adapter	R9b	Medium
E18	PnL settlement ignores fees	Bridge	R10b	Medium
E19	Full-slot-list read on every event = N×FFI overhead	Bridge	R10b	Low
E20	All persistence rows write post-trade capital	Persistence	R12	High
E21	Async fill uses synthetic Decision with wrong size	Persistence	R12	Medium
E22	capital_before arithmetic reconstruction wrong	Persistence	R12	High
E23	trade_events exit_price = entry_price	Persistence	R12	Low
E24	Mock venue always fills synchronously	Test	—	Medium
E25	Zero live tests for LIMIT/async-fill paths	Test	—	High
E26	Fresh-kernel reconcile doesn't restore venue	Test	—	Medium

Total: 26 E2E flaws (4 High, 10 Medium, 11 Low, 1 Info)

The four High-severity flaws in the E2E trace:

E6: Global FFI singleton + __del__ use-after-free — memory corruption risk
E8: Exit-size overshoot — slot can get stuck (A1)
E20/E22: Post-trade capital in all persistence rows + arithmetic capital_before — ClickHouse records are misleading for accounting
E25: No LIMIT/async-fill test coverage — partial-fill path is production code with zero live validation

PASS 3 — NEW FINDINGS (Deepest E2E Trace)

F1: `process_intent` CANCEL returns "accepted" before the cancel happens — caller gets wrong `outcome.state`

File: rust_backend.py:595-614

The CANCEL path:

Calls self.venue.cancel(order) → HTTP DELETE → returns VenueEvent[]
For each event, calls self.on_venue_event(event) → Rust FSM transition
Assembles final_outcome from the Rust kernel's pre-venue-event slot state

outcome = _outcome_from_payload(result["outcome"])  # Rust CANCEL accepts (slot NOT mutated yet)
# ... venue.cancel() ...
# ... on_venue_event() for each event (now slot IS mutated) ...
final_slot = self._get_slot(outcome.slot_id)         # Re-reads post-mutation state
final_outcome = KernelOutcome(
    accepted=outcome.accepted,        # TRUE — from Rust's pre-event accept
    state=final_slot.fsm_state,       # IDLE — from post-event state
    diagnostic_code=outcome.diagnostic_code,  # "OK" — from Rust's pre-event accept
)

For ENTER/EXIT, the same pattern exists — the Rust kernel's outcome is pre-venue. But for CANCEL the disconnect is worst: Rust returns accepted=true with the slot still in ENTRY_WORKING, and only the subsequent on_venue_event(CANCEL_ACK) transitions to IDLE.

Fix: The diagnostic code should be reconciled with the actual venue outcome, not taken from the pre-venue Rust outcome.

Severity: Medium

F2: `_last_settled_pnl` reset before `venue.submit()` — transient window

File: rust_backend.py:597-604

if intent.action == KernelCommandType.ENTER and outcome.accepted:
    self._last_settled_pnl[intent.slot_id] = 0.0   # reset HERE
# ... venue.submit() called below ...

If venue.submit() fails (HTTP error, rate limit), the ENTER was accepted by the Rust FSM but no venue order was placed. The slot is stuck in ORDER_REQUESTED. If the caller retries the same ENTER, _last_settled_pnl is 0.0 from the first attempt — correct for a new trade.

Real risk: If the previous trade on this slot had realized PnL that was never settled (impossible with incremental settle, but hypothetically), resetting to 0.0 loses that PnL. In practice, incremental settle makes this safe.

Severity: Medium (retry-safe, but exposes slot-stall)

F3: `_first_invalid_intent_field` allows `leverage=0` and `target_size=0`

File: rust_backend.py:295-316

The guard catches NaN/Inf and negative target_size. Does NOT catch:

leverage=0 or negative (Rust silently falls back to 1.0)
target_size=0 (submits zero-quantity order to BingX)
reference_price=0 (mark_price ignores non-positive)
limit_price=0 with order_type="LIMIT" (BingX rejects price=0)

The zero-target-size case: a direct process_intent(EXIT, target_size=0.0) computes exit_size = 0, submits MARKET order with quantity=0 to BingX, which may return an error or silent no-op.

Severity: Low (runtime's _exit_intent_from_slot prevents for EXIT; direct kernel API users can trigger it)

F4: `outcome.emitted_events` only contains venue events — Rust kernel's events silently dropped

File: rust_backend.py:641-652

final_outcome = KernelOutcome(
    emitted_events=tuple(emitted_events),  # only from venue.submit()
)

The Rust kernel's KernelOutcome struct has emitted_events — currently always empty because the Rust FSM never sets it. If a future change adds Rust-side event emission, those events are silently dropped: final_outcome only uses the Python-side list.

Severity: Low (no Rust-emitted events exist today)

F5: `on_venue_event` does redundant FFI read of slot already returned by Rust

File: `rust_backend.py:698-706**

def on_venue_event(self, event):
    result = _get_rust().on_venue_event(...)
    outcome = _outcome_from_payload(result["outcome"])
    slot_payload = result.get("slot")
    slot = _slot_from_payload(slot_payload) if slot_payload else self._get_slot(...)
    # ...
    current = self._get_slot(slot.slot_id)  # REDUNDANT — slot already has this data!
    self.projection.write_slot(current)

Line 706 re-reads current from the backend even though slot (from the Rust result) already has the exact same data. Each redundant FFI read is JSON serialize → C FFI → Rust serialize → C FFI → Python parse — ~100μs. With 2-3 events per process_intent and 10 slots, ~3ms wasted per cycle.

Severity: Low (performance)

F6: `_record_transitions` in `process_intent` records pre-venue transitions with `event=None`

File: `rust_backend.py:708, 650**

# process_intent line 650:
self._record_transitions(outcome.transitions, final_slot, None)  # event=None

# on_venue_event line 708:
self._record_transitions(outcome.transitions, slot, event)  # event attached

Venue-event transitions ARE recorded individually inside each on_venue_event call (line 708). The journal has all transitions. But the pre-venue transitions (from Rust FSM before venue call) have event=None attached — no event context for the journal reader.

Severity: Informational (diagnostic inconvenience only)

F7: `reconcile_from_slots` writes ALL slots to projection/zinc, not just reconciled ones

File: `rust_backend.py:718-733**

for current in slots:          # iterates ALL max_slots
    self.projection.write_slot(current)   # writes unchanged slots too
    self.zinc_plane.write_slot(current)

After reconcile, ALL slots are written to projection and Zinc, even if the reconcile only modified one slot. Slots 1-9 are serialized and written with their unchanged state. Wasteful but harmless.

Also: Rust kernel's reconcile_slots_json silently ignores slot_id out of range — no error returned. Caller sees accepted=true even if no slots were reconciled.

Severity: Low

F8: `HazelcastRowWriter.put()` is synchronous with no error handling — Hazelcast failure crashes the intent

File: `hazelcast_projection.py:30-48**

class HazelcastRowWriter:
    def __call__(self, name, row):
        if name.endswith("trade_events"):
            self.client.get_topic(name).publish(json.dumps(row, ...))
            return
        self.client.get_map(name).put(key, json_safe(row))  # synchronous, no try/except

No try/except. Hazelcast put() is synchronous — blocks until the cluster acknowledges. If Hazelcast is down, under load, or partitioned, this:

Blocks the calling thread (which holds the Rust kernel handle — no other operation can proceed)
Raises an exception that propagates through _set_slot() → process_intent() → crashes the entire intent

Severity: Medium (Hazelcast failure in hot path stalls execution)

F9: `RealZincPlane.write_slot()` serializes ALL slots, not just the changed one

File: `real_zinc_plane.py:205-212**

def write_slot(self, slot):
    with self._lock:
        self._slot_cache[int(slot.slot_id)] = slot
        payload = {"slots": [self._slot_cache[key].to_dict() for key in range(self._slot_count)]}
        self._write_region(self.state_region, self._state_seq, payload)

Every single-slot write serializes ALL slot_count slots (default 10) to JSON. With VenueOrder metadata, each slot payload can be ~1-5KB → 10-50KB per write. This is written to Zinc shared memory on every process_intent() and on_venue_event() call.

InMemoryZincPlane does NOT have this problem — it only stores the one slot.

Severity: Low (performance + Zinc shared-memory capacity waste)

F10: `RealZincPlane.write_slot` zeros buffer before write — concurrent read sees empty data

File: `real_zinc_plane.py:255-263**

def _write_region(self, region, seq, payload):
    buf = region.as_buffer()
    view = memoryview(buf)
    view[:] = b"\x00" * len(view)     # Zeros the buffer
    view[: len(packet)] = packet       # Writes packet
    region.notify()

Between the zero and the write, any concurrent reader sees zeros or a truncated packet. _decode_packet checks size <= len(buf) - 16 — a partially-written packet fails validation and returns {}. The reader (e.g., another thread calling read_slots()) gets an empty result.

Window is microseconds but it exists. No version guard — reader always returns whatever is in the region.

Severity: Low (brief window, no corruption — just empty results)

F11: `RealZincPlane._write_region` has no partial-write recovery

File: `real_zinc_plane.py:255-263**

If _encode_packet raises (JSON serialization error), the method raises before writing — region retains previous content. Safe.

If view[:] = b"\x00" fails (memory error), the region is partially zeroed. Not recoverable. No fallback.

Severity: Low (memory errors are extremely rare)

F12: `InMemoryZincPlane` intent_region grows without bound

File: `zinc_plane.py:83-85**

def publish_intent(self, intent):
    self.intent_region.append(intent)   # unbounded growth

self.intent_region is List[KernelIntent] — grows on every publish_intent call. Over thousands of policy cycles, this grows without bound.

RealZincPlane.publish_intent() limits to last 512 entries in shared memory, but its self._intent_cache (in-memory) also grows without bound.

Severity: Low (memory leak — ~MB/day)

F13: `InMemoryZincPlane` uses non-re-entrant `threading.Condition`

File: `zinc_plane.py:41-43**

_signal: threading.Condition = field(default_factory=threading.Condition)

threading.Condition is NOT re-entrant. If any code path calls back into publish_intent while holding the condition's lock — deadlock.

Severity: Low (no current code path triggers this, but it's a landmine)

F14: `KernelSlotView.setattr` round-trips unknown fields through Rust — silently dropped

File: `rust_backend.py:370-395**

If a new field is added to Python's TradeSlot that Rust's TradeSlot doesn't know about, slot.to_dict() includes it. _set_slot serializes to JSON, sends to Rust, which deserializes with #[serde(default)] — unknown fields are silently dropped. The round-trip loses data without warning.

The reverse: if Rust adds a field that Python doesn't know about, _slot_from_payload ignores unknown keys. Also silently dropped.

Severity: Low (fields must be added to both sides atomically; no guard)

F15: `on_venue_event` loop in `process_intent` stops on first exception — slot left in partial state

File: `rust_backend.py:599-610**

for event in emitted_events:
    evt_outcome = self.on_venue_event(event)  # NO TRY/EXCEPT

If self.on_venue_event(event) raises (FFI error, null pointer, OOM), the loop stops. Events after the failing event are never processed. The slot is in a partial state — some events applied, some not.

Concrete scenario: ACK arrives first → applied. FULL_FILL arrives second → FFI error, exception raised. Slot is stuck in ENTRY_WORKING with size=0. Next process_intent(EXIT) returns NO_OPEN_POSITION. No recovery path exists.

Severity: High — single exception during fill feedback leaves slot unrecoverable. Zero defense in depth.

F16: `venue.submit()` returning empty events leaves slot in `ORDER_REQUESTED`

File: `rust_backend.py:599-610**

If venue.submit() returns [] (venue rejected order with no response, or internal error), the for loop doesn't run. No on_venue_event is called. Slot stays in Rust's pre-venue state (ORDER_REQUESTED).

final_outcome has accepted=true, state=ORDER_REQUESTED, emitted_events=[]. Caller sees "successful" but no exchange order exists. Slot stuck in ORDER_REQUESTED until pump_venue_events() or manual reconcile.

Severity: Medium — silent slot stall with no error indication.

F17: Cancel truth-based confirmation returns `REJECTED` for already-cancelled orders on GET failure

File: `bingx_direct.py:474-498**

try:
    oo = await self._client.signed_get("/openApi/swap/v2/trade/openOrders", ...)
    still_open = (venue_order_id in ids)
except Exception:
    still_open = None  # GET failed

if still_open is False:
    return {"status": "CANCELED", ...}
# still_open is None (GET failed) or True (order still on book)
# Falls through to DELETE response check

If the DELETE succeeded but the verification GET failed (network blip, rate limit on the verification endpoint), still_open=None. The code then checks the DELETE response. If the DELETE returned an ambiguous error (e.g., "order not found" because it was already cancelled by another path), the status is "ERROR" — reported as REJECTED even though the order IS cancelled.

The bingx_venue._events_from_cancel() emits CANCEL_REJECT. The Rust FSM handles CANCEL_REJECT as a no-op — slot stays in EXIT_WORKING with no active order. Stuck until pump_venue_events() or manual reconcile.

Severity: Medium — needs a third state: "definitely cancelled," "probably cancelled," "definitely not cancelled."

File: `bingx_direct.py:376-417**

await self._client.signed_post("/openApi/swap/v2/trade/leverage", ...)  # step A
# ...
ack_payload = await self._client.signed_post("/openApi/swap/v2/trade/order", payload)  # step B

If step A fails (400 for invalid symbol), the exception handler at line 417 catches BingxHttpError and returns REJECTED. No way for the caller to know whether the leverage set failed or the order submission failed — both go through the same handler. The error message just says "REJECTED."

Also: if step A succeeds and step B fails, leverage was changed on the exchange but no order was placed. System state unchanged (leverage changes don't affect capital), but diagnostics are poor.

Severity: Low (correct behavior, poor diagnostics)

F19: `_events_from_submit` stale snapshot fallback → wrong fill detection

File: `bingx_venue.py:375-400**

_filled_size_from_snapshots() diffs position quantity before and after submit. The "before" snapshot comes from _backend_snapshot() which can return stale data (E13). A stale "before" against a fresh "after" produces a wrong diff — could be negative, zero, or larger than reality.

This wrong diff propagates to emitted_events — the PARTIAL_FILL or FULL_FILL event has wrong filled_size. The Rust kernel's apply_fill uses this wrong filled_size to set slot.size. Capital settles on the wrong delta.

Severity: Medium — wrong fill size propagates to kernel state and PnL.

F20: `del` frees Rust handle at unpredictable GC time — no explicit `close()`

File: `rust_backend.py:558-566**

def __del__(self):
    backend = getattr(self, "_backend", None)
    if backend is not None:
        try: _get_rust().destroy(backend)
        except: pass

ExecutionKernel has no close() method. The Rust KernelHandle is only freed by __del__, which runs on the GC thread at unpredictable time. If any code holds a stale reference to self._backend, the pointer dangles when the kernel is GC'd.

DITAv2LauncherBundle.close() calls _maybe_close on venue, zinc, and control plane — but NOT on kernel (which has no close() or disconnect()). The kernel is leaked until GC.

Severity: Medium — reliance on __del__ for critical C resource cleanup.

F21: `DITAv2LauncherBundle.close()` closes venue before kernel is done with it

File: `launcher.py:90-95**

def close(self):
    _maybe_close(self.venue)       # Closes HTTP client
    _maybe_close(self.zinc_plane)  # Closes Zinc regions

If the kernel is mid-process_intent in another thread (hypothetical — single-threaded in practice), venue.submit() would fail because the HTTP client is already closed. No ordering enforcement.

Severity: Low (single-threaded deployment)

F22: Silent fallback from real Zinc/Hazelcast to in-memory on error — operator unaware

File: control.py:210-217, launcher.py:175-185, projection.py:30-40

def build_control_plane(...):
    if real_requested:
        try:
            return RealZincControlPlane(...)
        except Exception:
            pass  # SILENT — operator never knows
    return ZincControlPlane(snapshot=snapshot)

Three places have this pattern. An operator who configures DITA_V2_ZINC=REAL and Zinc isn't available gets in-memory storage without any warning, error, or log. The ZincPlane protocol has no introspection method to check if it's real or in-memory.

The same applies to Hazelcast projection and the venue adapter.

Severity: Medium — configuration errors are silently masked.

F23: `VenueEvent.size` = `intent.target_size` not actual fill — wrong for multi-leg EXIT

File: `bingx_venue.py:410-420**

base_event = VenueEvent(
    size=float(intent.target_size or 0.0),  # target, not fill
)

For an EXIT leg, intent.target_size is the intended exit size. The ACK event's size reflects the target, not the actual fill. For fully-filled MARKET orders, target == fill so it's invisible. For partially-filled LIMIT orders, size on the ACK is wrong.

The fill event later has filled_size from the venue's executedQty, so the downstream kernel uses the correct fill size. The ACK's size is unused by the kernel (the kernel uses filled_size for PnL computation).

Severity: Informational (unused by kernel)

F24: `asyncio.run()` inside async function in test generator — nested event loops

File: _build_pink_extended.py:75-81

def _check_open_orders(c, vs):
    r = __import__('asyncio').run(c._request_json("GET", ...))

asyncio.run() is called INSIDE an async def context (the test body is async). This creates a new event loop on the current thread, suspending pytest's asyncio loop. Nested event loops are "not recommended" per Python docs.

Severity: Low (works in practice)

F25: `_build_fresh_kernel_from_slot` leaks old kernel objects per call

File: `_build_pink_extended.py:95-108**

def _build_fresh_kernel_from_slot(slot_data, ic=25000.0):
    cfg = _build_config(ic)
    b = build_launcher_bundle(venue_mode="BINGX", ...)  # NEW bundle, OLD not closed
    k = b.kernel
    return RB(runtime=Shim(k), config=cfg)

Each call creates a new launcher bundle (new kernel, new Rust handle, new HTTP client, new Zinc plane) without closing the old one. Called 4 times across the fresh-kernel test bodies. Leaks ~50MB per call (Rust lib, HTTP connections).

Severity: Low (test infrastructure only)

F26: `seen_event_ids` not cleared on re-entry — event IDs accumulate across trades

File: lib.rs:672-683

When a slot re-enters (new ENTER after previous EXIT), the Rust kernel resets most fields (lib.rs:740-765) but does NOT clear seen_event_ids. The new trade inherits the previous trade's event history up to MAX_SEEN_EVENT_IDS (256). After 256 events across multiple trades, old IDs are drained.

For MARKET trading (2-4 events per trade), this takes ~60-80 trades before draining. For LIMIT trading (many partial fills), could be 5-10 trades.

Fix: slot.seen_event_ids.clear() on ENTER.

Severity: Low (event ID collision across trades is astronomically unlikely)

F27: `RealZincControlPlane.read()` parses Zinc region every call — no caching

File: `real_control_plane.py:88-94**

def read(self):
    payload = _decode_packet(self.region.as_buffer())  # JSON parse every call
    control = payload.get("control")
    self._snapshot = KernelControlSnapshot(**control)   # reconstruct every call
    return self._snapshot

Called by ExecutionKernel.control property on every process_intent(). Each call re-constructs a KernelControlSnapshot from dict — allocating new objects for every field. ~50μs per call. A simple cached-until-modified pattern would eliminate all parses between writes.

Severity: Low (performance)

F28: `_legacy_intent` hardcodes `confidence=1.0` and `bars_held=0`

File: bingx_venue.py:270-285

These fields are in LegacyIntent but unused by submit_intent() (which only reads asset, side, action, target_size, leverage, metadata). The downstream ClickHouse rows use the policy-layer Intent, not LegacyIntent, so the hardcoded values don't reach persistence.

Only propagates through the venue adapter's internal chain. No consumer reads them today.

Severity: Informational

F29: `_slot_to_payload` in `real_zinc_plane.py` is dead code

File: `real_zinc_plane.py:57-59**

def _slot_to_payload(slot):
    data = slot.to_dict()
    return data

Defined, never called anywhere in the file. All slot serialization calls slot.to_dict() directly.

Severity: Informational

F30: Duplicate `_slot_from_payload` in `real_zinc_plane.py` and `rust_backend.py`

File: real_zinc_plane.py:62-112**, rust_backend.py:270-310`

Two nearly identical implementations. The real_zinc_plane version manually constructs VenueOrder objects (lines 63-88) with different defaults (e.g., fallback to slot size if intended_size missing). The rust_backend version delegates to _order_from_payload with all-default fallbacks.

If fields are added to TradeSlot or VenueOrder, both must be updated.

Severity: Low (code duplication risk)

Complete Flaw Catalog

All-Passes Combined

Family	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural (old 13, now superseded)	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace (Pass 1)	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
Total		80	1	10	21	32	16

Most Dangerous Single Flaw: F15

An exception in on_venue_event() during the fill-feedback loop stops the chain mid-apply. The ACK applied but the FILL didn't. Slot in ENTRY_WORKING with no position. No retry mechanism, no recovery path. The slot is stuck forever until manual intervention. Zero defense in depth — no try/except, no undo, no validation that the slot reached a consistent state.

This is the single highest-impact E2E flaw because it requires no concurrency, no race condition, no unusual market conditions — just a transient FFI error during normal operation.

PASS 4 — SYSTEMATIC DOMAIN SCANS (Config, Rust, Persistence, Lifecycle)

Rust Kernel — Numeric & FSM Invariants

G1: EXIT_RESIDUAL action is entirely missing from Rust KernelCommandType

File: _rust_kernel/src/lib.rs

string_enum! {
    enum KernelCommandType {
        ENTER, EXIT, MARK_PRICE, RECONCILE, CONTROL, CANCEL,
    }
}

Six variants. No EXIT_RESIDUAL. If any caller submits an intent with action = "EXIT_RESIDUAL", the string_enum deserializer fails — serde returns INVALID_INTENT_PARSE. Even if deserialization worked, there's no branch to handle residual-position cleanup. Any position with remaining size after partial exit legs has no way to trigger a clean-up exit via the intent system.

The Python KernelCommandType enum (contracts.py) does have EXIT_RESIDUAL, translated to "EXIT_RESIDUAL" string by _intent_to_payload. This string hits Rust's string_enum → parse error → INVALID_INTENT_PARSE.

Fix: Add EXIT_RESIDUAL variant to Rust enum + match arm that skips the NO_OPEN_POSITION guard for residual-sized positions.

Severity: Critical

G2: `into_c_string` uses `unwrap()` — panics on interior NUL byte

File: _rust_kernel/src/lib.rs:1477

fn into_c_string(value: &str) -> *mut c_char {
    CString::new(value).unwrap().into_raw()
}

CString::new() returns Err if the string contains a NUL ('\0') byte. .unwrap() panics at the C FFI boundary. If any serde_json::to_string() output (e.g., user-controlled string in KernelIntent, VenueEvent, or TradeSlot) contains a NUL byte, this panics the entire process.

Triggered by every FFI call that returns a string:

dita_kernel_process_intent_json
dita_kernel_on_venue_event_json
dita_kernel_reconcile_slots_json
dita_kernel_snapshot_json
dita_kernel_get_slot_json

Fix: Replace .unwrap() with unwrap_or_else(|_| ptr::null_mut()) or feed through invalid_intent_cstring.

Severity: Critical

G3: `process_intent` EXIT hardcodes `prev_state = POSITION_OPEN` unconditionally

File: _rust_kernel/src/lib.rs:842-890

slot.fsm_state = TradeStage::EXIT_REQUESTED;        // unconditional override
let transition = self.transition(
    &slot,
    TradeStage::POSITION_OPEN,                        // always POSITION_OPEN
    slot.fsm_state.clone(),
    "EXIT_INTENT",
);

Three problems:

(a) Transition prev_state is a lie. If the slot was in EXIT_WORKING, EXIT_SENT, EXIT_REQUESTED, or POSITION_PARTIALLY_CLOSED, the transition record says POSITION_OPEN — wrong.

(b) Backward transition. If the slot is EXIT_WORKING and a new EXIT intent arrives, fsm_state is set to EXIT_REQUESTED — a backward transition from EXIT_WORKING → EXIT_REQUESTED. This corrupts the FSM.

(c) No state guard. EXIT should only be allowed from POSITION_OPEN, EXIT_WORKING (for additional legs), or POSITION_PARTIALLY_CLOSED. Currently any state that passes !is_free() && !closed && size > 0 can transition to EXIT_REQUESTED.

Fix: Check actual FSM state before allowing EXIT, log actual prev_state, guard against backward transitions.

Severity: Critical

G4: `consume_exit_leg` advances beyond last valid index — stale `all_legs_done` variable

File: _rust_kernel/src/lib.rs:1420-1435

let all_legs_done = slot.active_leg_index >= slot.exit_leg_ratios.len(); // (A)
let should_close = (slot.size <= 1e-12 || (!partial && all_legs_done));  // (B)

if !partial {
    slot.consume_exit_leg();  // (C) — advances active_leg_index POST (A)
}

if should_close && slot.size <= 1e-12 {         // (D) — close
} else if !partial && !all_legs_done {           // (E) — stale! uses (A) not post-advance index

On the last leg (active_leg_index = len - 1):

(A): all_legs_done = false (pre-advance)
(C): advances to len (exhausted)
(E): !partial && !false = true → enters POSITION_OPEN instead of examining should_close with post-advance index

The all_legs_done variable is captured before consume_exit_leg advances the index. Branch (E) should use the post-advance index to correctly detect exhaustion.

After exhaustion, next_exit_ratio() returns 1.0 (out-of-bounds unwrap_or(1.0)) — silently tries to exit remaining size as 100% instead of detecting completion.

Severity: Critical

G5: `realized_pnl` uses unbounded f64 — overflows to inf at extreme values

File: _rust_kernel/src/lib.rs:648-656

let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);
delta * notional

No is_finite() check on intermediate products. At exit_price=1e200, entry_price=1e-200: delta = (1e200 - 1e-200) / 1e-200 ≈ 1e400 → inf. The resulting inf is stored in slot.realized_pnl, corrupting all future PnL tracking.

Subnormals: entry_price=5e-324 (subnormal) causes division to produce inf for modest exit prices on some platforms.

Fix: Add is_finite() guards on both prices and cap intermediate products.

Severity: High

G6: `mark_price` produces unbounded `unrealized_pnl`

File: _rust_kernel/src/lib.rs:384-399

self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage;
// No is_finite() check on result

If any of delta, size, entry_price, or leverage is extreme, the product overflows to inf. No result guard. inf stored in unrealized_pnl forever. Capped only by the price <= 0.0 guard on input — no guard on the computation chain.

Also: self.entry_price = price at line 388 overwrites entry_price on every mark_price call for a position with entry_price <= 0.0, even when the position has been open for a while. This means a stale-zero entry_price gets set to the current market price on first mark_price after open, which is correct — but if the slot is reused (re-entry without resetting entry_price), the old entry price from the prior trade bleeds into unrealized PnL.

Severity: High

G7: `process_intent` ENTER — no `is_finite()` guard on `target_size`

File: _rust_kernel/src/lib.rs:806-807

intended_size: intent.target_size.max(0.0),

f64::NAN.max(0.0) returns NAN. f64::INFINITY.max(0.0) returns inf. Serde_json does accept Infinity and NaN by default — they're valid JSON tokens. If the Python-side _first_invalid_intent_field guard is bypassed (F3 — it allows these through), NaN/inf propagates into intended_size in VenueOrder, corrupting all fill calculations.

Similarly, reference_price is never validated for finiteness before being stored in VenueOrder.metadata.

Severity: High

G8: `reconcile_slots_json` — no dedup or bounds validation

File: _rust_kernel/src/lib.rs:1668-1675

for slot in slots {
    if slot.slot_id < core.slots.len() {
        core.slots[slot.slot_id] = slot.clone();
    }
}

Two slots with the same slot_id: the second overwrites the first silently. A slot with slot_id >= core.slots.len(): silently dropped — no error, no diagnostic. Caller sees accepted=true even if some/all slots were not applied.

Severity: High

G9: `exchange_order_id` propagation uses wrong order target

File: _rust_kernel/src/lib.rs:1110-1125

let target = if slot.active_entry_order.is_some() {
    slot.active_entry_order.as_mut()
} else {
    slot.active_exit_order.as_mut()
};

If an entry order exists (even if fully filled) and an exit fill event arrives, the code updates the entry order's venue_order_id instead of the exit order's. The exit order's venue_order_id stays empty. Any subsequent CANCEL intent on the exit order fails because active_exit_order.venue_order_id is empty — the venue can't match the cancel.

Fix: Disambiguate by matching venue_client_id, or clear active_entry_order when entry is complete.

Severity: High

G10: CANCEL diagnostic code says NO_ACTIVE_EXIT_ORDER for entry cancel too

File: _rust_kernel/src/lib.rs:966-1005

if !has_cancellable_exit && !has_cancellable_entry {
    return KernelResult {
        diagnostic_code: KernelDiagnosticCode::NO_ACTIVE_EXIT_ORDER, // always says exit
        details: json!({"reason": "NO_ACTIVE_EXIT_ORDER"}),
    };
}

When neither exit nor entry is cancellable, the diagnostic returns NO_ACTIVE_EXIT_ORDER regardless of which order was the target. If the user wanted to cancel an entry order that's not in a cancellable state, the diagnostic is misleading.

Fix: Separate diagnostic codes: NO_ACTIVE_EXIT_ORDER, NO_ACTIVE_ENTRY_ORDER, ENTRY_NOT_CANCELLABLE.

Severity: High

G11: `apply_fill` entry-fill overwrites `active_entry_order.intended_size` with `slot.size`

File: `_rust_kernel/src/lib.rs:1363-1377**

On FULL_FILL entry, slot.active_entry_order is entirely replaced with a new VenueOrder where intended_size = slot.size (the fill amount) instead of the original intended size. The original intended size (which could be larger than fill size for partial fills) is lost.

If a duplicate fill event arrives (dedup fails due to missing event_id), the second fill would use slot.size as the basis for further fills — wrong values.

Severity: Medium

G12: `leverage` unbounded after `is_finite()` — no maximum cap

File: _rust_kernel/src/lib.rs:778

slot.leverage = if intent.leverage.is_finite() && intent.leverage > 0.0 {
    intent.leverage  // 1e100 accepted here
} else { 1.0 };

leverage = 1e100 passes is_finite(). Feeds into realized_pnl() as slot.leverage.max(1.0) = 1e100, producing notional = exit_size * entry_price * 1e100. Makes unrealized_pnl arbitrarily large.

No maximum leverage cap enforced anywhere — the exchange-level cap (DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP) exists in BingxExecClientConfig but is never passed to the Rust kernel.

Severity: Medium

G13: `resolve_slot` fallback returns `unwrap_or(0)` — can misroute events

File: _rust_kernel/src/lib.rs:623

self.slots.first().map(|slot| slot.slot_id).unwrap_or(0)

When no slot matches the event (slot_id out of range or all slot filters fail), returns slot_id of the first slot (which may be 0 or any value). No diagnostic emitted — caller sees slot state change with no idea the event was misrouted.

Severity: Medium

G14: `commit_slot` silently ignores out-of-bounds slot_id

File: `_rust_kernel/src/lib.rs:595-600**

fn commit_slot(&mut self, slot: TradeSlot) {
    if slot.slot_id < self.slots.len() {
        self.slots[slot_id] = slot;
    }
    // else: silently dropped — no error returned
}

Mutations to out-of-bounds slot are silently discarded. Can happen if slot.slot_id is corrupted via set_slot_from_json causing index mismatch between slot.slot_id and the actual slot position.

Severity: Medium

Configuration & Validation Chain

G15: Zero `__post_init__` validators on all config dataclasses

Every config dataclass in the system has zero field-level validation:

Dataclass	Fields	Validators
`KernelControlSnapshot`	16	0
`ControlUpdate`	16	0
`KernelIntent`	19	0
`TradeSlot`	22	0
`VenueOrder`	8	0
`VenueEvent`	18	0
`KernelTransition`	11	0
`KernelOutcome`	8	0
`AccountSnapshot`	9	0
Total	127	0

The only validation in the entire chain:

_first_invalid_intent_field() — finiteness guard at Python→Rust FFI boundary (not a dataclass validator)
Rust leverage = if is_finite && > 0.0 { val } else { 1.0 } — post-hoc clamp
Rust KernelCore::new(max_slots.max(1)) — floor only, no ceiling
launcher.py:143: max(1, int(...)) for active_slot_limit — floor only

No __post_init__ exists anywhere. No bounds check on any field except the two floor-only guards.

Severity: High

G16: `DITA_V2_DEBUG_CLICKHOUSE` defaults to `True` when env var is unset

File: launcher.py:133

debug = _env_bool("DITA_V2_DEBUG_CLICKHOUSE", True)

_env_bool (launcher.py:75) returns default when the env var is unset. So debug = True by default. Every runtime writes debug traces to ClickHouse by default. DITA_V2_DEBUG_CLICKHOUSE=False is required to disable it.

This is not a bug per se, but it means debug ClickHouse writes are on by default, adding ~10 ClickHouse insertions per process_intent call (every transition + position state + trade event) that most production deployments may not want.

Severity: Informational

G17: String config fields have no charset/length validation — Zinc region injection risk

File: control.py:31-53, real_zinc_plane.py:30

runtime_namespace, strategy_namespace, event_namespace, actor_name, exec_venue, data_venue, ledger_authority are all free-form strings with no validation. They're used as:

Zinc shared memory region names: self.prefix + "." + namespace + "." + kind — an attacker-controlled namespace could collide with other processes' Zinc regions
ClickHouse table names: DOLPHIN_BINGX_JOURNAL_STRATEGY is used as a table suffix — SQL injection risk in ClickHouse journal
Hazelcast map names: Same injection risk via event_namespace

Severity: Medium

G18: `exit_leg_ratios` no sum-to-1 validation

KernelIntent.exit_leg_ratios and TradeSlot.exit_leg_ratios are tuple/list of floats. No validator ensures they sum to approximately 1.0. Ratios summing to 0.5 leave the position partially closed forever (residual can't be exited because next_exit_ratio() returns 1.0 after exhaustion, exiting 100% of remaining — which may exceed the intended residual).

Severity: Low

G19: `RealZincControlPlane.read()` has no sequence check — torn-read risk

File: `real_control_plane.py:88-94**

def read(self):
    payload = _decode_packet(self.region.as_buffer())
    control = payload.get("control")
    if not isinstance(control, dict):
        return self._snapshot
    self._snapshot = KernelControlSnapshot(**control)
    return self._snapshot

The binary packet has a 64-bit sequence number but read() never checks it. Between the zero-write and packet-write in _write_region, a reader sees an empty buffer → _decode_packet fails → falls back to self._snapshot (stale). Between the packet-write and struct.pack header (order depends on implementation), a reader sees a partial write with wrong size → _decode_packet fails.

No checksum on the wire format: struct.pack("!QQ", seq, len) + json_bytes. A torn write produces garbage that json.loads may or may not parse successfully.

Severity: Low

G20: `DOLPHIN_BINGX_JOURNAL_STRATEGY`/`_DB` — ClickHouse SQL injection risk

File: launcher.py:202-203

"DOLPHIN_BINGX_JOURNAL_STRATEGY": os.environ.get("DOLPHIN_BINGX_JOURNAL_STRATEGY", ""),
"DOLPHIN_BINGX_JOURNAL_DB": os.environ.get("DOLPHIN_BINGX_JOURNAL_DB", ""),

These are used as ClickHouse table and database name suffixes in pink_clickhouse.py. An attacker who can set env vars can inject SQL via semicolons or quotes in the table name. ClickHouse supports INSERT INTO db.table FORMAT JSONEachRow — a table name like positions; DROP TABLE ...; could be destructive.

Severity: Low (requires env var control, which implies broader access)

Persistence Schema Alignment

G21: `entry_price` used as `exit_price` in `trade_events` — data loss

File: pink_clickhouse.py (outside workspace)

The _write_trade_event function maps entry_price from slot.to_dict() to both the entry_price and exit_price columns. The actual exit fill price (available on the VenueEvent object) is never written to the exit_price column.

Result: Every trade_events row has exit_price == entry_price. The exit_price column is a dead column — always contains the entry price, never the actual fill.

Severity: High — data loss to DB for the most important trade metric.

G22: `active_leg_index` → `entry_bar` semantic mis-mapping

File: pink_clickhouse.py (outside workspace)

"entry_bar": int(slot_dict.get("active_leg_index", 0) or 0),

active_leg_index tracks the exit-leg-ratios cursor (which leg of a multi-leg exit we're on), not a bar count. The value 0 at position open and 1 after the first exit leg — neither value represents bars held. The entry_bar column stores the wrong concept.

Severity: Medium — column contains semantically meaningless data.

G23: `capital_before` arithmetic reconstruction absorbs cross-slot PnL

File: pink_clickhouse.py (outside workspace)

capital_before = capital_after - pnl_leg

capital_before is reconstructed by subtracting the current leg's PnL from the current capital. In a multi-slot system, other slots' PnL changes between legs are absorbed into capital_before. The column is always wrong in multi-slot scenarios because capital_after reflects total PnL from all slots, not just the leg being recorded.

Severity: Medium — wrong capital_before for multi-slot trading.

G24: Recovery `trade_reconstruction` always has `trade_id=""`

File: pink_clickhouse.py (outside workspace)

The persist_recovery_state function passes kernel.snapshot()["account"] (an account dict with keys capital, equity, realized_pnl, ...) where a slot dict is expected. The trade_id key does not exist on the account dict. The recovery_state row always has trade_id="".

Severity: Medium — recovery data is not associable with any trade.

G25: `seen_event_ids`, `exit_leg_ratios`, `VenueOrder`, `metadata` not in flat ClickHouse tables

These fields are:

Present on the Python TradeSlot ✅
Transmitted through Zinc shared memory ✅
Stored in Hazelcast ✅
Stored in ClickHouse dita_kernel_debug (full JSON) ✅
NOT extracted into main ClickHouse flat tables position_state, trade_events, trade_exit_legs ❌

Data exists at the source, travels through the pipeline, hits the debug journal — but is lost in the main analytical tables.

Severity: Low (data exists in debug journal if needed for reconstruction)

G26: `_safe_float` silently converts NaN/None/Inf to 0.0

File: utils.py:15

def _safe_float(v, default=0.0):
    try:
        f = float(v)
        if not math.isfinite(f):
            return default
        return f
    except (TypeError, ValueError, OverflowError):
        return default

Used in multiple ClickHouse writers. Silently converts NaN/Inf/parsing errors to 0.0. No diagnostic emitted when a non-finite value reaches the persistence layer — data silently zeroed.

Severity: Low (safe default but silent corruption)

Lifecycle & Resource Management

G27: `build_launcher_bundle` has no exception safety — prior resources leak

File: `launcher.py:264-300**

def build_launcher_bundle(...):
    control_plane = _build_control_plane(...)
    projection = build_projection(...)
    zinc_plane = _build_zinc_plane(...)
    venue = _build_venue(...)
    kernel = ExecutionKernel(...)  # ← if THIS fails, everything above leaks

If any step after the first raises, all previously built resources leak:

RealZincPlane created → _build_venue() fails → 3 shared memory regions orphaned
RealZincControlPlane created → _build_zinc_plane() fails → 1 shared memory region orphaned
BingxVenueAdapter created → ExecutionKernel.__init__() fails → HTTP connection leaked

No try/finally anywhere in the builder. The init order is also optimized for forward construction, not backward cleanup.

Severity: High — shared memory leak on any build failure.

G28: `RealZincPlane` and `RealZincControlPlane` have no `del`

When close() is not called (exception in builder, forgotten cleanup, GC during shutdown), the shared memory regions opened by RealZincPlane (3 regions) and RealZincControlPlane (1 region) are orphaned on the OS. They persist in /dev/shm/ (or platform equivalent) until system reboot.

Python's __del__ is unreliable (not called on SIGKILL, not called if the object is part of a cycle without a GC run), but its absence means even normal garbage collection can't clean up.

Severity: High — shared memory leaks.

G29: Zero signal handlers — no cleanup on SIGTERM/SIGINT

$ grep -rn "signal\|SIGTERM\|SIGINT\|atexit" *.py  # ZERO matches

When SIGTERM or SIGINT arrives:

Python's default handler terminates the process immediately
No DITAv2LauncherBundle.close() is called
No ExecutionKernel.__del__ is called (CPython may run GC on normal exit but not reliably)
All shared memory (RealZincPlane, RealZincControlPlane) is orphaned
In-flight BingX HTTP calls are interrupted mid-stream
Rust kernel handle is leaked

Severity: High

G30: `ExecutionKernel` has no `close()` — relies on `del` for Rust handle cleanup

ExecutionKernel has __del__ which calls _get_rust().destroy(backend). No close() method. DITAv2LauncherBundle.close() never touches the kernel — the Rust handle is only freed by GC at unpredictable time.

If any code holds a stale _backend pointer, the handle dangles when GC runs. If __del__ is suppressed (e.g., during interpreter shutdown with cyclic references), the Rust handle leaks permanently.

Fix: Add close() to ExecutionKernel, call it from DITAv2LauncherBundle.close().

Severity: High

G31: `projection` (Hazelcast) never closed

build_projection() returns a HazelcastProjection which holds a Hazelcast client connection. No close() or disconnect() method exists on the projection, projector, or row writer. DITAv2LauncherBundle.close() doesn't touch the projection. The Hazelcast client connection leaks on shutdown.

Severity: Medium

G32: `_maybe_close()` only calls the first method found — `break` skips the second

File: `launcher.py:233-243**

for method_name in ("close", "disconnect"):
    method = getattr(obj, method_name, None)
    if method is None:
        continue
    try:
        result = method()
    except TypeError:
        continue
    if inspect.isawaitable(result):
        try:
            asyncio.run(result)
        except RuntimeError:
            pass
    break  # ← ONLY calls the FIRST found method, never both

If an object has both close() and disconnect(), only close() is called. disconnect() is silently skipped. Also: asyncio.run(result) silently swallows RuntimeError when a running event loop exists — the coroutine is never executed.

Currently no object has both, but the pattern is fragile.

Severity: Low

G33: `close()` is not idempotent for RealZinc components

RealZincPlane.close() and RealZincControlPlane.close() call their Zinc region's close() method. If called twice, the second call operates on an already-closed region — likely crashes from Hazelcast's shared memory code.

No nulling of references after close: DITAv2LauncherBundle.close() sets self.venue, self.zinc_plane, self.control_plane to None — wait, it doesn't. It calls _maybe_close() which doesn't null references. Double close() is unsafe.

Severity: Low

G34: No context manager on `DITAv2LauncherBundle`

DITAv2LauncherBundle has no __enter__/__exit__. Users must manually call close(). No with pattern exists anywhere in the source for lifecycle management. No __del__ fallback on the bundle either.

Severity: Low (ergonomic, not a leak source if caller follows the pattern)

G35: `BingxVenueAdapter.connect()` exists but is never called by the launcher

BingxDirectExecutionAdapter has a connect() method that initializes the lifetime HTTP client. BingxVenueAdapter has connect() that calls _call_backend("connect"). Neither is called in build_launcher_bundle() or _build_venue(). If the adapter's submit_intent() relies on a connected client, it initializes lazily — but the connect path is dead code that exists but is never invoked.

Severity: Informational

G36: Only one `try/finally` in the entire codebase

The only try/finally is _RustKernelLib._take_string() (rust_backend.py:140-143) which frees the Rust C string. All other resource management uses try/except with no finally.

No cleanup is guaranteed on exception:

build_launcher_bundle() — no cleanup on failure
process_intent() — no cleanup of partial slot state on venue event exception
on_venue_event() — no cleanup on FFI failure
_set_slot() — no cleanup on projection or Zinc write failure

Severity: High (across all layers)

Pass 4 Summary

#	Flaw	Layer	Severity
G1	EXIT_RESIDUAL action missing from Rust KernelCommandType	Rust	Critical
G2	`into_c_string` unwrap() panics on NUL byte	Rust	Critical
G3	EXIT hardcodes prev_state=POSITION_OPEN, allows backward FSM transition	Rust	Critical
G4	`consume_exit_leg` stale `all_legs_done` variable — wrong branch after last leg	Rust	Critical
G5	`realized_pnl` unbounded f64 overflow to inf	Rust	High
G6	`mark_price` unbounded unrealized_pnl — no result guard	Rust	High
G7	ENTER no is_finite() guard on target_size	Rust	High
G8	`reconcile_slots_json` no dedup or bounds validation	Rust	High
G9	`exchange_order_id` update targets wrong order — exit cancel broken	Rust	High
G10	CANCEL diagnostic always says NO_ACTIVE_EXIT_ORDER	Rust	High
G11	`apply_fill` overwrites intended_size with slot.size	Rust	Medium
G12	No max leverage cap enforced by kernel	Rust	Medium
G13	`resolve_slot` fallback returns unwrap_or(0) — misroutes events	Rust	Medium
G14	`commit_slot` silently ignores out-of-bounds slot_id	Rust	Medium
G15	Zero `__post_init__` validators on all config dataclasses	Config	High
G16	DITA_V2_DEBUG_CLICKHOUSE defaults to True when unset	Config	Info
G17	String config fields — Zinc region injection risk	Config	Medium
G18	`exit_leg_ratios` no sum-to-1 validation	Config	Low
G19	RealZincControlPlane.read() no sequence check — torn-read risk	Config	Low
G20	ClickHouse journal strategy/db env vars — SQL injection risk	Config	Low
G21	entry_price used as exit_price in trade_events — data loss	Persistence	High
G22	active_leg_index → entry_bar semantic mis-mapping	Persistence	Medium
G23	capital_before arithmetic absorbs cross-slot PnL	Persistence	Medium
G24	Recovery trade_reconstruction always has trade_id=""	Persistence	Medium
G25	seen_event_ids, exit_leg_ratios, VenueOrder, metadata not in flat CH tables	Persistence	Low
G26	_safe_float silently converts NaN/None/Inf to 0.0	Persistence	Low
G27	build_launcher_bundle no exception safety — prior resources leak	Lifecycle	High
G28	RealZincPlane/RealZincControlPlane no del — SHM orphaned	Lifecycle	High
G29	Zero signal handlers — no cleanup on SIGTERM/SIGINT	Lifecycle	High
G30	ExecutionKernel has no close() — relies on del for Rust handle	Lifecycle	High
G31	Hazelcast projection never closed	Lifecycle	Medium
G32	_maybe_close() break skips second method	Lifecycle	Low
G33	close() not idempotent for RealZinc components	Lifecycle	Low
G34	No context manager on DITAv2LauncherBundle	Lifecycle	Low
G35	BingxVenueAdapter.connect() never called	Lifecycle	Info
G36	Only one try/finally in entire codebase	Lifecycle	High

Pass 4 Severity Distribution

Severity	Count
Critical	4 (G1, G2, G3, G4)
High	11 (G5-G10, G15, G21, G27, G28, G29, G30, G36)
Medium	11 (G11-G14, G17, G22, G23, G24, G31)
Low	8 (G16, G18, G19, G20, G25, G26, G32, G33, G34, G35)
Info	2

Combined Catalog (All 4 Passes)

Pass	Focus	Count	Critical	High	Medium	Low	Info
A	Architectural	15	0	2	0	2	11
T	Threading/Atomicity	9	1	3	3	2	0
E	E2E Trace	26	0	4	10	11	1
F	Deep E2E (Pass 3)	30	0	1	8	17	4
G	Domain Scans (Pass 4)	36	4	11	11	8	2
Total		116	5	21	32	40	18

91 KiB Raw Blame History Unescape Escape

PINK DITAv2 — End-to-End Trace & Flaw Analysis

E2E Data Flow (One Call)

Layer 1: Policy Cycle Entry (pink_direct.py:422)

E1: step() calls pump_venue_events() every cycle unconditionally

E2: kernel.snapshot()["account"] returns a fresh dict, not a live view

Layer 2: Decision/Intent Bridging (pink_direct.py:79-115)

E3: _decision_to_kernel_intent drops order_type and limit_price

E4: _exit_intent_from_slot trusts slot.size but slot may be stale

Layer 3: Kernel Bridge — Rust FSM Entry (rust_backend.py)

E5: JSON serialization round-trip loses numeric precision

E6: _RustKernelLib is a global singleton — shared across all kernels

Layer 4: Rust Kernel FSM (lib.rs:728)

E7: ENTER handler silently allows re-entry with same trade_id

E8: EXIT handler uses initial_size not current size

E9: CANCEL handler returns diagnostic even when nothing happened

E10: apply_fill entry branch double-sets active_entry_order

Layer 5: Venue Adapter Boundary (bingx_venue.py)

E11: _legacy_intent() is a lossy conversion

E12: _events_from_submit() price fallback chain can lose venue price

E13: _backend_snapshot() timeout returns stale data

E14: _events_from_cancel uses stale slot_id from order metadata

Layer 6: BingX Direct Adapter (bingx_direct.py)

E15: Submit sets leverage via separate HTTP call

E16: _format_quantity and _format_price use _instrument_step/_instrument_tick — both may be zero

E17: Cancel uses truth-based confirmation — can mask real errors

Layer 7: Fill Feedback Loop (rust_backend.py on_venue_event)

E18: on_venue_event settles PnL incrementally — but fees are never included

E19: observe_slots called with ALL slots, not just changed ones

Layer 8: Persistence Boundary (pink_clickhouse.py)

E20: _capital() reads live from AccountProjection — stale row risk

E21: persist_fill_events() synthesizes fake Decision/Intent

E22: _write_trade_exit_leg capital_before uses arithmetic reconstruction

E23: _write_trade_event uses slot_dict.get("entry_price") as exit_price

Layer 9: Test Infrastructure

E24: MockVenueAdapter.submit() always emits fill on partial_fill_ratio > 0

E25: Test scenarios use MARKET-only _si() helper — no LIMIT tests

E26: Fresh-kernel reconcile tests create second kernel but share venue

Summary: Critical E2E Flaw Chain

Complete Flaw Catalog (All Layers)

PASS 3 — NEW FINDINGS (Deepest E2E Trace)

F1: process_intent CANCEL returns "accepted" before the cancel happens — caller gets wrong outcome.state

F2: _last_settled_pnl reset before venue.submit() — transient window

F3: _first_invalid_intent_field allows leverage=0 and target_size=0

F4: outcome.emitted_events only contains venue events — Rust kernel's events silently dropped

F5: on_venue_event does redundant FFI read of slot already returned by Rust

F6: _record_transitions in process_intent records pre-venue transitions with event=None

F7: reconcile_from_slots writes ALL slots to projection/zinc, not just reconciled ones

F8: HazelcastRowWriter.put() is synchronous with no error handling — Hazelcast failure crashes the intent

F9: RealZincPlane.write_slot() serializes ALL slots, not just the changed one

F10: RealZincPlane.write_slot zeros buffer before write — concurrent read sees empty data

F11: RealZincPlane._write_region has no partial-write recovery

F12: InMemoryZincPlane intent_region grows without bound

F13: InMemoryZincPlane uses non-re-entrant threading.Condition

F14: KernelSlotView.__setattr__ round-trips unknown fields through Rust — silently dropped

F15: on_venue_event loop in process_intent stops on first exception — slot left in partial state

F16: venue.submit() returning empty events leaves slot in ORDER_REQUESTED

F17: Cancel truth-based confirmation returns REJECTED for already-cancelled orders on GET failure

F18: Leverage-set and order-submit failures share error handler — poor diagnostics

F19: _events_from_submit stale snapshot fallback → wrong fill detection

F20: __del__ frees Rust handle at unpredictable GC time — no explicit close()

F21: DITAv2LauncherBundle.close() closes venue before kernel is done with it

F22: Silent fallback from real Zinc/Hazelcast to in-memory on error — operator unaware

F23: VenueEvent.size = intent.target_size not actual fill — wrong for multi-leg EXIT

F24: asyncio.run() inside async function in test generator — nested event loops

F25: _build_fresh_kernel_from_slot leaks old kernel objects per call

F26: seen_event_ids not cleared on re-entry — event IDs accumulate across trades

F27: RealZincControlPlane.read() parses Zinc region every call — no caching

F28: _legacy_intent hardcodes confidence=1.0 and bars_held=0

F29: _slot_to_payload in real_zinc_plane.py is dead code

F30: Duplicate _slot_from_payload in real_zinc_plane.py and rust_backend.py

Complete Flaw Catalog

All-Passes Combined

Most Dangerous Single Flaw: F15

PASS 4 — SYSTEMATIC DOMAIN SCANS (Config, Rust, Persistence, Lifecycle)

Rust Kernel — Numeric & FSM Invariants

G1: EXIT_RESIDUAL action is entirely missing from Rust KernelCommandType

G2: into_c_string uses unwrap() — panics on interior NUL byte

G3: process_intent EXIT hardcodes prev_state = POSITION_OPEN unconditionally

G4: consume_exit_leg advances beyond last valid index — stale all_legs_done variable

91 KiB

Raw Blame History

E1: `step()` calls `pump_venue_events()` every cycle unconditionally

E2: `kernel.snapshot()["account"]` returns a fresh dict, not a live view

E3: `_decision_to_kernel_intent` drops `order_type` and `limit_price`

E4: `_exit_intent_from_slot` trusts slot.size but slot may be stale

E6: `_RustKernelLib` is a global singleton — shared across all kernels

E8: EXIT handler uses `initial_size` not `current size`

E10: `apply_fill` entry branch double-sets `active_entry_order`

E11: `_legacy_intent()` is a lossy conversion

E12: `_events_from_submit()` price fallback chain can lose venue price

E13: `_backend_snapshot()` timeout returns stale data

E14: `_events_from_cancel` uses stale `slot_id` from order metadata

E16: `_format_quantity` and `_format_price` use `_instrument_step`/`_instrument_tick` — both may be zero

E18: `on_venue_event` settles PnL incrementally — but fees are never included

E19: `observe_slots` called with ALL slots, not just changed ones

E20: `_capital()` reads live from `AccountProjection` — stale row risk

E21: `persist_fill_events()` synthesizes fake Decision/Intent

E22: `_write_trade_exit_leg` capital_before uses arithmetic reconstruction

E23: `_write_trade_event` uses `slot_dict.get("entry_price")` as exit_price

E24: `MockVenueAdapter.submit()` always emits fill on `partial_fill_ratio > 0`

E25: Test scenarios use MARKET-only `_si()` helper — no LIMIT tests

F1: `process_intent` CANCEL returns "accepted" before the cancel happens — caller gets wrong `outcome.state`

F2: `_last_settled_pnl` reset before `venue.submit()` — transient window

F3: `_first_invalid_intent_field` allows `leverage=0` and `target_size=0`

F4: `outcome.emitted_events` only contains venue events — Rust kernel's events silently dropped

F5: `on_venue_event` does redundant FFI read of slot already returned by Rust

F6: `_record_transitions` in `process_intent` records pre-venue transitions with `event=None`

F7: `reconcile_from_slots` writes ALL slots to projection/zinc, not just reconciled ones

F8: `HazelcastRowWriter.put()` is synchronous with no error handling — Hazelcast failure crashes the intent

F9: `RealZincPlane.write_slot()` serializes ALL slots, not just the changed one

F10: `RealZincPlane.write_slot` zeros buffer before write — concurrent read sees empty data

F11: `RealZincPlane._write_region` has no partial-write recovery

F12: `InMemoryZincPlane` intent_region grows without bound

F13: `InMemoryZincPlane` uses non-re-entrant `threading.Condition`

F14: `KernelSlotView.setattr` round-trips unknown fields through Rust — silently dropped

F15: `on_venue_event` loop in `process_intent` stops on first exception — slot left in partial state

F16: `venue.submit()` returning empty events leaves slot in `ORDER_REQUESTED`

F17: Cancel truth-based confirmation returns `REJECTED` for already-cancelled orders on GET failure

F19: `_events_from_submit` stale snapshot fallback → wrong fill detection

F20: `del` frees Rust handle at unpredictable GC time — no explicit `close()`

F21: `DITAv2LauncherBundle.close()` closes venue before kernel is done with it

F23: `VenueEvent.size` = `intent.target_size` not actual fill — wrong for multi-leg EXIT

F24: `asyncio.run()` inside async function in test generator — nested event loops

F25: `_build_fresh_kernel_from_slot` leaks old kernel objects per call

F26: `seen_event_ids` not cleared on re-entry — event IDs accumulate across trades

F27: `RealZincControlPlane.read()` parses Zinc region every call — no caching

F28: `_legacy_intent` hardcodes `confidence=1.0` and `bars_held=0`

F29: `_slot_to_payload` in `real_zinc_plane.py` is dead code

F30: Duplicate `_slot_from_payload` in `real_zinc_plane.py` and `rust_backend.py`

G2: `into_c_string` uses `unwrap()` — panics on interior NUL byte

G3: `process_intent` EXIT hardcodes `prev_state = POSITION_OPEN` unconditionally

G4: `consume_exit_leg` advances beyond last valid index — stale `all_legs_done` variable

G5: `realized_pnl` uses unbounded f64 — overflows to inf at extreme values

G6: `mark_price` produces unbounded `unrealized_pnl`

G7: `process_intent` ENTER — no `is_finite()` guard on `target_size`

G8: `reconcile_slots_json` — no dedup or bounds validation

G9: `exchange_order_id` propagation uses wrong order target

G11: `apply_fill` entry-fill overwrites `active_entry_order.intended_size` with `slot.size`

G12: `leverage` unbounded after `is_finite()` — no maximum cap

G13: `resolve_slot` fallback returns `unwrap_or(0)` — can misroute events

G14: `commit_slot` silently ignores out-of-bounds slot_id

G15: Zero `__post_init__` validators on all config dataclasses

G16: `DITA_V2_DEBUG_CLICKHOUSE` defaults to `True` when env var is unset

G18: `exit_leg_ratios` no sum-to-1 validation

G19: `RealZincControlPlane.read()` has no sequence check — torn-read risk

G20: `DOLPHIN_BINGX_JOURNAL_STRATEGY`/`_DB` — ClickHouse SQL injection risk

G21: `entry_price` used as `exit_price` in `trade_events` — data loss

G22: `active_leg_index` → `entry_bar` semantic mis-mapping

G23: `capital_before` arithmetic reconstruction absorbs cross-slot PnL

G24: Recovery `trade_reconstruction` always has `trade_id=""`

G25: `seen_event_ids`, `exit_leg_ratios`, `VenueOrder`, `metadata` not in flat ClickHouse tables

G26: `_safe_float` silently converts NaN/None/Inf to 0.0

G27: `build_launcher_bundle` has no exception safety — prior resources leak

G28: `RealZincPlane` and `RealZincControlPlane` have no `del`

G30: `ExecutionKernel` has no `close()` — relies on `del` for Rust handle cleanup

G31: `projection` (Hazelcast) never closed

G32: `_maybe_close()` only calls the first method found — `break` skips the second

G33: `close()` is not idempotent for RealZinc components

G34: No context manager on `DITAv2LauncherBundle`

G35: `BingxVenueAdapter.connect()` exists but is never called by the launcher