Twelfth pass: _maybe_close asyncio.run silently skips close from async context (O1), _pick_live_symbol missing await crashes on coroutine iteration (O3), _run() pool .result() no timeout — backend hang freezes process (O5), KernelSlotView.__getattr__ N FFI calls for N fields no caching (O8), DITAv2LauncherBundle no __del__ leaks resource tree (O9), ExecutionKernel no close() — __del__ only cleanup (O10), __setattr__ triggers 5 persistence side effects undocumented (O11). 254 total flaws. Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
229 KiB
PINK DITAv2 — End-to-End Trace & Flaw Analysis
Analysis date: 2026-05-31 Method: Full-trace static analysis — every file, every data path, every boundary crossing in the PINK execution pipeline. No test execution. System scope: 34 active source files, ~12,000 lines across Rust kernel, Python bridge, venue adapter, runtime, and persistence.
Central flaw registry: PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md contains the combined catalog of all 116 flaws (A, T, E, F, G series) with severity distribution and cross-references. This file provides the deep E2E trace context — read the central registry for the master list.
E2E Data Flow (One Call)
Every E2E path in the PINK system traces through this sequence. Each numbered step below is a site where data crosses a module boundary and can be lost, mangled, or misinterpreted.
PinkDirectRuntime.step() # R1: policy cycle entry
├─ pump_venue_events() # R2: drain async fills
├─ kernel.snapshot()["account"] # R3: read capital
├─ kernel.slot(0) # R4: read slot state
├─ decision_engine.decide() # R5: policy-layer ENTER/EXIT
├─ intent_engine.plan() # R6: intent sizing
├─ _decision_to_kernel_intent() # R7: Decision → KernelIntent
├─ kernel.process_intent(kernel_intent) # R8: KERNEL BOUNDARY
│ ├─ rust_backend._intent_to_payload() # R8a: KernelIntent → JSON
│ ├─ _RustKernelLib.process_intent() # R8b: JSON → C FFI
│ │ └─ Rust process_intent() # R8c: FSM mutates TradeSlot
│ ├─ venue.submit(intent) # R9: VENUE BOUNDARY
│ │ ├─ bingx_venue._legacy_intent() # R9a: KernelIntent → LegacyIntent
│ │ ├─ BingxDirectExecutionAdapter # R9b: HTTP POST /trade/order
│ │ │ .submit_intent()
│ │ └─ bingx_venue._events_from_submit() # R9c: receipt → VenueEvent[]
│ └─ on_venue_event(event) # R10: FEEDBACK BOUNDARY
│ ├─ _RustKernelLib → Rust FSM # R10a: C FFI → FSM transition
│ ├─ account.settle(delta) # R10b: incremental PnL settlement
│ └─ persistence writes # R10c: ClickHouse / Zinc / HZ
├─ kernel.snapshot()["account"] # R11: read final capital
└─ persistence.persist_step() # R12: PERSISTENCE BOUNDARY
Layer 1: Policy Cycle Entry (pink_direct.py:422)
E1: step() calls pump_venue_events() every cycle unconditionally
pink_direct.py:436
await self.pump_venue_events(snapshot, market_state=market_state)
This is called before reading slot/account state for the policy decision.
The pump calls venue.reconcile() which for BingxVenueAdapter does 5 HTTP
requests (balance, positions, open orders, plus history if include_history).
For MARKET-only workflows, no resting orders exist, so reconcile() returns
empty events every time. But the HTTP calls still happen. On BingX VST with
~10 req/s limit and a 5s policy cycle, this burns 1 req/s just to learn
"nothing changed." Add the actual trade HTTP calls, and the budget is tight.
Flaw: E1 — unconditional exchange poll wastes rate limit.
Already documented as A10, but worse when traced E2E: each pump_venue_events
calls venue.reconcile() → _backend_snapshot() → parallel asyncio.gather
of 3 HTTP GETs. The _refresh_exchange_state at bingx_direct.py:281-352
always fetches balance + positions + openOrders concurrently. Even when
include_history=False (which it is for the pump), that's 3 HTTP calls
every policy cycle regardless of whether any orders are resting.
Severity: Medium. Wasteful but not destructive on testnet.
E2: kernel.snapshot()["account"] returns a fresh dict, not a live view
pink_direct.py:437
acc = self.kernel.snapshot()["account"]
ExecutionKernel.snapshot() at rust_backend.py:740-752 builds a dict from
kernel state at call time. The decision/intent engines then consume this
snapshot. Between the snapshot and process_intent() (line 523), another
caller (or the same runtime in a concurrent cycle) could advance the kernel
state, making the decision based on stale capital.
Flaw: E2 — TOCTOU between capital snapshot and intent execution.
The context.capital read at line 437 is used at line 523 for the ENTER
safety guard (_unsafe_entry_reason) and possibly by the decision/intent
engines. If capital changes between these two points (e.g. an async fill
arrives via a concurrent test-HTTP path), the guard uses stale capital.
Severity: Low in single-threaded deployment. Critical under concurrency.
Layer 2: Decision/Intent Bridging (pink_direct.py:79-115)
E3: _decision_to_kernel_intent drops order_type and limit_price
pink_direct.py:79-115
def _decision_to_kernel_intent(decision, intent, slot_id=0):
return KernelIntent(
...
# order_type and limit_price are NOT SET here
)
KernelIntent has order_type="MARKET" and limit_price=0.0 as defaults,
so MARKET orders work correctly. But the runtime never sets these fields
from the policy layer. If decision or intent ever carries order_type
or limit_price, it's silently dropped because the bridge doesn't map them.
Flaw: E3 — LIMIT support in runtime is dead code.
The order_type/limit_price fields in KernelIntent and the LIMIT payload
building in bingx_direct.py lines 384-398 are unreachable from the runtime.
The only path that can set them is direct KernelIntent(...) construction
in tests (_build_pink_bodies.py style scenarios). The _decision_to_kernel_intent
bridge must be patched when a policy engine needs to emit LIMIT orders.
Severity: Medium. Blocks any production path to LIMIT orders.
E4: _exit_intent_from_slot trusts slot.size but slot may be stale
pink_direct.py:398-420
def _exit_intent_from_slot(self, kernel_intent):
try:
slot_size = float(self.kernel.slot(int(kernel_intent.slot_id)).size or 0.0)
except Exception:
slot_size = 0.0
...
exit_size = min(policy_size, slot_size) if policy_ok else slot_size
Reads slot.size fresh from the Rust kernel at call time, then uses it to
cap the exit size. Between this read and the process_intent call that
actually executes the EXIT (line 523), the slot can be modified by
pump_venue_events (line 436) or a concurrent cycle. If a partial fill
arrived between the slot read and the EXIT, the exit size could be wrong.
Flaw: E4 — TOCTOU between exit sizing and exit execution. Same class as E2 but for exit size rather than capital. If the pump drained a partial fill between R4 (slot read) and R8 (process_intent), the EXIT requests a size based on pre-pump remaining size. The kernel caps it at actual remaining, so this is self-correcting — but the intent payload has wrong metadata.
Severity: Low. Self-correcting at kernel level.
Layer 3: Kernel Bridge — Rust FSM Entry (rust_backend.py)
E5: JSON serialization round-trip loses numeric precision
rust_backend.py:460-485 (_intent_to_payload)
KernelIntent fields like reference_price, target_size, leverage are
Python floats. They're serialized to JSON text, sent through C FFI, parsed
by serde_json into Rust f64, then serialized back to JSON, parsed by Python
json.loads(). Each serialization step can introduce precision loss:
# Python float → JSON: 0.1 → "0.1" → Rust f64: 0.10000000000000000555
# Rust f64 → JSON: → serde_json may print "0.10000000000000001"
# Python json.loads → 0.10000000000000001
For prices (TRXUSDT at ~$0.08), a 1e-16 relative error is negligible. For
PnL accumulation over thousands of trades at 9x leverage, the error can grow
to cents or dollars. The |Δcapital − realized| < 1e-9 assertion in tests
would catch gross errors but not sub-cent accumulation.
Flaw: E5 — JSON serialization precision drift over long runs. Severity: Low. Not a practical concern for the current deployment scale.
E6: _RustKernelLib is a global singleton — shared across all kernels
rust_backend.py:40-45
_RUST: _RustKernelLib | None = None
def _get_rust() -> _RustKernelLib:
global _RUST
if _RUST is None:
_RUST = _RustKernelLib()
return _RUST
The _RustKernelLib singleton loads the .so shared library once and
provides FFI functions. Each ExecutionKernel instance gets its own
KernelHandle via _get_rust().create(max_slots). The FFI functions take
the handle as the first argument, so multiple kernels are isolated at the
Rust level.
However, the singleton means ALL kernels share the same ctypes function
pointer table. If a second kernel is created and the first is destroyed,
KernelHandle of the first becomes a dangling pointer. Calling any FFI
function on the destroyed kernel's handle is use-after-free.
Flaw: E6 — No protection against use-after-free on kernel destroy.
Already documented as T7. Worth re-emphasizing in the E2E trace because the
test infrastructure creates and destroys kernels frequently (fresh-kernel
reconcile tests, each _build_rb() call in scenario wrappers).
Severity: High. Use-after-free in C FFI is memory corruption.
Layer 4: Rust Kernel FSM (lib.rs:728)
E7: ENTER handler silently allows re-entry with same trade_id
lib.rs:740-745
if !slot.is_free() && !slot.trade_id.is_empty() && slot.trade_id != intent.trade_id {
return SLOT_BUSY;
}
If slot.trade_id == intent.trade_id, the ENTER is accepted even if the
slot is not free (e.g., POSITION_OPEN with an active position). This is by
design — it lets the same trade_id re-enter after the slot was partially
reconciled or restored from a snapshot. But it also means:
- EXIT sets
slot.closed=trueand transitions toCLOSED - A new ENTER with the same trade_id re-enters the CLOSED slot
- The slot resets
slot.closed=false,slot.size=0.0,slot.initial_size=0.0 - Kernel now thinks the trade is new, but the Rust indexes still have the old trade_id pointing to slot 0
Downstream effect: After a re-entry with the same trade_id, the
active_trade_index[trade_id] still correctly points to slot 0. But the
old VenueOrder in client_order_index and venue_order_index is still
present until the new entry fills and creates new orders. A reconcile event
addressed to the old venue_client_id could stomp on the new trade.
Flaw: E7 — Re-entry with same trade_id leaves stale index entries.
Severity: Low. The rebuild_indexes() call in commit_slot() rebuilds
from scratch, so stale entries are cleared on the first write.
E8: EXIT handler uses initial_size not current size
lib.rs:770-775
let exit_ratio = slot.next_exit_ratio();
let base_size = if slot.initial_size > 0.0 { slot.initial_size } else { slot.size };
let exit_size = (base_size * exit_ratio).max(0.0);
Already documented as A1. In the E2E trace, this is the single most impactful execution flaw. A concrete scenario:
- Enter
size=1.0,initial_size=1.0,exit_leg_ratios=(0.5, 0.5, 1.0) - EXIT leg 0: requests
1.0 * 0.5 = 0.5. Slot goes to 0.5. - EXIT leg 1: requests
1.0 * 0.5 = 0.5. Slot goes to 0.0.active_leg_indexadvances to 2.all_legs_done = (2 >= 3) = false. But wait —exit_leg_ratios.len()is 3: [0.5, 0.5, 1.0]. Soall_legs_done = (2 >= 3) = false. The slot stays atPOSITION_OPEN,size=0.0,!closed. - EXIT leg 2 (ratio 1.0):
exit_size = 1.0 * 1.0 = 1.0. Slot is at 0.0.slot.is_free():fsm_state=POSITION_OPEN, not in{IDLE, CLOSED}.slot.size <= 0.0is true. But!slot.is_free()returns true because of the FSM state check, not the size check. The ENTER guard!slot.is_free()blocks re-entry. The EXIT guardslot.is_free() || slot.closed || size <= 0.0triggers — returnsNO_OPEN_POSITION. - Slot is stuck forever. No operation can advance it.
Severity: High. Concrete, reproducible, and not caught by any test.
E9: CANCEL handler returns diagnostic even when nothing happened
lib.rs:795-810
if matches!(intent.action, KernelCommandType::CANCEL) {
let has_cancellable_exit = slot.active_exit_order.is_some();
let has_cancellable_entry = slot.active_entry_order.is_some()
&& matches!(slot.fsm_state, ENTRY_WORKING | ORDER_REQUESTED | ORDER_SENT | IDLE);
if !has_cancellable_exit && !has_cancellable_entry {
return KernelResult {
outcome: KernelOutcome {
accepted: false,
diagnostic_code: NO_ACTIVE_EXIT_ORDER,
...
},
...
};
}
return KernelResult {
outcome: KernelOutcome {
accepted: true,
...
},
...
};
}
Two issues:
- When neither is cancellable, the diagnostic is
NO_ACTIVE_EXIT_ORDEReven if the actual reason is "no active entry order either" or "slot is already IDLE". The diagnostic is misleading. - When at least one IS cancellable, the Rust kernel returns
accepted=truebut does not mutate the slot at all — it returns immediately with the slot as-is. The actual cancel (HTTP call + FSM transition) happens in the Python bridge. The Rust kernel's "accept" just means "yes you may try to cancel this" — not "the cancel is complete."
This disconnect means: if the Python bridge's venue.cancel() fails (HTTP
error), the Rust kernel has already returned accepted=true for a cancel
that never happened. The caller sees accepted=true but the slot state
hasn't changed.
Flaw: E9 — Rust CANCEL "accepts" before Python actually cancels.
Severity: Medium. The outcome.accepted boolean is misleading for CANCEL.
E10: apply_fill entry branch double-sets active_entry_order
lib.rs:1330-1390
// First set — at the top of the entry branch:
slot.active_entry_order = Some(VenueOrder {
...
filled_size: fill_size,
status: if partial { PARTIALLY_FILLED } else { FILLED },
...
});
// ... then later for full fill:
if !partial {
slot.fsm_state = TradeStage::POSITION_OPEN;
slot.active_entry_order = Some(VenueOrder { // SECOND SET
...
filled_size: slot.size, // uses updated slot.size
...
});
}
The entry branch sets active_entry_order at the top with filled_size from
the event, then for a FULL_FILL, sets it again with filled_size = slot.size
(which may have been updated by slot.initial_size = fill_size above). The
first VenueOrder's intended_size is from the event, the second uses
slot.size. Both are correct in isolation, but the double-write is wasteful.
More importantly, for a PARTIAL_FILL entry, the first set is the ONLY set.
If a second PARTIAL_FILL arrives for the same order, the entry branch at
line 1334 checks slot.active_entry_order.is_some() which is true (set by
the first partial), but the FSM state is ENTRY_WORKING (also set by first
partial). The condition at line 1334-1338 matches ENTRY_WORKING, so the
second partial enters the entry branch again. But fill_size is the event's
filled_size — the total filled, not the incremental amount.
Flaw: E10 — Second PARTIAL_FILL on entry overwrites, doesn't accumulate.
let fill_size = if event.filled_size > 0.0 {
event.filled_size // ← TOTAL filled, not incremental
} else {
event.size
}.max(0.0);
slot.active_entry_order = Some(VenueOrder {
...
filled_size: fill_size, // ← overwrites previous filled_size
...
});
slot.initial_size = slot.initial_size.max(fill_size); // ← OK, uses max
slot.size = fill_size; // ← OVERWRITES previous size with total
On a RESTING LIMIT entry that partially fills in two events:
- Event 1: filled_size=0.3 → slot.size=0.3, entry_order.filled_size=0.3
- Event 2: filled_size=0.7 → slot.size=0.7, entry_order.filled_size=0.7
The filled_size on the VenueOrder correctly reflects cumulative fill
(0.7), but slot.size jumps from 0.3 to 0.7 — the increment is 0.4, which
is correct because fill_size IS the cumulative fill (0.7). Actually this
is correct — the venue sends cumulative filled_size, not incremental. Let
me re-verify: at bingx_venue._events_from_submit() line ~480:
filled_size = _row_float(ack_row, "executedQty", ...)
This reads executedQty which on BingX IS cumulative. So the second event's
filled_size=0.7 means "total filled across all fills = 0.7." The kernel
sets slot.size = 0.7 which is the total position size. This is correct.
But the second fill event has slot.entry_price overwritten by the new
fill's price. If the first fill was at 0.0834 and the second at 0.0836, the
slot's entry_price becomes 0.0836 — losing the blended average. For a LIMIT
entry with two partial fills at different prices, the entry_price in the slot
is the price of the LAST fill, not the VWAP.
Flaw: E10a — Entry price on multi-partial entry is last-fill, not VWAP. Severity: Low. Unrealized PnL computation uses this price. Error is small for tight spreads.
Layer 5: Venue Adapter Boundary (bingx_venue.py)
E11: _legacy_intent() is a lossy conversion
bingx_venue.py:270-285
@staticmethod
def _legacy_intent(intent: KernelIntent) -> LegacyIntent:
action = LegacyDecisionAction.ENTER if intent.action == E.ENTER else ...
side = LegacyTradeSide.SHORT if intent.side == TS.SHORT else ...
metadata = dict(intent.metadata)
metadata["_order_type"] = getattr(intent, "order_type", "MARKET")
metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
return LegacyIntent(
timestamp=intent.timestamp,
trade_id=intent.trade_id,
decision_id=intent.intent_id,
asset=intent.asset,
action=action,
side=side,
reason=intent.reason,
target_size=float(intent.target_size),
leverage=float(intent.leverage),
reference_price=float(intent.reference_price),
confidence=1.0, # ← HARDCODED
bars_held=0, # ← HARDCODED
exit_leg_ratios=tuple(intent.exit_leg_ratios or (1.0,)),
metadata=metadata,
)
confidence is always 1.0 and bars_held is always 0. The LegacyIntent
carries these to BingxDirectExecutionAdapter.submit_intent() which ignores
them (it only reads asset, side, action, target_size, leverage,
and metadata). So the hardcoded values don't affect execution — but they
affect the ExecutionReceipt and any downstream consumers that might read
receipt.confidence.
Flaw: E11 — Lossy conversion with hardcoded metadata. Severity: Informational. No downstream consumer reads these fields.
E12: _events_from_submit() price fallback chain can lose venue price
bingx_venue.py:375-400 (_events_from_submit)
base_event = VenueEvent(
...
price=safe_float(getattr(receipt, "price", 0.0), 0.0),
...
)
# ... later for fill event:
fill_price = safe_float(
_row_float(ack_row, "avgPrice", "ap", "price", "lastFillPrice",
default=getattr(receipt, "price", 0.0)),
0.0
)
The fill price is read from ack_row (the HTTP response dict) first, falling
back to receipt.price (the ExecutionReceipt field). The executionReceipt
price comes from bingx_direct.py:434:
fill_price = 0.0
for key in ("avgPrice", "avgFilledPrice", "price", "lastFillPrice", "tradePrice"):
try: value = float(ack_row.get(key) or 0.0)
except: value = 0.0
if value > 0: fill_price = value; break
if fill_price <= 0 and self._state is not None:
fill_price = next((float(...)) for ... in self._state.open_positions.values() ...)
So the price flows: BingX HTTP ack → ack_row[key] → receipt.price →
_events_from_submit() → fill_price in VenueEvent.
If ack_row has no price field AND self._state.open_positions has no matching
position (e.g., first fill on a new entry), fill_price stays 0.0. The kernel's
apply_fill at lib.rs:1397 checks if event.price > 0.0 before setting
entry_price — so a zero fill price leaves entry_price at 0.0. This means:
- The slot's
entry_pricestays 0.0 realized_pnl()at lib.rs:662 checksif slot.entry_price <= 0.0→ returns 0.0- PnL is never computed for this fill
- Capital never settles
This is very unlikely on BingX VST, which always returns avgPrice in order
acknowledgements. But on any venue that doesn't, PnL is silently zeroed.
Flaw: E12 — Zero fill price → zero entry_price → zero PnL. Severity: Medium. Silent PnL loss if venue returns no price.
E13: _backend_snapshot() timeout returns stale data
bingx_venue.py:290-320
def _backend_snapshot(self, *, include_history=False, timeout_ms=5000.0):
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
with self._snap_lock:
return self._last_snapshot # ← STALE DATA
If the previous snapshot fetch is still in-flight when a new caller arrives,
the timeout returns self._last_snapshot — which could be seconds or minutes
old. The caller (e.g., submit()) then uses this stale snapshot to compute
_filled_size_from_snapshots() — potentially comparing stale "before" data
with fresh "after" data, producing a wrong delta.
Flaw: E13 — Stale snapshot fallback causes wrong fill-size detection.
Severity: Medium. The _filled_size_from_snapshots diff can be wrong.
E14: _events_from_cancel uses stale slot_id from order metadata
bingx_venue.py:485-510
VenueEvent(
...
slot_id=int(order.metadata.get("slot_id", 0) or 0),
...
)
The slot_id in the CANCEL event comes from the VenueOrder.metadata which
was set when the order was created (in Rust FSM's process_intent or
on_venue_event). If the slot was re-assigned or the kernel's slot count
changed since order creation, this slot_id is wrong. The Rust kernel's
resolve_slot() at lib.rs:610-624 would use the event's slot_id (the
stale one) and find the wrong slot.
Flaw: E14 — Cancel event carries stale slot_id from order creation. Severity: Low. Slots are stable and never renumbered.
Layer 6: BingX Direct Adapter (bingx_direct.py)
E15: Submit sets leverage via separate HTTP call
bingx_direct.py:376-379
await self._client.signed_post(
"/openApi/swap/v2/trade/leverage",
{"symbol": symbol, "side": "BOTH", "leverage": leverage},
)
This is a POST to set exchange leverage before each order. If this call
fails (rate limit, network error), the exception at line 417 sets
status = "RATE_LIMITED" and returns a rejection — the order is NOT
submitted. But the error handling at line 417 catches BingxHttpError for
the leverage call AND the order call with the same handler. If the leverage
call fails with a non-rate-limit error (e.g., 400 Bad Request for invalid
symbol), the status is "REJECTED" and no order is placed. This is correct
behavior — but the error message doesn't distinguish "leverage set failed"
from "order submission failed."
Flaw: E15 — Leverage-set failure and order failure share error handler. Severity: Low. Correct behavior, poor diagnostics.
E16: _format_quantity and _format_price use _instrument_step/_instrument_tick — both may be zero
bingx_direct.py:234-268
def _instrument_step(self, asset):
instrument = self._resolve_instrument(asset)
if instrument is not None:
try: return Decimal(str(instrument.size_increment.as_decimal()))
except: pass
return Decimal("0.001") # fallback
def _format_quantity(self, asset, quantity):
step = self._instrument_step(asset)
if step <= 0:
return str(max(0.0, quantity))
...
If _resolve_instrument returns None (asset not in provider), step=0.001
and tick=0.01. These defaults are correct for most USDT perpetuals on
BingX VST, but may be wrong for non-standard symbols. The format functions
still produce a valid string — just possibly with wrong precision.
More concerning: _resolve_instrument at line 211-226 tries three lookup
strategies and iterates all instruments on the third. This iteration is O(n)
in the number of instruments and happens on EVERY submit_intent() call.
With 540 instruments, this is ~0.5ms — acceptable. But _instrument_step
and _instrument_tick each call _resolve_instrument independently, so
submit_intent() calls it twice (once for quantity, once for price, plus
once for _instrument_venue_symbol at line 358). Three full-instrument-list
iterations per order.
Flaw: E16 — Instrument resolution called 3x per order with O(n) scan. Severity: Low. Performance, not correctness.
E17: Cancel uses truth-based confirmation — can mask real errors
bingx_direct.py:474-498
still_open = True
try:
oo = await self._client.signed_get("/openApi/swap/v2/trade/openOrders", ...)
...
still_open = (venue_order_id in ids) if venue_order_id else (venue_client_id in cids)
except Exception:
still_open = None
if still_open is False:
return {"status": "CANCELED", ...}
if str(delete_resp.get("status", "")).upper() in {"CANCELED", "CANCELLED", "SUCCESS", "OK"}:
return {"status": "CANCELED", ...}
return {"status": delete_resp.get("status", "REJECTED"), ...}
The cancel logic:
- DELETE the order on BingX
- GET open orders to verify
- If the order is no longer open, return CANCELED
- If the DELETE response says CANCELED, return CANCELED
- Otherwise return REJECTED
If step 2's GET fails (network error, rate limit), still_open=None.
Then step 4 checks the DELETE response. If the DELETE also returned an error
(e.g., "order not found" because it was already cancelled by another caller),
status is "ERROR" or "not found" — neither matches "CANCELED".
The cancel is reported as REJECTED even though the order IS cancelled.
The bingx_venue._events_from_cancel() then emits CANCEL_REJECT instead
of CANCEL_ACK. The Rust kernel handles CANCEL_REJECT at lib.rs:1218:
KernelEventKind::CANCEL_REJECT => {
if slot.fsm_state == TradeStage::EXIT_WORKING {
slot.fsm_state = TradeStage::EXIT_WORKING; // no-op
}
diagnostic_code = KernelDiagnosticCode::CANCEL_REJECTED;
}
The slot stays in its current state (e.g., EXIT_WORKING) with no active order
(the exchange has no record of it). The slot is stuck until a manual reconcile.
Flaw: E17 — Cancel can return false REJECTED for already-cancelled orders. Severity: Medium. Leads to stuck slot requiring manual intervention.
Layer 7: Fill Feedback Loop (rust_backend.py on_venue_event)
E18: on_venue_event settles PnL incrementally — but fees are never included
rust_backend.py:530-545
incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0)
if abs(incremental_pnl) > 1e-12:
self.account.settle(incremental_pnl)
self._last_settled_pnl[slot.slot_id] = slot.realized_pnl
The Rust kernel's apply_fill computes realized PnL as:
let realized = Self::realized_pnl(slot, event.price, fill_size);
slot.realized_pnl += realized;
No fee subtraction. No commission reading from the event. The VenueEvent
could carry fee data via metadata["fee"] or raw_payload["commission"],
but the Rust kernel doesn't read it and the Python bridge doesn't extract it.
Over the 142 live test scenarios on VST (where fees are 0 or negligible), this is invisible. On live mainnet with exchange fees of 0.02-0.04%, the cumulative error is unbounded.
Flaw: E18 — PnL settlement ignores fees.
Already documented as A7. In the E2E trace, the gap is specifically here:
VenueEvent.price is used for realized_pnl() but VenueEvent.metadata
(which could carry commission from the venue) is never read.
Severity: Medium (grows with trade volume).
E19: observe_slots called with ALL slots, not just changed ones
rust_backend.py:538-545
slots = [self._get_slot(i) for i in range(self.max_slots)]
self.account.observe_slots(slots)
Every on_venue_event call re-reads ALL slots from the Rust kernel (N FFI
calls) and calls observe_slots with the full list. With max_slots=10,
this is 10 FFI round-trips per venue event. Each round-trip serializes a
TradeSlot to JSON, passes through C FFI, parses on the Rust side, serializes
the result, passes back, and parses on the Python side. For a multi-leg EXIT
with 3 fills (ACK + PARTIAL + FULL), that's 3 × 10 = 30 slot reads per
process_intent call.
Flaw: E19 — Full-slot-list read on every event is N×FFI overhead. Severity: Low (performance). Not a correctness issue.
Layer 8: Persistence Boundary (pink_clickhouse.py)
E20: _capital() reads live from AccountProjection — stale row risk
pink_clickhouse.py:199-200
def _capital(self) -> float:
return float(self.account.snapshot.capital or 0.0)
Every row writer calls _capital() at write time to get the current capital.
But persist_result() is called AFTER kernel.process_intent() returns —
at which point the account has already been settled. The account_events,
position_state, and trade_events rows all record the SAME capital value
(the post-settle value). capital_before is then reconstructed by
subtracting PnL (already documented as A5).
The effect: all ClickHouse rows for a single process_intent() call show
identical capital / account_capital / portfolio_capital values, because
they're all written within the same Python call stack with no intervening
events. This is correct for single-threaded operation — all rows reflect
POST-trade state. But it means ClickHouse querying for "capital before trade"
must use capital_after - pnl, which is the wrong formula under multi-slot.
Flaw: E20 — All persistence rows write post-trade capital, not pre-trade. Already documented as A5 from the capital_before angle.
Severity: High for multi-slot accounting reconstruction.
E21: persist_fill_events() synthesizes fake Decision/Intent
pink_clickhouse.py:383-435
def persist_fill_events(self, *, snapshot, events, slot_dict, market_state):
...
decision = Decision(
timestamp=ts, decision_id=trade_id or "async", asset=asset,
action=action, side=side, reason="ASYNC_FILL",
confidence=0.0, velocity_divergence=0.0, irp_alignment=0.0,
reference_price=price, target_size=cur_size, leverage=leverage,
...
)
intent = Intent(
timestamp=ts, trade_id=trade_id, decision_id=trade_id or "async",
...
)
The async fill pump (called by pump_venue_events) constructs fake
Decision/Intent objects because there's no real policy decision backing an
async fill — it just arrived from the exchange. These synthetic objects have:
decision_id = trade_id(or"async"if trade_id is empty)decision_idandtrade_idare the same stringconfidence=0.0,velocity_divergence=0.0,irp_alignment=0.0target_size = cur_size(the remaining size after the fill, not the size that was filled)
These are written to policy_events, trade_reconstruction, and
trade_events with the same row shapes as real policy-driven fills. Any
ClickHouse query that joins policy_events to trade_events on
decision_id will find matching rows (both set to trade_id), but the
policy_events row's target_size is the POST-fill size, not the pre-fill
size. A replay system that reconstructs position from policy_events →
trade_reconstruction would see incorrect sizing.
Flaw: E21 — Async fill persistence uses synthetic decision with wrong data. Severity: Medium. Misleading historical records.
E22: _write_trade_exit_leg capital_before uses arithmetic reconstruction
pink_clickhouse.py:761-762
capital_after = self._capital()
capital_before = capital_after - pnl_leg
Already documented as A5. In the E2E trace, the specific path is:
- Slot 0 exit leg fills →
_capital()returns capital AFTER settlement (because the kernel'son_venue_eventalready calledaccount.settle) capital_before = capital_after - pnl_legreconstructs pre-leg capital
If slot 1 also settled between the leg fill and the persistence write
(possible in multi-threaded or concurrent scenario), capital_after includes
slot 1's PnL, and capital_before is wrong by exactly slot 1's contribution.
Severity: High for multi-slot.
E23: _write_trade_event uses slot_dict.get("entry_price") as exit_price
pink_clickhouse.py:813-815
entry_price = _safe_float(slot_dict.get("entry_price", 0.0), ...)
exit_price = _safe_float(slot_dict.get("entry_price", 0.0), ...) # ← SAME FIELD
Already documented as A13. The exit_price is set to entry_price from
the same slot dict field. The BingX ack payload does contain the fill price,
but it's not propagated to the slot dict's entry_price for exit fills —
the slot's entry_price is set during entry fill and remains unchanged
during exit. The exit fill price is only on the VenueEvent, which is not
passed through to _write_trade_event.
The trade_events row in ClickHouse always shows exit_price == entry_price,
making PnL reconstruction from (exit_price - entry_price) × size × lev
impossible. The pnl field IS correct (it's slot.realized_pnl), but only
the summary is accurate — the component prices are wrong.
Severity: Low. pnl is correct, only the decomposed price is wrong.
Layer 9: Test Infrastructure
E24: MockVenueAdapter.submit() always emits fill on partial_fill_ratio > 0
mock_venue.py:60-90
if self.scenario.emit_fill_on_submit or self.scenario.partial_fill_ratio > 0:
fill_ratio = max(0.0, min(1.0, float(effective_ratio)))
...
if is_entry:
effective_ratio = self.scenario.entry_partial_fill_ratio if \
self.scenario.entry_partial_fill_ratio != 1.0 else \
self.scenario.partial_fill_ratio
else:
effective_ratio = self.scenario.exit_partial_fill_ratio ...
The default MockVenueScenario() has partial_fill_ratio=1.0. So every
submit() call on a default mock emits a FULL_FILL event immediately.
This means mock-venue tests always test the "order fills instantly" path —
they never test resting orders, partial fills, or async fills.
Any test that relies on the mock venue is testing a subset of real venue behavior. The mock never produces:
- DELAYED fills (fill arrives on a later
reconcile()call) - PARTIAL fills with subsequent fills
- Partial fills during entry (entry fills partially, then more later)
- Mixed entry/exit partial behavior
Flaw: E24 — Mock venue always fills synchronously — never tests async path.
Severity: Medium. The pump_venue_events() path has never been exercised
with the mock venue.
E25: Test scenarios use MARKET-only _si() helper — no LIMIT tests
gen_live_tests.py and _gen_test.py
The _si() helper constructs a KernelIntent with order_type="MARKET" and
limit_price=0.0 (the defaults). All 157 live test scenarios use _si().
The 3 "LIMIT" scenarios (limit_does_not_fill, limit_immediate_fill) use
reference_price=0.0 and target_size=-0.001 respectively — they test
intent validation, not actual LIMIT order submission.
There is zero live-test coverage of:
- Submitting a LIMIT order that rests on the book
- A resting LIMIT being cancelled
- A resting LIMIT receiving a partial fill then a subsequent fill
- An async fill arriving via
pump_venue_events()
The Rust kernel's PARTIAL_FILL event handling and the Python bridge's
on_venue_event + incremental settle + async pump has never been exercised
on a live exchange.
Flaw: E25 — Zero live tests for LIMIT/resting/async-fill paths. Severity: High. The partial-fill code path is untested in production.
E26: Fresh-kernel reconcile tests create second kernel but share venue
gen_live_tests.py (fresh_kernel_reconcile_entry body)
fresh = _build_fresh_kernel_from_slot(slot_data, ic=cb)
k2 = fresh.runtime.kernel
The _build_fresh_kernel_from_slot function creates a new PinkDirectRuntime
with a new ExecutionKernel. But the venue adapter is shared or
re-created with the same BingX backend. Two kernels making concurrent HTTP
calls to BingX through shared or separate venue adapters is exactly the
multi-threaded scenario that triggers T1 (Rust kernel UB) — except the tests
are sequential, not concurrent, so they don't trigger it.
The fresh kernel does NOT restore the venue state (open orders, positions). The fresh kernel has a blank venue adapter state — it can't know about previous LIMIT orders resting on the exchange. This is correct for MARKET-only tests (no resting orders) but would fail for LIMIT tests.
Flaw: E26 — Fresh-kernel reconcile doesn't restore venue state. Severity: Medium (would break LIMIT scenarios).
Summary: Critical E2E Flaw Chain
The most dangerous E2E scenario is a LIMIT order with partial fills on a live exchange:
1. Policy emits LIMIT ENTER [E3: can't happen — bridge drops order_type]
2. KernelIntent with order_type="LIMIT" [dead code path from step 1]
3. bingx_direct.submit_intent builds LIMIT payload [works if reached]
4. BingX accepts LIMIT, returns ACK with no fill [VenueEvent.price may be 0]
5. FSM transitions to ENTRY_WORKING [correct]
6. RESTING LIMIT sits on book [no further kernel events]
7. Next policy cycle: pump_venue_events() [E1: expensive HTTP calls]
8. Reconciled venue has no fill events [nothing to drain]
9. Repeated cycles with no progress [wasteful but safe]
10. Eventually BingX fills partially [VenueEvent arrives]
11. apply_fill PARTIAL_FILL entry branch runs [E10: entry_price = last fill, not VWAP]
12. on_venue_event settles incremental PnL [E18: fees not included]
13. persistence writes [E20/E21/E22/E23: wrong capital_before, exit_price]
14. Remaining LIMIT still rests on book [continues to step 7]
15. Eventually full fill or cancel [E17: cancel can return false REJECTED]
None of steps 4-15 have live test coverage.
Complete Flaw Catalog (All Layers)
| # | Flaw | Layer | Step | Severity |
|---|---|---|---|---|
| E1 | Unconditional pump_venue_events wastes rate limit | Runtime | R2 | Medium |
| E2 | TOCTOU between capital snapshot and intent | Runtime | R3→R8 | Medium |
| E3 | Runtime bridge drops order_type/limit_price | Bridging | R7 | Medium |
| E4 | TOCTOU between exit sizing and execution | Runtime | R8 | Low |
| E5 | JSON precision drift over long runs | Bridge | R8a→R8c | Low |
| E6 | Global FFI singleton no guard vs use-after-free | Bridge | R8b | High |
| E7 | Same-trade-id re-entry leaves stale index entries | Rust | R8c | Low |
| E8 | EXIT uses initial_size not remaining size | Rust | R8c | High |
| E9 | CANCEL "accepted" before cancel actually happens | Rust | R8c | Medium |
| E10 | Entry price on multi-partial fill = last fill, not VWAP | Rust | R10a | Low |
| E11 | _legacy_intent hardcodes confidence/bars_held | Venue | R9a | Info |
| E12 | Zero fill price → zero PnL | Venue | R9c | Medium |
| E13 | Stale snapshot fallback causes wrong fill delta | Venue | R9c | Medium |
| E14 | Cancel event carries stale slot_id | Venue | R9c | Low |
| E15 | Leverage-set failure and order failure share handler | Adapter | R9b | Low |
| E16 | Instrument resolution 3x per order, O(n) scan | Adapter | R9b | Low |
| E17 | Cancel returns false REJECTED for already-cancelled | Adapter | R9b | Medium |
| E18 | PnL settlement ignores fees | Bridge | R10b | Medium |
| E19 | Full-slot-list read on every event = N×FFI overhead | Bridge | R10b | Low |
| E20 | All persistence rows write post-trade capital | Persistence | R12 | High |
| E21 | Async fill uses synthetic Decision with wrong size | Persistence | R12 | Medium |
| E22 | capital_before arithmetic reconstruction wrong | Persistence | R12 | High |
| E23 | trade_events exit_price = entry_price | Persistence | R12 | Low |
| E24 | Mock venue always fills synchronously | Test | — | Medium |
| E25 | Zero live tests for LIMIT/async-fill paths | Test | — | High |
| E26 | Fresh-kernel reconcile doesn't restore venue | Test | — | Medium |
Total: 26 E2E flaws (4 High, 10 Medium, 11 Low, 1 Info)
The four High-severity flaws in the E2E trace:
- E6: Global FFI singleton +
__del__use-after-free — memory corruption risk - E8: Exit-size overshoot — slot can get stuck (A1)
- E20/E22: Post-trade capital in all persistence rows + arithmetic capital_before — ClickHouse records are misleading for accounting
- E25: No LIMIT/async-fill test coverage — partial-fill path is production code with zero live validation
PASS 3 — NEW FINDINGS (Deepest E2E Trace)
F1: process_intent CANCEL returns "accepted" before the cancel happens — caller gets wrong outcome.state
File: rust_backend.py:595-614
The CANCEL path:
- Calls
self.venue.cancel(order)→ HTTP DELETE → returnsVenueEvent[] - For each event, calls
self.on_venue_event(event)→ Rust FSM transition - Assembles
final_outcomefrom the Rust kernel's pre-venue-event slot state
outcome = _outcome_from_payload(result["outcome"]) # Rust CANCEL accepts (slot NOT mutated yet)
# ... venue.cancel() ...
# ... on_venue_event() for each event (now slot IS mutated) ...
final_slot = self._get_slot(outcome.slot_id) # Re-reads post-mutation state
final_outcome = KernelOutcome(
accepted=outcome.accepted, # TRUE — from Rust's pre-event accept
state=final_slot.fsm_state, # IDLE — from post-event state
diagnostic_code=outcome.diagnostic_code, # "OK" — from Rust's pre-event accept
)
For ENTER/EXIT, the same pattern exists — the Rust kernel's outcome is
pre-venue. But for CANCEL the disconnect is worst: Rust returns accepted=true
with the slot still in ENTRY_WORKING, and only the subsequent
on_venue_event(CANCEL_ACK) transitions to IDLE.
Fix: The diagnostic code should be reconciled with the actual venue outcome, not taken from the pre-venue Rust outcome.
Severity: Medium
F2: _last_settled_pnl reset before venue.submit() — transient window
File: rust_backend.py:597-604
if intent.action == KernelCommandType.ENTER and outcome.accepted:
self._last_settled_pnl[intent.slot_id] = 0.0 # reset HERE
# ... venue.submit() called below ...
If venue.submit() fails (HTTP error, rate limit), the ENTER was accepted by
the Rust FSM but no venue order was placed. The slot is stuck in
ORDER_REQUESTED. If the caller retries the same ENTER, _last_settled_pnl
is 0.0 from the first attempt — correct for a new trade.
Real risk: If the previous trade on this slot had realized PnL that was never settled (impossible with incremental settle, but hypothetically), resetting to 0.0 loses that PnL. In practice, incremental settle makes this safe.
Severity: Medium (retry-safe, but exposes slot-stall)
F3: _first_invalid_intent_field allows leverage=0 and target_size=0
File: rust_backend.py:295-316
The guard catches NaN/Inf and negative target_size. Does NOT catch:
leverage=0or negative (Rust silently falls back to 1.0)target_size=0(submits zero-quantity order to BingX)reference_price=0(mark_price ignores non-positive)limit_price=0withorder_type="LIMIT"(BingX rejects price=0)
The zero-target-size case: a direct process_intent(EXIT, target_size=0.0)
computes exit_size = 0, submits MARKET order with quantity=0 to BingX,
which may return an error or silent no-op.
Severity: Low (runtime's _exit_intent_from_slot prevents for EXIT; direct
kernel API users can trigger it)
F4: outcome.emitted_events only contains venue events — Rust kernel's events silently dropped
File: rust_backend.py:641-652
final_outcome = KernelOutcome(
emitted_events=tuple(emitted_events), # only from venue.submit()
)
The Rust kernel's KernelOutcome struct has emitted_events — currently always
empty because the Rust FSM never sets it. If a future change adds Rust-side
event emission, those events are silently dropped: final_outcome only uses
the Python-side list.
Severity: Low (no Rust-emitted events exist today)
F5: on_venue_event does redundant FFI read of slot already returned by Rust
File: `rust_backend.py:698-706**
def on_venue_event(self, event):
result = _get_rust().on_venue_event(...)
outcome = _outcome_from_payload(result["outcome"])
slot_payload = result.get("slot")
slot = _slot_from_payload(slot_payload) if slot_payload else self._get_slot(...)
# ...
current = self._get_slot(slot.slot_id) # REDUNDANT — slot already has this data!
self.projection.write_slot(current)
Line 706 re-reads current from the backend even though slot (from the
Rust result) already has the exact same data. Each redundant FFI read is
JSON serialize → C FFI → Rust serialize → C FFI → Python parse — ~100μs.
With 2-3 events per process_intent and 10 slots, ~3ms wasted per cycle.
Severity: Low (performance)
F6: _record_transitions in process_intent records pre-venue transitions with event=None
File: `rust_backend.py:708, 650**
# process_intent line 650:
self._record_transitions(outcome.transitions, final_slot, None) # event=None
# on_venue_event line 708:
self._record_transitions(outcome.transitions, slot, event) # event attached
Venue-event transitions ARE recorded individually inside each
on_venue_event call (line 708). The journal has all transitions. But the
pre-venue transitions (from Rust FSM before venue call) have event=None
attached — no event context for the journal reader.
Severity: Informational (diagnostic inconvenience only)
F7: reconcile_from_slots writes ALL slots to projection/zinc, not just reconciled ones
File: `rust_backend.py:718-733**
for current in slots: # iterates ALL max_slots
self.projection.write_slot(current) # writes unchanged slots too
self.zinc_plane.write_slot(current)
After reconcile, ALL slots are written to projection and Zinc, even if the reconcile only modified one slot. Slots 1-9 are serialized and written with their unchanged state. Wasteful but harmless.
Also: Rust kernel's reconcile_slots_json silently ignores slot_id out of
range — no error returned. Caller sees accepted=true even if no slots were
reconciled.
Severity: Low
F8: HazelcastRowWriter.put() is synchronous with no error handling — Hazelcast failure crashes the intent
File: `hazelcast_projection.py:30-48**
class HazelcastRowWriter:
def __call__(self, name, row):
if name.endswith("trade_events"):
self.client.get_topic(name).publish(json.dumps(row, ...))
return
self.client.get_map(name).put(key, json_safe(row)) # synchronous, no try/except
No try/except. Hazelcast put() is synchronous — blocks until the cluster
acknowledges. If Hazelcast is down, under load, or partitioned, this:
- Blocks the calling thread (which holds the Rust kernel handle — no other operation can proceed)
- Raises an exception that propagates through
_set_slot()→process_intent()→ crashes the entire intent
Severity: Medium (Hazelcast failure in hot path stalls execution)
F9: RealZincPlane.write_slot() serializes ALL slots, not just the changed one
File: `real_zinc_plane.py:205-212**
def write_slot(self, slot):
with self._lock:
self._slot_cache[int(slot.slot_id)] = slot
payload = {"slots": [self._slot_cache[key].to_dict() for key in range(self._slot_count)]}
self._write_region(self.state_region, self._state_seq, payload)
Every single-slot write serializes ALL slot_count slots (default 10) to JSON.
With VenueOrder metadata, each slot payload can be ~1-5KB → 10-50KB per write.
This is written to Zinc shared memory on every process_intent() and
on_venue_event() call.
InMemoryZincPlane does NOT have this problem — it only stores the one slot.
Severity: Low (performance + Zinc shared-memory capacity waste)
F10: RealZincPlane.write_slot zeros buffer before write — concurrent read sees empty data
File: `real_zinc_plane.py:255-263**
def _write_region(self, region, seq, payload):
buf = region.as_buffer()
view = memoryview(buf)
view[:] = b"\x00" * len(view) # Zeros the buffer
view[: len(packet)] = packet # Writes packet
region.notify()
Between the zero and the write, any concurrent reader sees zeros or a truncated
packet. _decode_packet checks size <= len(buf) - 16 — a partially-written
packet fails validation and returns {}. The reader (e.g., another thread
calling read_slots()) gets an empty result.
Window is microseconds but it exists. No version guard — reader always returns whatever is in the region.
Severity: Low (brief window, no corruption — just empty results)
F11: RealZincPlane._write_region has no partial-write recovery
File: `real_zinc_plane.py:255-263**
If _encode_packet raises (JSON serialization error), the method raises before
writing — region retains previous content. Safe.
If view[:] = b"\x00" fails (memory error), the region is partially zeroed.
Not recoverable. No fallback.
Severity: Low (memory errors are extremely rare)
F12: InMemoryZincPlane intent_region grows without bound
File: `zinc_plane.py:83-85**
def publish_intent(self, intent):
self.intent_region.append(intent) # unbounded growth
self.intent_region is List[KernelIntent] — grows on every publish_intent
call. Over thousands of policy cycles, this grows without bound.
RealZincPlane.publish_intent() limits to last 512 entries in shared memory,
but its self._intent_cache (in-memory) also grows without bound.
Severity: Low (memory leak — ~MB/day)
F13: InMemoryZincPlane uses non-re-entrant threading.Condition
File: `zinc_plane.py:41-43**
_signal: threading.Condition = field(default_factory=threading.Condition)
threading.Condition is NOT re-entrant. If any code path calls back into
publish_intent while holding the condition's lock — deadlock.
Severity: Low (no current code path triggers this, but it's a landmine)
F14: KernelSlotView.__setattr__ round-trips unknown fields through Rust — silently dropped
File: `rust_backend.py:370-395**
If a new field is added to Python's TradeSlot that Rust's TradeSlot doesn't
know about, slot.to_dict() includes it. _set_slot serializes to JSON, sends
to Rust, which deserializes with #[serde(default)] — unknown fields are
silently dropped. The round-trip loses data without warning.
The reverse: if Rust adds a field that Python doesn't know about,
_slot_from_payload ignores unknown keys. Also silently dropped.
Severity: Low (fields must be added to both sides atomically; no guard)
F15: on_venue_event loop in process_intent stops on first exception — slot left in partial state
File: `rust_backend.py:599-610**
for event in emitted_events:
evt_outcome = self.on_venue_event(event) # NO TRY/EXCEPT
If self.on_venue_event(event) raises (FFI error, null pointer, OOM), the loop
stops. Events after the failing event are never processed. The slot is in a
partial state — some events applied, some not.
Concrete scenario: ACK arrives first → applied. FULL_FILL arrives second
→ FFI error, exception raised. Slot is stuck in ENTRY_WORKING with size=0.
Next process_intent(EXIT) returns NO_OPEN_POSITION. No recovery path exists.
Severity: High — single exception during fill feedback leaves slot unrecoverable. Zero defense in depth.
F16: venue.submit() returning empty events leaves slot in ORDER_REQUESTED
File: `rust_backend.py:599-610**
If venue.submit() returns [] (venue rejected order with no response, or
internal error), the for loop doesn't run. No on_venue_event is called.
Slot stays in Rust's pre-venue state (ORDER_REQUESTED).
final_outcome has accepted=true, state=ORDER_REQUESTED, emitted_events=[].
Caller sees "successful" but no exchange order exists. Slot stuck in
ORDER_REQUESTED until pump_venue_events() or manual reconcile.
Severity: Medium — silent slot stall with no error indication.
F17: Cancel truth-based confirmation returns REJECTED for already-cancelled orders on GET failure
File: `bingx_direct.py:474-498**
try:
oo = await self._client.signed_get("/openApi/swap/v2/trade/openOrders", ...)
still_open = (venue_order_id in ids)
except Exception:
still_open = None # GET failed
if still_open is False:
return {"status": "CANCELED", ...}
# still_open is None (GET failed) or True (order still on book)
# Falls through to DELETE response check
If the DELETE succeeded but the verification GET failed (network blip, rate limit
on the verification endpoint), still_open=None. The code then checks the DELETE
response. If the DELETE returned an ambiguous error (e.g., "order not found"
because it was already cancelled by another path), the status is "ERROR" —
reported as REJECTED even though the order IS cancelled.
The bingx_venue._events_from_cancel() emits CANCEL_REJECT. The Rust FSM
handles CANCEL_REJECT as a no-op — slot stays in EXIT_WORKING with no
active order. Stuck until pump_venue_events() or manual reconcile.
Severity: Medium — needs a third state: "definitely cancelled," "probably cancelled," "definitely not cancelled."
F18: Leverage-set and order-submit failures share error handler — poor diagnostics
File: `bingx_direct.py:376-417**
await self._client.signed_post("/openApi/swap/v2/trade/leverage", ...) # step A
# ...
ack_payload = await self._client.signed_post("/openApi/swap/v2/trade/order", payload) # step B
If step A fails (400 for invalid symbol), the exception handler at line 417
catches BingxHttpError and returns REJECTED. No way for the caller to know
whether the leverage set failed or the order submission failed — both go through
the same handler. The error message just says "REJECTED."
Also: if step A succeeds and step B fails, leverage was changed on the exchange but no order was placed. System state unchanged (leverage changes don't affect capital), but diagnostics are poor.
Severity: Low (correct behavior, poor diagnostics)
F19: _events_from_submit stale snapshot fallback → wrong fill detection
File: `bingx_venue.py:375-400**
_filled_size_from_snapshots() diffs position quantity before and after
submit. The "before" snapshot comes from _backend_snapshot() which can
return stale data (E13). A stale "before" against a fresh "after" produces
a wrong diff — could be negative, zero, or larger than reality.
This wrong diff propagates to emitted_events — the PARTIAL_FILL or
FULL_FILL event has wrong filled_size. The Rust kernel's apply_fill
uses this wrong filled_size to set slot.size. Capital settles on the
wrong delta.
Severity: Medium — wrong fill size propagates to kernel state and PnL.
F20: __del__ frees Rust handle at unpredictable GC time — no explicit close()
File: `rust_backend.py:558-566**
def __del__(self):
backend = getattr(self, "_backend", None)
if backend is not None:
try: _get_rust().destroy(backend)
except: pass
ExecutionKernel has no close() method. The Rust KernelHandle is only
freed by __del__, which runs on the GC thread at unpredictable time. If
any code holds a stale reference to self._backend, the pointer dangles
when the kernel is GC'd.
DITAv2LauncherBundle.close() calls _maybe_close on venue, zinc, and
control plane — but NOT on kernel (which has no close() or disconnect()).
The kernel is leaked until GC.
Severity: Medium — reliance on __del__ for critical C resource cleanup.
F21: DITAv2LauncherBundle.close() closes venue before kernel is done with it
File: `launcher.py:90-95**
def close(self):
_maybe_close(self.venue) # Closes HTTP client
_maybe_close(self.zinc_plane) # Closes Zinc regions
If the kernel is mid-process_intent in another thread (hypothetical —
single-threaded in practice), venue.submit() would fail because the HTTP
client is already closed. No ordering enforcement.
Severity: Low (single-threaded deployment)
F22: Silent fallback from real Zinc/Hazelcast to in-memory on error — operator unaware
File: control.py:210-217, launcher.py:175-185, projection.py:30-40
def build_control_plane(...):
if real_requested:
try:
return RealZincControlPlane(...)
except Exception:
pass # SILENT — operator never knows
return ZincControlPlane(snapshot=snapshot)
Three places have this pattern. An operator who configures DITA_V2_ZINC=REAL
and Zinc isn't available gets in-memory storage without any warning, error, or
log. The ZincPlane protocol has no introspection method to check if it's
real or in-memory.
The same applies to Hazelcast projection and the venue adapter.
Severity: Medium — configuration errors are silently masked.
F23: VenueEvent.size = intent.target_size not actual fill — wrong for multi-leg EXIT
File: `bingx_venue.py:410-420**
base_event = VenueEvent(
size=float(intent.target_size or 0.0), # target, not fill
)
For an EXIT leg, intent.target_size is the intended exit size. The ACK
event's size reflects the target, not the actual fill. For fully-filled
MARKET orders, target == fill so it's invisible. For partially-filled
LIMIT orders, size on the ACK is wrong.
The fill event later has filled_size from the venue's executedQty, so
the downstream kernel uses the correct fill size. The ACK's size is
unused by the kernel (the kernel uses filled_size for PnL computation).
Severity: Informational (unused by kernel)
F24: asyncio.run() inside async function in test generator — nested event loops
File: _build_pink_extended.py:75-81
def _check_open_orders(c, vs):
r = __import__('asyncio').run(c._request_json("GET", ...))
asyncio.run() is called INSIDE an async def context (the test body is
async). This creates a new event loop on the current thread, suspending
pytest's asyncio loop. Nested event loops are "not recommended" per Python
docs.
Severity: Low (works in practice)
F25: _build_fresh_kernel_from_slot leaks old kernel objects per call
File: `_build_pink_extended.py:95-108**
def _build_fresh_kernel_from_slot(slot_data, ic=25000.0):
cfg = _build_config(ic)
b = build_launcher_bundle(venue_mode="BINGX", ...) # NEW bundle, OLD not closed
k = b.kernel
return RB(runtime=Shim(k), config=cfg)
Each call creates a new launcher bundle (new kernel, new Rust handle, new HTTP client, new Zinc plane) without closing the old one. Called 4 times across the fresh-kernel test bodies. Leaks ~50MB per call (Rust lib, HTTP connections).
Severity: Low (test infrastructure only)
F26: seen_event_ids not cleared on re-entry — event IDs accumulate across trades
File: lib.rs:672-683
When a slot re-enters (new ENTER after previous EXIT), the Rust kernel resets
most fields (lib.rs:740-765) but does NOT clear seen_event_ids. The new
trade inherits the previous trade's event history up to MAX_SEEN_EVENT_IDS
(256). After 256 events across multiple trades, old IDs are drained.
For MARKET trading (2-4 events per trade), this takes ~60-80 trades before draining. For LIMIT trading (many partial fills), could be 5-10 trades.
Fix: slot.seen_event_ids.clear() on ENTER.
Severity: Low (event ID collision across trades is astronomically unlikely)
F27: RealZincControlPlane.read() parses Zinc region every call — no caching
File: `real_control_plane.py:88-94**
def read(self):
payload = _decode_packet(self.region.as_buffer()) # JSON parse every call
control = payload.get("control")
self._snapshot = KernelControlSnapshot(**control) # reconstruct every call
return self._snapshot
Called by ExecutionKernel.control property on every process_intent().
Each call re-constructs a KernelControlSnapshot from dict — allocating
new objects for every field. ~50μs per call. A simple cached-until-modified
pattern would eliminate all parses between writes.
Severity: Low (performance)
F28: _legacy_intent hardcodes confidence=1.0 and bars_held=0
File: bingx_venue.py:270-285
These fields are in LegacyIntent but unused by submit_intent() (which
only reads asset, side, action, target_size, leverage, metadata).
The downstream ClickHouse rows use the policy-layer Intent, not LegacyIntent,
so the hardcoded values don't reach persistence.
Only propagates through the venue adapter's internal chain. No consumer reads them today.
Severity: Informational
F29: _slot_to_payload in real_zinc_plane.py is dead code
File: `real_zinc_plane.py:57-59**
def _slot_to_payload(slot):
data = slot.to_dict()
return data
Defined, never called anywhere in the file. All slot serialization calls
slot.to_dict() directly.
Severity: Informational
F30: Duplicate _slot_from_payload in real_zinc_plane.py and rust_backend.py
File: real_zinc_plane.py:62-112**, rust_backend.py:270-310`
Two nearly identical implementations. The real_zinc_plane version manually
constructs VenueOrder objects (lines 63-88) with different defaults
(e.g., fallback to slot size if intended_size missing). The rust_backend
version delegates to _order_from_payload with all-default fallbacks.
If fields are added to TradeSlot or VenueOrder, both must be updated.
Severity: Low (code duplication risk)
Complete Flaw Catalog
All-Passes Combined
| Family | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural (old 13, now superseded) | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| Total | 80 | 1 | 10 | 21 | 32 | 16 |
Most Dangerous Single Flaw: F15
An exception in on_venue_event() during the fill-feedback loop stops the
chain mid-apply. The ACK applied but the FILL didn't. Slot in ENTRY_WORKING
with no position. No retry mechanism, no recovery path. The slot is stuck
forever until manual intervention. Zero defense in depth — no try/except, no
undo, no validation that the slot reached a consistent state.
This is the single highest-impact E2E flaw because it requires no concurrency, no race condition, no unusual market conditions — just a transient FFI error during normal operation.
PASS 4 — SYSTEMATIC DOMAIN SCANS (Config, Rust, Persistence, Lifecycle)
Rust Kernel — Numeric & FSM Invariants
G1: EXIT_RESIDUAL action is entirely missing from Rust KernelCommandType
File: _rust_kernel/src/lib.rs
string_enum! {
enum KernelCommandType {
ENTER, EXIT, MARK_PRICE, RECONCILE, CONTROL, CANCEL,
}
}
Six variants. No EXIT_RESIDUAL. If any caller submits an intent with action = "EXIT_RESIDUAL", the string_enum deserializer fails — serde returns INVALID_INTENT_PARSE. Even if deserialization worked, there's no branch to handle residual-position cleanup. Any position with remaining size after partial exit legs has no way to trigger a clean-up exit via the intent system.
The Python KernelCommandType enum (contracts.py) does have EXIT_RESIDUAL, translated to "EXIT_RESIDUAL" string by _intent_to_payload. This string hits Rust's string_enum → parse error → INVALID_INTENT_PARSE.
Fix: Add EXIT_RESIDUAL variant to Rust enum + match arm that skips the NO_OPEN_POSITION guard for residual-sized positions.
Severity: Critical
G2: into_c_string uses unwrap() — panics on interior NUL byte
File: _rust_kernel/src/lib.rs:1477
fn into_c_string(value: &str) -> *mut c_char {
CString::new(value).unwrap().into_raw()
}
CString::new() returns Err if the string contains a NUL ('\0') byte. .unwrap() panics at the C FFI boundary. If any serde_json::to_string() output (e.g., user-controlled string in KernelIntent, VenueEvent, or TradeSlot) contains a NUL byte, this panics the entire process.
Triggered by every FFI call that returns a string:
dita_kernel_process_intent_jsondita_kernel_on_venue_event_jsondita_kernel_reconcile_slots_jsondita_kernel_snapshot_jsondita_kernel_get_slot_json
Fix: Replace .unwrap() with unwrap_or_else(|_| ptr::null_mut()) or feed through invalid_intent_cstring.
Severity: Critical
G3: process_intent EXIT hardcodes prev_state = POSITION_OPEN unconditionally
File: _rust_kernel/src/lib.rs:842-890
slot.fsm_state = TradeStage::EXIT_REQUESTED; // unconditional override
let transition = self.transition(
&slot,
TradeStage::POSITION_OPEN, // always POSITION_OPEN
slot.fsm_state.clone(),
"EXIT_INTENT",
);
Three problems:
(a) Transition prev_state is a lie. If the slot was in EXIT_WORKING, EXIT_SENT, EXIT_REQUESTED, or POSITION_PARTIALLY_CLOSED, the transition record says POSITION_OPEN — wrong.
(b) Backward transition. If the slot is EXIT_WORKING and a new EXIT intent arrives, fsm_state is set to EXIT_REQUESTED — a backward transition from EXIT_WORKING → EXIT_REQUESTED. This corrupts the FSM.
(c) No state guard. EXIT should only be allowed from POSITION_OPEN, EXIT_WORKING (for additional legs), or POSITION_PARTIALLY_CLOSED. Currently any state that passes !is_free() && !closed && size > 0 can transition to EXIT_REQUESTED.
Fix: Check actual FSM state before allowing EXIT, log actual prev_state, guard against backward transitions.
Severity: Critical
G4: consume_exit_leg advances beyond last valid index — stale all_legs_done variable
File: _rust_kernel/src/lib.rs:1420-1435
let all_legs_done = slot.active_leg_index >= slot.exit_leg_ratios.len(); // (A)
let should_close = (slot.size <= 1e-12 || (!partial && all_legs_done)); // (B)
if !partial {
slot.consume_exit_leg(); // (C) — advances active_leg_index POST (A)
}
if should_close && slot.size <= 1e-12 { // (D) — close
} else if !partial && !all_legs_done { // (E) — stale! uses (A) not post-advance index
On the last leg (active_leg_index = len - 1):
- (A):
all_legs_done = false(pre-advance) - (C): advances to
len(exhausted) - (E):
!partial && !false= true → entersPOSITION_OPENinstead of examiningshould_closewith post-advance index
The all_legs_done variable is captured before consume_exit_leg advances the index. Branch (E) should use the post-advance index to correctly detect exhaustion.
After exhaustion, next_exit_ratio() returns 1.0 (out-of-bounds unwrap_or(1.0)) — silently tries to exit remaining size as 100% instead of detecting completion.
Severity: Critical
G5: realized_pnl uses unbounded f64 — overflows to inf at extreme values
File: _rust_kernel/src/lib.rs:648-656
let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);
delta * notional
No is_finite() check on intermediate products. At exit_price=1e200, entry_price=1e-200: delta = (1e200 - 1e-200) / 1e-200 ≈ 1e400 → inf. The resulting inf is stored in slot.realized_pnl, corrupting all future PnL tracking.
Subnormals: entry_price=5e-324 (subnormal) causes division to produce inf for modest exit prices on some platforms.
Fix: Add is_finite() guards on both prices and cap intermediate products.
Severity: High
G6: mark_price produces unbounded unrealized_pnl
File: _rust_kernel/src/lib.rs:384-399
self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage;
// No is_finite() check on result
If any of delta, size, entry_price, or leverage is extreme, the product overflows to inf. No result guard. inf stored in unrealized_pnl forever. Capped only by the price <= 0.0 guard on input — no guard on the computation chain.
Also: self.entry_price = price at line 388 overwrites entry_price on every mark_price call for a position with entry_price <= 0.0, even when the position has been open for a while. This means a stale-zero entry_price gets set to the current market price on first mark_price after open, which is correct — but if the slot is reused (re-entry without resetting entry_price), the old entry price from the prior trade bleeds into unrealized PnL.
Severity: High
G7: process_intent ENTER — no is_finite() guard on target_size
File: _rust_kernel/src/lib.rs:806-807
intended_size: intent.target_size.max(0.0),
f64::NAN.max(0.0) returns NAN. f64::INFINITY.max(0.0) returns inf. Serde_json does accept Infinity and NaN by default — they're valid JSON tokens. If the Python-side _first_invalid_intent_field guard is bypassed (F3 — it allows these through), NaN/inf propagates into intended_size in VenueOrder, corrupting all fill calculations.
Similarly, reference_price is never validated for finiteness before being stored in VenueOrder.metadata.
Severity: High
G8: reconcile_slots_json — no dedup or bounds validation
File: _rust_kernel/src/lib.rs:1668-1675
for slot in slots {
if slot.slot_id < core.slots.len() {
core.slots[slot.slot_id] = slot.clone();
}
}
Two slots with the same slot_id: the second overwrites the first silently. A slot with slot_id >= core.slots.len(): silently dropped — no error, no diagnostic. Caller sees accepted=true even if some/all slots were not applied.
Severity: High
G9: exchange_order_id propagation uses wrong order target
File: _rust_kernel/src/lib.rs:1110-1125
let target = if slot.active_entry_order.is_some() {
slot.active_entry_order.as_mut()
} else {
slot.active_exit_order.as_mut()
};
If an entry order exists (even if fully filled) and an exit fill event arrives, the code updates the entry order's venue_order_id instead of the exit order's. The exit order's venue_order_id stays empty. Any subsequent CANCEL intent on the exit order fails because active_exit_order.venue_order_id is empty — the venue can't match the cancel.
Fix: Disambiguate by matching venue_client_id, or clear active_entry_order when entry is complete.
Severity: High
G10: CANCEL diagnostic code says NO_ACTIVE_EXIT_ORDER for entry cancel too
File: _rust_kernel/src/lib.rs:966-1005
if !has_cancellable_exit && !has_cancellable_entry {
return KernelResult {
diagnostic_code: KernelDiagnosticCode::NO_ACTIVE_EXIT_ORDER, // always says exit
details: json!({"reason": "NO_ACTIVE_EXIT_ORDER"}),
};
}
When neither exit nor entry is cancellable, the diagnostic returns NO_ACTIVE_EXIT_ORDER regardless of which order was the target. If the user wanted to cancel an entry order that's not in a cancellable state, the diagnostic is misleading.
Fix: Separate diagnostic codes: NO_ACTIVE_EXIT_ORDER, NO_ACTIVE_ENTRY_ORDER, ENTRY_NOT_CANCELLABLE.
Severity: High
G11: apply_fill entry-fill overwrites active_entry_order.intended_size with slot.size
File: `_rust_kernel/src/lib.rs:1363-1377**
On FULL_FILL entry, slot.active_entry_order is entirely replaced with a new VenueOrder where intended_size = slot.size (the fill amount) instead of the original intended size. The original intended size (which could be larger than fill size for partial fills) is lost.
If a duplicate fill event arrives (dedup fails due to missing event_id), the second fill would use slot.size as the basis for further fills — wrong values.
Severity: Medium
G12: leverage unbounded after is_finite() — no maximum cap
File: _rust_kernel/src/lib.rs:778
slot.leverage = if intent.leverage.is_finite() && intent.leverage > 0.0 {
intent.leverage // 1e100 accepted here
} else { 1.0 };
leverage = 1e100 passes is_finite(). Feeds into realized_pnl() as slot.leverage.max(1.0) = 1e100, producing notional = exit_size * entry_price * 1e100. Makes unrealized_pnl arbitrarily large.
No maximum leverage cap enforced anywhere — the exchange-level cap (DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP) exists in BingxExecClientConfig but is never passed to the Rust kernel.
Severity: Medium
G13: resolve_slot fallback returns unwrap_or(0) — can misroute events
File: _rust_kernel/src/lib.rs:623
self.slots.first().map(|slot| slot.slot_id).unwrap_or(0)
When no slot matches the event (slot_id out of range or all slot filters fail), returns slot_id of the first slot (which may be 0 or any value). No diagnostic emitted — caller sees slot state change with no idea the event was misrouted.
Severity: Medium
G14: commit_slot silently ignores out-of-bounds slot_id
File: `_rust_kernel/src/lib.rs:595-600**
fn commit_slot(&mut self, slot: TradeSlot) {
if slot.slot_id < self.slots.len() {
self.slots[slot_id] = slot;
}
// else: silently dropped — no error returned
}
Mutations to out-of-bounds slot are silently discarded. Can happen if slot.slot_id is corrupted via set_slot_from_json causing index mismatch between slot.slot_id and the actual slot position.
Severity: Medium
Configuration & Validation Chain
G15: Zero __post_init__ validators on all config dataclasses
Every config dataclass in the system has zero field-level validation:
| Dataclass | Fields | Validators |
|---|---|---|
KernelControlSnapshot |
16 | 0 |
ControlUpdate |
16 | 0 |
KernelIntent |
19 | 0 |
TradeSlot |
22 | 0 |
VenueOrder |
8 | 0 |
VenueEvent |
18 | 0 |
KernelTransition |
11 | 0 |
KernelOutcome |
8 | 0 |
AccountSnapshot |
9 | 0 |
| Total | 127 | 0 |
The only validation in the entire chain:
_first_invalid_intent_field()— finiteness guard at Python→Rust FFI boundary (not a dataclass validator)- Rust
leverage = if is_finite && > 0.0 { val } else { 1.0 }— post-hoc clamp - Rust
KernelCore::new(max_slots.max(1))— floor only, no ceiling launcher.py:143:max(1, int(...))foractive_slot_limit— floor only
No __post_init__ exists anywhere. No bounds check on any field except the two floor-only guards.
Severity: High
G16: DITA_V2_DEBUG_CLICKHOUSE defaults to True when env var is unset
File: launcher.py:133
debug = _env_bool("DITA_V2_DEBUG_CLICKHOUSE", True)
_env_bool (launcher.py:75) returns default when the env var is unset. So debug = True by default. Every runtime writes debug traces to ClickHouse by default. DITA_V2_DEBUG_CLICKHOUSE=False is required to disable it.
This is not a bug per se, but it means debug ClickHouse writes are on by default, adding ~10 ClickHouse insertions per process_intent call (every transition + position state + trade event) that most production deployments may not want.
Severity: Informational
G17: String config fields have no charset/length validation — Zinc region injection risk
File: control.py:31-53, real_zinc_plane.py:30
runtime_namespace, strategy_namespace, event_namespace, actor_name, exec_venue, data_venue, ledger_authority are all free-form strings with no validation. They're used as:
- Zinc shared memory region names:
self.prefix + "." + namespace + "." + kind— an attacker-controlled namespace could collide with other processes' Zinc regions - ClickHouse table names:
DOLPHIN_BINGX_JOURNAL_STRATEGYis used as a table suffix — SQL injection risk in ClickHouse journal - Hazelcast map names: Same injection risk via
event_namespace
Severity: Medium
G18: exit_leg_ratios no sum-to-1 validation
KernelIntent.exit_leg_ratios and TradeSlot.exit_leg_ratios are tuple/list of floats. No validator ensures they sum to approximately 1.0. Ratios summing to 0.5 leave the position partially closed forever (residual can't be exited because next_exit_ratio() returns 1.0 after exhaustion, exiting 100% of remaining — which may exceed the intended residual).
Severity: Low
G19: RealZincControlPlane.read() has no sequence check — torn-read risk
File: `real_control_plane.py:88-94**
def read(self):
payload = _decode_packet(self.region.as_buffer())
control = payload.get("control")
if not isinstance(control, dict):
return self._snapshot
self._snapshot = KernelControlSnapshot(**control)
return self._snapshot
The binary packet has a 64-bit sequence number but read() never checks it. Between the zero-write and packet-write in _write_region, a reader sees an empty buffer → _decode_packet fails → falls back to self._snapshot (stale). Between the packet-write and struct.pack header (order depends on implementation), a reader sees a partial write with wrong size → _decode_packet fails.
No checksum on the wire format: struct.pack("!QQ", seq, len) + json_bytes. A torn write produces garbage that json.loads may or may not parse successfully.
Severity: Low
G20: DOLPHIN_BINGX_JOURNAL_STRATEGY/_DB — ClickHouse SQL injection risk
File: launcher.py:202-203
"DOLPHIN_BINGX_JOURNAL_STRATEGY": os.environ.get("DOLPHIN_BINGX_JOURNAL_STRATEGY", ""),
"DOLPHIN_BINGX_JOURNAL_DB": os.environ.get("DOLPHIN_BINGX_JOURNAL_DB", ""),
These are used as ClickHouse table and database name suffixes in pink_clickhouse.py. An attacker who can set env vars can inject SQL via semicolons or quotes in the table name. ClickHouse supports INSERT INTO db.table FORMAT JSONEachRow — a table name like positions; DROP TABLE ...; could be destructive.
Severity: Low (requires env var control, which implies broader access)
Persistence Schema Alignment
G21: entry_price used as exit_price in trade_events — data loss
File: pink_clickhouse.py (outside workspace)
The _write_trade_event function maps entry_price from slot.to_dict() to both the entry_price and exit_price columns. The actual exit fill price (available on the VenueEvent object) is never written to the exit_price column.
Result: Every trade_events row has exit_price == entry_price. The exit_price column is a dead column — always contains the entry price, never the actual fill.
Severity: High — data loss to DB for the most important trade metric.
G22: active_leg_index → entry_bar semantic mis-mapping
File: pink_clickhouse.py (outside workspace)
"entry_bar": int(slot_dict.get("active_leg_index", 0) or 0),
active_leg_index tracks the exit-leg-ratios cursor (which leg of a multi-leg exit we're on), not a bar count. The value 0 at position open and 1 after the first exit leg — neither value represents bars held. The entry_bar column stores the wrong concept.
Severity: Medium — column contains semantically meaningless data.
G23: capital_before arithmetic reconstruction absorbs cross-slot PnL
File: pink_clickhouse.py (outside workspace)
capital_before = capital_after - pnl_leg
capital_before is reconstructed by subtracting the current leg's PnL from the current capital. In a multi-slot system, other slots' PnL changes between legs are absorbed into capital_before. The column is always wrong in multi-slot scenarios because capital_after reflects total PnL from all slots, not just the leg being recorded.
Severity: Medium — wrong capital_before for multi-slot trading.
G24: Recovery trade_reconstruction always has trade_id=""
File: pink_clickhouse.py (outside workspace)
The persist_recovery_state function passes kernel.snapshot()["account"] (an account dict with keys capital, equity, realized_pnl, ...) where a slot dict is expected. The trade_id key does not exist on the account dict. The recovery_state row always has trade_id="".
Severity: Medium — recovery data is not associable with any trade.
G25: seen_event_ids, exit_leg_ratios, VenueOrder, metadata not in flat ClickHouse tables
These fields are:
- Present on the Python
TradeSlot✅ - Transmitted through Zinc shared memory ✅
- Stored in Hazelcast ✅
- Stored in ClickHouse
dita_kernel_debug(full JSON) ✅ - NOT extracted into main ClickHouse flat tables
position_state,trade_events,trade_exit_legs❌
Data exists at the source, travels through the pipeline, hits the debug journal — but is lost in the main analytical tables.
Severity: Low (data exists in debug journal if needed for reconstruction)
G26: _safe_float silently converts NaN/None/Inf to 0.0
File: utils.py:15
def _safe_float(v, default=0.0):
try:
f = float(v)
if not math.isfinite(f):
return default
return f
except (TypeError, ValueError, OverflowError):
return default
Used in multiple ClickHouse writers. Silently converts NaN/Inf/parsing errors to 0.0. No diagnostic emitted when a non-finite value reaches the persistence layer — data silently zeroed.
Severity: Low (safe default but silent corruption)
Lifecycle & Resource Management
G27: build_launcher_bundle has no exception safety — prior resources leak
File: `launcher.py:264-300**
def build_launcher_bundle(...):
control_plane = _build_control_plane(...)
projection = build_projection(...)
zinc_plane = _build_zinc_plane(...)
venue = _build_venue(...)
kernel = ExecutionKernel(...) # ← if THIS fails, everything above leaks
If any step after the first raises, all previously built resources leak:
RealZincPlanecreated →_build_venue()fails → 3 shared memory regions orphanedRealZincControlPlanecreated →_build_zinc_plane()fails → 1 shared memory region orphanedBingxVenueAdaptercreated →ExecutionKernel.__init__()fails → HTTP connection leaked
No try/finally anywhere in the builder. The init order is also optimized for forward construction, not backward cleanup.
Severity: High — shared memory leak on any build failure.
G28: RealZincPlane and RealZincControlPlane have no __del__
When close() is not called (exception in builder, forgotten cleanup, GC during shutdown), the shared memory regions opened by RealZincPlane (3 regions) and RealZincControlPlane (1 region) are orphaned on the OS. They persist in /dev/shm/ (or platform equivalent) until system reboot.
Python's __del__ is unreliable (not called on SIGKILL, not called if the object is part of a cycle without a GC run), but its absence means even normal garbage collection can't clean up.
Severity: High — shared memory leaks.
G29: Zero signal handlers — no cleanup on SIGTERM/SIGINT
$ grep -rn "signal\|SIGTERM\|SIGINT\|atexit" *.py # ZERO matches
When SIGTERM or SIGINT arrives:
- Python's default handler terminates the process immediately
- No
DITAv2LauncherBundle.close()is called - No
ExecutionKernel.__del__is called (CPython may run GC on normal exit but not reliably) - All shared memory (RealZincPlane, RealZincControlPlane) is orphaned
- In-flight BingX HTTP calls are interrupted mid-stream
- Rust kernel handle is leaked
Severity: High
G30: ExecutionKernel has no close() — relies on __del__ for Rust handle cleanup
ExecutionKernel has __del__ which calls _get_rust().destroy(backend). No close() method. DITAv2LauncherBundle.close() never touches the kernel — the Rust handle is only freed by GC at unpredictable time.
If any code holds a stale _backend pointer, the handle dangles when GC runs. If __del__ is suppressed (e.g., during interpreter shutdown with cyclic references), the Rust handle leaks permanently.
Fix: Add close() to ExecutionKernel, call it from DITAv2LauncherBundle.close().
Severity: High
G31: projection (Hazelcast) never closed
build_projection() returns a HazelcastProjection which holds a Hazelcast client connection. No close() or disconnect() method exists on the projection, projector, or row writer. DITAv2LauncherBundle.close() doesn't touch the projection. The Hazelcast client connection leaks on shutdown.
Severity: Medium
G32: _maybe_close() only calls the first method found — break skips the second
File: `launcher.py:233-243**
for method_name in ("close", "disconnect"):
method = getattr(obj, method_name, None)
if method is None:
continue
try:
result = method()
except TypeError:
continue
if inspect.isawaitable(result):
try:
asyncio.run(result)
except RuntimeError:
pass
break # ← ONLY calls the FIRST found method, never both
If an object has both close() and disconnect(), only close() is called. disconnect() is silently skipped. Also: asyncio.run(result) silently swallows RuntimeError when a running event loop exists — the coroutine is never executed.
Currently no object has both, but the pattern is fragile.
Severity: Low
G33: close() is not idempotent for RealZinc components
RealZincPlane.close() and RealZincControlPlane.close() call their Zinc region's close() method. If called twice, the second call operates on an already-closed region — likely crashes from Hazelcast's shared memory code.
No nulling of references after close: DITAv2LauncherBundle.close() sets self.venue, self.zinc_plane, self.control_plane to None — wait, it doesn't. It calls _maybe_close() which doesn't null references. Double close() is unsafe.
Severity: Low
G34: No context manager on DITAv2LauncherBundle
DITAv2LauncherBundle has no __enter__/__exit__. Users must manually call close(). No with pattern exists anywhere in the source for lifecycle management. No __del__ fallback on the bundle either.
Severity: Low (ergonomic, not a leak source if caller follows the pattern)
G35: BingxVenueAdapter.connect() exists but is never called by the launcher
BingxDirectExecutionAdapter has a connect() method that initializes the lifetime HTTP client. BingxVenueAdapter has connect() that calls _call_backend("connect"). Neither is called in build_launcher_bundle() or _build_venue(). If the adapter's submit_intent() relies on a connected client, it initializes lazily — but the connect path is dead code that exists but is never invoked.
Severity: Informational
G36: Only one try/finally in the entire codebase
The only try/finally is _RustKernelLib._take_string() (rust_backend.py:140-143) which frees the Rust C string. All other resource management uses try/except with no finally.
No cleanup is guaranteed on exception:
build_launcher_bundle()— no cleanup on failureprocess_intent()— no cleanup of partial slot state on venue event exceptionon_venue_event()— no cleanup on FFI failure_set_slot()— no cleanup on projection or Zinc write failure
Severity: High (across all layers)
Pass 4 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| G1 | EXIT_RESIDUAL action missing from Rust KernelCommandType | Rust | Critical |
| G2 | into_c_string unwrap() panics on NUL byte |
Rust | Critical |
| G3 | EXIT hardcodes prev_state=POSITION_OPEN, allows backward FSM transition | Rust | Critical |
| G4 | consume_exit_leg stale all_legs_done variable — wrong branch after last leg |
Rust | Critical |
| G5 | realized_pnl unbounded f64 overflow to inf |
Rust | High |
| G6 | mark_price unbounded unrealized_pnl — no result guard |
Rust | High |
| G7 | ENTER no is_finite() guard on target_size | Rust | High |
| G8 | reconcile_slots_json no dedup or bounds validation |
Rust | High |
| G9 | exchange_order_id update targets wrong order — exit cancel broken |
Rust | High |
| G10 | CANCEL diagnostic always says NO_ACTIVE_EXIT_ORDER | Rust | High |
| G11 | apply_fill overwrites intended_size with slot.size |
Rust | Medium |
| G12 | No max leverage cap enforced by kernel | Rust | Medium |
| G13 | resolve_slot fallback returns unwrap_or(0) — misroutes events |
Rust | Medium |
| G14 | commit_slot silently ignores out-of-bounds slot_id |
Rust | Medium |
| G15 | Zero __post_init__ validators on all config dataclasses |
Config | High |
| G16 | DITA_V2_DEBUG_CLICKHOUSE defaults to True when unset | Config | Info |
| G17 | String config fields — Zinc region injection risk | Config | Medium |
| G18 | exit_leg_ratios no sum-to-1 validation |
Config | Low |
| G19 | RealZincControlPlane.read() no sequence check — torn-read risk | Config | Low |
| G20 | ClickHouse journal strategy/db env vars — SQL injection risk | Config | Low |
| G21 | entry_price used as exit_price in trade_events — data loss | Persistence | High |
| G22 | active_leg_index → entry_bar semantic mis-mapping | Persistence | Medium |
| G23 | capital_before arithmetic absorbs cross-slot PnL | Persistence | Medium |
| G24 | Recovery trade_reconstruction always has trade_id="" | Persistence | Medium |
| G25 | seen_event_ids, exit_leg_ratios, VenueOrder, metadata not in flat CH tables | Persistence | Low |
| G26 | _safe_float silently converts NaN/None/Inf to 0.0 | Persistence | Low |
| G27 | build_launcher_bundle no exception safety — prior resources leak | Lifecycle | High |
| G28 | RealZincPlane/RealZincControlPlane no del — SHM orphaned | Lifecycle | High |
| G29 | Zero signal handlers — no cleanup on SIGTERM/SIGINT | Lifecycle | High |
| G30 | ExecutionKernel has no close() — relies on del for Rust handle | Lifecycle | High |
| G31 | Hazelcast projection never closed | Lifecycle | Medium |
| G32 | _maybe_close() break skips second method | Lifecycle | Low |
| G33 | close() not idempotent for RealZinc components | Lifecycle | Low |
| G34 | No context manager on DITAv2LauncherBundle | Lifecycle | Low |
| G35 | BingxVenueAdapter.connect() never called | Lifecycle | Info |
| G36 | Only one try/finally in entire codebase | Lifecycle | High |
Pass 4 Severity Distribution
| Severity | Count |
|---|---|
| Critical | 4 (G1, G2, G3, G4) |
| High | 11 (G5-G10, G15, G21, G27, G28, G29, G30, G36) |
| Medium | 11 (G11-G14, G17, G22, G23, G24, G31) |
| Low | 8 (G16, G18, G19, G20, G25, G26, G32, G33, G34, G35) |
| Info | 2 |
Combined Catalog (All 4 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| Total | 116 | 5 | 21 | 32 | 40 | 18 |
PASS 5 — EDGE DOMAINS (Dependencies, Error Handling, Types, Contracts)
H1: No Python dependency declaration files exist in workspace
Files: workspace root
Zero requirements.txt, setup.py, setup.cfg, pyproject.toml, Pipfile, or poetry.lock anywhere. All Python package dependencies are entirely implicit — determined by what's installed in the runtime environment. No reproducible installs, no version pinning, no audit trail.
The Rust side does have Cargo.toml + Cargo.lock — but all 4 direct Rust deps use open ranges ("0.4", "0.2", "1", "1").
Severity: Critical
H2: Rust kernel compiled from source on every cold start via subprocess
File: rust_backend.py:60-72
def _ensure_library() -> Path:
path = _library_path()
if not path.exists():
_build_library() # cargo build --release
return path
def _build_library():
subprocess.run(
["cargo", "build", "--release", ...],
check=True, # no timeout!
)
First load takes 3-10 minutes (Rust compilation). Requires Rust toolchain in production. subprocess.run() has no timeout= — if cargo hangs (network, disk, lock contention), the Python process hangs indefinitely. No prebuilt binary distribution.
Severity: Critical
H3: Zero logging — every swallowed error is invisible
The entire codebase has zero use of Python's logging module, print(), or warnings.warn() for error reporting. Every except: pass, except Exception: pass, and return default silently discards the error. There is no mechanism to detect, alert, or diagnose production failures.
All try/except: pass sites found:
| # | File:Line | What's Hidden |
|---|---|---|
| 1 | bingx_venue.py:51 |
float() conversion failure on any API field value |
| 2 | bingx_venue.py:133 |
regex match failure in rate-limit parsing |
| 3 | bingx_venue.py:136 |
int/float conversion of retry_after |
| 4 | bingx_venue.py:325 |
slot lookup failure during cancel asset resolution |
| 5 | bingx_venue.py:350 |
BingXHttpError in cancel — network error looks like rejection |
| 6 | control.py:213 |
RealZincControlPlane construction failure |
| 7 | launcher.py:187 |
RealZincPlane construction failure |
| 8 | launcher.py:119 |
malformed env var for active_slot_limit |
| 9 | launcher.py:243 |
asyncio.run() RuntimeError in _maybe_close |
| 10 | launcher.py:277 |
RealZincControlPlane fallback in build_control_plane |
| 11 | real_control_plane.py:97 |
region.wait() exception — timeout and error both return False |
| 12 | real_control_plane.py:112 |
region.notify() exception — writer thinks broadcast succeeded |
| 13 | real_zinc_plane.py:31 |
Zinc SharedRegion import failure |
| 14 | projection.py:87 |
HazelcastRowWriter import failure |
| 15 | rust_backend.py:102 |
del exception in Rust kernel destroy |
| 16 | bingx_venue.py:55 |
_row_float tries 5+ key fallbacks, each failing silently |
Severity: Critical
H4: _row_float rejects zero as a valid value — or pattern treats 0 as missing
File: bingx_venue.py:47-55
def _row_float(row, *keys, default=0.0):
for key in keys:
try:
value = float(row.get(key) or 0.0) # `or 0.0` treats 0 as missing
except Exception:
continue
if value == value and value not in (float("inf"), float("-inf")) and value != 0.0:
return value # explicitly rejects 0.0
return default
Two bugs: (a) except Exception: continue swallows ALL conversion errors, and (b) value != 0.0 explicitly rejects zero as a valid return value. A legitimate zero price, zero filled quantity, or zero position amount causes _row_float to skip that key and search further. If ALL keys return 0, the default 0.0 is returned — indistinguishable from "none of the keys existed."
Called by every single BingX API response parser: _position_qty(), _position_price(), _venue_order_from_row(), _event_from_row(), _fill_event_from_row(), _events_from_submit(), _events_from_cancel(), _filled_size_from_snapshots(). None verify the returned 0.0 is real vs. missing-vs-zero.
Severity: High
H5: _backend_snapshot timeout returns stale data with no signal to callers
File: `bingx_venue.py:242-251**
def _backend_snapshot(self, *, timeout_ms=5000.0):
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
with self._snap_lock:
return self._last_snapshot # STALE — could be hours old
When the snapshot-fetch condition times out, returns self._last_snapshot — initialized to None and only updated on successful fetches. First timeout returns None. All callers (cancel(), open_orders(), open_positions(), reconcile(), submit()) access .open_orders, .open_positions immediately — crash with AttributeError: 'NoneType' object has no attribute 'open_orders'.
Even after the first fetch succeeds, subsequent timeouts return the last-good snapshot which could be arbitrarily stale. No caller timestamps, version-checks, or requests a refresh.
Severity: High
H6: All enum-from-raw-string sites crash on unknown value — zero fallback
Files: rust_backend.py:250-386, real_zinc_plane.py:70-106
Every site that reconstructs a Python enum from a string received from the Rust kernel:
side=TradeSide(str(payload.get("side", TradeSide.FLAT.value)))
status=VenueOrderStatus(str(payload.get("status", VenueOrderStatus.NEW.value)))
fsm_state=TradeStage(str(payload.get("fsm_state", TradeStage.IDLE.value)))
kind=KernelEventKind(str(row.get("kind", KernelEventKind.ORDER_ACK.value)))
If the Rust kernel introduces a new enum variant (e.g., TradeStage::ENTRY_REJECTED) not in the Python TradeStage enum, TradeStage("ENTRY_REJECTED") raises ValueError with zero fallback. Crashes _outcome_from_payload() and takes down the kernel's event processing loop.
17 sites total across rust_backend.py and real_zinc_plane.py. No try/except, no mapping, no fallback on any of them.
Severity: High
H7: _legacy_intent reads getattr(intent, "order_type", "MARKET") — always defaults to MARKET
File: `bingx_venue.py:282-285**
metadata["_order_type"] = getattr(intent, "order_type", "MARKET")
metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
order_type and limit_price are NOT fields on KernelIntent (contracts.py). They only exist in intent.metadata as metadata["order_type"] if set by the caller. getattr(intent, "order_type", "MARKET") checks the dataclass field — not the metadata dict — so it ALWAYS returns "MARKET".
Even when the PINK runtime produces a LIMIT intent (LIMIT_DECISION → metadata["order_type"] = "LIMIT"), the legacy adapter converts is to MARKET because it reads the wrong source. Every LIMIT order is submitted as MARKET.
Similarly, limit_price is always 0.0 — any limit price from the metadata dict is lost.
Severity: High
H8: _venue_event_status_from_row silently maps unknown venue status to ACKED
File: `bingx_venue.py:83-96**
def _venue_event_status_from_row(status: str) -> VenueEventStatus:
normalized = _normalize_status(status)
# ... checks known statuses ...
return VenueEventStatus.ACKED # fallthrough for anything unknown
If BingX introduces a new status ("SUSPENDED", "PENDING_CANCEL", "EXPIRED"), it doesn't match any known mapping and silently returns ACKED. The kernel treats a suspended/cancelled/expired order as acknowledged — dangerous misclassification.
Severity: High
H9: RealZincPlane.write_slot() — slot written to slot_id >= slot_count is invisible
File: `real_zinc_plane.py:206-210**
def write_slot(self, slot):
with self._lock:
self._slot_cache[int(slot.slot_id)] = slot
payload = {"slots": [self._slot_cache[key].to_dict() for key in range(self._slot_count)]}
_slot_cache is a plain dict — accepts any key. But read_slots() only reads 0..slot_count-1. Writing to slot_id >= slot_count stores the slot in the cache but it's never serialized or read back. No error.
Severity: High
H10: RealZincControlPlane.read() has no atomicity with concurrent update()
File: `real_control_plane.py:70-77**
_write_region() zero-fills the buffer then writes the packet. If read() interleaves between zero-fill and write, it sees a partially-zeroed buffer → _decode_packet returns {} → returns stale self._snapshot with no observable error. No lock, no sequence check, no atomic read.
The same bug exists in RealZincPlane.read_slots() (real_zinc_plane.py:220-230) — reads shared memory while a concurrent write_slot() is in progress.
Severity: High
H11: _RustKernelLib lazily initialized with race condition
File: `rust_backend.py:187-190**
_RUST: _RustKernelLib | None = None
def _get_rust():
global _RUST
if _RUST is None:
_RUST = _RustKernelLib() # no lock — two threads can both create
return _RUST
No threading lock. Two concurrent calls to _get_rust() (possible via BingxVenueAdapter's thread pool) can create two _RustKernelLib objects. The _RustKernelLib() constructor runs _ensure_library() which runs subprocess.run(["cargo", "build", ...], check=True) — concurrent cargo build can corrupt the build directory.
Severity: High
H12: ExecutionKernel.__del__ can deadlock or use-after-free
File: `rust_backend.py:527-531**
def __del__(self):
backend = getattr(self, "_backend", None)
if backend is not None:
try:
_get_rust().destroy(backend) # accesses module singleton
except Exception:
pass
_get_rust() accesses the module-level _RUST singleton, which may already be destroyed if the module's garbage collection runs before the instance's. The destroy call happens outside any lock — one thread's destructor could destroy the Rust kernel while another thread is still using it. Use-after-free.
Severity: High
H13: MirroredControlPlane missing protocol methods
File: `control.py:171-184**
ControlPlane protocol defines wait() and notify(). MirroredControlPlane inherits from nothing and only implements read(), update(), and mirror(). Calling plane.wait() on a MirroredControlPlane raises AttributeError.
Severity: Medium
H14: TradeSlot.remaining_size() and VenueOrder.remaining_size() — same name, different semantics
Files: contracts.py:207-208, `contracts.py:143-145**
# TradeSlot:
def remaining_size(self) -> float:
return max(0.0, float(self.size)) # open position size
# VenueOrder:
def remaining_size(self) -> float:
return max(0.0, self.intended_size - self.filled_size) # unfilled order qty
Same method name, completely different semantics. TradeSlot.remaining_size() returns the current open position size. VenueOrder.remaining_size() returns the untracked/unfilled order quantity. A caller using slot.remaining_size() to check if an order is fully filled gets position size, which doesn't change with fills — it changes with entry/exit.
Severity: Medium
H15: _maybe_close() — asyncio.run() RuntimeError silently swallowed for coroutines
File: `launcher.py:233-243**
if inspect.isawaitable(result):
try:
asyncio.run(result)
except RuntimeError:
pass # SILENT — coroutine never executed
When maybe_close is called from an async context (which it is — DITAv2LauncherBundle.close() is used in async test code), asyncio.run() raises RuntimeError("Cannot run the event loop while another loop is running"). The exception is swallowed, the coroutine is never awaited, and the close/disconnect never happens.
Also: break after calling the first found method means if an object has both close() and disconnect(), disconnect() is never called.
Severity: Medium
H16: _build_launcher_bundle imports BingxDirectExecutionAdapter inside function — import-time side effect is safe but lazy loading masks errors
File: `launcher.py:254**
def _build_venue(...):
from prod.clean_arch.adapters.bingx_direct import BingxDirectExecutionAdapter
Import inside function — safe, lazy, no side effects. But if the bingx_direct module has an import error (missing dependency, version mismatch), it only surfaces at bundle construction time, not at process start. A misconfigured production deployment would fail on the first trade, not on boot.
Severity: Informational
H17: load_dotenv() at module level — import-time filesystem I/O and env mutation
File: `launcher.py:49-51**
load_dotenv(PROJECT_ROOT / ".env") # executes on module import
Runs on every import of launcher.py — reads filesystem, mutates process environment. Hard to mock in tests — setting env vars in test setup gets overwritten on module import. Also: if .env doesn't exist, load_dotenv() silently does nothing — missing config is invisible.
Severity: Medium
H18: _run() in BingxVenueAdapter — asyncio.run() thread-pool bridge blocks on every call
File: `bingx_venue.py:225-233**
def _run(self, result):
if inspect.isawaitable(result):
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(result)
pool = self._get_executor()
return pool.submit(asyncio.run, result).result() # BLOCKS
Every call to _run() that receives an awaitable blocks the calling thread via .result(). The BingX HTTP call inside submit_intent() can take 1-5 seconds. During this block, the event loop cannot process other tasks. In a single-runtime deployment, this stalls the entire policy cycle.
Severity: Medium
H19: HazelcastClientLike protocol has zero concrete implementations in workspace
File: `hazelcast_projection.py:13-15**
class HazelcastClientLike(Protocol):
def get_map(self, name: str): ...
def get_topic(self, name: str): ...
Used as a type hint. No code in the workspace creates an object that satisfies this protocol. The Hazelcast client comes from an external package. If the external API changes, the protocol silently drifts — no compilation check.
Severity: Low
H20: _decode_packet in RealZinc — no bound check on size beyond > len(buf)-16
Files: real_control_plane.py:50-52, `real_zinc_plane.py:70-81**
seq, size = struct.unpack_from("!QQ", buf, 0)
if size <= 0 or size > len(buf) - 16:
return {}
payload = bytes(buf[16 : 16 + size]).decode("utf-8") # can raise UnicodeDecodeError
out = json.loads(payload) # can raise ValueError
If shared memory contains a corrupted size field within bounds, .decode() or json.loads() raises — uncaught by callers. A single corrupted byte in shared memory crashes the kernel.
Severity: Low
H21: All Rust crate features enabled by default — wasm-bindgen compiled into native shared library
File: _rust_kernel/Cargo.toml, transitive through chrono → iana-time-zone → js-sys → wasm-bindgen
The Rust kernel is a native .so/.dylib but chrono's iana-time-zone pulls in js-sys and wasm-bindgen (WebAssembly support) even on native Linux. Larger binary, longer compile times. cc crate pulled in for iana-time-zone-haiku which only compiles on Haiku OS.
Severity: Low
H22: socket.getaddrinfo monkey-patch in test generator code
File: `gen2.py:295-298**
Monkey-patches Python stdlib socket.getaddrinfo to force IPv4 as a workaround for IPv6 resolution failure in the deployment environment. If copied to production code, would break IPv6 connectivity.
Severity: Low
Pass 5 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| H1 | No Python dependency files (requirements.txt, pyproject.toml, etc.) | Build | Critical |
| H2 | Rust kernel compiled from source on every cold start — no prebuilt binary | Build | Critical |
| H3 | Zero logging — 16+ silent except:pass sites, no error observability | All | Critical |
| H4 | _row_float rejects zero as valid, except Exception: continue swallows all |
Venue | High |
| H5 | _backend_snapshot timeout returns stale data/None — callers crash |
Venue | High |
| H6 | All enum-from-raw-string sites crash on unknown variant (17 sites) | Bridge | High |
| H7 | _legacy_intent reads getattr(intent, "order_type") not metadata — always MARKET |
Venue | High |
| H8 | Unknown venue status silently mapped to ACKED | Venue | High |
| H9 | RealZincPlane.write_slot() slot_id >= slot_count silently lost |
Zinc | High |
| H10 | RealZincControlPlane.read() no atomicity with concurrent update() |
Control | High |
| H11 | _RustKernelLib lazy init with race condition — concurrent cargo build |
Bridge | High |
| H12 | ExecutionKernel.__del__ use-after-free on Rust handle |
Bridge | High |
| H13 | MirroredControlPlane missing protocol methods (wait/notify) |
Control | Medium |
| H14 | TradeSlot.remaining_size vs VenueOrder.remaining_size — different semantics |
Contracts | Medium |
| H15 | _maybe_close asyncio.run RuntimeError silently swallowed |
Launcher | Medium |
| H16 | Lazy import of bingx_direct masks config errors until first trade | Build | Info |
| H17 | load_dotenv() at module level — import-time I/O side effect |
Launcher | Medium |
| H18 | _run() blocks event loop on every HTTP call via thread pool |
Venue | Medium |
| H19 | HazelcastClientLike protocol has zero concrete implementations |
Projection | Low |
| H20 | _decode_packet uncaught UnicodeDecodeError/ValueError on corrupted SHM |
Zinc | Low |
| H21 | wasm-bindgen compiled into native library unnecessarily |
Build | Low |
| H22 | socket.getaddrinfo monkey-patch in test code |
Test | Low |
Pass 5 Severity Distribution
| Severity | Count |
|---|---|
| Critical | 3 (H1, H2, H3) |
| High | 9 (H4-H12) |
| Medium | 5 (H13, H14, H15, H17, H18) |
| Low | 4 (H19, H20, H21, H22) |
| Info | 1 (H16) |
Combined Catalog (All 5 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| Total | 138 | 8 | 30 | 37 | 44 | 19 |
PASS 6 — MATH, TESTS, CONCURRENCY, RECOVERY, SECURITY
I1: Entry apply_fill sets slot.size = fill_size — multiple partial fills overwrite instead of accumulating
File: _rust_kernel/src/lib.rs:798
// Entry fill path in apply_fill:
slot.size = fill_size; // DIRECT ASSIGNMENT
slot.initial_size = slot.initial_size.max(fill_size); // max, not sum
If a single entry order receives multiple partial fills (e.g., LIMIT order on the book):
- Fill #1:
fill_size = 0.5→slot.size = 0.5,initial_size = max(0, 0.5) = 0.5 - Fill #2:
fill_size = 0.3→slot.size = 0.3,initial_size = max(0.5, 0.3) = 0.5
After both fills, the actual position is 0.8 but slot.size reports 0.3. The position is under-counted by 0.5 — 62.5% error.
The exit path correctly does slot.size = (slot.size - fill_size).max(0.0) (subtractive). The entry path should accumulate: slot.size += fill_size.
This only manifests with LIMIT orders that receive multiple partial fills over time — a scenario entirely absent from tests (I7).
Severity: Critical
I2: exit_ratio = 0.0 creates zero-size exit order — slot stuck in EXIT_REQUESTED
File: _rust_kernel/src/lib.rs:467-469
let exit_ratio = slot.next_exit_ratio(); // returns 0.0 from exit_leg_ratios=[0.0, ...]
let base_size = if slot.initial_size > 0.0 { ... } else { slot.size };
let exit_size = (base_size * exit_ratio).max(0.0); // = 0.0
When exit_leg_ratios contains 0.0 in any position, exit_size = 0.0. The zero-size exit order is submitted to the venue (intended_size = 0). On the fill side, realized_pnl() returns 0.0 (guarded by exit_size <= 0.0), and slot.size is unchanged. The slot stays in EXIT_REQUESTED with no means to advance — the leg is consumed but nothing happened. Subsequent exits may eventually handle this, but the zero-size leg is a wasted FSM transition that leaves the slot in a confusing intermediate state.
Also: NaN in exit_leg_ratios (from clamp(0.0, 1.0) not guarding NaN, though serde_json rejects NaN) would produce the same zero-size exit behavior.
Severity: Medium
I3: entry_price inconsistency — Python uses falsy check, Rust uses <= 0.0
File: contracts.py:88-98 (Python), _rust_kernel/src/lib.rs:227-228 (Rust)
# Python TradeSlot.mark_price():
self.entry_price = self.entry_price or price # falsy — keeps -0.5, 0.0 replaced
# Rust TradeSlot::mark_price():
if self.entry_price <= 0.0 { self.entry_price = price; } // catches -0.5, replaces it
If entry_price is negative (possible only via set_slot_json direct injection — not from normal trading), Python keeps it and computes unrealized_pnl with wrong sign. Rust replaces it. The Python-side mark_price is only called from ExecutionKernel.mark_price() in rust_backend.py:LOW-1, which never writes back to the Rust kernel — so the Python-side calculation is purely local and the inconsistency has no effect on the Rust kernel's canonical state. However, the observe_slots call after mark_price re-reads from the Rust kernel, which recomputes PnL correctly. The Python-side mark_price is effectively wasted computation that never feeds back.
Severity: Informational
I4: No Rust unit tests for 99% of kernel functionality
File: _rust_kernel/src/lib.rs:1731-1765
Only 1 Rust test exists: enter_then_ack_fill — creates a 2-slot kernel, submits ENTER, sends ACK, asserts state transitions.
Not tested in Rust:
- EXIT, CANCEL, MARK_PRICE, RECONCILE, CONTROL actions
- Any FILL event (PARTIAL, FULL)
- CANCEL_ACK, CANCEL_REJECT, ORDER_REJECT
- RATE_LIMITED handling
- Multi-leg exits
consume_exit_legedge casesrealized_pnl()formula with boundary valuesmark_price()with extreme valuesresolve_slot()fallback pathreconcile_slots_jsondedup/overflow- Any C FFI boundary function
- Any serde deserialization failure
- Null pointer handling
No #[cfg(test)] module exists — the single test is inline. No Rust integration tests (tests/ directory).
Severity: High
I5: MockVenueScenario rejection flags exist but zero tests use them
File: mock_venue.py:23-35
@dataclass
class MockVenueScenario:
reject_entries: bool = False
reject_exits: bool = False
cancel_reject: bool = False
Three boolean flags to simulate venue rejection of orders. Not a single test in test_flaws.py sets any of them to True. The ORDER_REJECT handler in the Rust kernel's on_venue_event exists (lib.rs lines ~1440-1460) but is never exercised by any test.
Similarly, entry_partial_fill_ratio and exit_partial_fill_ratio exist on MockVenueScenario but only one test (test_cancel_entry_with_partial_fill) uses partial fills at all — and it only checks size > 0, not the full capital-accrual chain.
Severity: High
I6: No LIMIT order test through the full kernel path
The test suite has zero LIMIT orders. The Rust kernel doesn't even contain LIMIT-specific logic — all orders are MARKET. The generated live tests have limit_does_not_fill and limit_immediate_fill scenario placeholders, but:
limit_does_not_fillusesreference_price=0.0(not a real LIMIT order)limit_immediate_fillusestarget_size=-0.001(negative size → clamped to 0.0)
Neither scenario actually submits a LIMIT order with order_type="LIMIT" and a non-zero limit_price. The _legacy_intent bug (H7) would convert any LIMIT attempt to MARKET anyway.
The only LIMIT-related code is the Rust kernel's if intent.order_type == "LIMIT" branches (lib.rs:503, 1584) which are compile-time dead code — KernelIntent doesn't have an order_type field that serde would populate.
Severity: High
I7: Three weak/vacuous assertions in test_flaws.py
File: test_flaws.py
-
Line 512:
assert order.metadata.get("asset") is not None or order.metadata.get("slot_id") is not None— mock venue always sets both, this can never fail. -
Line 700:
test_pnl_warning_on_unsettled_reentry— titled to assert a warning is raised but only checksr.accepted. Never checksdiagnostic_codeor verifies the warning was issued. -
Line 318:
assert slot.active_entry_order is None or slot.active_entry_order.status == VenueOrderStatus.FILLED— theorallows two different scenarios to pass, reducing diagnostic power.
Severity: Low
I8: slot.size = fill_size entry overfill no guard
File: _rust_kernel/src/lib.rs:798
Already noted in I1 — entry fill sets slot.size directly to fill_size. Unlike exit fill which has (slot.size - fill_size).max(0.0), there's no guard against entry overfill (venue fills more than the intended order size). For MARKET orders this is fine (one fill per order), but for LIMIT orders with multiple partial fills, the accumulated fill could exceed initial_size.
Severity: Low (only relevant with LIMIT + partial fills, which don't exist in the codebase)
I9: No crash durability — slot state is pure in-memory until step 7 of process_intent
File: rust_backend.py:470-560
The process_intent sequence:
- validate → 2. Rust FSM → 3. venue.submit() → 4. on_venue_event() → 5. projection → 6. zinc_plane
If the process crashes between steps 2-5, the slot state accumulated in the Rust kernel's in-memory KernelCore is completely lost. The Rust kernel has no WAL, no journal, no persistent store. On restart, ExecutionKernel.__init__ creates a fresh KernelCore with all slots IDLE.
The crash between step 3 and step 5 is the most dangerous: the exchange has an open order/position, but the kernel has no record of it. On restart:
- The Rust kernel sees
slot.slot_id = IDLE - The Zinc slot cache may or may not have the pre-crash state (depends on timing)
- No code on restart loads Zinc state back into the Rust kernel (I14)
- The exchange order lives until it fills (unexpected position) or is manually cancelled
Concrete example: venue.submit() sends POST to BingX, order placed. HTTP response arrives. on_venue_event(ORDER_ACK) transitions slot to ENTRY_WORKING. Crash between returning from on_venue_event and zinc_plane.write_slot(). On restart: slot is IDLE, no active entry order, _last_settled_pnl is reset. The exchange has a live ENTRY_WORKING order. Next process_intent(ENTER) gets SLOT_BUSY because... wait — the fresh kernel doesn't know the order exists, so it sees slot as IDLE and allows a new ENTER. The old order fills on the exchange → double position.
Severity: Critical
I10: seen_event_ids lost on restart — events replayed after restart are double-processed
File: _rust_kernel/src/lib.rs:672-683
seen_event_ids is per-slot, per-[KernelCore] instance — purely in-process memory. On restart with a fresh KernelCore, every slot has seen_event_ids = Vec::new(). If events are replayed (from pump_venue_events() calling venue.reconcile() which re-fetches exchange state):
- Original run: order fills →
FULL_FILLwithevent_id = "EV-00000042"→ processed, slot →POSITION_OPEN - Crash
- Restart: fresh
KernelCore,seen_event_idsempty pump_venue_events()fetches same exchange state → newVenueEventobjects with new event IDs (adapter's_event_seqresets)- Rust kernel sees these as novel events — processes them again
- Position is double-booked, PnL double-settled
The bingx_venue._event_seq is an instance-level itertools.count() starting from 1. On adapter restart, it resets — so the new event IDs won't match the old ones anyway. Dedup is fundamentally impossible across restarts.
Severity: Critical
I11: No idempotency key (newClientOrderId) sent to BingX
File: bingx_venue.py:282-285, bingx_direct.py (external)
BingX supports newClientOrderId for order idempotency — sending the same ID twice returns the original order status instead of creating a duplicate. The DITAv2 kernel passes intent.intent_id as decision_id to the legacy adapter, but there's no guarantee this maps to newClientOrderId in the BingX payload.
If the HTTP POST to /trade/order times out before the response is read:
- The order was placed on the exchange
_call_backendraises aBingxHttpError(or similar network exception)process_intent()propagates the exception — no retry- Next cycle: caller may retry with a new
intent_id - Second POST creates a second order on the exchange — duplicate position
Without a client-order-id that persists across retries, the system can create duplicate orders on network timeouts. The exchange has no way to deduplicate.
Severity: High
I12: No graceful degradation for ANY subsystem
Every subsystem failure mode examined:
| Subsystem | Failure | Current behavior |
|---|---|---|
| Zinc SHM init | Corrupted region, OOM | Silent fallback to InMemoryZincPlane (no operator signal) |
| Zinc SHM write | Region overflow, write error | Unhandled exception → kernel crashes |
| Hazelcast write | Cluster unavailable | .put() raises → unhandled exception → kernel crashes |
| ClickHouse journal | Sink failure | Exception propagates (no try/except in callers) |
| BingX HTTP | Timeout, rate limit | Exception or REJECTED → slot stuck in ORDER_REQUESTED |
| Rust kernel | Null pointer from FFI | _take_string raises RuntimeError → kernel crash |
| Memory pressure | OOM | Process killed by kernel. No signal handler. Zero signal handlers. |
No subsystem has a graceful degradation path. No circuit breaker, no retry queue, no fallback to log-only mode, no offline/cached trading mode. Every failure (except the two init-time silent fallbacks) crashes the current kernel operation.
Severity: High
I13: Stray venue event can reactivate a CLOSED slot — no guard
File: _rust_kernel/src/lib.rs:625+
The on_venue_event function has no guard for closed slots:
fn on_venue_event(&mut self, event: VenueEvent) -> KernelResult {
// ... resolve slot, check duplicates ...
// NO: if slot.closed { return ... }
let prev_state = slot.fsm_state.clone();
match event.kind {
SOME_EVENT_KIND => { /* transitions regardless of closed state */ }
}
}
If a stray venue event arrives for a CLOSED slot:
ORDER_ACK→ setsENTRY_WORKING— slot re-opens from CLOSEDFULL_FILL→apply_fillruns →slot.size = fill_size,fsm_state = POSITION_OPENORDER_REJECT→ clearstrade_id,asset, setsIDLE— actually benign reset
A CLOSED slot should be a terminal state that rejects all events. Currently only CANCEL_ACK is harmless on a closed slot; the rest can revive a dead position.
Severity: High
I14: No reconcile_from_slots call on startup — Zinc state never loaded into Rust kernel
Files: rust_backend.py:435-465 (init), real_zinc_plane.py:95-115 (init)
On restart:
RealZincPlane.__init__reads state from Zinc shared memory into_slot_cacheExecutionKernel.__init__creates freshKernelCore— all slots IDLEKernelStateView(self)reads from the fresh kernelaccount.observe_slots([self._get_slot(i) for i in range(max_slots)])— all slots IDLE
Step 3 and 4 read from the Rust kernel, NOT from Zinc. The Zinc _slot_cache populated in step 1 is never loaded into the Rust kernel. The reconcile_on_restart flag exists in KernelControlSnapshot (default True) but is never checked anywhere in ExecutionKernel.__init__ or the launcher.
The system always starts with a blank state even when durable shared memory state exists.
Severity: High
I15: CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING
File: _rust_kernel/src/lib.rs:1165-1175
KernelEventKind::CANCEL_REJECT => {
if slot.fsm_state == TradeStage::EXIT_WORKING {
// stays EXIT_WORKING — no state transition
// active_exit_order remains attached
}
diagnostic_code = KernelDiagnosticCode::CANCEL_REJECTED;
}
When the exchange rejects a cancel (typically because the order was already filled or no longer exists), the slot stays in EXIT_WORKING with active_exit_order still attached. Every subsequent CANCEL attempt hits the same path — the exchange returns "order not found," the kernel sees CANCEL_REJECT, and the slot is stuck forever.
If the order was already filled (CANCEL_REJECT means "can't cancel, no longer open"), the slot should check the actual position size and potentially transition to POSITION_OPEN or CLOSED depending on fill status.
Severity: Medium
I16: Zinc shared memory — world-readable/writable by same-machine processes
Files: real_control_plane.py, real_zinc_plane.py
The Zinc shared memory regions are created with these names:
self.region_name = f"{base}_intent" # e.g., "dita_v2_intent"
self.state_name = f"{base}_state" # "dita_v2_state"
self.control_name = f"{base}_control" # "dita_v2_control"
Region names are predictable (prefix defaults to "dita_v2"). The SharedRegion uses POSIX shm_open — the default permissions depend on umask (typically 0644 or 0600). Any process on the same machine can:
- Read: Open the region →
as_buffer()→_decode_packet()→ read all slot state, PnL, open orders, control settings - Write: Open the region → forge a packet (
struct.pack("!QQ", seq, len) + json_bytes) → overwrite slot state, inject fake intents, modify control plane
No access control, no encryption, no integrity check (HMAC/signature) on the wire format. The sequence number is the only ordering mechanism, and it's trivially predictable.
Severity: High
I17: KernelSlotView exposes full slot state via unrestricted __getattr__/__setattr__
File: rust_backend.py:411-460
class KernelSlotView:
def __getattr__(self, name):
slot = self._snapshot()
return getattr(slot, name) # read ANY field
def __setattr__(self, name, value):
setattr(slot, name, value)
self._kernel._set_slot(slot) # write ANY field — bypasses FSM
Any code with a KernelSlotView reference can:
- Read all slot fields:
trade_id,size,entry_price,unrealized_pnl,realized_pnl,seen_event_ids,metadata - Write all slot fields:
slot_view.realized_pnl = -9999999— directly manipulates PnL figures flowing into capital settlement
The _set_slot call writes through to the Rust kernel without any FSM validation. The entire kernel state is exposed through mutable Python objects with zero access control.
Severity: High
I18: sys.path.insert(0, ...) at import time in three production files
Files: real_control_plane.py:14, real_zinc_plane.py:22, test_flaws.py:13, _build_pink_bodies.py:2, _gen_test.py:3
# real_control_plane.py, real_zinc_plane.py — at MODULE LEVEL:
sys.path.insert(0, str(_ZINC_ADAPTER_PATH))
# test_flaws.py, _build_pink_bodies.py, _gen_test.py — at MODULE LEVEL:
sys.path.insert(0, '/mnt/dolphinng5_predict')
sys.path.insert(0, ...) gives the injected path highest import priority. An attacker with filesystem write access to the inserted path can create a malicious module that shadows a legitimate import (e.g., zinc.py, utils.py, typing.py). When any subsequent from X import Y runs, the attacker's module loads with the full privileges of the kernel process.
The production files use a relative path resolution (Path(__file__).resolve().parents[3] / "zinc" / "adapters" / "python"), while the test files use a hardcoded absolute path ('/mnt/dolphinng5_predict'). Both patterns are dangerous.
Severity: High
I19: pump_venue_events re-fetches exchange state that can produce phantom position events
File: bingx_venue.py:395-415
reconcile() calls _backend_snapshot() which fetches current positions and open orders from the exchange. The _events_from_snapshot method diff-s the current snapshot against the last-known snapshot to produce events:
def _events_from_snapshot(self, before, after):
for symbol, current_pos in after.open_positions.items():
prev_pos = before.open_positions.get(symbol)
if current_pos and (not prev_pos or abs(prev_pos.position_amount) < 1e-12):
# This looks like a new position — emit event
If before is stale (from _backend_snapshot timeout), the diff can produce spurious events. A position that existed before the crash is absent from the stale snapshot → the diff sees it as "new" → emits an entry fill event → Rust kernel processes it as a fresh enter → double position. This compounds with I10 (seen_event_ids lost on restart).
Severity: High
I20: exit_leg_ratios no guard against empty list — next_exit_ratio returns 1.0
File: contracts.py:196-198
def next_exit_ratio(self) -> float:
if self.active_leg_index < len(self.exit_leg_ratios):
return self.exit_leg_ratios[self.active_leg_index]
return 1.0
If exit_leg_ratios is empty (default (1.0,) prevents this normally, but the default is only (1.0,) in the dataclass), next_exit_ratio() returns 1.0. This is the same as "exit everything" — the consume_exit_leg then advances active_leg_index to min(1, 1) = 1, and all_legs_done = active_leg_index >= exit_leg_ratios.len() → 1 >= 0 = true → slot closes. The empty-ratios edge case is silently handled with unwrap_or(1.0), which happens to be correct — but undocumented.
Severity: Informational
I21: No test for rate-limited events — RATE_LIMITED kernel path is dead code
File: _rust_kernel/src/lib.rs (event handler), MockVenueScenario.mock_venue.py (no rate_limit flag)
The Rust kernel has a handler for KernelEventKind::RATE_LIMITED (lib.rs lines ~1480-1500). The event flows through the Python bridge's process_intent() rate-limit detection (rust_backend.py:585-593). But MockVenueScenario has no flag to emit rate-limited events. The only path to trigger RATE_LIMITED is from the real BingX adapter — which requires live exchange connectivity.
The entire RATE_LIMITED code path — in both Python and Rust — is untested in CI. Any bug in this path only surfaces in production under rate-limit conditions.
Severity: Medium
I22: Thread pool for _run — max_workers=3 shared across ALL adapter instances
File: `bingx_venue.py:236-245**
@classmethod
def _get_executor(cls):
if cls._EXECUTOR is None:
with cls._EXECUTOR_LOCK:
if cls._EXECUTOR is None:
cls._EXECUTOR = ThreadPoolExecutor(max_workers=3, ...)
return cls._EXECUTOR
Class-level singleton — all BingxVenueAdapter instances share the same 3-thread pool. With the runtime's step() calling submit() (1 thread) + _backend_snapshot (potentially another thread for open orders) + cancel() (1 thread in parallel), all 3 threads are consumed. A fourth concurrent call blocks the calling thread at .result() indefinitely — freezing the entire event loop.
The pool is never shut down. If a BingxVenueAdapter is destroyed, the threads remain running (zombie workers). No close()/disconnect() path shuts down the executor.
Severity: Medium
Pass 6 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| I1 | Entry apply_fill multiple partial fills overwrite size instead of accumulating |
Rust | Critical |
| I2 | Zero exit_ratio creates zero-size exit order — slot stuck in EXIT_REQUESTED | Rust | Medium |
| I3 | entry_price inconsistency — Python falsy vs Rust <= 0.0 gate |
Bridge | Info |
| I4 | Only 1 Rust unit test for 1765-line kernel — 99% untested at Rust layer | Rust | High |
| I5 | MockVenueScenario rejection flags exist but zero tests use them | Test | High |
| I6 | No LIMIT order test through full kernel path | Test | High |
| I7 | Three weak/vacuous assertions in test_flaws.py | Test | Low |
| I8 | Entry overfill no guard | Rust | Low |
| I9 | No crash durability — slot state pure in-memory until step 7 of process_intent | Bridge | Critical |
| I10 | seen_event_ids lost on restart — events double-processed | Rust | Critical |
| I11 | No idempotency key sent to BingX — lost response creates duplicate orders | Venue | High |
| I12 | No graceful degradation for ANY subsystem | All | High |
| I13 | Stray venue event can reactivate CLOSED slot — no guard | Rust | High |
| I14 | No reconcile_from_slots call on startup — Zinc state never loaded into kernel | Restart | High |
| I15 | CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING | Rust | Medium |
| I16 | Zinc shared memory world-readable/writable by same-machine processes | Zinc | High |
| I17 | KernelSlotView unrestricted getattr/setattr — bypasses all FSM guards | Bridge | High |
| I18 | sys.path.insert(0) at import time in 3 production files — malicious module loading | Build | High |
| I19 | pump_venue_events stale snapshot diff produces phantom position events | Venue | High |
| I20 | exit_leg_ratios empty list — next_exit_ratio defaults to 1.0 (undocumented) | Contracts | Info |
| I21 | RATE_LIMITED code path in both Python and Rust is completely untested | All | Medium |
| I22 | Thread pool max_workers=3 shared across all adapter instances — never shut down | Venue | Medium |
Pass 6 Severity Distribution
| Severity | Count |
|---|---|
| Critical | 3 (I1, I9, I10) |
| High | 9 (I4, I5, I6, I11, I12, I13, I14, I16, I17, I18, I19) |
| Medium | 4 (I2, I15, I21, I22) |
| Low | 2 (I7, I8) |
| Info | 2 (I3, I20) |
Combined Catalog (All 6 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| Total | 160 | 11 | 41 | 41 | 46 | 21 |
PASS 7 — TEST INFRA, DATA FEED, RUST DEEPER, ENV PARSING, CONNECTIONS
J1: Test _flatten helper submits wrong direction for LONG positions
File: _build_pink_extended.py (patch), gen2.py:399-412, gen_live_tests.py:155-169
Every instance of _flatten in the codebase submits a SHORT exit regardless of the actual position direction:
def _flatten(k, symbol, price, label):
_exit(k, symbol, price, slot_id=0) # _exit creates a SHORT exit
# ... if still not free, tries LONG exit
_exit calls _si(k, EXIT, ..., "SHORT", ...). If the open position is LONG, this SHORT exit is actually an enter short — a new position opening, not a flatten. Only after the first attempt fails does it try a LONG exit. This can double the position instead of flattening it.
No test in the suite has ever hit this because no test before the _verify step has an open position with the wrong direction — but the code is fundamentally wrong: it assumes all positions are SHORT.
Severity: Medium
J2: Test _check_slot_accounting double-counts unrealized PnL
File: _build_pink_extended.py (patched into generated file)
total_rp = sum(k.slot(i).realized_pnl for i in range(k.max_slots))
total_up = sum(k.slot(i).unrealized_pnl for i in range(k.max_slots))
expected = start_cap + total_rp + total_up
actual = k.account.snapshot.capital
assert abs(actual - expected) < 0.01
The accounting identity capital = start_cap + Σrealized_pnl + Σunrealized_pnl double-counts unrealized PnL if the Rust kernel's capital computation already includes it. The kernel's account.snapshot.capital is updated by settle() which adds realized_pnl only — so unrealized PnL is NOT included in capital. This means the assertion is actually correct semantically: capital = start + realized + unrealized. Wait — that IS correct. Let me re-examine...
Actually, account.settle(realized_pnl) adds only realized PnL to capital. Capital does NOT include unrealized. So capital = start + realized and the test adds unrealized on top. If unrealized > 0, the assertion actual == expected where expected = actual + unrealized will always fail for open positions. The test only passes when unrealized ≈ 0 (closed positions or when mark_price hasn't been called — which is always, per J4).
This assertion produces false failures for every test with an open position. The only reason it doesn't trigger is that mark_price is never called, so slot.unrealized_pnl is always 0. Silent near-miss.
Severity: Medium
J3: _build_live_snapshot uses time.time() (float) as timestamp — downstream expects datetime
File: gen_live_tests.py:81
def _build_live_snapshot(client, symbol, interval=None):
# ...
return MarketSnapshot(
timestamp=time.time(), # ← float (Unix epoch seconds)
...
)
While gen2.py:352 and _gen_test.py:138 correctly use datetime.now(timezone.utc) (timezone-aware datetime). If any downstream code calls .isoformat() or .strftime() on the snapshot's timestamp, it crashes with AttributeError: 'float' object has no attribute 'isoformat'.
This function is used by the newer _run harness in the generated live-test file. Whether the crash manifests depends on what MarketSnapshot and PinkDirectRuntime.step() do with the timestamp field.
Severity: High
J4: ExecutionKernel.mark_price() exists but is never called — no periodic mark-to-market
File: rust_backend.py:667-672
def mark_price(self, asset: str, price: float) -> None:
for slot in self.state.slots:
if slot.asset == asset and slot.is_open():
slot.mark_price(price)
self.account.observe_slots(...)
This method exists on ExecutionKernel but has zero callers in the entire codebase. Unrealized PnL is never updated outside of process_intent and on_venue_event (which only compute realized PnL). The slot.unrealized_pnl field stays at its initial value (0) unless mark_price is called externally.
The AccountProjection.observe_slots() (account.py:53-66) reads slot.unrealized_pnl and reports it — but since nothing ever updates it, unrealized PnL is always 0 in the account snapshot.
This means the capital figure reported by kernel.snapshot()["account"]["unrealized_pnl"] is always zero for open positions — the system has no live mark-to-market.
Severity: High
J5: All VenueEvent timestamps use local machine clock, not exchange timestamp
File: bingx_venue.py (7 locations)
Every VenueEvent constructed in the venue adapter uses the local machine's clock:
VenueEvent(
timestamp=datetime.now(timezone.utc), # local clock, not exchange
...
)
This includes:
_events_from_submit()(lines 370, 390) — withgetattr(receipt, "timestamp", ...)fallback that still uses local clock_events_from_cancel()(lines 455, 480)_event_from_row()(line 546)_fill_event_from_row()(line 570)
The exchange's HTTP response includes timestamps (transactTime, updateTime) that are authoritative. These are available in the raw response dict (stored in raw_payload) but are never extracted as the event timestamp. Clock skew between the local machine and the exchange is invisible — event timestamps may be ahead of or behind exchange time.
Severity: Medium
J6: No monotonic timestamp verification anywhere in the system
No code path in the entire codebase checks whether a new timestamp is >= the previous one for the same asset/slot:
process_intent()— no comparison between intent timestamp and slot'slast_event_timeon_venue_event()— no check that event timestamp >= previous eventsTradeSlot.last_event_timeis stored but never validated for monotonicityVenueEventtimestamps frompump_venue_events()are never compared with event history
With NTP clock adjustments, daylight saving time changes, or VM clock drift, timestamps can go backwards. The system has no detection or guard.
Severity: Low
J7: rebuild_indexes() silently overwrites duplicate trade_id — last slot wins, first becomes invisible
File: _rust_kernel/src/lib.rs:571-596
fn rebuild_indexes(&mut self) {
for slot in &self.slots {
if !slot.trade_id.is_empty() {
self.active_trade_index.insert(slot.trade_id.clone(), slot.slot_id);
// ↑ HashMap::insert overwrites — no duplicate check
}
}
}
If two slots happen to have the same trade_id (not prevented by any invariant check), the index maps to the last slot with that trade_id. The first slot becomes invisible to resolve_slot()'s trade_id-based fallback. Any venue event for that trade_id with an unspecified or negative slot_id always resolves to the last slot.
The process_intent ENTER handler checks slot.trade_id != intent.trade_id to prevent overwriting a different trade on the same slot — but there's no global uniqueness check across all slots.
Severity: High
J8: resolve_slot() falls back to slot 0 when all indexes miss — stray event corrupts slot 0
File: _rust_kernel/src/lib.rs:606-622
fn resolve_slot(&self, event: &VenueEvent) -> usize {
// ... try by slot_id, trade_id, venue_order_id, client_order_id ...
self.slots.first().map(|slot| slot.slot_id).unwrap_or(0)
}
When a venue event has:
slot_id = -1(negative — can't be used as usize)- Empty
trade_id(trade not found on new kernel after restart) - Empty
venue_order_idandvenue_client_id
...the event is routed to slot 0 regardless of which slot it was intended for. If slot 0 is in the middle of a trade, the stray event (e.g., a stale ORDER_ACK from a pre-crash order) overwrites slot 0's state. Combined with I10 (seen_event_ids lost on restart), this is a concrete crash-recovery failure path.
Severity: High
J9: dita_kernel_get_slot_json and dita_kernel_snapshot_json return null with no diagnostic
File: _rust_kernel/src/lib.rs (FFI exports)
The intent/event processing paths (process_intent_json, on_venue_event_json) have two layers of error handling — parse errors produce a structured invalid_intent_cstring() diagnostic JSON, and serialization errors also produce diagnostics.
But the slot/snapshot read functions return bare null pointers:
// dita_kernel_get_slot_json (line 1608):
Err(_) => ptr::null_mut() // ← no diagnostic
// dita_kernel_snapshot_json (line 1765):
Err(_) => ptr::null_mut() // ← no diagnostic
The Python caller (_RustKernelLib.get_slot_json, line 164) checks if not raw: raise IndexError(...) — so null is caught, but the IndexError provides no detail about why it failed. If snapshot() returns null (serialization failure with f64 NaN/Inf in some slot), the Python code gets a bare IndexError or RuntimeError with no diagnostic.
Severity: Medium
J10: Two processes with same DITA_V2_PREFIX corrupt shared Zinc memory
File: real_zinc_plane.py:79-82, launcher.py:302
# launcher.py:
resolved_prefix = (prefix or os.environ.get("DITA_V2_PREFIX", "dita_v2")).strip() or "dita_v2"
# real_zinc_plane.py:
self.intent_name = f"{base}_intent" # e.g., "dita_v2_intent"
self.state_name = f"{base}_state" # "dita_v2_state"
self.control_name = f"{base}_control" # "dita_v2_control"
Two processes on the same machine with the same prefix will:
- Attach to the same named shared memory regions
- Overwrite each other's slot state, intents, and control settings
- Race on concurrent writes — last writer wins with no coordination
- One process's
create=Trueconflicts with another's —SharedRegion.create()may fail
There is no prefix uniqueness validation, no PID suffix, no UUID, no lock file, no access control. The prefix defaults to "dita_v2" — trivially guessable.
Severity: High
J11: load_dotenv() only runs when launcher.py is imported — env vars unset for other module paths
File: launcher.py:49-51, control.py:205, projection.py:71
# launcher.py (at module level):
load_dotenv(PROJECT_ROOT / ".env") # only runs on `import launcher`
# control.py (at function call time):
raw = os.environ.get("DITA_V2_CONTROL_PLANE") # reads env var
# projection.py (at function call time):
raw = os.environ.get("DITA_V2_HAZELCAST") # reads env var
If any code imports from .control import build_control_plane directly (without first importing launcher.py), load_dotenv() has not run. The .env file is never loaded. Env vars that should have been set from .env are absent.
This creates an ordering dependency: module import order determines whether config files are loaded. Different import paths can produce different runtime behavior.
Severity: Medium
J12: BINGX_API_KEY/BINGX_SECRET_KEY passed as None with no validation — fails at HTTP time
File: launcher.py:195-196
api_key=os.environ.get("BINGX_API_KEY"), # None if unset
secret_key=os.environ.get("BINGX_SECRET_KEY"), # None if unset
When keys are unset, None is passed to BingxExecClientConfig and then to the HTTP client. No validation occurs at config/build time. The system:
- Successfully builds a full
DITAv2LauncherBundlewith empty keys - Creates an
ExecutionKernel - The first trade's
venue.submit()call sends an HTTP request to BingX with empty auth - BingX returns 401 — cryptic
"signature verification failed"error
This is a late failure — the operator has no indication of misconfiguration until the first trade attempt. Fast failure at launcher time would catch this.
Also: gen_live_tests.py:116-117 and gen2.py:320 use bracket access os.environ["BINGX_API_KEY"] which crashes with KeyError if the var is missing — an inconsistent pattern (crash immediately vs fail at HTTP time).
Severity: Medium
J13: API credentials never masked in error messages or tracebacks
File: launcher.py:195-196, bingx_venue.py (through config object)
Credentials flow through:
os.environ.get("BINGX_API_KEY")→BingxExecClientConfig(api_key=...)BingxExecClientConfig→BingxDirectExecutionAdapter.__init__(config)- Config object stored as Python attribute — accessible via
repr(),str(), error tracebacks
No code masks, redacts, truncates, or otherwise protects the API key or secret key. If an exception propagates and the traceback includes the config object (through local variables, frame inspection, or exception chaining), the credentials are exposed in logs.
The generated live-test code also embeds credentials literally:
client = BingxHttpClient(api_key="<ACTUAL_KEY>", secret="<ACTUAL_SECRET>", ...)
When test files are checked into version control (even temporarily), credentials are at risk.
Severity: High
J14: _env_bool treats empty-string var as False while unset returns default — inconsistent
File: launcher.py:84-88
def _env_bool(name: str, default: bool = False) -> bool:
raw = os.environ.get(name)
if raw is None:
return default # unset → uses caller's default
return str(raw).strip().lower() in {"1", "true", "yes", "on"}
# empty/whitespace → "" → False
| Env Var State | _env_bool(name, True) returns |
|---|---|
| Unset (key absent) | True (caller's default) |
Set to "" (empty) |
False (empty not in truthy set) |
Set to " " (whitespace) |
False |
Setting DITA_V2_DEBUG_CLICKHOUSE="" (intending "don't set, use default") actually forces it to False, overriding the default. And setting DITA_V2_DEBUG_CLICKHOUSE=" " (whitespace accidentally) does the same. The operator would need to know that empty and whitespace are treated as explicit falsy values, not as "unset."
Severity: Low
J15: gen2.py and _gen_test.py both write to the same output file — last writer wins
Files: gen2.py and _gen_test.py
Both generators write to:
OUTPUT = "/mnt/dolphinng5_predict/prod/tests/test_pink_bingx_dita_live_e2e.py"
_gen_test.py is more complete (includes _inspect_outcome, _assert_accepted, _check_slot_accounting, _build_fresh_kernel_from_slot helpers). gen2.py is simpler (no helpers). The last file to execute determines what the test file contains.
If gen2.py runs last, the helpers from _build_pink_extended.py and _build_pink_bodies.py are lost — their patches to the generated file become stale updates to a now-overwritten file. The _check_slot_accounting assertions in 14 body functions silently become dead code.
Severity: Medium
J16: Shim test bridge has no step(), decision_engine, intent_engine — zero fidelity to production runtime
File: (generated in _build_rb sections across all test generators)
class Shim:
def __init__(self, k): self.kernel = k
async def connect(self, ic=0): self.kernel.venue.connect()
async def disconnect(self):
try: self.kernel.venue.disconnect()
except: pass
The Shim provides none of PinkDirectRuntime's capabilities:
- No
step()method — tests callk.process_intent()directly - No
data_feed— tests must provide prices manually - No
decision_engine— tests construct intents manually - No
intent_engine— no intent sizing/validation - No lifecycle beyond connect/disconnect
The test suite effectively tests ExecutionKernel in isolation, not the full runtime pipeline. Any bug in the decision→intent→kernel→fill→persist chain that passes through step() is invisible to these tests.
Severity: High
Pass 7 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| J1 | _flatten submites wrong direction for LONG positions |
Test | Medium |
| J2 | _check_slot_accounting double-counts unrealized PnL |
Test | Medium |
| J3 | _build_live_snapshot timestamp is float vs datetime — type crash risk |
Data Feed | High |
| J4 | ExecutionKernel.mark_price() never called — no mark-to-market |
Bridge | High |
| J5 | All VenueEvent timestamps use local clock, not exchange timestamp | Venue | Medium |
| J6 | No monotonic timestamp verification anywhere | All | Low |
| J7 | rebuild_indexes() overwrites duplicate trade_id — last wins, first invisible |
Rust | High |
| J8 | resolve_slot() falls back to slot 0 — stray event corrupts slot 0 |
Rust | High |
| J9 | get_slot_json/snapshot_json return null with no diagnostic |
Rust | Medium |
| J10 | Two processes with same DITA_V2_PREFIX corrupt shared Zinc memory | Zinc | High |
| J11 | load_dotenv() only runs on launcher.py import — ordering dependency |
Config | Medium |
| J12 | BINGX_API_KEY passed None with no validation — fails at HTTP time | Config | Medium |
| J13 | API credentials never masked in error messages or tracebacks | Config | High |
| J14 | _env_bool inconsistent: empty string = False vs unset = default |
Config | Low |
| J15 | gen2.py and _gen_test.py write to same output — last writer wins | Test | Medium |
| J16 | Shim test bridge lacks step(), decision_engine — zero runtime fidelity | Test | High |
Pass 7 Severity
| Severity | Count |
|---|---|
| High | 7 (J3, J4, J7, J8, J10, J13, J16) |
| Medium | 7 (J1, J2, J5, J9, J11, J12, J15) |
| Low | 2 (J6, J14) |
Combined Catalog (All 7 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
| Total | 176 | 11 | 48 | 48 | 48 | 21 |
PASS 8 — OBSERVABILITY, MEMORY, TIME, DEAD CODE, MODULE INIT
K1: Zero stdout/stderr output — system is completely silent
No production code path emits any stdout or stderr output. Zero print(), zero logging output, zero warnings.warn(). The system runs with zero operator-visible evidence of being alive. If Hazelcast and ClickHouse are both disabled, the system is a black box — no logs, no metrics, no health checks, no output of any kind.
logging is imported in exactly one file (bingx_user_stream.py) with no root logger configuration anywhere. Even those logging calls produce no output without logging.basicConfig().
Severity: Critical
K2: No health check endpoint, no metrics, no monitoring surface
There are zero:
- HTTP health check endpoints (
/health,/ready) - Prometheus metrics endpoints
- Statsd/Graphite reporters
- Periodic heartbeats
- Liveness/readiness probes
- Process manager integration (no systemd unit, no supervisor config, no container healthcheck)
The only monitoring surface is programmatic — calling kernel.snapshot() from Python code with access to the same ExecutionKernel instance. For cross-process monitoring, the operator must write custom code to read Zinc shared memory regions and parse the undocumented JSON packets.
Severity: Critical
K3: Failed trades produce no notification — error exists only in return value
process_intent() returns KernelOutcome(accepted=False, diagnostic_code=..., details=...) but:
- No log line is written for the failure
- No stdout/stderr output
- The failure is not persisted to any durable store (unless debug_clickhouse is enabled and sink is configured)
- If the caller (strategy/algo) doesn't inspect the return value, the failure is completely invisible
- There is no alert mechanism, no error counter, no dead-letter queue
Severity: High
K4: Exception tracebacks not captured in production — all except: blocks swallow silently
Every except Exception: pass and except Exception: continue in the codebase discards the full Python traceback. There is no logging infrastructure to capture it. When an exception occurs:
launcher.py:187: RealZincPlane init failure → traceback lostrust_backend.py:102:__del__exception → traceback lostbingx_venue.py:51:_row_floatconversion failure → traceback lostbingx_venue.py:325: slot lookup failure → traceback lostbingx_venue.py:350: cancel HTTP error → traceback lostcontrol.py:213: control plane fallback → traceback lost- All real_control_plane.py try/except blocks → traceback lost
The only exception information that survives is the final exception message in BingxHttpError (converted to a dict) and Rust kernel diagnostic codes (structured JSON). Full Python tracebacks are invisible.
Severity: High
K5: ~85+ Python objects allocated per process_intent() call — 36 TradeSlot copies via JSON round-trip
Every _get_slot() call does a full JSON serialization (Rust) → C FFI → JSON parse (Python) → new TradeSlot dataclass. A single ENTER intent with 2 venue events results in approximately:
- 36
TradeSlotinstances from repeated_get_slot()calls (state refresh, observe_slots, projection writes) - 4
VenueEventinstances - 3
KernelOutcomeinstances - ~30 dicts for serialization payloads
- ~4
KernelTransitioninstances
No caching exists — every _get_slot() call goes through the full FFI round-trip. Multiple calls within the same process_intent() invocation fetch the same slot data multiple times.
Severity: Medium
K6: Circular reference cycle Kernel → StateView → SlotView → Kernel — prevents refcount GC
# KernelStateView and KernelSlotView both hold strong references:
self._kernel = kernel # strong reference
This forms ExecutionKernel → state → slots[]._kernel → ExecutionKernel. Python's refcounting cannot free this cycle — it depends on the generational GC. The __del__ method on ExecutionKernel (which destroys the Rust KernelHandle) fires at an unpredictable time, potentially long after the last explicit reference to the kernel is dropped.
Severity: High
K7: MemoryKernelJournal silently drops transitions after 10,000 rows — no warning, no rollover
def record(self, row):
if len(self.rows) < self.capture_limit: # capture_limit = 10,000
self.rows.append(dict(row))
# else: silently no-op — every subsequent transition is lost
After 10,000 transitions, record() becomes a no-op. No error, no warning, no FIFO eviction, no rollover to disk. In a production system with 10+ transitions per trade and 100+ trades/day, the journal dies in ~10 days. At that point, all field debugging/troubleshooting capability is silently lost.
Each row holds a full slot.to_dict() (~1 KB) plus event/control payloads. The 10,000 rows retain ~10-15 MB permanently.
Severity: High
K8: RealZincPlane._intent_cache Python list unbounded — only shared memory write is bounded
# real_zinc_plane.py:189-191
self._intent_cache.append(row)
self._write_region(self.intent_region, self._intent_seq, {"items": self._intent_cache[-512:]})
The shared memory write limits to the last 512 entries. But the Python _intent_cache list grows unbounded — every intent ever published remains in memory forever. After 1M intents: ~1M dict objects, ~500 MB+ of Python memory.
Note: InMemoryZincPlane.intent_region has the same unbounded growth (already documented as F12).
Severity: High
K9: _backend_snapshot timeout uses wall-clock threading.Event.wait() — NTP can truncate/extend
def _backend_snapshot(self, *, timeout_ms=5000.0):
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0): # wall clock!
return self._last_snapshot # stale data
threading.Event.wait(timeout) uses the system wall clock. If NTP adjusts the clock:
- Forward (e.g., +2 seconds): the timeout is truncated to ~3 seconds — spurious timeout, stale snapshot returned
- Backward (e.g., -2 seconds): the timeout extends to ~7 seconds — caller blocks longer than expected
The correct pattern is time.monotonic() with a deadline loop — which InMemoryControlPlane.wait() already uses correctly. The _backend_snapshot timeout is the single highest-impact site because it controls whether the venue adapter returns fresh or stale exchange state.
Severity: High
K10: RealZincControlPlane.wait() uses wall clock — no monotonic guarantee
# real_control_plane.py:126-130
def wait(self, timeout_ms=1000):
try:
return bool(self.region.wait(timeout_ms)) # wall clock
except Exception:
return False
The SharedRegion.wait() implementation (external) uses wall clock. Same NTP sensitivity as K9, though lower impact (controls shared memory synchronization, not exchange data freshness).
Severity: Medium
K11: exchange_ts falls back to local time.time() when exchange timestamp E is missing
# bingx_user_stream.py:278
ts = int(frame.get("E") or time.time() * 1000) # local clock fallback
When the exchange's WebSocket frame lacks the E (event time) field, the code substitutes the local machine's wall clock. Two problems:
- Local clock may differ from exchange clock by seconds or minutes (VM drift)
time.time()is wall-clock — subject to NTP backward jumps
Events that lack E will have timestamps from a different clock source than events that have E. This creates ordering paradoxes in any downstream consumer that sorts by timestamp.
Severity: Medium
K12: No monotonic timestamp verification anywhere in the system
Zero code paths check whether timestamps progress forward:
process_intent()— no comparison between intent timestamp and slot'slast_event_timeon_venue_event()— no check that event timestamp >= previous eventsAccountProjectionV2._build()— no monotonicity check onReconcileResult.ts- Rust kernel —
last_event_time = Some(event.timestamp)stored but never validated
NTP backward jumps, clock skew, or VM migration all can produce decreasing timestamps. The system has no detection, no guard, no warning log.
Severity: Medium
K13: ControlPlane.wait() and notify() have zero callers across all implementations — dead protocol surface
The ControlPlane protocol defines wait(timeout_ms=1000) and notify(). Both are implemented by InMemoryControlPlane, ZincControlPlane, and RealZincControlPlane. But zero callers exist in production code:
ExecutionKernelnever callsself.control_plane.wait()or.notify()launcher.pynever calls them- No test exercises them
Combined with the protocol methods having real implementations (with monotonic clock logic in InMemoryControlPlane), this is ~40 lines of dead-but-maintained code.
Similarly: all 7 ZincPlane wait/notify methods (wait_on_intent, notify_intent, wait_on_state, notify_state, wait_on_control, notify_control, read_slots) have zero callers — dead protocol surface.
Severity: Informational
K14: AccountProjection.to_account_event() has zero callers
# account.py:86
def to_account_event(self, metadata=None):
...
Defined, never called anywhere in production code or tests. Dead code.
Severity: Informational
K15: HazelcastProjector entire class dead — zero callers
# hazelcast_projection.py:18-48
class HazelcastProjector:
def publish_slot(self, slot): ...
def publish_event(self, event_type, payload): ...
Both methods have zero callers anywhere in the codebase. The class can never be constructed from any production code path. The actively-used projection class is HazelcastRowWriter.
Severity: Informational
K16: _order_to_payload() dead code
# rust_backend.py:220
def _order_to_payload(order):
...
Defined, never called. Serializing a VenueOrder to dict is done inline in TradeSlot.to_dict() (contracts.py:127-134), not via this function.
Severity: Informational
K17: MirroredControlPlane entire class dead — never constructed
# control.py:171-184
class MirroredControlPlane:
def __init__(self, inner, mirror_sink=None): ...
build_control_plane() never returns a MirroredControlPlane. The class can only be constructed if someone explicitly instantiates it — no code path does. Similarly, KernelJournal protocol is never used as a type annotation outside journal.py.
Severity: Informational
K18: 12 of 20 TradeStage variants never matched in Rust FSM logic
Defined in the Rust string_enum! but never matched in any process_intent or on_venue_event arm:
DECISION_CREATED, INTENT_CREATED, ORDER_SENT, ORDER_ACKED, ORDER_REJECTED, POSITION_OPENED, EXIT_SENT, EXIT_ACKED, EXIT_REJECTED, POSITION_PARTIALLY_CLOSED, POSITION_CLOSED, TRADE_TERMINAL_WRITTEN
Only 7 variants are used in FSM logic: IDLE, ORDER_REQUESTED, ENTRY_WORKING, POSITION_OPEN, EXIT_REQUESTED, EXIT_WORKING, CLOSED, STALE_STATE_RECONCILING. The other 12 are serialization-only — they exist in the enum but the kernel never transitions a slot to them.
Severity: Low
K19: Unused imports in projection.py and hazelcast_projection.py
projection.py imports AccountProjection, TradeStage, datetime, Iterable, List — none used.
hazelcast_projection.py imports KernelTransition, TradeSlot, KernelControlSnapshot, _transition_row — none used.
These are carryovers from earlier code versions. They add no runtime cost (import is cached after first load) but indicate stale code structure.
Severity: Informational
K20: sys.path mutation on import — importing the package appends Zinc path globally
Both real_control_plane.py:13-15 and real_zinc_plane.py:22-24 do:
if _ZINC_ADAPTER_PATH.exists() and str(_ZINC_ADAPTER_PATH) not in sys.path:
sys.path.append(str(_ZINC_ADAPTER_PATH))
This fires at module import time as a side effect of importing __init__.py (through the chain: __init__ → launcher → real_control_plane/real_zinc_plane). It modifies the process-global sys.path, which persists for the entire process lifetime. If the Zinc adapter path shadows or conflicts with other modules, the consequences are global and hard to debug.
Severity: Medium
K21: load_dotenv() runs at module import time — mutates os.environ as side effect
# launcher.py:49-51 (at module level)
PROJECT_ROOT = Path(__file__).resolve().parents[3]
load_dotenv(PROJECT_ROOT / ".env")
This fires on import launcher (which happens via __init__.py). Mutates os.environ process-globally. Tests that need to set specific env vars must import launcher first to get .env loaded, then override — or the .env values win. Also: if .env doesn't exist, load_dotenv() silently does nothing, and the import dependency shifts — importing the package may or may not load .env depending on filesystem state.
Severity: Medium
K22: ControlPlane protocol not in __init__.py.__all__
# __init__.py (__all__)
"ControlPlane" not in __all__ # ← hidden from star imports
from prod.clean_arch.dita_v2 import * exports 44 names but does NOT include ControlPlane (the main interface type). Concrete implementations (InMemoryControlPlane, RealZincControlPlane, etc.) are all exported. The protocol class itself is hidden.
Severity: Informational
K23: KernelSlotView.__getattr__ makes a ctypes call per attribute access — no caching
# rust_backend.py:422-426
def __getattr__(self, name):
slot = self._snapshot() # FFI round-trip every time
if hasattr(slot, name):
return getattr(slot, name)
raise AttributeError(name)
Every attribute access on a KernelSlotView (e.g., slot.size, slot.fsm_state, slot.trade_id) does a full JSON round-trip to the Rust kernel. The _snapshot() method calls self._kernel._get_slot(self._slot_id) which calls _get_rust().get_slot_json() → Rust serializes slot to JSON → Python parses → creates new TradeSlot → attribute is read from the new object.
Accessing 5 fields on a KernelSlotView does 5 FFI round-trips. There is no caching of the deserialized TradeSlot between accesses.
Severity: Medium
Pass 8 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| K1 | Zero stdout/stderr — system completely silent | All | Critical |
| K2 | No health check, metrics, or monitoring surface | All | Critical |
| K3 | Failed trades produce no notification — error in return value only | Bridge | High |
| K4 | Exception tracebacks not captured — all except:pass swallow silently | All | High |
| K5 | ~85+ Python objects per process_intent — 36 TradeSlot copies via FFI | Bridge | Medium |
| K6 | Circular ref cycle Kernel→StateView→SlotView→Kernel — delays del | Bridge | High |
| K7 | MemoryKernelJournal silently drops transitions after 10K rows | Journal | High |
| K8 | RealZincPlane._intent_cache unbounded Python list growth | Zinc | High |
| K9 | _backend_snapshot timeout uses wall clock — NTP truncates/extends | Venue | High |
| K10 | RealZincControlPlane.wait() uses wall clock — no monotonic | Control | Medium |
| K11 | exchange_ts fallback to local time.time() when E missing | Stream | Medium |
| K12 | No monotonic timestamp verification anywhere | All | Medium |
| K13 | ControlPlane.wait()/notify() — zero callers across all impls | Control | Info |
| K14 | AccountProjection.to_account_event() — zero callers | Account | Info |
| K15 | HazelcastProjector entire class dead | Projection | Info |
| K16 | _order_to_payload() dead code | Bridge | Info |
| K17 | MirroredControlPlane entire class dead — never constructed | Control | Info |
| K18 | 12 of 20 TradeStage variants never matched in Rust FSM | Rust | Low |
| K19 | Unused imports in projection.py and hazelcast_projection.py | Projection | Info |
| K20 | sys.path mutation on import — global side effect | Config | Medium |
| K21 | load_dotenv() at module import time — mutates os.environ globally | Config | Medium |
| K22 | ControlPlane protocol not exported in all | Config | Info |
| K23 | KernelSlotView.getattr makes FFI call per attribute access | Bridge | Medium |
Pass 8 Severity
| Severity | Count |
|---|---|
| Critical | 2 (K1, K2) |
| High | 7 (K3, K4, K6, K7, K8, K9) |
| Medium | 7 (K5, K10, K11, K12, K20, K21, K23) |
| Low | 1 (K18) |
| Info | 6 (K13, K14, K15, K16, K17, K19, K22) |
Combined Catalog (All 8 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
| Total | 199 | 13 | 55 | 55 | 49 | 27 |
PASS 9 — CONTRACTS, EXCHANGE EVENTS, NETWORK, FFI, BACKUP DIFFS
L1: KernelOutcome(accepted=True, diagnostic_code=INVALID_INTENT) is parseable — no invariant check
File: rust_backend.py:388-402
def _outcome_from_payload(payload):
return KernelOutcome(
accepted=bool(payload.get("accepted", False)),
diagnostic_code=KernelDiagnosticCode(str(payload.get("diagnostic_code", "OK"))),
)
No validation that accepted=True implies diagnostic_code=OK, or that accepted=False implies a non-OK diagnostic code. If the Rust kernel ever returns contradictory values (e.g., {"accepted": true, "diagnostic_code": "INVALID_INTENT"}), Python silently accepts them. The default for both KernelOutcome.diagnostic_code and _outcome_from_payload fallback is OK — an accepted=False with no explicit diagnostic_code would silently show OK.
Similarly, KernelTransition has no FSM validation — any (prev_state, next_state) pair is accepted, even impossible transitions like IDLE → POSITION_CLOSED.
Severity: Medium
L2: VenueEvent.filled_size > VenueEvent.size possible — _fill_event_from_row uses different source fields
File: bingx_venue.py:530-531
size=abs(_row_float(row, "executedQty", "z", "lastFilledQty", default=0.0)),
filled_size=abs(_row_float(row, "lastFilledQty", "l", "z", default=0.0)),
size comes from executedQty (cumulative) while filled_size comes from lastFilledQty (incremental). If lastFilledQty > executedQty (exchange-side rounding, partial fill of a partially-cancelled order), filled_size > size. The Rust kernel's apply_fill uses event.filled_size for PnL and position adjustment — an oversized fill could over-count position reduction.
Also: VenueOrder.filled_size > intended_size possible via _venue_order_from_row() (line 157-163) when the exchange reports executedQty > origQty.
Severity: Medium
L3: VenueEvent.price=0 can reach the kernel from multiple paths
File: bingx_venue.py:495 (via _row_float default 0.0), mock_venue.py:180 (via 0.0 when reference_price=0), rust_backend.py:411 (via outcome default 0.0)
The Rust kernel's realized_pnl() guards against entry_price <= 0.0 and exit_size <= 0.0, but exit_price=0 in a fill event produces delta = (0 - entry) / entry = -1.0. For LONG: PnL = -1.0 * notional → -100% of position. A zero-price fill event would register as a total loss.
The mark_price() function guards against price <= 0, so unrealized PnL is safe. But realized PnL from a zero-price fill is not guarded.
Severity: High
L4: BingxUserStream — available_margin set to cw (cross wallet balance) instead of crossWalletBalance - usedMargin
File: bingx_user_stream.py:336
available_margin=cw # cw = cross wallet balance, NOT available margin
In BingX's ACCOUNT_UPDATE frame, "cw" is the cross wallet balance (total equity), not the available margin. Available margin = crossWalletBalance - usedMargin. The ExchangeEvent.available_margin field receives the wrong value. This flows into the dual-ledger accounting's EBlock.available_margin — if used for reconcile rules, the exchange-side available_margin is overstated.
Severity: High
L5: BingxUserStream — wallet_balance silently defaults to 0 when "wb" is absent
File: bingx_user_stream.py:334
wallet = _safe_float(usdt_bal.get("wb") or usdt_bal.get("walletBalance"))
If neither "wb" nor "walletBalance" exists in the USDT balance object (possible for some account types or frame formats), _safe_float(None | None) returns 0.0. The exchange wallet balance is silently zeroed, making the E-side of the dual-ledger reconciliation see wallet_balance=0 when the actual balance is positive. This always produces an ERROR reconcile status (R1: capital >> 0 vs wallet=0).
Severity: High
L6: BingxUserStream — _keepalive_loop has no stop mechanism — runs forever on old listen key after rotation
File: bingx_user_stream.py:394-405
async def _keepalive_loop(self, listen_key):
while True:
await asyncio.sleep(self._keepalive_secs)
await self._http.signed_put_raw(...)
The keepalive loop is an asyncio.Task with no stop signal. When the 24h rotation creates a new listen key, the old keepalive task keeps sending PUT requests to the old (now-deleted) listen key indefinitely. BingX returns errors for keepalive on deleted keys — these errors are suppressed by with suppress(Exception) in the delete path but NOT in the keepalive path. The keepalive loop's errors are unhandled.
Severity: Medium
L7: BingxUserStream — event_id from frame.get("i") can be integer 0 — str(0) is falsy on or chain, generates random UUID
File: bingx_user_stream.py:283
event_id = str(frame.get("i") or frame.get("event_id") or uuid.uuid4().hex)
If frame.get("i") returns integer 0 (valid event ID in some BingX frames), str(0) gives "0" which is falsy on the or chain → falls through to uuid.uuid4().hex, losing the real event ID. Event dedup downstream sees a random UUID instead of the exchange's ID.
Severity: Medium
L8: BingX test URLs hardcoded in test generators — wrong environment if system targets LIVE
Files: gen_live_tests.py:70,77, gen2.py:135
"https://open-api-vst.bingx.com/openApi/swap/v2/user/positions"
"https://open-api-vst.bingx.com/openApi/swap/v2/quote/price"
Hardcoded vst (testnet) URLs. The production launcher.py path selects VST vs LIVE via BingxEnvironment and DOLPHIN_BINGX_ENV, but the test generators hardcode VST. If the system is configured for LIVE and these tests run, they hit the wrong exchange environment.
Severity: Medium
L9: No proxy support — cannot be deployed behind corporate proxy
No code parses HTTP_PROXY, HTTPS_PROXY, SOCKS_PROXY or passes proxy configuration to aiohttp.TCPConnector or ClientSession. The aiohttp.ClientSession in bingx_user_stream.py is created without any proxy parameter. Deployment behind a corporate proxy or SOCKS proxy requires code changes.
Severity: Low (deployment constraint, not a correctness bug)
L10: 5-minute DNS cache TTL in WebSocket adapter — stale IPs on infrastructure change
File: bingx_user_stream.py:425
aiohttp.TCPConnector(limit=4, ttl_dns_cache=300) # 300 seconds = 5 minutes
If BingX changes server IPs during an infrastructure migration or failover, the system continues using stale IPs for up to 5 minutes. The connector is recreated on each WS reconnect, so the cache resets — but a reconnection that uses the stale DNS from the just-discarded connector's cache... actually, ttl_dns_cache=300 means aiohttp caches DNS results for 5 minutes. After a reconnect, the new connector starts with an empty cache. But if the system doesn't reconnect and just keeps the WS alive, DNS changes go undetected for 5 minutes.
Severity: Low
L11: getattr(intent, "limit_price", 0.0) reads from dataclass field, not metadata dict — always 0.0
File: bingx_venue.py:267
metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
intent.limit_price is a field on KernelIntent (default 0.0). The or 0.0 is redundant — if it's somehow None, float(None) raises TypeError before or is evaluated. Actually, getattr(intent, "limit_price", 0.0) returns 0.0 (the default), then 0.0 or 0.0 → 0.0, then float(0.0) → 0.0. The result is always 0.0 regardless of what the policy layer set in metadata.
But wait — limit_price IS a real field on KernelIntent (contracts.py:257, added in this version). If the policy layer sets intent.limit_price = 10.50, then getattr(intent, "limit_price", 0.0) returns 10.50, and float(10.50) → 10.50. So this IS correct for the new code where KernelIntent has the field. But the _legacy_intent function (identical to H7) doesn't check intent.metadata.get("limit_price") — it reads the dataclass field. If any caller passes limit_price via metadata dict only, it's lost.
Severity: Low
L12: Backup diff — Rust kernel added 428 lines including entire dual-ledger accounting, 14+ bug fixes
Comparing the backup rust_kernel_src/lib.rs (1614 lines) against current _rust_kernel/src/lib.rs (2042 lines) reveals:
Bugs fixed between backup and current:
- CANCEL now works on entry orders (backup only checked exit orders)
- Partial fills now accumulate (backup overwrote
filled_size) - Stale venue events on closed slots now rejected (TERMINAL_STATE guard, I13 fix)
- CANCEL_ACK properly resets entry orders to IDLE
- EXIT transition captures actual
prev_stateinstead of hardcodedPOSITION_OPEN into_c_stringsanitizes NUL bytes instead of panicking (G2 fix)- Null-string FFI returns diagnostic JSON instead of null pointer
invalid_intent_cstring()helper returns structured diagnostics- Reconcile validates slot invariants before applying
New Rust features:
AccountStatedual-ledger struct with K-value vs E-fact reconcile ruleson_account_event()FFI for account-level eventsset_seed_capital()FFIINVALID_INTENTdiagnostic code
Critical finding: The backup still has the entry-fill size overwrite bug (I1), the backward EXIT prev_state bug (G3), and the CANCEL-only-exit-order bug (G10). These were all fixed in the current code. The backup represents a pre-fix state that would double-settle PnL on partial fills.
Severity: Informational
L13: _build_full_runtime in gen_live_tests.py is never called — dead code
File: gen_live_tests.py:148-161
def _build_full_runtime(initial_capital):
# Creates HazelcastDataFeed, DecisionEngine, IntentEngine, PinkDirectRuntime
# ... but never called by any test function
This function wires the full production pipeline: HazelcastDataFeed + PinkDirectRuntime + DecisionEngine + IntentEngine. But every test function calls _build_runtime_bundle() instead, which returns a _RuntimeShim with zero fidelity (J16). The real PinkDirectRuntime — with step(), data feed, decision engine, intent engine — is never instantiated in any test.
Also: hz_client=build_projection(...) passes a HazelcastProjection (write-side wrapper) where a Hazelcast client object should go — type mismatch.
Severity: High
L14: BingxUserStream — listenKeyExpired raises RuntimeError instead of clean return — triggers full reconnect
File: bingx_user_stream.py:273
if frame.get("e") == "listenKeyExpired":
raise RuntimeError("listenKeyExpired")
When the exchange sends listenKeyExpired, the code raises RuntimeError inside the _consume() async generator. This propagates to the outer subscribe() loop's try/except, which treats it as a connection failure — delays, creates a new listen key, reconnects. The proper behavior is to yield an ExchangeEvent(kind=RECONNECTED) and return cleanly, letting the caller handle the rotation without backoff delay.
Severity: Medium
L15: BingxUserStream — _delete_listen_key suppresses all exceptions — leaked keys on auth failures
File: bingx_user_stream.py:413-416
async def _delete_listen_key(self, listen_key):
with suppress(Exception):
await self._http.signed_delete_raw(...)
If the DELETE call fails (invalid signature, expired key, network error), the exception is swallowed. The old listen key remains active on BingX, wasting server resources. Over days of operation with unhandled auth failures, leaked listen keys accumulate server-side.
Severity: Low
L16: Backup diff — venue_order_id propagation logic has ambiguous target selection
File: _rust_kernel/src/lib.rs:1110-1125 (current code)
if !event.venue_order_id.is_empty() {
let target = if slot.active_entry_order.is_some() {
slot.active_entry_order.as_mut()
} else {
slot.active_exit_order.as_mut()
};
If an entry order exists (even if fully filled and the slot is in POSITION_OPEN), ANY incoming event's venue_order_id propagates to the entry order — even if the event is for the exit order. The active_entry_order status might be FILLED but it's still Some(...), so the exit event's ID goes to the wrong order.
Severity: Medium
Pass 9 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| L1 | KernelOutcome(accepted=True, diag=INVALID_INTENT) parseable — no invariant check |
Bridge | Medium |
| L2 | VenueEvent.filled_size > size possible via different source fields |
Venue | Medium |
| L3 | VenueEvent.price=0 reaches kernel — zero-price fill = 100% loss PnL |
Venue | High |
| L4 | available_margin set to cross-wallet balance, not available margin |
Stream | High |
| L5 | wallet_balance defaults to 0 when "wb" absent — E-side reconcile always ERROR |
Stream | High |
| L6 | _keepalive_loop no stop mechanism — runs on old key after rotation |
Stream | Medium |
| L7 | event_id integer 0 → str(0) falsy on or → random UUID generated |
Stream | Medium |
| L8 | Hardcoded VST URLs in test generators — wrong env if LIVE configured | Test | Medium |
| L9 | No proxy support — can't deploy behind corporate proxy | Network | Low |
| L10 | 5-minute DNS cache TTL — stale IPs on infrastructure change | Network | Low |
| L11 | limit_price getattr reads dataclass field, not metadata dict |
Venue | Low |
| L12 | Backup diff: 14+ critical bugs fixed, 428-line dual-ledger accounting added | Rust | Info |
| L13 | _build_full_runtime dead — real pipeline never tested |
Test | High |
| L14 | listenKeyExpired raises RuntimeError instead of clean yield |
Stream | Medium |
| L15 | _delete_listen_key suppresses all exceptions — leaked server keys |
Stream | Low |
| L16 | venue_order_id target selection ambiguous when entry order exists |
Rust | Medium |
Pass 9 Severity
| Severity | Count |
|---|---|
| High | 4 (L3, L4, L5, L13) |
| Medium | 8 (L1, L2, L6, L7, L8, L14, L16) |
| Low | 4 (L9, L10, L11, L15) |
| Info | 1 (L12) |
Combined Catalog (All 9 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
| Total | 215 | 13 | 59 | 63 | 53 | 27 |
PASS 10 — RUNTIME, TEST BUGS, FSM AUDIT, PERSISTENCE, MEASUREMENT
M1: ENTER transition hardcodes prev_state = IDLE — every non-IDLE entry corrupts the audit trail
File: _rust_kernel/src/lib.rs:1117
let transition = self.transition(
&slot,
TradeStage::IDLE, // HARDCODED — lies about actual prev_state
slot.fsm_state.clone(),
"ENTER_INTENT",
);
When a slot is entered from CLOSED (re-entry) or from any other state that passed the is_free() or same-trade bypass, the transition record claims prev_state = IDLE. This is always wrong unless the slot was genuinely IDLE. Every ENTER transition in the journal for a re-entered slot or a slot coming from CLOSED records an impossible transition (CLOSED → ORDER_REQUESTED recorded as IDLE → ORDER_REQUESTED).
This corrupts any downstream FSM analysis, journal audit, or trade-lifecycle reconstruction that relies on accurate prev_state values.
Severity: Critical
M2: CANCEL intent creates no transition record — invisible in audit log
File: _rust_kernel/src/lib.rs:1287-1305
The CANCEL branch in process_intent returns a KernelResult with no call to self.transition(). Every other intent (ENTER, EXIT, MARK_PRICE, RECONCILE) records a transition. CANCEL operations — including accepted cancels — are invisible in the transition audit log.
Additionally, CANCEL returns accepted = true but never mutates the slot's fsm_state. The slot stays in whatever state it was in. The caller sees accepted = true with no visible effect.
Severity: Critical
M3: _mk_intent test helper drops order_type/limit_price into metadata instead of setting proper fields
File: test_flaws.py:43
def _mk_intent(action, trade_id="t1", size=0.001, price=100.0, slot_id=0, **kw):
return KernelIntent(
...
metadata=kw, # order_type="LIMIT" goes into metadata dict, not the dataclass field!
)
KernelIntent has dedicated fields order_type: str = "MARKET" and limit_price: float = 0.0 (contracts.py:274-275), but _mk_intent passes **kw as metadata=kw. So _mk_intent(order_type="LIMIT") produces intent.order_type == "MARKET" (the default) while intent.metadata["order_type"] == "LIMIT".
The Flaw 6 tests in test_flaws.py that verify order_type/limit_price preservation through _legacy_intent pass for the wrong reason — they check legacy.metadata.get("order_type") which finds the value in the passthrough metadata, not because _legacy_intent correctly reads intent.order_type. If the production code changes and the test helper isn't fixed, the tests silently become false positives.
Severity: High
M4: test_cancel_entry_with_partial_fill never sends a CANCEL — misnamed vacuous test
File: test_flaws.py:161-172
def test_cancel_entry_with_partial_fill(self):
k = _fresh_kernel(scenario=MockVenueScenario(partial_fill_ratio=0.5))
k.process_intent(_mk_intent(action=E.ENTER, trade_id="ce4", size=0.002))
slot_after = k._get_slot(0)
assert slot_after.size > 0, "Should have partial fill"
Named "Cancel entry with partial fill," belongs to TestFlaw1EntryCancel — but no CANCEL intent is ever sent. It only verifies that a partial fill occurred. The test is completely vacuous for its stated purpose.
The same pattern affects Flaw 9 tests — test_cancel_uses_slot_asset_not_trade_id and test_mock_venue_cancel_event_has_asset both have "cancel" in their names but never call any cancel function.
Severity: High
M5: Flaw 7 tests (test_entry_exit_different_ratios, test_per_action_type_ratios) never send EXIT
File: test_flaws.py Flaw 7 test class
Both tests set exit_partial_fill_ratio on the mock venue scenario but only ever process an ENTER intent. The exit_partial_fill_ratio is configured but never exercised. The tests verify entry partial fill behavior only — they don't test what their titles and class name claim.
Severity: Medium
M6: test_dedup_window_accepts_many_events uses wrong constant — actual=256, flaw claims 64-only 70 events sent
File: test_flaws.py:536-555
The Flaw 10 tests reference a 64-event dedup window, but the actual Rust constant is MAX_SEEN_EVENT_IDS = 256 (lib.rs:8). The test sends 70 events and asserts >= 70. Since 70 < 256, no eviction occurs. The test passes trivially regardless of whether the old-64-bound flaw exists. To meaningfully test eviction, >256 events would be needed.
Similarly, test_dedup_eviction_does_not_accept_old_event sends only 70 events then checks for dedup — with a 256-entry window, the first event is never evicted. The test verifies basic dedup (non-evicted), not eviction behavior.
Severity: Medium
M7: test_outcome_state_matches_actual_slot is tautological — compares value with itself
File: test_flaws.py:200-210
result = k.process_intent(_mk_intent(action=E.ENTER, trade_id="oc1"))
slot = k._get_slot(0)
assert result.state == slot.fsm_state,
result.state is set from final_slot.fsm_state (which comes from self._get_slot(outcome.slot_id) inside process_intent). The test then calls k._get_slot(0) again. Both read from the same Rust backend — they must be equal by construction. This test proves nothing; it's a tautology.
Severity: Low
M8: ORDER_ACK silent fallthrough when no active order — accepts event with no effect
File: _rust_kernel/src/lib.rs:1476-1498
When on_venue_event receives an ORDER_ACK for a slot with neither active_entry_order nor active_exit_order (shouldn't happen normally, but possible after a reconcile or race), the match arm executes no branch. The state is unchanged, diagnostic_code stays OK, and accepted = true. The event is silently accepted with no effect — no diagnostic, no warning.
The same bug exists for CANCEL_ACK (line 1545): if no matching active order exists, the event is silently accepted with no state change and OK diagnostic.
Severity: Medium
M9: ORDER_REJECT on POSITION_OPEN with stale entry order destroys the position
File: _rust_kernel/src/lib.rs:1499-1530
KernelEventKind::ORDER_REJECT => {
if slot.active_entry_order.is_some() && slot.fsm_state != TradeStage::POSITION_OPEN {
// clear entry, wipe trade data, set IDLE
} else if slot.active_exit_order.is_some() {
// clear exit order only, set POSITION_OPEN
} else {
// no match — reset to IDLE
}
}
If a slot is in POSITION_OPEN (position active) but active_entry_order is still Some (stale — didn't get cleared on fill), the entry-reject guard fsm_state != POSITION_OPEN prevents the entry path. It falls to the exit check. If no exit order, the final else branch fires — resetting the slot to IDLE and destroying the open position and all trade data.
Severity: Critical
M10: No aggregation of any metric — trade count, success/fail, latency all zero
File: entire codebase
The following metrics are completely impossible to obtain from the current system:
| Metric | Why unavailable |
|---|---|
| Total trades processed | trade_seq declared on AccountSnapshot but never incremented anywhere |
| Succeeded vs failed trades | No aggregation of KernelDiagnosticCode outcomes |
| PnL per individual trade | slot.realized_pnl is overwritten on slot reuse — no per-trade persistence |
| Slippage (fill vs intended price) | Data exists transiently but no computed metric |
| API calls per minute | No call counters anywhere in the venue adapter |
process_intent latency |
Zero timing instrumentation — no time.monotonic() in kernel path |
| Process memory usage | No memory tracking of any kind |
| Deduplicated vs fresh event count | Dedup detection exists but is never counted |
The AccountSnapshot.trade_seq field (account.py:27) is declared as trade_seq: int = 0 but never assigned — no code path ever sets it above 0. It's a dead field.
Severity: High
M11: Flaw 6 tests pass via metadata passthrough, not via _legacy_intent field logic
File: test_flaws.py Flaw 6 tests
The two Flaw 6 tests verify that _legacy_intent preserves order_type and limit_price. They pass because _mk_intent(order_type="LIMIT") puts the value into intent.metadata, and _legacy_intent copies intent.metadata into legacy.metadata verbatim. The tests check legacy.metadata.get("order_type") which finds the value in the passthrough — not because _legacy_intent reads intent.order_type correctly.
_legacy_intent actually reads getattr(intent, "order_type", "MARKET") which returns "MARKET" (the default, since _mk_intent put it in metadata not the field), and sets legacy.metadata["_order_type"] = "MARKET". The assertion passes via the wrong code path. If _legacy_intent stopped copying metadata entirely, the tests would still pass as long as intent.metadata is passed through.
Severity: High
M12: No retry or fallback for ClickHouse INSERT failures
Evidence across all persistence paths: every sink(table, row) call in pink_clickhouse.py is unprotected. If ClickHouse is unreachable, slow, or returns an error, the exception propagates unhandled through persist_step() → step(). No retry, no backoff, no fallback, no queue, no error reporting to anomaly_events.
This means a transient ClickHouse outage (common in cloud deployments) crashes the entire policy cycle. The slot state in the Rust kernel may be lost as the exception unwinds.
Severity: High
M13: AccountSnapshot.trade_seq declared but never incremented — dead field
File: account.py:27
@dataclass
class AccountSnapshot:
...
trade_seq: int = 0
This field is part of the AccountSnapshot dataclass. It's initialized to 0 and never assigned or incremented anywhere in the entire codebase. Every snapshot from kernel.snapshot()["account"] returns trade_seq: 0. Despite being a standard field in every persistence row, it's always 0 — making it impossible to order trades chronologically by sequence number from any persisted data.
Severity: Medium
M14: test_reentry_after_full_close_no_pnl_loss uses absurdly loose 50% bound
File: test_flaws.py:686-706
assert abs(cap_after_second - cap_before) < cap_before * 0.5
Allows a 50% capital deviation (12,500 USDT on 25,000). The actual PnL from the test's tiny trades (~0.02 USDT) is orders of magnitude smaller. A bug that silently leaked 10,000 USDT of PnL would pass this test. The bound provides no meaningful verification.
Also: the test never checks diagnostic_code for the warning it claims to test (already documented as I7 weakness).
Severity: Low
M15: test_reconcile_rejects_position_open_with_zero_size passes even if reconcile silently ignores bad data
File: test_flaws.py:568-585
result = k.reconcile_from_slots([bad_slot])
slot = k._get_slot(0)
assert slot.fsm_state != TradeStage.POSITION_OPEN or slot.size > 0
The assertion was true before calling reconcile (slot starts IDLE with size=0). The test never checks result.accepted == False or verifies the diagnostic code. If reconcile_from_slots silently ignores the bad slot and returns accepted=True, the test still passes — it only proves the slot wasn't in POSITION_OPEN after reconcile, which was already true.
The same structural weakness exists in test_reconcile_rejects_idle_with_nonzero_size.
Severity: Low
M16: No built-in metric for active slot count, event throughput, or memory usage
The following operational metrics cannot be obtained without writing custom code:
- Active slot count:
len([s for s in kernel.state.slots if not s.is_free()])— requires Python access to theExecutionKernelobject. Noactive_slot_countproperty exists. - Total event count: No counter. The journal tracks individual transitions but there's no
total_events_processed: intanywhere. - Memory usage: No
tracemalloc, nopsutil, no RSS polling. Nothing. - Runtime uptime: No
start_timeoruptime()method anywhere.
Severity: Medium
M17: M4 duplicate — test_cancel_uses_slot_asset_not_trade_id and test_mock_venue_cancel_event_has_asset never call cancel
File: test_flaws.py Flaw 9 class
Both tests verify that an entry order's metadata contains an asset key. They never call scenario.cancel() or k.process_intent(action=CANCEL). Despite their names and class (TestFlaw9CancelSymbolFallback), they test metadata preservation on entry, not cancel behavior.
Severity: High
M18: _decision_to_kernel_intent drops order_type and limit_price — LIMIT orders unreachable from the runtime
File: pink_direct.py:79-115 (inferred from E2E trace)
The bridge function converts a Decision to a KernelIntent. It sets timestamp, intent_id, trade_id, asset, side, action, reference_price, target_size, leverage, exit_leg_ratios, reason, and metadata. It does NOT set order_type or limit_price — both default to "MARKET" and 0.0.
Even if the DecisionEngine produced a LIMIT decision with a limit price, the runtime has no path to express it. The entire LIMIT-order pipeline is dead code from the runtime — LIMIT orders can only be set via direct KernelIntent(...) construction in tests, which is itself broken (M3).
Severity: High
Pass 10 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| M1 | ENTER transition hardcodes prev_state=IDLE — audit trail lies for re-entries | Rust | Critical |
| M2 | CANCEL creates no transition record — invisible in audit log | Rust | Critical |
| M3 | _mk_intent drops order_type/limit_price into metadata, not proper field |
Test | High |
| M4 | test_cancel_entry_with_partial_fill never sends CANCEL — misnamed vacuous test | Test | High |
| M5 | Flaw 7 tests never send EXIT — exit_partial_fill_ratio untested | Test | Medium |
| M6 | test_dedup tests use wrong constant (actual=256, claim 64) — 70 events insufficient | Test | Medium |
| M7 | test_outcome_state_matches_actual_slot is tautological | Test | Low |
| M8 | ORDER_ACK silent fallthrough when no active order — accepted with no effect | Rust | Medium |
| M9 | ORDER_REJECT on POSITION_OPEN with stale entry order destroys position | Rust | Critical |
| M10 | No aggregation of trade count, success/fail, latency — all zero | All | High |
| M11 | Flaw 6 tests pass via metadata passthrough, not field logic | Test | High |
| M12 | No retry/fallback for ClickHouse INSERT failures — crashes policy cycle | Persistence | High |
| M13 | AccountSnapshot.trade_seq never incremented — always 0 | Account | Medium |
| M14 | test_reentry_after_full_close_no_pnl_loss uses 50% bound — absurd | Test | Low |
| M15 | test_reconcile_rejects_position_open_with_zero_size passes for wrong reason | Test | Low |
| M16 | No built-in metric for active slots, event throughput, or memory | All | Medium |
| M17 | Flaw 9 tests named for cancel but never call cancel | Test | High |
| M18 | _decision_to_kernel_intent drops order_type and limit_price — LIMIT dead from runtime | Runtime | High |
Pass 10 Severity
| Severity | Count |
|---|---|
| Critical | 3 (M1, M2, M9) |
| High | 7 (M3, M4, M10, M11, M12, M17, M18) |
| Medium | 5 (M5, M6, M8, M13, M16) |
| Low | 3 (M7, M14, M15) |
Combined Catalog (All 10 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
| Total | 233 | 16 | 66 | 68 | 56 | 27 |
PASS 11 — ASYNC/SYNC SEAMS, LOCK ANALYSIS, THREADING
N1: Rust kernel with_handle_mut has zero synchronization — &mut KernelCore from raw pointer, UB on concurrent FFI
File: _rust_kernel/src/lib.rs:2042
fn with_handle_mut<F, R>(handle: *mut KernelHandle, f: F) -> Result<R, String>
where
F: FnOnce(&mut KernelCore) -> Result<R, String>,
{
// Safety: single-threaded; caller holds exclusive access for the duration.
let core = unsafe { &mut (*handle).core }; // raw ptr → &mut
The comment says "single-threaded" but provides zero enforcement — no Mutex, no RwLock, no atomic flag, no thread-local constraints, no !Send/!Sync marker on KernelCore. The unsafe block converts a raw pointer to a &mut reference, which under Rust's aliasing rules must be exclusive — two simultaneous &mut references to the same data is undefined behavior (data race, torn reads, LLVM miscompilation).
The ctypes FFI mechanism releases the GIL during the Rust call (Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS). Two Python threads can call any two dita_kernel_* functions simultaneously — one in process_intent (writing slot state), another in snapshot_json (reading). Both produce &mut KernelCore. This is a compiler-level UB, not just a logical race.
Trigger scenario: Thread A calls process_intent() (ENTRY fill → mutates slot). Thread B calls on_venue_event() (exit fill → mutates slot). The GIL is released during both Rust FFI calls. Both get &mut KernelCore. The Rust compiler can reorder, elide, or speculate any memory operation. Slot data becomes corrupted, PnL doubles, or the process segfaults.
Severity: Critical — undefined behavior, no enforcement, no mitigation.
N2: _run() has two completely different code paths depending on event loop state — runtime branch, not design decision
File: bingx_venue.py:225-238
def _run(self, result):
if inspect.isawaitable(result):
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(result) # Path A: no loop → direct run
pool = self._get_executor()
return pool.submit(asyncio.run, result).result() # Path B: loop → pool + block
return result
Path A (no event loop running): asyncio.run(result) — creates a new event loop, runs the coroutine, closes it. All on the same thread. Correct for sync contexts.
Path B (event loop running): pool.submit(asyncio.run, result).result() — submits to a 3-thread pool, each worker creates yet ANOTHER event loop via asyncio.run(), then blocks the calling thread with .result().
The asyncio.get_running_loop() check is a runtime probe — the code doesn't know from its design whether it's in an async context. Same logical operation (run a coroutine), two completely different implementations. Path B is a documented anti-pattern (creating/destroying event loops per call), Path A is correct.
This is the root cause of the entire async/sync seam problem — the architecture never committed to being async or sync.
Severity: Critical
N3: _run() Path B blocks the event loop thread for every venue HTTP operation
File: bingx_venue.py:236
return pool.submit(asyncio.run, result).result() # BLOCKS calling thread
When called from within a running event loop (all live tests, any async deployment), .result() blocks the event loop thread until the thread pool worker completes. During this block:
- No WS messages can be received from the
BingxUserStream - No keepalive tasks can run
- No timer-based events can fire
- The event loop is stuck
If the thread pool is exhausted (3 concurrent HTTP calls — e.g., _backend_snapshot from submit() which calls it twice plus cancel() which calls it three times), the 4th call blocks at .result() indefinitely — the work item is queued but no worker is free. This is a stuck-process scenario where the entire system freezes.
The event loop thread is blocked on .result(), which means it cannot process the WS events that might contain the fill for the order it just submitted. If the exchange fills instantly, the WS message arrives before .result() returns — the WS data sits in the kernel's TCP receive buffer, unprocessed, until process_intent completes and the event loop can schedule the WS reader again. This delay can cause stale fills, missed state transitions, or WS timeouts.
Severity: Critical
N4: asyncio.run() called repeatedly inside thread pool — creates/destroys event loops per call, documented anti-pattern
File: bingx_venue.py:236
return pool.submit(asyncio.run, result).result()
Each call to asyncio.run() creates a new SelectorEventLoop, runs it, then closes it. Doing this repeatedly for every HTTP call is a documented CPython anti-pattern:
- Each loop allocation costs memory (selector, callbacks, timeout queue)
- Each loop destruction leaves loop-internal objects for GC
- Over many calls (hundreds of trades), this creates GC pressure and memory fragmentation
- The
asyncio.run()documentation explicitly says "don't call this repeatedly" — use a long-lived loop
Path A (no event loop) has the same issue — asyncio.run() is called per-_run() invocation.
The total cost: each process_intent() may call _run() 3-4 times (_backend_snapshot ×2 + submit_intent + optionally cancel). Each _run() creates/destroys an event loop. With 10 trades/min, that's 30-40 event loop creations/destructions per minute.
Severity: Critical
N5: _snapshot_ready Event cascading re-fetch — N concurrent callers produce N overlapping HTTP calls
File: bingx_venue.py:258-274
def _backend_snapshot(self, ...):
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
return self._last_snapshot # stale
self._snapshot_ready.clear()
try:
snapshot = self._call_backend("refresh_state", ...) # HTTP call
except Exception:
self._snapshot_ready.set()
raise
with self._snap_lock:
self._last_snapshot = snapshot
self._snapshot_ready.set()
When _snapshot_ready.set() fires at the end, ALL threads waiting on .wait() wake up. Each one proceeds to clear() and start a new HTTP call — even though a fresh snapshot was just written. With N concurrent callers to _backend_snapshot, this produces N overlapping refresh_state HTTP calls instead of N-1 callers reading the just-received result.
On BingX VST (rate limit ~10 req/s), 3 overlapping refresh_state calls (each doing 5 parallel sub-requests) burns 15 of the 10 req/s budget. The calls overlap and cascade, wasting rate-limit capacity with redundant work.
Severity: High
N6: BingxUserStream.close() does not cancel pending tasks — keepalive/rotation tasks continue after close
File: bingx_user_stream.py:160-169
async def close(self) -> None:
self._closed.set()
if self._session is not None and not self._session.closed:
await self._session.close()
close() sets the _closed event and closes the aiohttp session. It does not cancel the keepalive_task or rotation_task created inside subscribe(). These tasks are only cancelled in the finally block of subscribe(). If close() is called while nobody is iterating the subscribe() generator (or if iteration is blocked in _consume()), those tasks keep running until:
- The event loop shuts down (automatic task cancellation)
- The subscribe generator is garbage collected
- An exception occurs in the WS reader
During this window, the keepalive loop continues sending PUT requests to the (now potentially deleted) listen key. The rotation task continues its 23h50m sleep. Both are zombie tasks with no cleanup path.
Severity: Medium
N7: Live test architecture forces worst-case _run() behavior for every operation
File: gen_live_tests.py, gen2.py, _gen_test.py (all test generators)
The live tests use this pattern:
def test_pink_ditav2_xxx(_live_client) -> None:
...
result = asyncio.run(_run_scenario(bundle, _live_client, body_fn, name, ic))
Each test is a synchronous function that calls asyncio.run(). Inside the resulting event loop, every call to k.process_intent() triggers Path B of _run() — the pool-submit-.result() path. The test architecture forces the architecture's slowest, most thread-expensive code path for every single intent.
Every HTTP call: creates a new event loop on a pool thread → blocks the main event loop thread → blocks WS processing → wastes pool slots. Even for trivial mock-venue tests that don't need HTTP at all, the architecture still goes through the same _run() → pool → .result() path because the mock venue also returns awaitables.
Severity: Medium
N8: BingxUserStream subscribe() creates new tasks on every reconnect — rapid reconnect causes task churn
File: bingx_user_stream.py:100-120
async def subscribe(self):
while not self._closed.is_set():
...
keepalive_task = asyncio.create_task(self._keepalive_loop(listen_key))
rotation_task = asyncio.create_task(self._rotation_sentinel())
...
async for event in self._consume(listen_key, rotation_task):
yield event
Each iteration of the reconnect loop creates new keepalive_task and rotation_task, then cancels the previous ones in the finally block. If the connection drops every few seconds (unstable WS), tasks are created and cancelled in rapid succession. Cancellation races with task creation — a task can be cancelled before its first await, which changes its state machine.
Also: no rate limiting on the reconnect loop beyond the delay_ms exponential backoff. If the WS repeatedly fails immediately after connection, the loop creates/destroys tasks in a tight cycle.
Severity: Medium
N9: No asyncio.all_tasks() or task accounting anywhere — leaked tasks undetectable
No code in the entire workspace calls asyncio.all_tasks() or maintains a task registry. If a task is leaked (cancellation not propagated, generator not cleaned up), there is:
- No way to detect it programmatically
- No warning log
- No metrics
- No
__del__fallback
Combined with N6 (tasks not cancelled on close) and N8 (task churn on reconnect), leaked tasks accumulate silently. Each leaked task holds references to its coroutine frame, which may hold references to aiohttp.ClientSession, websocket connections, and other resources.
Severity: Low
N10: _snap_lock / _snapshot_ready pattern has no reader-side protection on _last_snapshot
File: bingx_venue.py:258-274
The _snap_lock protects _last_snapshot only during writes (line 269-271). The fallback path (timeout at line 260-262) also reads _last_snapshot under _snap_lock. But the _call_backend call at line 266 is outside the lock — the snapshot is fetched without holding _snap_lock, which is correct (don't hold a lock across HTTP). However, the time between releasing the lock and reacquiring it for the write (line 269) means another thread could also be writing _last_snapshot concurrently. The _snap_lock ensures only one write at a time, but the _last_snapshot can still be overwritten between threads — this is the intended behavior (last writer wins for staleness purposes, not a correctness bug).
Severity: Informational
Pass 11 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| N1 | Rust kernel with_handle_mut zero synchronization — &mut from raw ptr, UB on concurrent FFI |
Rust | Critical |
| N2 | _run() has two completely different code paths — runtime branch, not design decision |
Venue | Critical |
| N3 | _run() path B blocks event loop thread for every venue HTTP operation |
Venue | Critical |
| N4 | asyncio.run() called repeatedly — creates/destroys event loops per call, documented anti-pattern |
Venue | Critical |
| N5 | _snapshot_ready cascading re-fetch — N callers produce N overlapping HTTP calls |
Venue | High |
| N6 | BingxUserStream.close() doesn't cancel pending tasks — zombie keepalive/rotation after close |
Stream | Medium |
| N7 | Live test architecture forces worst-case _run() path for every operation |
Test | Medium |
| N8 | subscribe() reconnect creates new tasks per iteration — rapid reconnect causes task churn |
Stream | Medium |
| N9 | No asyncio.all_tasks() or task accounting — leaked tasks undetectable |
All | Low |
| N10 | _snap_lock/_snapshot_ready no reader-side protection (informational) |
Venue | Info |
Pass 11 Severity
| Severity | Count |
|---|---|
| Critical | 4 (N1, N2, N3, N4) |
| High | 1 (N5) |
| Medium | 3 (N6, N7, N8) |
| Low | 1 (N9) |
| Info | 1 (N10) |
Combined Catalog (All 11 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
| Total | 243 | 20 | 67 | 70 | 58 | 28 |
PASS 12 — SYNC/ASYNC WIDER SCOPE (launcher, generators, streams, FFI, tests)
O1: _maybe_close() calls asyncio.run() without checking for a running event loop — close/disconnect silently skipped
File: launcher.py:270-274
def _maybe_close(obj):
...
if inspect.isawaitable(result):
try:
asyncio.run(result)
except RuntimeError:
pass # SILENT — coroutine never executed
When _maybe_close() is called from any context that already has a running event loop (which includes all async tests, any async def main() orchestrator, or any code path that imports and runs DITAv2LauncherBundle inside an async context), asyncio.run(result) raises RuntimeError: asyncio.run() cannot be called from a running event loop. The except RuntimeError: pass swallows it — the close/disconnect method never executes.
Affected resources when called from async context:
RealZincPlane.close()— never called → 3 shared memory regions leakedRealZincControlPlane.close()— never called → 1 shared memory region leakedBingxVenueAdapterhas neitherclose()nordisconnect()— N/AInMemoryZincPlanehas no close — N/A
The DITAv2LauncherBundle.close() method calls _maybe_close(self.venue), _maybe_close(self.zinc_plane), _maybe_close(self.control_plane) — if any of these have async close/disconnect methods, they're all silently skipped when called from async context.
This means: in any async deployment (which is the only deployment pattern — tests, and presumably production via asyncio.run() at top level), shared memory regions are never explicitly closed. They rely on process exit cleanup.
Severity: High
O2: async def connect() shims in all test generators call sync venue.connect() without await — misleading pattern
Files: gen_live_tests.py:143-146, gen2.py:332-333, _gen_test.py:70 (via Shim/Shim pattern)
# All three test harnesses have this pattern:
async def connect(self, initial_capital=0):
self.kernel.venue.connect() # sync method, no await
BingxVenueAdapter.connect() (bingx_venue.py:301) is a sync def that returns bool. It internally calls self._run(result()) which under a running event loop submits to the thread pool and blocks with .result(). The async def connect() wrapper is misleading — it's async but immediately calls a sync method that will block the event loop for the HTTP round-trip duration.
The caller's perspective: await runtime.connect() should yield the event loop. Instead, it blocks until the BingX HTTP call inside connect() completes (via _run()'s thread pool path).
Severity: Medium
O3: gen_live_tests.py:171 — _contract_rows(client) NOT awaited in async def _pick_live_symbol — silent failure
File: `gen_live_tests.py:171**
async def _pick_live_symbol(client):
rows = _contract_rows(client) # MISSING await! _contract_rows is async def
...
pos_rows = [r for r in rows if ...]
_contract_rows is async def (line 69). Without await, rows is a coroutine object, not the actual data. The subsequent iteration for r in rows would iterate over a coroutine object — in Python 3.12+, coroutines raise TypeError: 'coroutine' object is not iterable when iterated.
This function is called from _run_scenario (line 260) and _run_pink_live_roundtrip (line 297). If either path reaches _pick_live_symbol, it crashes with TypeError. This bug may not have manifested in practice if the code paths that call _pick_live_symbol are rarely exercised or if the test generator's output file hasn't been regenerated recently.
Severity: High
O4: test_exchange_event_seam_parity.py uses deprecated asyncio.get_event_loop().run_until_complete()
File: `test_exchange_event_seam_parity.py:243,264**
snap = asyncio.get_event_loop().run_until_complete(mock.account_snapshot()) # line 243
asyncio.get_event_loop().run_until_complete(asyncio.wait_for(_collect(), timeout=2.0)) # line 264
asyncio.get_event_loop() is deprecated in Python 3.12+ (raises DeprecationWarning). If no running event loop exists at call time, it creates a new loop and sets it as the current event loop — which can cause subtle issues when multiple event loops are active. The modern pattern is asyncio.run().
These are the only two places in the workspace that use the deprecated get_event_loop().run_until_complete() pattern.
Severity: Medium
O5: _run() thread pool has no timeout on .result() — if backend hangs, calling thread hangs forever
File: `bingx_venue.py:236**
return pool.submit(asyncio.run, result).result() # NO timeout
concurrent.futures.Future.result() has an optional timeout parameter. None is set here. If the thread pool worker hangs (e.g., the asyncio.run() call in the worker gets stuck on a never-responding HTTP request, a deadlocked coroutine, or an infinite loop), the calling thread blocks forever on .result().
If the calling thread is the event loop thread (Path B), the entire event loop is frozen indefinitely. No WS messages, no keepalive tasks, no timer events. The system is completely dead.
The _backend_snapshot() method has a 5-second timeout for its threading.Event.wait(), but the actual _call_backend("refresh_state", ...) that runs inside the thread pool has no timeout. The HTTP client (BingxHttpClient) may have its own default timeout (typically 30-60 seconds for aiohttp), but there's no fallback if it hangs beyond that.
Severity: High
O6: MockVenueAdapter never exercises the thread-pool bridge — all CI tests use mock venue, bridge untested
Files: mock_venue.py vs bingx_venue.py
MockVenueAdapter.submit() is pure sync — it does return self._events_from_submit(...) with no awaitables, no thread pools. BingxVenueAdapter.submit() is a sync-bridge that goes through _run() → pool.submit(asyncio.run, ...).result().
All 35+ tests in test_flaws.py use MockVenueAdapter. All generated live tests use BingxVenueAdapter but are rarely executed (require live exchange credentials and API key env vars). The thread-pool bridge — including:
- Thread creation and lifecycle
asyncio.run()inside pool workers- Event loop per HTTP call
- Thread pool exhaustion handling
- Exception propagation through
.result()
— is never exercised in CI. If the bridge has a bug (e.g., the asyncio.run() inside the pool worker corrupts shared state, or thread-safety issues in aiohttp), it surfaces only in production.
Severity: Medium
O7: BingxUserStream._keepalive_loop and _rotation_sentinel are fire-and-forget tasks — unhandled exceptions silently lost
File: `bingx_user_stream.py:105-112**
keepalive_task = asyncio.create_task(self._keepalive_loop(listen_key), name="lk_keepalive")
rotation_task = asyncio.create_task(self._rotation_sentinel(), name="lk_rotation")
Both are created with create_task() and tracked for later cancellation, but not supervised during normal operation. If _keepalive_loop raises an exception that's not caught by its internal try/except (e.g., a asyncio.CancelledError variant, or a RuntimeError from the HTTP layer), the exception is stored in the Task object. If .result() or .exception() is never called on that Task, the exception is logged by the asyncio event loop as "Task exception was never retrieved" — a warning message, but no structured error handling.
_rotation_sentinel has no exception handling in its body — it just does await asyncio.sleep(secs) and returns. It can't raise an exception unless the event loop is shut down during its sleep (in which case CancelledError is raised, which is properly handled in the finally block).
Severity: Low
O8: KernelSlotView.__getattr__ makes a ctypes call per attribute — each read triggers Rust FFI and is not cached
File: `rust_backend.py:422-426**
def __getattr__(self, name: str) -> Any:
slot = self._snapshot() # FFI call → Rust serialize → JSON parse → TradeSlot
if hasattr(slot, name):
return getattr(slot, name)
raise AttributeError(name)
Every attribute access on a KernelSlotView — including slot.size, slot.fsm_state, slot.trade_id, slot.active_entry_order, etc. — does a full JSON round-trip to the Rust kernel:
- Python calls
_get_rust().get_slot_json(self._backend, slot_id) - ctypes calls Rust
dita_kernel_get_slot_json - Rust serializes the entire
TradeSlotto a JSON string - ctypes returns the C string pointer
- Python calls
_take_string(raw)→text.decode("utf-8") - Python calls
json.loads(text)→ dict _slot_from_payload(dict)→ newTradeSlotdataclassgetattr(slot, name)→ read the one field from the new object
Accessing 5 fields on a KernelSlotView (e.g., slot.size, slot.fsm_state, slot.entry_price, slot.active_entry_order, slot.trade_id) does 5 FFI round-trips. The deserialized TradeSlot is created and immediately discarded for each access.
The _snapshot() method (line 435) calls self._kernel._get_slot(self._slot_id) which does the full FFI round-trip. There is no caching of the deserialized TradeSlot between successive accesses. This is an N+1 performance issue — accessing N fields costs N FFI calls instead of 1.
Severity: Medium
O9: DITAv2LauncherBundle has no __del__ — bundle that's garbage collected leaks its entire resource tree
File: `launcher.py:64-95**
@dataclass
class DITAv2LauncherBundle:
kernel: ExecutionKernel
control_plane: ControlPlane
projection: HazelcastProjection
zinc_plane: ZincPlane
venue: VenueAdapter
def close(self) -> None:
_maybe_close(self.venue)
_maybe_close(self.zinc_plane)
_maybe_close(self.control_plane)
No __del__ method. If a bundle is garbage collected without an explicit close() call:
- The Rust kernel's
KernelHandleis freed byExecutionKernel.__del__(if GC runs) - If
RealZincPlanewas in use, itsclose()is never called → 3 shared memory regions leaked - If
RealZincControlPlanewas in use, itsclose()is never called → 1 shared memory region leaked - The projection (Hazelcast) client connection is never closed
- The venue adapter's thread pool executor is never shut down
If the bundle is created and dropped in a loop (e.g., per-test setup/teardown), shared memory regions accumulate until the system runs out of /dev/shm/ space.
Severity: Medium
O10: ExecutionKernel has no close() — __del__ is the only cleanup path for the Rust handle
File: `rust_backend.py:519-525**
def __del__(self) -> None:
backend = getattr(self, "_backend", None)
if backend is not None:
try:
_get_rust().destroy(backend)
except Exception:
pass
No close() method exists on ExecutionKernel. The DITAv2LauncherBundle.close() doesn't touch the kernel (it calls _maybe_close on venue, zinc_plane, and control_plane only). The Rust _backend handle is only freed when __del__ runs during garbage collection.
If the kernel is part of a reference cycle (K3/K6 — Kernel → KernelStateView → KernelSlotView → Kernel), __del__ may be delayed indefinitely until the cycle GC runs. During that delay, the Rust KernelHandle is alive but unreachable — its memory is leaked until GC.
Severity: Medium
O11: KernelSlotView.__setattr__ triggers 5 side effects including durable writes — undocumented
File: `rust_backend.py:428-453**
def __setattr__(self, name: str, value: Any) -> None:
...
slot = self._snapshot()
setattr(slot, name, value)
self._kernel._set_slot(slot) # triggers: Rust FFI write + state refresh
# + account.observe_slots
# + projection.write_slot
# + zinc_plane.write_slot
Setting any attribute on a KernelSlotView — even something trivial like slot.some_metadata_field = "test" — triggers 5 side effects: Rust FFI write to the kernel, KernelStateView.refresh(), account.observe_slots(), projection.write_slot(), and zinc_plane.write_slot(). The method name __setattr__ gives no indication that setting a field triggers durable writes across multiple persistence layers.
There is no read-only view that prevents accidental mutation. Any code that holds a KernelSlotView reference and assigns a field bypasses all FSM guards and directly mutates the Rust kernel state.
Severity: Medium
Pass 12 Summary
| # | Flaw | Layer | Severity |
|---|---|---|---|
| O1 | _maybe_close() asyncio.run without loop guard — close/disconnect silently skipped from async context |
Launcher | High |
| O2 | async def connect() shims call sync venue.connect() without await — blocking pattern |
Test | Medium |
| O3 | _contract_rows(client) NOT awaited in _pick_live_symbol — silent coroutine iteration crash |
Test | High |
| O4 | test_exchange_event_seam_parity.py uses deprecated get_event_loop().run_until_complete() |
Test | Medium |
| O5 | _run() thread pool .result() has no timeout — backend hang freezes process indefinitely |
Venue | High |
| O6 | MockVenueAdapter never exercises thread-pool bridge — bridge untested in CI | Venue | Medium |
| O7 | _keepalive_loop/_rotation_sentinel fire-and-forget tasks — exceptions silently lost |
Stream | Low |
| O8 | KernelSlotView.__getattr__ makes N FFI calls for N attribute accesses — no caching |
Bridge | Medium |
| O9 | DITAv2LauncherBundle no __del__ — GC'd bundle leaks entire resource tree |
Launcher | Medium |
| O10 | ExecutionKernel no close() — Rust handle only freed by unpredictable __del__ |
Bridge | Medium |
| O11 | KernelSlotView.__setattr__ triggers 5 persistence side effects — read-only view missing |
Bridge | Medium |
Pass 12 Severity
| Severity | Count |
|---|---|
| High | 3 (O1, O3, O5) |
| Medium | 7 (O2, O4, O6, O8, O9, O10, O11) |
| Low | 1 (O7) |
Combined Catalog (All 12 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|---|---|---|---|---|---|---|---|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 |
| Total | 254 | 20 | 70 | 73 | 60 | 28 |