Files
siloqy/prod/clean_arch/dita_v2/CRITICAL_DITAv2_FLAWS.md
Codex 3d7b00e28d Snapshot PINK DITAv2 system + Sprint 0 flaw-fix verification
First commit of the previously-untracked PINK-on-DITAv2 migration system
(execution moves to the Rust kernel; policy stays on legacy DITA, so Alpha
Engine algorithmic integrity is preserved). BLUE is untouched.

Sprint 0 (safety snapshot + flaw-fix verification, MARKET single-leg scope):
- Verified Rust FSM fixes (flaws 2,4,10,11,13) by source read of lib.rs.
- Hardened 5 vacuous/guarded assertions in test_flaws.py so each flaw test
  genuinely exercises its fix. Most important: Flaw 5 now asserts capital
  moves by EXACTLY realized PnL (was entering/exiting at the same price).
- Offline suites: 533 passed, 0 failed (35 flaws + 402 kernel/accounting/
  bridge + 96 runtime/persistence/multi-exit/restart/seams).
- GATE PASS: MARKET-path-critical flaws 1,2,5 confirmed fixed + green.
- Added SPRINT0_FLAW_VERIFICATION.md report and _rust_kernel/.gitignore
  (excludes Rust target/ build artifacts).

LIMIT/partial-fill remain explicitly out of scope (MARKET-only bring-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 18:26:43 +02:00

32 KiB
Raw Blame History

CRITICAL: DITAv2 Execution Kernel — 13 Structural Flaws

Analysis date: 2026-05-30 Analyst: Systematic code review across Rust kernel, Python bridge, venue adapters, and test infrastructure Scope: Full DITAv2 pipeline — kernel.pyrust_backend.py_rust_kernel/src/lib.rsbingx_venue.pybingx_direct.py → BingX REST


How to read this document

Each flaw follows the same structure:

Section What you'll find
Location File path(s) and approximate line numbers
Nature What kind of defect — structural, logic, protocol, edge-case, missing-feature
Downstream effect What breaks in practice, not just what the code does wrong
Exploit / trigger The exact sequence of events that manifests the bug
Why it's not caught Why existing tests (142/142 pass) don't detect it
Fix strategy High-level approach; no patch code here

Flaw 1: Entry-order cancellation is structurally broken

Location: rust_backend.py lines ~470475 (Python bridge), _rust_kernel/src/lib.rs lines ~660685 (Rust process_intent CANCEL branch), _rust_kernel/src/lib.rs lines ~740748 (Rust on_venue_event CANCEL_ACK branch)

Nature: Missing feature / logic gap — two-layer hole

Downstream effect

A CANCEL intent submitted for an entry order (slot in ORDER_REQUESTED or ENTRY_WORKING) is silently ignored. The venue is never called, so the order remains live on the exchange. The caller receives an accepted=False, diagnostic_code=NO_ACTIVE_EXIT_ORDER outcome but no error is raised — normal execution continues.

With MARKET orders (the only type tested in the 142-scenario suite), this doesn't matter because the order fills in 13 seconds, arriving before the CANCEL even runs or making the CANCEL economically irrelevant. With LIMIT orders (per CRITICAL_NEEDED_PARTIAL_FILL_SUPPORT.md), resting orders on the book would be structurally impossible to cancel through the kernel.

Exact code path

Layer 1 — Python bridge (rust_backend.py):

elif intent.action == KernelCommandType.CANCEL:
    emitted_events = self.venue.cancel(
        self.slot(intent.slot_id).active_exit_order,  # ← None for entry-only slots
        ...
    ) if self.slot(intent.slot_id).active_exit_order else []  # ← always []

The guard if self.slot(...).active_exit_order evaluates to False for any slot that only has an entry order. emitted_events stays []. The venue's cancel() is never called.

Layer 2 — Rust kernel process_intent (lib.rs):

KernelCommandType::CANCEL => {
    if slot.active_exit_order.is_none() {
        return KernelResult {
            outcome: KernelOutcome {
                accepted: false,
                diagnostic_code: KernelDiagnosticCode::NO_ACTIVE_EXIT_ORDER,
                ...
            },
            ...
        };
    }
    // ... code only reachable if active_exit_order.is_some()
}

The Rust kernel also only looks for an exit order. It returns NO_ACTIVE_EXIT_ORDER for entry cancels.

Layer 3 — Rust kernel on_venue_event CANCEL_ACK (lib.rs):

KernelEventKind::CANCEL_ACK => {
    if slot.active_exit_order.is_some() {
        slot.active_exit_order = None;
        slot.fsm_state = TradeStage::POSITION_OPEN;
    }
}

Even if a CANCEL_ACK somehow arrived for an entry order, the Rust FSM has no branch to transition ENTRY_WORKING → IDLE on cancel. The slot would remain stuck.

Why it's not caught

The test suite has:

  • cancel_entry_order — ENTER → sleep 1s → CANCEL. By 1s the MARKET order has filled, so the slot is already POSITION_OPEN, making the CANCEL technically valid against active_exit_order? No — it's active_entry_order that's filled. But wait: when the entry fills, the Rust kernel transitions to POSITION_OPEN and keeps active_entry_order in place (filled state). active_exit_order is still None. So the CANCEL still hits NO_ACTIVE_EXIT_ORDER. But the test only checks that capital is positive and exchange is flat — it never checks outcome.accepted or outcome.diagnostic_code for the CANCEL call.
  • cancel_idempotent — Same pattern: ENTER → sleep 0.5s → CANCEL.
  • double_cancel — Same.
  • All checks are pass/fail on capital + exchange flatness, not on whether the cancel actually did anything.

Fix strategy

  1. Add an order_action field to KernelIntent (or use existing action) to distinguish entry-cancel from exit-cancel
  2. In the Python bridge, call venue.cancel() on active_entry_order when the intent is CANCEL and active_exit_order is None
  3. In the Rust kernel, add an active_entry_order branch to process_intent(CANCEL) that transitions ENTRY_WORKING / ORDER_REQUESTED → IDLE
  4. In the Rust kernel, add an active_entry_order branch to on_venue_event(CANCEL_ACK) that transitions to IDLE

Flaw 2: Rust CANCEL FSM has no entry-order reset path

Location: _rust_kernel/src/lib.rs lines ~740748

Nature: Missing FSM case — the on_venue_event handler for CANCEL_ACK only handles exit orders

Downstream effect

Even if the Python bridge were fixed to call venue.cancel() on the active entry order (fixing Flaw 1), and even if BingX returned a successful cancel-ack, the Rust kernel would not update the slot state. The slot would remain in ENTRY_WORKING with active_entry_order still attached. The kernel would believe the order is still live on the exchange.

No subsequent ENTER intent would be accepted (SLOT_BUSY). The slot would be permanently deadlocked until a manual reconcile_from_slots overwrites it.

Exact code path

KernelEventKind::CANCEL_ACK => {
    if slot.active_exit_order.is_some() {
        slot.active_exit_order = None;
        slot.fsm_state = TradeStage::POSITION_OPEN;
    }
    // No else branch — silent no-op for entry cancels
}

The full FSM transition matrix for CANCEL_ACK should include:

  • ENTRY_WORKING, active_entry_order.is_some() → clear entry order, set IDLE
  • EXIT_WORKING, active_exit_order.is_some() → clear exit order, set POSITION_OPEN (existing code)

Why it's not caught

Same reason as Flaw 1 — the cancel never fires, so CANCEL_ACK never arrives. The code path has never been exercised.

Fix strategy

Add an else if branch:

} else if slot.active_entry_order.is_some() {
    slot.active_entry_order = None;
    slot.trade_id.clear();
    slot.asset.clear();
    slot.side = TradeSide::FLAT;
    slot.size = 0.0;
    slot.initial_size = 0.0;
    slot.fsm_state = TradeStage::IDLE;
}

Flaw 3: Python process_intent overwrites outcome with mixed-epoch state

Location: rust_backend.py lines ~490505

Nature: Data consistency — returned KernelOutcome mixes pre-venue and post-venue state

Downstream effect

Any caller inspecting the returned KernelOutcome from process_intent() gets misleading information:

  • diagnostic_code is from the Rust kernel's pre-venue opinion
  • state is from the slot after venue events were processed
  • transitions only contains pre-venue transitions
  • emitted_events correctly contains post-venue events

A caller checking outcome.accepted == True and outcome.state == ORDER_REQUESTED (the Rust kernel's initial state) would be wrong — the slot is actually already in POSITION_OPEN because the fill arrived within the same function call.

Exact code path

result = _get_rust().process_intent(...)  # Rust: IDLE → ORDER_REQUESTED
outcome = _outcome_from_payload(result["outcome"])  # state=ORDER_REQUESTED

# ... venue.submit() ... on_venue_event() ... transitions slot through ENTRY_WORKING → POSITION_OPEN

final_slot = self._get_slot(outcome.slot_id)  # fsm_state=POSITION_OPEN now

final_outcome = KernelOutcome(
    state=final_slot.fsm_state,                    # POSITION_OPEN ← post-venue
    diagnostic_code=outcome.diagnostic_code,       # OK              ← pre-venue
    transitions=outcome.transitions,                # [IDLE→ORDER_REQUESTED] ← incomplete
    emitted_events=tuple(emitted_events),           # [ORDER_ACK, FULL_FILL] ← correct
)

Why it's not caught

No test inspects outcome.transitions or validates that outcome.state matches outcome.diagnostic_code. The outcome_inspect_entry test (_gen_test.py body) checks len(info["transitions"]) > 0 — which passes because there's at least one — and info["diagnostic"] == "OK". It doesn't check that the state in the outcome matches the diagnostic or that all transitions are present.

Fix strategy

Either:

  1. Re-read the Rust outcome after venue events complete (costly — additional FFI call), or
  2. Emit the venue-event transitions back from on_venue_event and append them to the returned outcome, or
  3. Document that outcome.transitions is a partial snapshot and the caller should inspect the slot directly via k.slot(n) for current state

Flaw 4: Multi-leg exit final leg can double-close and double-settle

Location: _rust_kernel/src/lib.rs lines ~775830, specifically the apply_fill exit path in on_venue_event

Nature: Logic error — redundant state mutation

Downstream effect

When a FULL_FILL closes the last leg of a multi-leg exit, the Rust kernel sets slot.fsm_state = CLOSED and slot.closed = true in two separate code blocks. Block A does it based on active_leg_index, block B does it independently based on slot.size <= 1e-12. Both blocks run on the same event.

In practice this doesn't double-settle because the Python side processes a single on_venue_event call. But the slot state after the event is unpredictable — block B clears active_entry_order and active_exit_order that block A left in place. If any code path depends on inspecting the orders after a close (e.g., for journaling), it sees inconsistent state.

Exact code path

// Block A (lines ~780-800):
if slot.active_leg_index >= slot.exit_leg_ratios.len() {
    slot.closed = true;
    slot.fsm_state = TradeStage::CLOSED;
    slot.active_exit_order = None;
}

// Block B (lines ~810-830), runs unconditionally after block A:
if !partial {
    slot.consume_exit_leg();          // advances leg index
    if slot.size <= 1e-12 {
        slot.closed = true;            // redundant
        slot.fsm_state = TradeStage::CLOSED;  // redundant
        slot.active_exit_order = None;  // redundant
        slot.active_entry_order = None; // extra — block A didn't do this
    }
}

Why it's not caught

The multi-leg exit tests (multi_leg_exit, x4_partial_hold_exit, all leg ratio variants) check capital integrity and exchange flatness. They don't inspect the slot's active_entry_order or active_exit_order after exit. The final capital assertion passes because settle() is called once per on_venue_event call regardless of how many times the slot's internal flags toggle.

Fix strategy

Restructure apply_fill for exit fills so there's a single point where CLOSED is set:

  • If active_leg_index >= ratios.len() or size <= 1e-12 after the fill → set CLOSED
  • Not both independently

Flaw 5: Capital settlement only triggers on terminal states

Location: rust_backend.py lines ~520525

Nature: Accounting accuracy — intra-trade realized PnL invisible to account projection

Downstream effect

When a LIMIT order partially fills (PARTIALLY_FILLED event), the Rust kernel correctly accumulates realized PnL on the slot:

slot.realized_pnl += realized;

But the Python bridge only pushes PnL to the account on terminal transitions:

if slot.fsm_state in {TradeStage.CLOSED, TradeStage.TRADE_TERMINAL_WRITTEN} and slot.realized_pnl != 0.0:
    self.account.settle(slot.realized_pnl)

During a partial fill that leaves the slot in EXIT_WORKING, the accumulated PnL sits on the slot but never reaches account.snapshot.capital. For a LIMIT order that partially fills over several minutes, the system's view of available capital is stale during the entire fill window. This could cause the system to incorrectly calculate available margin for concurrent positions.

Exact trigger

  1. Slot is in POSITION_OPEN with size=1.0
  2. EXIT intent → slot moves to EXIT_WORKING
  3. Venue sends PARTIALLY_FILLED: filled_size=0.3, remaining_size=0.7
  4. Rust: slot.realized_pnl += +2.50 (3% gain on 30% of position)
  5. Python: slot.fsm_state == EXIT_WORKING (not CLOSED) → settle() is NOT called
  6. account.snapshot.capital still shows pre-exit value
  7. Venue sends FULL_FILL: filled_size=0.7, remaining_size=0.0
  8. Rust: slot.realized_pnl += +5.83 (remaining), total = 8.33
  9. Python: slot.fsm_state == CLOSED → settle(8.33) → capital jumps by full amount

For 3 minutes between step 4 and step 7, all downstream consumers see wrong capital.

Why it's not caught

All 142 tests use MARKET orders that fill instantly in one shot. There is never a multi-event fill sequence for a single order. The non-instant fills come from multi-leg exits (multiple MARKET orders), where each exit is a separate process_intent call with its own on_venue_event cycle, and each eventually reaches CLOSED independently.

Fix strategy

Change the settle trigger to fire on any realized PnL change, not just on terminal state transitions:

if slot.realized_pnl != self._last_settled_pnl.get(slot.slot_id, 0.0):
    incremental = slot.realized_pnl - self._last_settled_pnl[slot.slot_id]
    self.account.settle(incremental)
    self._last_settled_pnl[slot.slot_id] = slot.realized_pnl

Or simpler: settle the delta every time on_venue_event processes a fill event, regardless of slot state.


Flaw 6: _legacy_intent() silently drops order_type and limit_price

Location: bingx_venue.py lines ~280295

Nature: Chain break — data loss at the Python level

Downstream effect

The CRITICAL_NEEDED_PARTIAL_FILL_SUPPORT.md spec adds order_type and limit_price to KernelIntent. But there are two venue adapters, and one of them strips the new fields:

BingxVenueAdapter receives KernelIntent and converts to LegacyIntent:

def submit(self, intent: KernelIntent) -> List[VenueEvent]:
    receipt = self._call_backend("submit_intent", self._legacy_intent(intent))

_legacy_intent() builds a LegacyIntent — which has no order_type or limit_price fields:

return LegacyIntent(
    timestamp=intent.timestamp,
    trade_id=intent.trade_id,
    decision_id=intent.intent_id,
    asset=intent.asset,
    action=action,
    side=side,
    reason=intent.reason,
    target_size=float(intent.target_size),
    leverage=float(intent.leverage),
    reference_price=float(intent.reference_price),
    confidence=1.0,
    bars_held=0,
    exit_leg_ratios=tuple(intent.exit_leg_ratios or (1.0,)),
    metadata=dict(intent.metadata),
    # order_type and limit_price are NOT HERE — silently dropped
)

The BingxDirectExecutionAdapter.submit_intent() receives LegacyIntent and uses intent.action, intent.side, intent.target_size, etc. — none of which carry the new fields.

MockVenueAdapter receives KernelIntent directly and would see the new fields — but it only uses intent.target_size, intent.reference_price, intent.side, and intent.action. order_type and limit_price are ignored there too.

So even after KernelIntent gains the new fields, no code path exists that reads them and passes them to the BingX REST payload.

Exact trigger

Someone constructs:

intent = KernelIntent(
    action=ENTER, trade_id="t1",
    order_type="LIMIT", limit_price=0.083456,
    ...
)
k.process_intent(intent)

The new fields survive through _intent_to_payload() to Rust (harmless — Rust ignores unknown fields), then back to Python. The Python bridge calls venue.submit(intent) with the intent that still has order_type="LIMIT". But bingx_venue.submit() converts to LegacyIntent — which drops them. bingx_direct.py sees a MARKET order.

Why it's not caught

The new fields don't exist yet. No test exercises LIMIT orders.

Fix strategy

The cleanest fix is to bypass _legacy_intent() for BingxVenueAdapter.submit() and pass KernelIntent directly to the adapter. The adapter's submit_intent() already has access to intent.asset, intent.side, etc. It just needs to receive the right type.

If BingxDirectExecutionAdapter must keep accepting LegacyIntent for backward compatibility, encode the new fields in LegacyIntent.metadata:

metadata = dict(intent.metadata)
metadata["_order_type"] = intent.order_type
metadata["_limit_price"] = intent.limit_price

Then on the adapter side, read intent.metadata.get("_order_type", "MARKET").


Flaw 7: Mock venue partial_fill_ratio applies to both entry and exit

Location: mock_venue.py lines ~6090

Nature: Test infrastructure limitation — single ratio cannot distinguish entry vs exit

Downstream effect

The MockVenueScenario has one float: partial_fill_ratio: float = 1.0. When set to, say, 0.5, every submit() call produces a PARTIALLY_FILLED event with 50% fill — regardless of whether the intent is an ENTER or an EXIT.

This makes it impossible to write a mock-venue unit test that:

  • Entry fills fully (ratio=1.0) but exit fills partially (ratio=0.5)
  • Entry fills partially (ratio=0.3) and then fills fully on a second submit
  • Different partial ratios per leg of a multi-leg exit

Exact code path

if self.scenario.emit_fill_on_submit or self.scenario.partial_fill_ratio > 0:
    fill_ratio = max(0.0, min(1.0, float(self.scenario.partial_fill_ratio)))
    fill_size = float(intent.target_size) * fill_ratio
    # ... emits PARTIALLY_FILLED or FULL_FILL based on ratio
    # No distinction between ENTER and EXIT

Why it's not caught

The mock venue is used in unit tests (test_rust_backend.py or similar), not in the live BingX e2e tests. The live tests use BingxVenueAdapter with real BingX VST, where MARKET orders always fill fully. The partial_fill_ratio path has never been used for a scenario that distinguishes entry from exit behavior.

Fix strategy

Add per-action-type ratios:

@dataclass(frozen=True)
class MockVenueScenario:
    entry_partial_fill_ratio: float = 1.0
    exit_partial_fill_ratio: float = 1.0

Or add a per-order override via intent.metadata.


Flaw 8: Per-asset price precision helper does not exist

Location: bingx_direct.py_format_quantity() exists (line ~150) but _format_price() does not

Nature: Missing feature — LIMIT orders will be rejected by BingX

Downstream effect

BingX requires the price field of a LIMIT order to have the correct decimal precision for each symbol. The _format_quantity() method resolves size_increment from the instrument provider and quantizes the quantity. No equivalent exists for price.

Without it, submitting a LIMIT order with limit_price=0.08 for TRXUSDT sends "price": "0.08" to BingX. BingX expects 6 decimal places for TRXUSDT prices (e.g., 0.083456). The order is rejected with "code": 100001, "msg": "Invalid price precision".

Symbol Approx price Required decimals limit_price value What BingX expects
TRXUSDT $0.08 6 0.083456 "0.083456"
XRPUSDT $0.52 4 0.5234 "0.5234"
ADAUSDT $0.45 4 0.4523 "0.4523"
DOGEUSDT $0.15 5 0.15234 "0.15234"
BTCUSDT $60,000 2 60000.50 "60000.50"

Why it's not caught

No LIMIT orders are submitted. All 142 tests use MARKET orders where type="MARKET" and no price field is sent.

Fix strategy

Add _format_price(self, asset: str, price: float) -> str mirroring _format_quantity:

def _format_price(self, asset: str, price: float) -> str:
    instrument = self._resolve_instrument(asset)
    if instrument is not None:
        try:
            price_step = Decimal(str(instrument.price_increment.as_decimal()))
            value = Decimal(str(price))
            quantized = (value / price_step).to_integral_value(rounding=ROUND_DOWN) * price_step
            return _decimal_text(quantized)
        except Exception:
            pass
    return f"{price:.8f}".rstrip("0").rstrip(".")

The instrument provider already exposes price_increment — it just needs to be accessed.


Flaw 9: Cancel path falls back to trade_id as symbol

Location: bingx_venue.py lines ~300310 (within cancel())

Nature: Logic error — wrong variable in fallback chain

Downstream effect

When BingxVenueAdapter.cancel() is called and the order's metadata dict lacks an "asset" key, it falls back:

asset = str(order.metadata.get("asset") or order.internal_trade_id or order.venue_client_id or "")

order.internal_trade_id is the system's trade_id (e.g., "cancel-idle-1712345678"). This gets fed to self.backend._instrument_venue_symbol(asset) which does:

def _instrument_venue_symbol(self, asset: str) -> str:
    text = _normalize_symbol(asset)  # "CANCEL-IDLE-1712345678"
    if text.endswith("USDT"):
        return f"{text[:-4]}-USDT"  # "CANCEL-IDLE-1712345678"-USDT — nonsense
    return text  # doesn't end with USDT → returns the garbage

The cancel HTTP call is sent to BingX with a symbol that doesn't exist. BingX returns an error or silently ignores the request. The cancel silently fails.

This can happen whenever a VenueOrder is constructed without metadata["asset"]. The mock venue's _event_from_order sets metadata={"intent_id": ..., "action": ...} but does not include "asset". So any cancel path triggered from a mock venue event will hit this bug.

Exact trigger sequence

  1. MockVenueAdapter.submit() creates a VenueOrder with metadata={"intent_id": ..., "action": ...} — no "asset"
  2. The kernel attaches this order to the slot
  3. A CANCEL intent arrives
  4. Python bridge calls self.venue.cancel(self.slot(slot_id).active_entry_order)
  5. BingxVenueAdapter.cancel() does order.metadata.get("asset") → None
  6. Falls back to order.internal_trade_id → a trade_id string
  7. Sends delete to BingX with a bogus symbol

Note: this only occurs when the mock venue is used in a test configuration. In live mode, BingxDirectExecutionAdapter stores richer metadata. But the fallback chain is still wrong and could bite in edge cases.

Why it's not caught

The live tests always have metadata["asset"] populated because the kernel attaches it before calling the venue. The mock venue's cancel path is only exercised in unit tests that don't check the BingX HTTP call content.

Fix strategy

Change the fallback to use the order's internal_trade_id to look up the slot's asset from the kernel, not try to interpret it as a symbol:

# In cancel(), before the fallback:
slot = self._kernel.slot(order.metadata.get("slot_id", 0))
asset = str(order.metadata.get("asset") or slot.asset or "")

Or at minimum, add the asset to the mock venue's event metadata.


Flaw 10: Event dedup window is bounded at 64

Location: _rust_kernel/src/lib.rs lines ~5 (constant), ~850855 (eviction logic)

Nature: Resource management — fixed-size ring buffer with silent eviction

Downstream effect

Each TradeSlot tracks seen events in seen_event_ids: Vec<String>. When the vector exceeds 64 entries, the oldest entries are drained:

if slot.seen_event_ids.len() > MAX_SEEN_EVENT_IDS {
    let overflow = slot.seen_event_ids.len() - MAX_SEEN_EVENT_IDS;
    slot.seen_event_ids.drain(0..overflow);
}

This means:

  • Events 164 are deduplicated correctly
  • When event 65 arrives, event 1 is evicted. If event 1 arrives again, it's accepted as new
  • When event 66 arrives, event 2 is evicted, etc.
  • After 64 unique events, the dedup window is a rolling window of the last 64 events

With MARKET orders (13 events per trade), a slot would need ~2060 trades before cycling through 64 events. With LIMIT orders that may receive many partial fills per order (e.g., a resting order that gets 5 fills/hour over 6 hours = 30 events), the limit could be hit in a single trade.

Why it's not caught

No test submits more than ~30 events to a single slot (rapid_ten_cycle does 10 entry→exit cycles = ~30 events). The 64 limit was never reached.

Fix strategy

Either:

  1. Increase MAX_SEEN_EVENT_IDS to a larger value (256 or 1024), or
  2. Use a proper LRU/size-bounded set (e.g., LruCache from the lru crate), or
  3. Change to a HashMap-based dedup keyed by (event_id, action) so eviction is explicit

Flaw 11: Reconcile is a raw state override with no FSM validation

Location: _rust_kernel/src/lib.rs lines ~900915 (dita_kernel_reconcile_slots_json)

Nature: Safety — no guards on incoming state

Downstream effect

The reconcile function blindly overwrites slot state:

for slot in slots {
    if slot.slot_id < core.slots.len() {
        core.slots[slot.slot_id] = slot.clone();
    }
}

There is zero validation that the incoming slot state is a valid successor to the current state. A caller could:

  • Set fsm_state = POSITION_OPEN with size = 0.0 — the kernel thinks it has an open position with no size
  • Set fsm_state = CLOSED with size = 5.0 — the kernel thinks a position is closed but still has size
  • Set fsm_state = ENTRY_WORKING with trade_id = "" — the kernel is in "entry working" state for no trade
  • Clear seen_event_ids to reset dedup — silently accepting duplicates

The intended use is restoring kernel state from a snapshot after a crash, where the slot state was explicitly serialized by a previous kernel.snapshot(). In that case the state should be self-consistent. But there's no guard against malformed or corrupted snapshot data.

Why it's not caught

The reconcile tests (reconcile_empty, reconcile_after_entry, etc.) all reconcile with self-consistent slot data from k.slot(0). They never feed malformed state. The fresh_kernel_reconcile_* tests similarly use _slot_from_payload on data serialized from a real slot.

Fix strategy

Add validation in the Rust kernel (or Python bridge) that checks basic consistency:

  • fsm_state == POSITION_OPENsize > 0 and asset non-empty
  • fsm_state == IDLEsize == 0 and trade_id empty
  • fsm_state == CLOSEDclosed == true
  • size >= 0
  • slot_id matches array index

Flaw 12: outcome.transitions is incomplete — pre-venue only

Location: rust_backend.py lines ~490505, _rust_kernel/src/lib.rs lines ~700710

Nature: API contract — returned data is a partial snapshot

Downstream effect

process_intent() runs three phases in sequence:

  1. Rust kernel processes the intent (pure FSM: IDLE → ORDER_REQUESTED)
  2. Venue adapter submits to exchange (HTTP call, receives ack + fill)
  3. on_venue_event called per venue response (ORDER_ACK → ENTRY_WORKING, FULL_FILL → POSITION_OPEN)

Each phase produces KernelTransition records. But only phase 1 transitions appear in the returned KernelOutcome.transitions:

final_outcome = KernelOutcome(
    ...
    transitions=outcome.transitions,         # from Rust — phase 1 only
    emitted_events=tuple(emitted_events),    # from venue — phases 2-3
    ...
)

A caller inspecting transitions sees [IDLE → ORDER_REQUESTED] and has no way to discover that [ORDER_REQUESTED → ENTRY_WORKING] and [ENTRY_WORKING → POSITION_OPEN] also occurred. The journal (ClickHouseKernelJournal) records all transitions correctly — but the returned KernelOutcome is the API surface that callers interact with.

Why it's not caught

The outcome_inspect_entry test checks len(info["transitions"]) > 0 and info["diagnostic"] == "OK". It doesn't validate that all expected transitions are present. The transitions are journaled to the debug sink, but no test reads the journal.

Fix strategy

Collect transitions from phases 2-3 and append them to the outcome:

all_transitions = list(outcome.transitions)
for event in emitted_events:
    event_outcome = self.on_venue_event(event)
    all_transitions.extend(event_outcome.transitions)
final_outcome = KernelOutcome(..., transitions=tuple(all_transitions), ...)

Or document that transitions is an incomplete snapshot and the journal is the authoritative source.


Flaw 13: Slot realized PnL is not reset on re-entry after partial exit

Location: _rust_kernel/src/lib.rs lines ~575600 (ENTER intent handler), specifically slot reset

Nature: State leakage — accumulated PnL from prior trade survives into next cycle

Downstream effect

When an ENTER intent arrives, the Rust kernel resets most slot fields:

slot.trade_id = intent.trade_id.clone();
slot.asset = intent.asset.clone();
slot.side = intent.side.clone();
slot.entry_time = Some(intent.timestamp);
slot.entry_price = 0.0;
slot.size = 0.0;
slot.initial_size = 0.0;
slot.unrealized_pnl = 0.0;
slot.realized_pnl = 0.0;         // ← reset to zero
slot.exit_leg_ratios = ...;
slot.active_leg_index = 0;
slot.active_entry_order = None;
slot.active_exit_order = None;
slot.closed = false;
slot.last_event_time = None;
slot.fsm_state = TradeStage::ORDER_REQUESTED;

slot.realized_pnl = 0.0 is explicitly set — correct for a fresh trade. But recall from Flaw 5 that realized PnL from partial fills (before the terminal close) may not yet have been settled to the account. If the slot accumulates realized PnL during partial fills, then re-enters before the final settle happens, the in-flight PnL is zeroed without being settled.

This is actually the correct behavior because:

  1. All MARKET-order fills settle immediately (they arrive as FULL_FILL and transition to CLOSED in one shot)
  2. For LIMIT orders that partially fill, the re-entry scenario is impossible because the slot isn't IDLE — it can't accept a new ENTER until the position is fully closed
  3. The slot CAN re-enter after a full close, and by then all PnL has been settled

So this is a latent rather than active flaw. It would manifest if:

  1. A LIMIT order partially fills (PnL on slot, not settled)
  2. The remaining limit is cancelled
  3. The slot's consume_exit_leg() leaves the slot in POSITION_OPEN with size > 0 and !closed but no active orders
  4. Another ENTER arrives — but the Rust kernel rejects it because !slot.is_free()

So the slot design prevents this from happening accidentally. The flaw is that if a future code path bypasses the is_free() check (e.g., a force-enter feature), the unreleased PnL would be silently zeroed.

Why it's not caught

The scenario can't happen with the current FSM. All fills eventually reach CLOSED, which triggers settle. No test forces an entry on a non-free slot.

Fix strategy

Add an explicit assertion or sentinel in the ENTER handler:

if slot.realized_pnl.abs() > 1e-10 {
    // Log warning: unsynchronized PnL being discarded
}

Or enforce that settle() is always called before realized_pnl is reset, by moving the settle trigger to the Rust side.


Summary table

# Flaw Layer Severity Blocks partial-fill?
1 Entry-order cancellation broken Python + Rust Critical Yes — can't cancel resting LIMIT entries
2 No CANCEL_ACK → IDLE for entry Rust FSM Critical Yes — slot stuck after cancelled entry
3 Outcome mixes pre/post-venue state Python bridge Medium No
4 Multi-leg exit double-close Rust FSM Low No
5 Capital settle only on terminal state Python bridge High Partial — stale capital during partial fills
6 order_type/limit_price dropped in legacy intent Python venue Critical Yes — LIMIT orders never reach BingX
7 Mock venue single ratio for entry+exit Mock venue Low No (mock tests only)
8 Missing price formatting Adapter High Yes — BingX rejects bad price precision
9 Cancel falls back to trade_id as symbol Python venue Medium No
10 Event dedup window at 64 Rust FSM Low No
11 Reconcile has no FSM validation Rust FSM Low No
12 Outcome transitions incomplete Python bridge Medium No
13 Unsettled realized PnL on re-entry Rust FSM Low No

6 critical/high — must be fixed before safe LIMIT order / partial-fill deployment. 4 medium — should be fixed in the same pass to keep hygiene. 3 low — latent; fix opportunistically.