Six variants. **No `EXIT_RESIDUAL`.** If any caller submits an intent with `action = "EXIT_RESIDUAL"`, the string_enum deserializer fails — serde returns `INVALID_INTENT_PARSE`. Even if deserialization worked, there's no branch to handle residual-position cleanup. Any position with remaining size after partial exit legs has **no way to trigger a clean-up exit** via the intent system.
The Python `KernelCommandType` enum (contracts.py) does have `EXIT_RESIDUAL`, translated to `"EXIT_RESIDUAL"` string by `_intent_to_payload`. This string hits Rust's string_enum → parse error → `INVALID_INTENT_PARSE`.
**Fix:** Add `EXIT_RESIDUAL` variant to Rust enum + match arm that skips the `NO_OPEN_POSITION` guard for residual-sized positions.
`CString::new()` returns `Err` if the string contains a NUL (`'\0'`) byte. `.unwrap()` panics at the C FFI boundary. If any `serde_json::to_string()` output (e.g., user-controlled string in `KernelIntent`, `VenueEvent`, or `TradeSlot`) contains a NUL byte, this **panics the entire process**.
Triggered by every FFI call that returns a string:
-`dita_kernel_process_intent_json`
-`dita_kernel_on_venue_event_json`
-`dita_kernel_reconcile_slots_json`
-`dita_kernel_snapshot_json`
-`dita_kernel_get_slot_json`
**Fix:** Replace `.unwrap()` with `unwrap_or_else(|_| ptr::null_mut())` or feed through `invalid_intent_cstring`.
(a) **Transition prev_state is a lie.** If the slot was in `EXIT_WORKING`, `EXIT_SENT`, `EXIT_REQUESTED`, or `POSITION_PARTIALLY_CLOSED`, the transition record says `POSITION_OPEN` — wrong.
(b) **Backward transition.** If the slot is `EXIT_WORKING` and a new EXIT intent arrives, `fsm_state` is set to `EXIT_REQUESTED` — a backward transition from `EXIT_WORKING` → `EXIT_REQUESTED`. This corrupts the FSM.
(c) **No state guard.** EXIT should only be allowed from `POSITION_OPEN`, `EXIT_WORKING` (for additional legs), or `POSITION_PARTIALLY_CLOSED`. Currently any state that passes `!is_free() && !closed && size > 0` can transition to `EXIT_REQUESTED`.
**Fix:** Check actual FSM state before allowing EXIT, log actual prev_state, guard against backward transitions.
**Severity: Critical**
#### G4: `consume_exit_leg` advances beyond last valid index — stale `all_legs_done` variable
**File:** `_rust_kernel/src/lib.rs:1420-1435`
```rust
let all_legs_done = slot.active_leg_index >= slot.exit_leg_ratios.len(); // (A)
slot.consume_exit_leg(); // (C) — advances active_leg_index POST (A)
}
if should_close && slot.size <= 1e-12 { // (D) — close
} else if !partial && !all_legs_done { // (E) — stale! uses (A) not post-advance index
```
On the last leg (`active_leg_index = len - 1`):
- (A): `all_legs_done = false` (pre-advance)
- (C): advances to `len` (exhausted)
- (E): `!partial && !false` = true → enters `POSITION_OPEN` instead of examining `should_close` with post-advance index
The `all_legs_done` variable is captured **before**`consume_exit_leg` advances the index. Branch (E) should use the post-advance index to correctly detect exhaustion.
After exhaustion, `next_exit_ratio()` returns `1.0` (out-of-bounds `unwrap_or(1.0)`) — silently tries to exit remaining size as 100% instead of detecting completion.
**Severity: Critical**
#### G5: `realized_pnl` uses unbounded f64 — overflows to inf at extreme values
**File:** `_rust_kernel/src/lib.rs:648-656`
```rust
let notional = exit_size * slot.entry_price * slot.leverage.max(1.0);
delta * notional
```
No `is_finite()` check on intermediate products. At `exit_price=1e200`, `entry_price=1e-200`: `delta` = `(1e200 - 1e-200) / 1e-200` ≈ `1e400` → `inf`. The resulting `inf` is stored in `slot.realized_pnl`, corrupting all future PnL tracking.
Subnormals: `entry_price=5e-324` (subnormal) causes division to produce `inf` for modest exit prices on some platforms.
**Fix:** Add `is_finite()` guards on both prices and cap intermediate products.
If any of `delta`, `size`, `entry_price`, or `leverage` is extreme, the product overflows to `inf`. No result guard. `inf` stored in `unrealized_pnl` forever. Capped only by the `price <= 0.0` guard on input — no guard on the computation chain.
Also: `self.entry_price = price` at line 388 overwrites entry_price on every mark_price call for a position with `entry_price <= 0.0`, even when the position has been open for a while. This means a stale-zero entry_price gets set to the current market price on first mark_price after open, which is correct — but if the slot is reused (re-entry without resetting entry_price), the old entry price from the prior trade bleeds into unrealized PnL.
**Severity: High**
#### G7: `process_intent` ENTER — no `is_finite()` guard on `target_size`
**File:** `_rust_kernel/src/lib.rs:806-807`
```rust
intended_size: intent.target_size.max(0.0),
```
`f64::NAN.max(0.0)` returns `NAN`. `f64::INFINITY.max(0.0)` returns `inf`. Serde_json **does** accept `Infinity` and `NaN` by default — they're valid JSON tokens. If the Python-side `_first_invalid_intent_field` guard is bypassed (F3 — it allows these through), `NaN`/`inf` propagates into `intended_size` in `VenueOrder`, corrupting all fill calculations.
Similarly, `reference_price` is never validated for finiteness before being stored in `VenueOrder.metadata`.
**Severity: High**
#### G8: `reconcile_slots_json` — no dedup or bounds validation
**File:** `_rust_kernel/src/lib.rs:1668-1675`
```rust
for slot in slots {
if slot.slot_id <core.slots.len(){
core.slots[slot.slot_id] = slot.clone();
}
}
```
Two slots with the same `slot_id`: the **second overwrites the first** silently. A slot with `slot_id >= core.slots.len()`: **silently dropped** — no error, no diagnostic. Caller sees `accepted=true` even if some/all slots were not applied.
**Severity: High**
#### G9: `exchange_order_id` propagation uses wrong order target
**File:** `_rust_kernel/src/lib.rs:1110-1125`
```rust
let target = if slot.active_entry_order.is_some() {
slot.active_entry_order.as_mut()
} else {
slot.active_exit_order.as_mut()
};
```
If an **entry** order exists (even if fully filled) and an **exit** fill event arrives, the code updates the entry order's `venue_order_id` instead of the exit order's. The exit order's `venue_order_id` stays empty. Any subsequent `CANCEL` intent on the exit order fails because `active_exit_order.venue_order_id` is empty — the venue can't match the cancel.
**Fix:** Disambiguate by matching `venue_client_id`, or clear `active_entry_order` when entry is complete.
**Severity: High**
#### G10: CANCEL diagnostic code says NO_ACTIVE_EXIT_ORDER for entry cancel too
**File:** `_rust_kernel/src/lib.rs:966-1005`
```rust
if !has_cancellable_exit && !has_cancellable_entry {
When neither exit nor entry is cancellable, the diagnostic returns `NO_ACTIVE_EXIT_ORDER` regardless of which order was the target. If the user wanted to cancel an entry order that's not in a cancellable state, the diagnostic is misleading.
**Fix:** Separate diagnostic codes: `NO_ACTIVE_EXIT_ORDER`, `NO_ACTIVE_ENTRY_ORDER`, `ENTRY_NOT_CANCELLABLE`.
**Severity: High**
#### G11: `apply_fill` entry-fill overwrites `active_entry_order.intended_size` with `slot.size`
**File:** `_rust_kernel/src/lib.rs:1363-1377**
On FULL_FILL entry, `slot.active_entry_order` is entirely replaced with a new `VenueOrder` where `intended_size = slot.size` (the fill amount) instead of the original intended size. The original intended size (which could be larger than fill size for partial fills) is lost.
If a duplicate fill event arrives (dedup fails due to missing event_id), the second fill would use `slot.size` as the basis for further fills — wrong values.
**Severity: Medium**
#### G12: `leverage` unbounded after `is_finite()` — no maximum cap
**File:** `_rust_kernel/src/lib.rs:778`
```rust
slot.leverage = if intent.leverage.is_finite() && intent.leverage > 0.0 {
intent.leverage // 1e100 accepted here
} else { 1.0 };
```
`leverage = 1e100` passes `is_finite()`. Feeds into `realized_pnl()` as `slot.leverage.max(1.0) = 1e100`, producing `notional = exit_size * entry_price * 1e100`. Makes `unrealized_pnl` arbitrarily large.
No maximum leverage cap enforced anywhere — the exchange-level cap (`DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP`) exists in `BingxExecClientConfig` but is **never passed to the Rust kernel**.
**Severity: Medium**
#### G13: `resolve_slot` fallback returns `unwrap_or(0)` — can misroute events
When no slot matches the event (`slot_id` out of range or all slot filters fail), returns `slot_id` of the **first slot** (which may be 0 or any value). No diagnostic emitted — caller sees slot state change with no idea the event was misrouted.
Mutations to out-of-bounds slot are silently discarded. Can happen if `slot.slot_id` is corrupted via `set_slot_from_json` causing index mismatch between `slot.slot_id` and the actual slot position.
**Severity: Medium**
---
### Configuration & Validation Chain
#### G15: Zero `__post_init__` validators on all config dataclasses
Every config dataclass in the system has zero field-level validation:
| Dataclass | Fields | Validators |
|-----------|--------|------------|
| `KernelControlSnapshot` | 16 | **0** |
| `ControlUpdate` | 16 | **0** |
| `KernelIntent` | 19 | **0** |
| `TradeSlot` | 22 | **0** |
| `VenueOrder` | 8 | **0** |
| `VenueEvent` | 18 | **0** |
| `KernelTransition` | 11 | **0** |
| `KernelOutcome` | 8 | **0** |
| `AccountSnapshot` | 9 | **0** |
| **Total** | **127** | **0** |
The only validation in the entire chain:
-`_first_invalid_intent_field()` — finiteness guard at Python→Rust FFI boundary (not a dataclass validator)
- Rust `leverage = if is_finite && > 0.0 { val } else { 1.0 }` — post-hoc clamp
- Rust `KernelCore::new(max_slots.max(1))` — floor only, no ceiling
-`launcher.py:143`: `max(1, int(...))` for `active_slot_limit` — floor only
**No `__post_init__` exists anywhere. No bounds check on any field except the two floor-only guards.**
**Severity: High**
#### G16: `DITA_V2_DEBUG_CLICKHOUSE` defaults to `True` when env var is unset
`_env_bool` (launcher.py:75) returns `default` when the env var is unset. So `debug = True` by default. Every runtime writes debug traces to ClickHouse by default. `DITA_V2_DEBUG_CLICKHOUSE=False` is required to disable it.
This is not a bug per se, but it means debug ClickHouse writes are **on by default**, adding ~10 ClickHouse insertions per process_intent call (every transition + position state + trade event) that most production deployments may not want.
**Severity: Informational**
#### G17: String config fields have no charset/length validation — Zinc region injection risk
`runtime_namespace`, `strategy_namespace`, `event_namespace`, `actor_name`, `exec_venue`, `data_venue`, `ledger_authority` are all free-form strings with no validation. They're used as:
1.**Zinc shared memory region names**: `self.prefix + "." + namespace + "." + kind` — an attacker-controlled namespace could collide with other processes' Zinc regions
2.**ClickHouse table names**: `DOLPHIN_BINGX_JOURNAL_STRATEGY` is used as a table suffix — SQL injection risk in ClickHouse journal
3.**Hazelcast map names**: Same injection risk via `event_namespace`
**Severity: Medium**
#### G18: `exit_leg_ratios` no sum-to-1 validation
`KernelIntent.exit_leg_ratios` and `TradeSlot.exit_leg_ratios` are tuple/list of floats. No validator ensures they sum to approximately 1.0. Ratios summing to 0.5 leave the position partially closed forever (residual can't be exited because `next_exit_ratio()` returns `1.0` after exhaustion, exiting 100% of remaining — which may exceed the intended residual).
**Severity: Low**
#### G19: `RealZincControlPlane.read()` has no sequence check — torn-read risk
**File:** `real_control_plane.py:88-94**
```python
def read(self):
payload = _decode_packet(self.region.as_buffer())
control = payload.get("control")
if not isinstance(control, dict):
return self._snapshot
self._snapshot = KernelControlSnapshot(**control)
return self._snapshot
```
The binary packet has a 64-bit sequence number but `read()`**never checks it**. Between the zero-write and packet-write in `_write_region`, a reader sees an empty buffer → `_decode_packet` fails → falls back to `self._snapshot` (stale). Between the packet-write and `struct.pack` header (order depends on implementation), a reader sees a partial write with wrong size → `_decode_packet` fails.
No checksum on the wire format: `struct.pack("!QQ", seq, len) + json_bytes`. A torn write produces garbage that `json.loads` may or may not parse successfully.
These are used as ClickHouse table and database name suffixes in `pink_clickhouse.py`. An attacker who can set env vars can inject SQL via semicolons or quotes in the table name. ClickHouse supports `INSERT INTO db.table FORMAT JSONEachRow` — a table name like `positions; DROP TABLE ...;` could be destructive.
**Severity: Low** (requires env var control, which implies broader access)
---
### Persistence Schema Alignment
#### G21: `entry_price` used as `exit_price` in `trade_events` — data loss
The `_write_trade_event` function maps `entry_price` from `slot.to_dict()` to both the `entry_price` and `exit_price` columns. The actual exit fill price (available on the `VenueEvent` object) is **never written** to the `exit_price` column.
**Result:** Every `trade_events` row has `exit_price == entry_price`. The `exit_price` column is a dead column — always contains the entry price, never the actual fill.
**Severity: High** — data loss to DB for the most important trade metric.
"entry_bar": int(slot_dict.get("active_leg_index", 0) or 0),
```
`active_leg_index` tracks the exit-leg-ratios cursor (which leg of a multi-leg exit we're on), not a bar count. The value `0` at position open and `1` after the first exit leg — neither value represents bars held. **The `entry_bar` column stores the wrong concept.**
`capital_before` is reconstructed by subtracting the current leg's PnL from the current capital. In a multi-slot system, other slots' PnL changes between legs are absorbed into `capital_before`. The column is **always wrong** in multi-slot scenarios because `capital_after` reflects total PnL from all slots, not just the leg being recorded.
**Severity: Medium** — wrong `capital_before` for multi-slot trading.
#### G24: Recovery `trade_reconstruction` always has `trade_id=""`
The `persist_recovery_state` function passes `kernel.snapshot()["account"]` (an account dict with keys `capital, equity, realized_pnl, ...`) where a slot dict is expected. The `trade_id` key **does not exist** on the account dict. The `recovery_state` row always has `trade_id=""`.
**Severity: Medium** — recovery data is not associable with any trade.
#### G25: `seen_event_ids`, `exit_leg_ratios`, `VenueOrder`, `metadata` not in flat ClickHouse tables
These fields are:
- Present on the Python `TradeSlot` ✅
- Transmitted through Zinc shared memory ✅
- Stored in Hazelcast ✅
- Stored in ClickHouse `dita_kernel_debug` (full JSON) ✅
- **NOT extracted** into main ClickHouse flat tables `position_state`, `trade_events`, `trade_exit_legs` ❌
Data exists at the source, travels through the pipeline, hits the debug journal — but is lost in the main analytical tables.
**Severity: Low** (data exists in debug journal if needed for reconstruction)
#### G26: `_safe_float` silently converts NaN/None/Inf to 0.0
**File:** `utils.py:15`
```python
def _safe_float(v, default=0.0):
try:
f = float(v)
if not math.isfinite(f):
return default
return f
except (TypeError, ValueError, OverflowError):
return default
```
Used in multiple ClickHouse writers. Silently converts `NaN`/`Inf`/parsing errors to `0.0`. No diagnostic emitted when a non-finite value reaches the persistence layer — data silently zeroed.
**Severity: Low** (safe default but silent corruption)
---
### Lifecycle & Resource Management
#### G27: `build_launcher_bundle` has no exception safety — prior resources leak
**File:** `launcher.py:264-300**
```python
def build_launcher_bundle(...):
control_plane = _build_control_plane(...)
projection = build_projection(...)
zinc_plane = _build_zinc_plane(...)
venue = _build_venue(...)
kernel = ExecutionKernel(...) # ← if THIS fails, everything above leaks
```
If any step after the first raises, all previously built resources leak:
-`RealZincPlane` created → `_build_venue()` fails → 3 shared memory regions orphaned
-`RealZincControlPlane` created → `_build_zinc_plane()` fails → 1 shared memory region orphaned
-`BingxVenueAdapter` created → `ExecutionKernel.__init__()` fails → HTTP connection leaked
**No `try/finally` anywhere in the builder.** The init order is also optimized for forward construction, not backward cleanup.
**Severity: High** — shared memory leak on any build failure.
#### G28: `RealZincPlane` and `RealZincControlPlane` have no `__del__`
When `close()` is not called (exception in builder, forgotten cleanup, GC during shutdown), the shared memory regions opened by `RealZincPlane` (3 regions) and `RealZincControlPlane` (1 region) are **orphaned on the OS**. They persist in `/dev/shm/` (or platform equivalent) until system reboot.
Python's `__del__` is unreliable (not called on SIGKILL, not called if the object is part of a cycle without a GC run), but its absence means even normal garbage collection can't clean up.
**Severity: High** — shared memory leaks.
#### G29: Zero signal handlers — no cleanup on SIGTERM/SIGINT
```bash
$ grep -rn "signal\|SIGTERM\|SIGINT\|atexit" *.py # ZERO matches
```
When SIGTERM or SIGINT arrives:
1. Python's default handler terminates the process immediately
2. No `DITAv2LauncherBundle.close()` is called
3. No `ExecutionKernel.__del__` is called (CPython may run GC on normal exit but not reliably)
4. All shared memory (RealZincPlane, RealZincControlPlane) is orphaned
5. In-flight BingX HTTP calls are interrupted mid-stream
6. Rust kernel handle is leaked
**Severity: High**
#### G30: `ExecutionKernel` has no `close()` — relies on `__del__` for Rust handle cleanup
`ExecutionKernel` has `__del__` which calls `_get_rust().destroy(backend)`. No `close()` method. `DITAv2LauncherBundle.close()` never touches the kernel — the Rust handle is only freed by GC at unpredictable time.
If any code holds a stale `_backend` pointer, the handle dangles when GC runs. If `__del__` is suppressed (e.g., during interpreter shutdown with cyclic references), the Rust handle leaks permanently.
**Fix:** Add `close()` to `ExecutionKernel`, call it from `DITAv2LauncherBundle.close()`.
**Severity: High**
#### G31: `projection` (Hazelcast) never closed
`build_projection()` returns a `HazelcastProjection` which holds a Hazelcast client connection. No `close()` or `disconnect()` method exists on the projection, projector, or row writer. `DITAv2LauncherBundle.close()` doesn't touch the projection. The Hazelcast client connection leaks on shutdown.
**Severity: Medium**
#### G32: `_maybe_close()` only calls the first method found — `break` skips the second
**File:** `launcher.py:233-243**
```python
for method_name in ("close", "disconnect"):
method = getattr(obj, method_name, None)
if method is None:
continue
try:
result = method()
except TypeError:
continue
if inspect.isawaitable(result):
try:
asyncio.run(result)
except RuntimeError:
pass
break # ← ONLY calls the FIRST found method, never both
```
If an object has both `close()` and `disconnect()`, only `close()` is called. `disconnect()` is silently skipped. Also: `asyncio.run(result)` silently swallows `RuntimeError` when a running event loop exists — the coroutine is **never executed**.
Currently no object has both, but the pattern is fragile.
**Severity: Low**
#### G33: `close()` is not idempotent for RealZinc components
`RealZincPlane.close()` and `RealZincControlPlane.close()` call their Zinc region's `close()` method. If called twice, the second call operates on an already-closed region — likely crashes from Hazelcast's shared memory code.
No nulling of references after close: `DITAv2LauncherBundle.close()` sets `self.venue`, `self.zinc_plane`, `self.control_plane` to `None` — **wait, it doesn't. It calls `_maybe_close()` which doesn't null references.** Double `close()` is unsafe.
**Severity: Low**
#### G34: No context manager on `DITAv2LauncherBundle`
`DITAv2LauncherBundle` has no `__enter__`/`__exit__`. Users must manually call `close()`. No `with` pattern exists anywhere in the source for lifecycle management. No `__del__` fallback on the bundle either.
**Severity: Low** (ergonomic, not a leak source if caller follows the pattern)
#### G35: `BingxVenueAdapter.connect()` exists but is never called by the launcher
`BingxDirectExecutionAdapter` has a `connect()` method that initializes the lifetime HTTP client. `BingxVenueAdapter` has `connect()` that calls `_call_backend("connect")`. Neither is called in `build_launcher_bundle()` or `_build_venue()`. If the adapter's `submit_intent()` relies on a connected client, it initializes lazily — but the connect path is dead code that exists but is never invoked.
**Severity: Informational**
#### G36: Only one `try/finally` in the entire codebase
The only `try/finally` is `_RustKernelLib._take_string()` (rust_backend.py:140-143) which frees the Rust C string. All other resource management uses `try/except` with no `finally`.
No cleanup is guaranteed on exception:
-`build_launcher_bundle()` — no cleanup on failure
-`process_intent()` — no cleanup of partial slot state on venue event exception
-`on_venue_event()` — no cleanup on FFI failure
-`_set_slot()` — no cleanup on projection or Zinc write failure
### H1: No Python dependency declaration files exist in workspace
**Files:** workspace root
Zero `requirements.txt`, `setup.py`, `setup.cfg`, `pyproject.toml`, `Pipfile`, or `poetry.lock` anywhere. All Python package dependencies are entirely implicit — determined by what's installed in the runtime environment. No reproducible installs, no version pinning, no audit trail.
The Rust side does have `Cargo.toml` + `Cargo.lock` — but all 4 direct Rust deps use open ranges (`"0.4"`, `"0.2"`, `"1"`, `"1"`).
**Severity: Critical**
### H2: Rust kernel compiled from source on every cold start via subprocess
**File:** `rust_backend.py:60-72`
```python
def _ensure_library() -> Path:
path = _library_path()
if not path.exists():
_build_library() # cargo build --release
return path
def _build_library():
subprocess.run(
["cargo", "build", "--release", ...],
check=True, # no timeout!
)
```
First load takes 3-10 minutes (Rust compilation). Requires Rust toolchain in production. `subprocess.run()` has no `timeout=` — if `cargo` hangs (network, disk, lock contention), the Python process hangs indefinitely. No prebuilt binary distribution.
**Severity: Critical**
### H3: Zero logging — every swallowed error is invisible
The entire codebase has zero use of Python's `logging` module, `print()`, or `warnings.warn()` for error reporting. Every `except: pass`, `except Exception: pass`, and `return default` silently discards the error. **There is no mechanism to detect, alert, or diagnose production failures.**
All `try/except: pass` sites found:
| # | File:Line | What's Hidden |
|---|-----------|---------------|
| 1 | `bingx_venue.py:51` | `float()` conversion failure on any API field value |
| 2 | `bingx_venue.py:133` | regex match failure in rate-limit parsing |
| 3 | `bingx_venue.py:136` | int/float conversion of retry_after |
### H4: `_row_float` rejects zero as a valid value — `or` pattern treats 0 as missing
**File:** `bingx_venue.py:47-55`
```python
def _row_float(row, *keys, default=0.0):
for key in keys:
try:
value = float(row.get(key) or 0.0) # `or 0.0` treats 0 as missing
except Exception:
continue
if value == value and value not in (float("inf"), float("-inf")) and value != 0.0:
return value # explicitly rejects 0.0
return default
```
Two bugs: (a) `except Exception: continue` swallows ALL conversion errors, and (b) `value != 0.0` explicitly rejects zero as a valid return value. A legitimate zero price, zero filled quantity, or zero position amount causes `_row_float` to skip that key and search further. If ALL keys return 0, the default `0.0` is returned — indistinguishable from "none of the keys existed."
Called by every single BingX API response parser: `_position_qty()`, `_position_price()`, `_venue_order_from_row()`, `_event_from_row()`, `_fill_event_from_row()`, `_events_from_submit()`, `_events_from_cancel()`, `_filled_size_from_snapshots()`. None verify the returned 0.0 is real vs. missing-vs-zero.
**Severity: High**
### H5: `_backend_snapshot` timeout returns stale data with no signal to callers
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
with self._snap_lock:
return self._last_snapshot # STALE — could be hours old
```
When the snapshot-fetch condition times out, returns `self._last_snapshot` — initialized to `None` and only updated on successful fetches. First timeout returns `None`. All callers (`cancel()`, `open_orders()`, `open_positions()`, `reconcile()`, `submit()`) access `.open_orders`, `.open_positions` immediately — crash with `AttributeError: 'NoneType' object has no attribute 'open_orders'`.
Even after the first fetch succeeds, subsequent timeouts return the last-good snapshot which could be arbitrarily stale. No caller timestamps, version-checks, or requests a refresh.
**Severity: High**
### H6: All enum-from-raw-string sites crash on unknown value — zero fallback
If the Rust kernel introduces a new enum variant (e.g., `TradeStage::ENTRY_REJECTED`) not in the Python `TradeStage` enum, `TradeStage("ENTRY_REJECTED")` raises `ValueError` with zero fallback. Crashes `_outcome_from_payload()` and takes down the kernel's event processing loop.
17 sites total across `rust_backend.py` and `real_zinc_plane.py`. No try/except, no mapping, no fallback on any of them.
metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
```
`order_type` and `limit_price` are NOT fields on `KernelIntent` (contracts.py). They only exist in `intent.metadata` as `metadata["order_type"]` if set by the caller. `getattr(intent, "order_type", "MARKET")` checks the dataclass field — not the metadata dict — so it ALWAYS returns `"MARKET"`.
Even when the PINK runtime produces a LIMIT intent (LIMIT_DECISION → `metadata["order_type"] = "LIMIT"`), the legacy adapter converts is to MARKET because it reads the wrong source. Every LIMIT order is submitted as MARKET.
Similarly, `limit_price` is always `0.0` — any limit price from the metadata dict is lost.
**Severity: High**
### H8: `_venue_event_status_from_row` silently maps unknown venue status to ACKED
return VenueEventStatus.ACKED # fallthrough for anything unknown
```
If BingX introduces a new status (`"SUSPENDED"`, `"PENDING_CANCEL"`, `"EXPIRED"`), it doesn't match any known mapping and silently returns `ACKED`. The kernel treats a suspended/cancelled/expired order as acknowledged — dangerous misclassification.
**Severity: High**
### H9: `RealZincPlane.write_slot()` — slot written to `slot_id >= slot_count` is invisible
**File:** `real_zinc_plane.py:206-210**
```python
def write_slot(self, slot):
with self._lock:
self._slot_cache[int(slot.slot_id)] = slot
payload = {"slots": [self._slot_cache[key].to_dict() for key in range(self._slot_count)]}
```
`_slot_cache` is a plain dict — accepts any key. But `read_slots()` only reads 0..slot_count-1. Writing to `slot_id >= slot_count` stores the slot in the cache but it's **never serialized or read back**. No error.
**Severity: High**
### H10: `RealZincControlPlane.read()` has no atomicity with concurrent `update()`
**File:** `real_control_plane.py:70-77**
`_write_region()` zero-fills the buffer then writes the packet. If `read()` interleaves between zero-fill and write, it sees a partially-zeroed buffer → `_decode_packet` returns `{}` → returns stale `self._snapshot` with no observable error. No lock, no sequence check, no atomic read.
The same bug exists in `RealZincPlane.read_slots()` (real_zinc_plane.py:220-230) — reads shared memory while a concurrent `write_slot()` is in progress.
**Severity: High**
### H11: `_RustKernelLib` lazily initialized with race condition
**File:** `rust_backend.py:187-190**
```python
_RUST: _RustKernelLib | None = None
def _get_rust():
global _RUST
if _RUST is None:
_RUST = _RustKernelLib() # no lock — two threads can both create
return _RUST
```
No threading lock. Two concurrent calls to `_get_rust()` (possible via `BingxVenueAdapter`'s thread pool) can create two `_RustKernelLib` objects. The `_RustKernelLib()` constructor runs `_ensure_library()` which runs `subprocess.run(["cargo", "build", ...], check=True)` — concurrent `cargo build` can corrupt the build directory.
**Severity: High**
### H12: `ExecutionKernel.__del__` can deadlock or use-after-free
`_get_rust()` accesses the module-level `_RUST` singleton, which may already be destroyed if the module's garbage collection runs before the instance's. The destroy call happens outside any lock — one thread's destructor could destroy the Rust kernel while another thread is still using it. Use-after-free.
`ControlPlane` protocol defines `wait()` and `notify()`. `MirroredControlPlane` inherits from nothing and only implements `read()`, `update()`, and `mirror()`. Calling `plane.wait()` on a `MirroredControlPlane` raises `AttributeError`.
**Severity: Medium**
### H14: `TradeSlot.remaining_size()` and `VenueOrder.remaining_size()` — same name, different semantics
return max(0.0, float(self.size)) # open position size
# VenueOrder:
def remaining_size(self) -> float:
return max(0.0, self.intended_size - self.filled_size) # unfilled order qty
```
Same method name, completely different semantics. `TradeSlot.remaining_size()` returns the current open position size. `VenueOrder.remaining_size()` returns the untracked/unfilled order quantity. A caller using `slot.remaining_size()` to check if an order is fully filled gets position size, which doesn't change with fills — it changes with entry/exit.
**Severity: Medium**
### H15: `_maybe_close()` — `asyncio.run()` RuntimeError silently swallowed for coroutines
**File:** `launcher.py:233-243**
```python
if inspect.isawaitable(result):
try:
asyncio.run(result)
except RuntimeError:
pass # SILENT — coroutine never executed
```
When `maybe_close` is called from an async context (which it is — `DITAv2LauncherBundle.close()` is used in async test code), `asyncio.run()` raises `RuntimeError("Cannot run the event loop while another loop is running")`. The exception is swallowed, the coroutine is never awaited, and the close/disconnect never happens.
Also: `break` after calling the first found method means if an object has both `close()` and `disconnect()`, `disconnect()` is never called.
**Severity: Medium**
### H16: `_build_launcher_bundle` imports `BingxDirectExecutionAdapter` inside function — import-time side effect is safe but lazy loading masks errors
**File:** `launcher.py:254**
```python
def _build_venue(...):
from prod.clean_arch.adapters.bingx_direct import BingxDirectExecutionAdapter
```
Import inside function — safe, lazy, no side effects. But if the `bingx_direct` module has an import error (missing dependency, version mismatch), it only surfaces at bundle construction time, not at process start. A misconfigured production deployment would fail on the first trade, not on boot.
**Severity: Informational**
### H17: `load_dotenv()` at module level — import-time filesystem I/O and env mutation
**File:** `launcher.py:49-51**
```python
load_dotenv(PROJECT_ROOT / ".env") # executes on module import
```
Runs on every import of `launcher.py` — reads filesystem, mutates process environment. Hard to mock in tests — setting env vars in test setup gets overwritten on module import. Also: if `.env` doesn't exist, `load_dotenv()` silently does nothing — missing config is invisible.
**Severity: Medium**
### H18: `_run()` in `BingxVenueAdapter` — `asyncio.run()` thread-pool bridge blocks on every call
Every call to `_run()` that receives an awaitable blocks the calling thread via `.result()`. The BingX HTTP call inside `submit_intent()` can take 1-5 seconds. During this block, the event loop cannot process other tasks. In a single-runtime deployment, this stalls the entire policy cycle.
**Severity: Medium**
### H19: `HazelcastClientLike` protocol has zero concrete implementations in workspace
**File:** `hazelcast_projection.py:13-15**
```python
class HazelcastClientLike(Protocol):
def get_map(self, name: str): ...
def get_topic(self, name: str): ...
```
Used as a type hint. No code in the workspace creates an object that satisfies this protocol. The Hazelcast client comes from an external package. If the external API changes, the protocol silently drifts — no compilation check.
**Severity: Low**
### H20: `_decode_packet` in RealZinc — no bound check on `size` beyond `> len(buf)-16`
If shared memory contains a corrupted `size` field within bounds, `.decode()` or `json.loads()` raises — uncaught by callers. A single corrupted byte in shared memory crashes the kernel.
**Severity: Low**
### H21: All Rust crate features enabled by default — `wasm-bindgen` compiled into native shared library
The Rust kernel is a native `.so`/`.dylib` but chrono's `iana-time-zone` pulls in `js-sys` and `wasm-bindgen` (WebAssembly support) even on native Linux. Larger binary, longer compile times. `cc` crate pulled in for `iana-time-zone-haiku` which only compiles on Haiku OS.
**Severity: Low**
### H22: `socket.getaddrinfo` monkey-patch in test generator code
**File:** `gen2.py:295-298**
Monkey-patches Python stdlib `socket.getaddrinfo` to force IPv4 as a workaround for IPv6 resolution failure in the deployment environment. If copied to production code, would break IPv6 connectivity.
After both fills, the actual position is 0.8 but `slot.size` reports 0.3. The position is under-counted by 0.5 — 62.5% error.
The exit path correctly does `slot.size = (slot.size - fill_size).max(0.0)` (subtractive). The entry path should accumulate: `slot.size += fill_size`.
This only manifests with LIMIT orders that receive multiple partial fills over time — a scenario entirely absent from tests (I7).
**Severity: Critical**
### I2: `exit_ratio = 0.0` creates zero-size exit order — slot stuck in EXIT_REQUESTED
**File:** `_rust_kernel/src/lib.rs:467-469`
```rust
let exit_ratio = slot.next_exit_ratio(); // returns 0.0 from exit_leg_ratios=[0.0, ...]
let base_size = if slot.initial_size > 0.0 { ... } else { slot.size };
let exit_size = (base_size * exit_ratio).max(0.0); // = 0.0
```
When `exit_leg_ratios` contains `0.0` in any position, `exit_size = 0.0`. The zero-size exit order is submitted to the venue (`intended_size = 0`). On the fill side, `realized_pnl()` returns 0.0 (guarded by `exit_size <= 0.0`), and `slot.size` is unchanged. The slot stays in `EXIT_REQUESTED` with no means to advance — the leg is consumed but nothing happened. Subsequent exits may eventually handle this, but the zero-size leg is a wasted FSM transition that leaves the slot in a confusing intermediate state.
Also: `NaN` in `exit_leg_ratios` (from `clamp(0.0, 1.0)` not guarding NaN, though serde_json rejects NaN) would produce the same zero-size exit behavior.
if self.entry_price <= 0.0 { self.entry_price = price; } // catches -0.5, replaces it
```
If `entry_price` is negative (possible only via `set_slot_json` direct injection — not from normal trading), Python keeps it and computes `unrealized_pnl` with wrong sign. Rust replaces it. The Python-side `mark_price` is only called from `ExecutionKernel.mark_price()` in rust_backend.py:LOW-1, which never writes back to the Rust kernel — so the Python-side calculation is purely local and the inconsistency has no effect on the Rust kernel's canonical state. However, the `observe_slots` call after `mark_price` re-reads from the Rust kernel, which recomputes PnL correctly. The Python-side mark_price is effectively wasted computation that never feeds back.
**Severity: Informational**
### I4: No Rust unit tests for 99% of kernel functionality
**File:** `_rust_kernel/src/lib.rs:1731-1765`
Only 1 Rust test exists: `enter_then_ack_fill` — creates a 2-slot kernel, submits ENTER, sends ACK, asserts state transitions.
**Not tested in Rust:**
- EXIT, CANCEL, MARK_PRICE, RECONCILE, CONTROL actions
- Any FILL event (PARTIAL, FULL)
- CANCEL_ACK, CANCEL_REJECT, ORDER_REJECT
- RATE_LIMITED handling
- Multi-leg exits
-`consume_exit_leg` edge cases
-`realized_pnl()` formula with boundary values
-`mark_price()` with extreme values
-`resolve_slot()` fallback path
-`reconcile_slots_json` dedup/overflow
- Any C FFI boundary function
- Any serde deserialization failure
- Null pointer handling
No `#[cfg(test)]` module exists — the single test is inline. No Rust integration tests (`tests/` directory).
**Severity: High**
### I5: `MockVenueScenario` rejection flags exist but zero tests use them
**File:** `mock_venue.py:23-35`
```python
@dataclass
class MockVenueScenario:
reject_entries: bool = False
reject_exits: bool = False
cancel_reject: bool = False
```
Three boolean flags to simulate venue rejection of orders. Not a single test in `test_flaws.py` sets any of them to `True`. The `ORDER_REJECT` handler in the Rust kernel's `on_venue_event` exists (lib.rs lines ~1440-1460) but is never exercised by any test.
Similarly, `entry_partial_fill_ratio` and `exit_partial_fill_ratio` exist on `MockVenueScenario` but only one test (`test_cancel_entry_with_partial_fill`) uses partial fills at all — and it only checks `size > 0`, not the full capital-accrual chain.
**Severity: High**
### I6: No LIMIT order test through the full kernel path
The test suite has zero LIMIT orders. The Rust kernel doesn't even contain LIMIT-specific logic — all orders are MARKET. The generated live tests have `limit_does_not_fill` and `limit_immediate_fill` scenario placeholders, but:
-`limit_does_not_fill` uses `reference_price=0.0` (not a real LIMIT order)
-`limit_immediate_fill` uses `target_size=-0.001` (negative size → clamped to 0.0)
Neither scenario actually submits a LIMIT order with `order_type="LIMIT"` and a non-zero `limit_price`. The `_legacy_intent` bug (H7) would convert any LIMIT attempt to MARKET anyway.
The only LIMIT-related code is the Rust kernel's `if intent.order_type == "LIMIT"` branches (lib.rs:503, 1584) which are compile-time dead code — `KernelIntent` doesn't have an `order_type` field that serde would populate.
**Severity: High**
### I7: Three weak/vacuous assertions in `test_flaws.py`
**File:** `test_flaws.py`
1.**Line 512:**`assert order.metadata.get("asset") is not None or order.metadata.get("slot_id") is not None` — mock venue always sets both, this can never fail.
2.**Line 700:**`test_pnl_warning_on_unsettled_reentry` — titled to assert a warning is raised but only checks `r.accepted`. Never checks `diagnostic_code` or verifies the warning was issued.
3.**Line 318:**`assert slot.active_entry_order is None or slot.active_entry_order.status == VenueOrderStatus.FILLED` — the `or` allows two different scenarios to pass, reducing diagnostic power.
**Severity: Low**
### I8: `slot.size = fill_size` entry overfill no guard
**File:** `_rust_kernel/src/lib.rs:798`
Already noted in I1 — entry fill sets `slot.size` directly to `fill_size`. Unlike exit fill which has `(slot.size - fill_size).max(0.0)`, there's no guard against entry overfill (venue fills more than the intended order size). For MARKET orders this is fine (one fill per order), but for LIMIT orders with multiple partial fills, the accumulated fill could exceed `initial_size`.
**Severity: Low** (only relevant with LIMIT + partial fills, which don't exist in the codebase)
### I9: No crash durability — slot state is pure in-memory until step 7 of process_intent
If the process crashes between steps 2-5, the slot state accumulated in the Rust kernel's in-memory `KernelCore` is **completely lost**. The Rust kernel has no WAL, no journal, no persistent store. On restart, `ExecutionKernel.__init__` creates a fresh `KernelCore` with all slots IDLE.
The crash between step 3 and step 5 is the most dangerous: the exchange has an open order/position, but the kernel has no record of it. On restart:
- The Rust kernel sees `slot.slot_id = IDLE`
- The Zinc slot cache may or may not have the pre-crash state (depends on timing)
- No code on restart loads Zinc state back into the Rust kernel (I14)
- The exchange order lives until it fills (unexpected position) or is manually cancelled
**Concrete example:** `venue.submit()` sends POST to BingX, order placed. HTTP response arrives. `on_venue_event(ORDER_ACK)` transitions slot to `ENTRY_WORKING`. Crash between returning from `on_venue_event` and `zinc_plane.write_slot()`. On restart: slot is IDLE, no active entry order, `_last_settled_pnl` is reset. The exchange has a live ENTRY_WORKING order. Next `process_intent(ENTER)` gets `SLOT_BUSY` because... wait — the fresh kernel doesn't know the order exists, so it sees slot as IDLE and allows a new ENTER. The old order fills on the exchange → double position.
**Severity: Critical**
### I10: `seen_event_ids` lost on restart — events replayed after restart are double-processed
**File:** `_rust_kernel/src/lib.rs:672-683`
`seen_event_ids` is per-slot, per-[`KernelCore`] instance — purely in-process memory. On restart with a fresh `KernelCore`, every slot has `seen_event_ids = Vec::new()`. If events are replayed (from `pump_venue_events()` calling `venue.reconcile()` which re-fetches exchange state):
1. Original run: order fills → `FULL_FILL` with `event_id = "EV-00000042"` → processed, slot → `POSITION_OPEN`
4.`pump_venue_events()` fetches same exchange state → new `VenueEvent` objects with new event IDs (adapter's `_event_seq` resets)
5. Rust kernel sees these as novel events — processes them again
6. Position is double-booked, PnL double-settled
The `bingx_venue._event_seq` is an instance-level `itertools.count()` starting from 1. On adapter restart, it resets — so the new event IDs won't match the old ones anyway. Dedup is fundamentally impossible across restarts.
**Severity: Critical**
### I11: No idempotency key (`newClientOrderId`) sent to BingX
BingX supports `newClientOrderId` for order idempotency — sending the same ID twice returns the original order status instead of creating a duplicate. The DITAv2 kernel passes `intent.intent_id` as `decision_id` to the legacy adapter, but there's no guarantee this maps to `newClientOrderId` in the BingX payload.
If the HTTP POST to `/trade/order` times out before the response is read:
1. The order was placed on the exchange
2.`_call_backend` raises a `BingxHttpError` (or similar network exception)
3.`process_intent()` propagates the exception — no retry
4. Next cycle: caller may retry with a new `intent_id`
5. Second POST creates a **second order** on the exchange — duplicate position
Without a client-order-id that persists across retries, the system can create duplicate orders on network timeouts. The exchange has no way to deduplicate.
**Severity: High**
### I12: No graceful degradation for ANY subsystem
Every subsystem failure mode examined:
| Subsystem | Failure | Current behavior |
|-----------|---------|-----------------|
| Zinc SHM init | Corrupted region, OOM | Silent fallback to InMemoryZincPlane (no operator signal) |
| Memory pressure | OOM | Process killed by kernel. No signal handler. Zero signal handlers. |
**No subsystem has a graceful degradation path.** No circuit breaker, no retry queue, no fallback to log-only mode, no offline/cached trading mode. Every failure (except the two init-time silent fallbacks) crashes the current kernel operation.
**Severity: High**
### I13: Stray venue event can reactivate a CLOSED slot — no guard
**File:** `_rust_kernel/src/lib.rs:625+`
The `on_venue_event` function has no guard for closed slots:
A CLOSED slot should be a terminal state that rejects all events. Currently only CANCEL_ACK is harmless on a closed slot; the rest can revive a dead position.
**Severity: High**
### I14: No `reconcile_from_slots` call on startup — Zinc state never loaded into Rust kernel
1.`RealZincPlane.__init__` reads state from Zinc shared memory into `_slot_cache`
2.`ExecutionKernel.__init__` creates fresh `KernelCore` — all slots IDLE
3.`KernelStateView(self)` reads from the fresh kernel
4.`account.observe_slots([self._get_slot(i) for i in range(max_slots)])` — all slots IDLE
Step 3 and 4 read from the Rust kernel, NOT from Zinc. The Zinc `_slot_cache` populated in step 1 is **never loaded into the Rust kernel**. The `reconcile_on_restart` flag exists in `KernelControlSnapshot` (default `True`) but is never checked anywhere in `ExecutionKernel.__init__` or the launcher.
The system always starts with a blank state even when durable shared memory state exists.
When the exchange rejects a cancel (typically because the order was already filled or no longer exists), the slot stays in `EXIT_WORKING` with `active_exit_order` still attached. Every subsequent CANCEL attempt hits the same path — the exchange returns "order not found," the kernel sees `CANCEL_REJECT`, and the slot is stuck forever.
If the order was already filled (CANCEL_REJECT means "can't cancel, no longer open"), the slot should check the actual position size and potentially transition to `POSITION_OPEN` or `CLOSED` depending on fill status.
**Severity: Medium**
### I16: Zinc shared memory — world-readable/writable by same-machine processes
Region names are predictable (prefix defaults to `"dita_v2"`). The `SharedRegion` uses POSIX `shm_open` — the default permissions depend on umask (typically `0644` or `0600`). Any process on the same machine can:
- **Read**: Open the region → `as_buffer()` → `_decode_packet()` → read all slot state, PnL, open orders, control settings
- **Write**: Open the region → forge a packet (`struct.pack("!QQ", seq, len) + json_bytes`) → overwrite slot state, inject fake intents, modify control plane
No access control, no encryption, no integrity check (HMAC/signature) on the wire format. The sequence number is the only ordering mechanism, and it's trivially predictable.
**Severity: High**
### I17: `KernelSlotView` exposes full slot state via unrestricted `__getattr__`/`__setattr__`
**File:** `rust_backend.py:411-460`
```python
class KernelSlotView:
def __getattr__(self, name):
slot = self._snapshot()
return getattr(slot, name) # read ANY field
def __setattr__(self, name, value):
setattr(slot, name, value)
self._kernel._set_slot(slot) # write ANY field — bypasses FSM
- Write all slot fields: `slot_view.realized_pnl = -9999999` — directly manipulates PnL figures flowing into capital settlement
The `_set_slot` call writes through to the Rust kernel without any FSM validation. The entire kernel state is exposed through mutable Python objects with zero access control.
**Severity: High**
### I18: `sys.path.insert(0, ...)` at import time in three production files
# real_control_plane.py, real_zinc_plane.py — at MODULE LEVEL:
sys.path.insert(0, str(_ZINC_ADAPTER_PATH))
# test_flaws.py, _build_pink_bodies.py, _gen_test.py — at MODULE LEVEL:
sys.path.insert(0, '/mnt/dolphinng5_predict')
```
`sys.path.insert(0, ...)` gives the injected path highest import priority. An attacker with filesystem write access to the inserted path can create a malicious module that shadows a legitimate import (e.g., `zinc.py`, `utils.py`, `typing.py`). When any subsequent `from X import Y` runs, the attacker's module loads with the full privileges of the kernel process.
The production files use a relative path resolution (`Path(__file__).resolve().parents[3] / "zinc" / "adapters" / "python"`), while the test files use a hardcoded absolute path (`'/mnt/dolphinng5_predict'`). Both patterns are dangerous.
**Severity: High**
### I19: `pump_venue_events` re-fetches exchange state that can produce phantom position events
**File:** `bingx_venue.py:395-415`
`reconcile()` calls `_backend_snapshot()` which fetches current positions and open orders from the exchange. The `_events_from_snapshot` method diff-s the current snapshot against the last-known snapshot to produce events:
```python
def _events_from_snapshot(self, before, after):
for symbol, current_pos in after.open_positions.items():
prev_pos = before.open_positions.get(symbol)
if current_pos and (not prev_pos or abs(prev_pos.position_amount) <1e-12):
# This looks like a new position — emit event
```
If `before` is stale (from `_backend_snapshot` timeout), the diff can produce spurious events. A position that existed before the crash is absent from the stale snapshot → the diff sees it as "new" → emits an entry fill event → Rust kernel processes it as a fresh enter → double position. This compounds with I10 (seen_event_ids lost on restart).
**Severity: High**
### I20: `exit_leg_ratios` no guard against empty list — `next_exit_ratio` returns 1.0
**File:** `contracts.py:196-198`
```python
def next_exit_ratio(self) -> float:
if self.active_leg_index <len(self.exit_leg_ratios):
If `exit_leg_ratios` is empty (default `(1.0,)` prevents this normally, but the default is only `(1.0,)` in the dataclass), `next_exit_ratio()` returns `1.0`. This is the same as "exit everything" — the `consume_exit_leg` then advances `active_leg_index` to `min(1, 1) = 1`, and `all_legs_done = active_leg_index >= exit_leg_ratios.len()` → `1 >= 0 = true` → slot closes. The empty-ratios edge case is silently handled with `unwrap_or(1.0)`, which happens to be correct — but undocumented.
**Severity: Informational**
### I21: No test for rate-limited events — `RATE_LIMITED` kernel path is dead code
**File:** `_rust_kernel/src/lib.rs` (event handler), `MockVenueScenario.mock_venue.py` (no rate_limit flag)
The Rust kernel has a handler for `KernelEventKind::RATE_LIMITED` (lib.rs lines ~1480-1500). The event flows through the Python bridge's `process_intent()` rate-limit detection (rust_backend.py:585-593). But `MockVenueScenario` has no flag to emit rate-limited events. The only path to trigger `RATE_LIMITED` is from the real BingX adapter — which requires live exchange connectivity.
The entire RATE_LIMITED code path — in both Python and Rust — is untested in CI. Any bug in this path only surfaces in production under rate-limit conditions.
**Severity: Medium**
### I22: Thread pool for `_run` — `max_workers=3` shared across ALL adapter instances
Class-level singleton — all `BingxVenueAdapter` instances share the same 3-thread pool. With the runtime's `step()` calling `submit()` (1 thread) + `_backend_snapshot` (potentially another thread for open orders) + `cancel()` (1 thread in parallel), all 3 threads are consumed. A fourth concurrent call blocks the calling thread at `.result()` indefinitely — freezing the entire event loop.
The pool is never shut down. If a `BingxVenueAdapter` is destroyed, the threads remain running (zombie workers). No `close()`/`disconnect()` path shuts down the executor.
Every instance of `_flatten` in the codebase submits a SHORT exit regardless of the actual position direction:
```python
def _flatten(k, symbol, price, label):
_exit(k, symbol, price, slot_id=0) # _exit creates a SHORT exit
# ... if still not free, tries LONG exit
```
`_exit` calls `_si(k, EXIT, ..., "SHORT", ...)`. If the open position is LONG, this SHORT exit is actually an **enter short** — a new position opening, not a flatten. Only after the first attempt fails does it try a LONG exit. This can double the position instead of flattening it.
No test in the suite has ever hit this because no test before the `_verify` step has an open position with the wrong direction — but the code is fundamentally wrong: it assumes all positions are SHORT.
**Severity: Medium**
### J2: Test `_check_slot_accounting` double-counts unrealized PnL
**File:** `_build_pink_extended.py` (patched into generated file)
```python
total_rp = sum(k.slot(i).realized_pnl for i in range(k.max_slots))
total_up = sum(k.slot(i).unrealized_pnl for i in range(k.max_slots))
expected = start_cap + total_rp + total_up
actual = k.account.snapshot.capital
assert abs(actual - expected) <0.01
```
The accounting identity `capital = start_cap + Σrealized_pnl + Σunrealized_pnl` double-counts unrealized PnL if the Rust kernel's capital computation already includes it. The kernel's `account.snapshot.capital` is updated by `settle()` which adds `realized_pnl` only — so unrealized PnL is NOT included in capital. This means the assertion is actually correct semantically: capital = start + realized + unrealized. Wait — that IS correct. Let me re-examine...
Actually, `account.settle(realized_pnl)` adds only realized PnL to capital. Capital does NOT include unrealized. So `capital = start + realized` and the test adds unrealized on top. If `unrealized > 0`, the assertion `actual == expected` where `expected = actual + unrealized` will **always fail** for open positions. The test only passes when `unrealized ≈ 0` (closed positions or when `mark_price` hasn't been called — which is always, per J4).
**This assertion produces false failures for every test with an open position.** The only reason it doesn't trigger is that `mark_price` is never called, so `slot.unrealized_pnl` is always 0. Silent near-miss.
While `gen2.py:352` and `_gen_test.py:138` correctly use `datetime.now(timezone.utc)` (timezone-aware datetime). If any downstream code calls `.isoformat()` or `.strftime()` on the snapshot's timestamp, it crashes with `AttributeError: 'float' object has no attribute 'isoformat'`.
This function is used by the newer `_run` harness in the generated live-test file. Whether the crash manifests depends on what `MarketSnapshot` and `PinkDirectRuntime.step()` do with the timestamp field.
**Severity: High**
### J4: `ExecutionKernel.mark_price()` exists but is never called — no periodic mark-to-market
This method exists on `ExecutionKernel` but has **zero callers** in the entire codebase. Unrealized PnL is never updated outside of `process_intent` and `on_venue_event` (which only compute realized PnL). The `slot.unrealized_pnl` field stays at its initial value (0) unless `mark_price` is called externally.
The `AccountProjection.observe_slots()` (account.py:53-66) reads `slot.unrealized_pnl` and reports it — but since nothing ever updates it, unrealized PnL is always 0 in the account snapshot.
This means the capital figure reported by `kernel.snapshot()["account"]["unrealized_pnl"]` is **always zero for open positions** — the system has no live mark-to-market.
**Severity: High**
### J5: All VenueEvent timestamps use local machine clock, not exchange timestamp
**File:** `bingx_venue.py` (7 locations)
Every VenueEvent constructed in the venue adapter uses the local machine's clock:
```python
VenueEvent(
timestamp=datetime.now(timezone.utc), # local clock, not exchange
...
)
```
This includes:
-`_events_from_submit()` (lines 370, 390) — with `getattr(receipt, "timestamp", ...)` fallback that still uses local clock
-`_events_from_cancel()` (lines 455, 480)
-`_event_from_row()` (line 546)
-`_fill_event_from_row()` (line 570)
The exchange's HTTP response includes timestamps (`transactTime`, `updateTime`) that are authoritative. These are available in the raw response dict (stored in `raw_payload`) but are never extracted as the event timestamp. Clock skew between the local machine and the exchange is invisible — event timestamps may be ahead of or behind exchange time.
**Severity: Medium**
### J6: No monotonic timestamp verification anywhere in the system
No code path in the entire codebase checks whether a new timestamp is >= the previous one for the same asset/slot:
-`process_intent()` — no comparison between intent timestamp and slot's `last_event_time`
-`on_venue_event()` — no check that event timestamp >= previous events
-`TradeSlot.last_event_time` is stored but never validated for monotonicity
-`VenueEvent` timestamps from `pump_venue_events()` are never compared with event history
With NTP clock adjustments, daylight saving time changes, or VM clock drift, timestamps can go backwards. The system has no detection or guard.
**Severity: Low**
### J7: `rebuild_indexes()` silently overwrites duplicate `trade_id` — last slot wins, first becomes invisible
// ↑ HashMap::insert overwrites — no duplicate check
}
}
}
```
If two slots happen to have the same `trade_id` (not prevented by any invariant check), the index maps to the **last** slot with that trade_id. The first slot becomes invisible to `resolve_slot()`'s trade_id-based fallback. Any venue event for that trade_id with an unspecified or negative `slot_id` always resolves to the last slot.
The `process_intent` ENTER handler checks `slot.trade_id != intent.trade_id` to prevent overwriting a different trade on the same slot — but there's no global uniqueness check across all slots.
**Severity: High**
### J8: `resolve_slot()` falls back to slot 0 when all indexes miss — stray event corrupts slot 0
-`slot_id = -1` (negative — can't be used as usize)
- Empty `trade_id` (trade not found on new kernel after restart)
- Empty `venue_order_id` and `venue_client_id`
...the event is routed to **slot 0** regardless of which slot it was intended for. If slot 0 is in the middle of a trade, the stray event (e.g., a stale ORDER_ACK from a pre-crash order) overwrites slot 0's state. Combined with I10 (seen_event_ids lost on restart), this is a concrete crash-recovery failure path.
**Severity: High**
### J9: `dita_kernel_get_slot_json` and `dita_kernel_snapshot_json` return null with no diagnostic
**File:** `_rust_kernel/src/lib.rs` (FFI exports)
The intent/event processing paths (`process_intent_json`, `on_venue_event_json`) have **two layers** of error handling — parse errors produce a structured `invalid_intent_cstring()` diagnostic JSON, and serialization errors also produce diagnostics.
But the slot/snapshot read functions return bare null pointers:
```rust
// dita_kernel_get_slot_json (line 1608):
Err(_) => ptr::null_mut() // ← no diagnostic
// dita_kernel_snapshot_json (line 1765):
Err(_) => ptr::null_mut() // ← no diagnostic
```
The Python caller (`_RustKernelLib.get_slot_json`, line 164) checks `if not raw: raise IndexError(...)` — so null is caught, but the IndexError provides no detail about why it failed. If `snapshot()` returns null (serialization failure with f64 NaN/Inf in some slot), the Python code gets a bare IndexError or RuntimeError with no diagnostic.
**Severity: Medium**
### J10: Two processes with same `DITA_V2_PREFIX` corrupt shared Zinc memory
Two processes on the same machine with the same prefix will:
1. Attach to the same named shared memory regions
2. Overwrite each other's slot state, intents, and control settings
3. Race on concurrent writes — last writer wins with no coordination
4. One process's `create=True` conflicts with another's — `SharedRegion.create()` may fail
There is no prefix uniqueness validation, no PID suffix, no UUID, no lock file, no access control. The prefix defaults to `"dita_v2"` — trivially guessable.
**Severity: High**
### J11: `load_dotenv()` only runs when `launcher.py` is imported — env vars unset for other module paths
load_dotenv(PROJECT_ROOT / ".env") # only runs on `import launcher`
# control.py (at function call time):
raw = os.environ.get("DITA_V2_CONTROL_PLANE") # reads env var
# projection.py (at function call time):
raw = os.environ.get("DITA_V2_HAZELCAST") # reads env var
```
If any code imports `from .control import build_control_plane` directly (without first importing `launcher.py`), `load_dotenv()` has not run. The `.env` file is never loaded. Env vars that should have been set from `.env` are absent.
This creates an ordering dependency: module import order determines whether config files are loaded. Different import paths can produce different runtime behavior.
**Severity: Medium**
### J12: `BINGX_API_KEY`/`BINGX_SECRET_KEY` passed as `None` with no validation — fails at HTTP time
**File:** `launcher.py:195-196`
```python
api_key=os.environ.get("BINGX_API_KEY"), # None if unset
secret_key=os.environ.get("BINGX_SECRET_KEY"), # None if unset
```
When keys are unset, `None` is passed to `BingxExecClientConfig` and then to the HTTP client. No validation occurs at config/build time. The system:
1. Successfully builds a full `DITAv2LauncherBundle` with empty keys
2. Creates an `ExecutionKernel`
3. The first trade's `venue.submit()` call sends an HTTP request to BingX with empty auth
This is a **late failure** — the operator has no indication of misconfiguration until the first trade attempt. Fast failure at launcher time would catch this.
Also: `gen_live_tests.py:116-117` and `gen2.py:320` use bracket access `os.environ["BINGX_API_KEY"]` which crashes with `KeyError` if the var is missing — an inconsistent pattern (crash immediately vs fail at HTTP time).
**Severity: Medium**
### J13: API credentials never masked in error messages or tracebacks
3. Config object stored as Python attribute — accessible via `repr()`, `str()`, error tracebacks
No code masks, redacts, truncates, or otherwise protects the API key or secret key. If an exception propagates and the traceback includes the config object (through local variables, frame inspection, or exception chaining), the credentials are exposed in logs.
The generated live-test code also embeds credentials literally:
| Set to `""` (empty) | `False` (empty not in truthy set) |
| Set to `" "` (whitespace) | `False` |
Setting `DITA_V2_DEBUG_CLICKHOUSE=""` (intending "don't set, use default") actually forces it to `False`, **overriding** the default. And setting `DITA_V2_DEBUG_CLICKHOUSE=" "` (whitespace accidentally) does the same. The operator would need to know that empty and whitespace are treated as explicit falsy values, not as "unset."
**Severity: Low**
### J15: `gen2.py` and `_gen_test.py` both write to the same output file — last writer wins
`_gen_test.py` is more complete (includes `_inspect_outcome`, `_assert_accepted`, `_check_slot_accounting`, `_build_fresh_kernel_from_slot` helpers). `gen2.py` is simpler (no helpers). The last file to execute determines what the test file contains.
If `gen2.py` runs last, the helpers from `_build_pink_extended.py` and `_build_pink_bodies.py` are lost — their patches to the generated file become stale updates to a now-overwritten file. The `_check_slot_accounting` assertions in 14 body functions silently become dead code.
**Severity: Medium**
### J16: Shim test bridge has no `step()`, `decision_engine`, `intent_engine` — zero fidelity to production runtime
**File:** (generated in `_build_rb` sections across all test generators)
The Shim provides none of `PinkDirectRuntime`'s capabilities:
- No `step()` method — tests call `k.process_intent()` directly
- No `data_feed` — tests must provide prices manually
- No `decision_engine` — tests construct intents manually
- No `intent_engine` — no intent sizing/validation
- No lifecycle beyond connect/disconnect
The test suite effectively tests `ExecutionKernel` in isolation, not the full runtime pipeline. Any bug in the decision→intent→kernel→fill→persist chain that passes through `step()` is invisible to these tests.
**Severity: High**
---
## Pass 7 Summary
| # | Flaw | Layer | Severity |
|---|------|-------|----------|
| J1 | `_flatten` submites wrong direction for LONG positions | Test | Medium |
| J2 | `_check_slot_accounting` double-counts unrealized PnL | Test | Medium |
| J3 | `_build_live_snapshot` timestamp is float vs datetime — type crash risk | Data Feed | **High** |
| J4 | `ExecutionKernel.mark_price()` never called — no mark-to-market | Bridge | **High** |
| J5 | All VenueEvent timestamps use local clock, not exchange timestamp | Venue | Medium |
| J6 | No monotonic timestamp verification anywhere | All | Low |
| J7 | `rebuild_indexes()` overwrites duplicate trade_id — last wins, first invisible | Rust | **High** |
| J8 | `resolve_slot()` falls back to slot 0 — stray event corrupts slot 0 | Rust | **High** |
| J9 | `get_slot_json`/`snapshot_json` return null with no diagnostic | Rust | Medium |
| J10 | Two processes with same DITA_V2_PREFIX corrupt shared Zinc memory | Zinc | **High** |
| J11 | `load_dotenv()` only runs on launcher.py import — ordering dependency | Config | Medium |
| J12 | BINGX_API_KEY passed None with no validation — fails at HTTP time | Config | Medium |
| J13 | API credentials never masked in error messages or tracebacks | Config | **High** |
## PASS 8 — OBSERVABILITY, MEMORY, TIME, DEAD CODE, MODULE INIT
### K1: Zero stdout/stderr output — system is completely silent
No production code path emits any stdout or stderr output. Zero `print()`, zero `logging` output, zero `warnings.warn()`. The system runs with zero operator-visible evidence of being alive. If Hazelcast and ClickHouse are both disabled, the system is a black box — no logs, no metrics, no health checks, no output of any kind.
`logging` is imported in exactly one file (`bingx_user_stream.py`) with no root logger configuration anywhere. Even those logging calls produce no output without `logging.basicConfig()`.
**Severity: Critical**
### K2: No health check endpoint, no metrics, no monitoring surface
There are zero:
- HTTP health check endpoints (`/health`, `/ready`)
- Prometheus metrics endpoints
- Statsd/Graphite reporters
- Periodic heartbeats
- Liveness/readiness probes
- Process manager integration (no systemd unit, no supervisor config, no container healthcheck)
The only monitoring surface is programmatic — calling `kernel.snapshot()` from Python code with access to the same `ExecutionKernel` instance. For cross-process monitoring, the operator must write custom code to read Zinc shared memory regions and parse the undocumented JSON packets.
**Severity: Critical**
### K3: Failed trades produce no notification — error exists only in return value
- The failure is not persisted to any durable store (unless debug_clickhouse is enabled and sink is configured)
- If the caller (strategy/algo) doesn't inspect the return value, the failure is completely invisible
- There is no alert mechanism, no error counter, no dead-letter queue
**Severity: High**
### K4: Exception tracebacks not captured in production — all `except:` blocks swallow silently
Every `except Exception: pass` and `except Exception: continue` in the codebase discards the full Python traceback. There is no logging infrastructure to capture it. When an exception occurs:
-`launcher.py:187`: RealZincPlane init failure → traceback lost
-`rust_backend.py:102`: `__del__` exception → traceback lost
-`bingx_venue.py:51`: `_row_float` conversion failure → traceback lost
-`bingx_venue.py:325`: slot lookup failure → traceback lost
-`bingx_venue.py:350`: cancel HTTP error → traceback lost
-`control.py:213`: control plane fallback → traceback lost
- All real_control_plane.py try/except blocks → traceback lost
The only exception information that survives is the final exception message in `BingxHttpError` (converted to a dict) and Rust kernel diagnostic codes (structured JSON). Full Python tracebacks are invisible.
**Severity: High**
### K5: ~85+ Python objects allocated per `process_intent()` call — 36 TradeSlot copies via JSON round-trip
Every `_get_slot()` call does a full JSON serialization (Rust) → C FFI → JSON parse (Python) → new `TradeSlot` dataclass. A single ENTER intent with 2 venue events results in approximately:
No caching exists — every `_get_slot()` call goes through the full FFI round-trip. Multiple calls within the same `process_intent()` invocation fetch the same slot data multiple times.
# KernelStateView and KernelSlotView both hold strong references:
self._kernel = kernel # strong reference
```
This forms `ExecutionKernel → state → slots[]._kernel → ExecutionKernel`. Python's refcounting cannot free this cycle — it depends on the generational GC. The `__del__` method on `ExecutionKernel` (which destroys the Rust `KernelHandle`) fires at an unpredictable time, potentially long after the last explicit reference to the kernel is dropped.
**Severity: High**
### K7: `MemoryKernelJournal` silently drops transitions after 10,000 rows — no warning, no rollover
```python
def record(self, row):
if len(self.rows) <self.capture_limit:#capture_limit =10,000
self.rows.append(dict(row))
# else: silently no-op — every subsequent transition is lost
```
After 10,000 transitions, `record()` becomes a no-op. No error, no warning, no FIFO eviction, no rollover to disk. In a production system with 10+ transitions per trade and 100+ trades/day, the journal dies in ~10 days. At that point, all field debugging/troubleshooting capability is silently lost.
Each row holds a full `slot.to_dict()` (~1 KB) plus event/control payloads. The 10,000 rows retain ~10-15 MB permanently.
**Severity: High**
### K8: `RealZincPlane._intent_cache` Python list unbounded — only shared memory write is bounded
The shared memory write limits to the last 512 entries. But the Python `_intent_cache` list grows unbounded — every intent ever published remains in memory forever. After 1M intents: ~1M dict objects, ~500 MB+ of Python memory.
Note: `InMemoryZincPlane.intent_region` has the same unbounded growth (already documented as F12).
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0): # wall clock!
return self._last_snapshot # stale data
```
`threading.Event.wait(timeout)` uses the system wall clock. If NTP adjusts the clock:
- **Forward** (e.g., +2 seconds): the timeout is truncated to ~3 seconds — spurious timeout, stale snapshot returned
- **Backward** (e.g., -2 seconds): the timeout extends to ~7 seconds — caller blocks longer than expected
The correct pattern is `time.monotonic()` with a deadline loop — which `InMemoryControlPlane.wait()` already uses correctly. The `_backend_snapshot` timeout is the single highest-impact site because it controls whether the venue adapter returns fresh or stale exchange state.
**Severity: High**
### K10: `RealZincControlPlane.wait()` uses wall clock — no monotonic guarantee
The `SharedRegion.wait()` implementation (external) uses wall clock. Same NTP sensitivity as K9, though lower impact (controls shared memory synchronization, not exchange data freshness).
**Severity: Medium**
### K11: `exchange_ts` falls back to local `time.time()` when exchange timestamp `E` is missing
```python
# bingx_user_stream.py:278
ts = int(frame.get("E") or time.time() * 1000) # local clock fallback
```
When the exchange's WebSocket frame lacks the `E` (event time) field, the code substitutes the local machine's wall clock. Two problems:
1. Local clock may differ from exchange clock by seconds or minutes (VM drift)
2.`time.time()` is wall-clock — subject to NTP backward jumps
Events that lack `E` will have timestamps from a different clock source than events that have `E`. This creates ordering paradoxes in any downstream consumer that sorts by timestamp.
**Severity: Medium**
### K12: No monotonic timestamp verification anywhere in the system
Zero code paths check whether timestamps progress forward:
-`process_intent()` — no comparison between intent timestamp and slot's `last_event_time`
-`on_venue_event()` — no check that event timestamp >= previous events
-`AccountProjectionV2._build()` — no monotonicity check on `ReconcileResult.ts`
- Rust kernel — `last_event_time = Some(event.timestamp)` stored but never validated
NTP backward jumps, clock skew, or VM migration all can produce decreasing timestamps. The system has no detection, no guard, no warning log.
**Severity: Medium**
### K13: `ControlPlane.wait()` and `notify()` have zero callers across all implementations — dead protocol surface
The `ControlPlane` protocol defines `wait(timeout_ms=1000)` and `notify()`. Both are implemented by `InMemoryControlPlane`, `ZincControlPlane`, and `RealZincControlPlane`. But **zero callers exist** in production code:
-`ExecutionKernel` never calls `self.control_plane.wait()` or `.notify()`
-`launcher.py` never calls them
- No test exercises them
Combined with the protocol methods having real implementations (with monotonic clock logic in `InMemoryControlPlane`), this is ~40 lines of dead-but-maintained code.
Similarly: all 7 `ZincPlane` wait/notify methods (`wait_on_intent`, `notify_intent`, `wait_on_state`, `notify_state`, `wait_on_control`, `notify_control`, `read_slots`) have zero callers — dead protocol surface.
**Severity: Informational**
### K14: `AccountProjection.to_account_event()` has zero callers
```python
# account.py:86
def to_account_event(self, metadata=None):
...
```
Defined, never called anywhere in production code or tests. Dead code.
**Severity: Informational**
### K15: `HazelcastProjector` entire class dead — zero callers
```python
# hazelcast_projection.py:18-48
class HazelcastProjector:
def publish_slot(self, slot): ...
def publish_event(self, event_type, payload): ...
```
Both methods have zero callers anywhere in the codebase. The class can never be constructed from any production code path. The actively-used projection class is `HazelcastRowWriter`.
**Severity: Informational**
### K16: `_order_to_payload()` dead code
```python
# rust_backend.py:220
def _order_to_payload(order):
...
```
Defined, never called. Serializing a `VenueOrder` to dict is done inline in `TradeSlot.to_dict()` (contracts.py:127-134), not via this function.
**Severity: Informational**
### K17: `MirroredControlPlane` entire class dead — never constructed
```python
# control.py:171-184
class MirroredControlPlane:
def __init__(self, inner, mirror_sink=None): ...
```
`build_control_plane()` never returns a `MirroredControlPlane`. The class can only be constructed if someone explicitly instantiates it — no code path does. Similarly, `KernelJournal` protocol is never used as a type annotation outside `journal.py`.
**Severity: Informational**
### K18: 12 of 20 `TradeStage` variants never matched in Rust FSM logic
Defined in the Rust `string_enum!` but never matched in any `process_intent` or `on_venue_event` arm:
Only 7 variants are used in FSM logic: `IDLE`, `ORDER_REQUESTED`, `ENTRY_WORKING`, `POSITION_OPEN`, `EXIT_REQUESTED`, `EXIT_WORKING`, `CLOSED`, `STALE_STATE_RECONCILING`. The other 12 are serialization-only — they exist in the enum but the kernel never transitions a slot to them.
**Severity: Low**
### K19: Unused imports in `projection.py` and `hazelcast_projection.py`
These are carryovers from earlier code versions. They add no runtime cost (import is cached after first load) but indicate stale code structure.
**Severity: Informational**
### K20: `sys.path` mutation on import — importing the package appends Zinc path globally
Both `real_control_plane.py:13-15` and `real_zinc_plane.py:22-24` do:
```python
if _ZINC_ADAPTER_PATH.exists() and str(_ZINC_ADAPTER_PATH) not in sys.path:
sys.path.append(str(_ZINC_ADAPTER_PATH))
```
This fires at module import time as a side effect of importing `__init__.py` (through the chain: `__init__` → `launcher` → `real_control_plane`/`real_zinc_plane`). It modifies the process-global `sys.path`, which persists for the entire process lifetime. If the Zinc adapter path shadows or conflicts with other modules, the consequences are global and hard to debug.
**Severity: Medium**
### K21: `load_dotenv()` runs at module import time — mutates `os.environ` as side effect
This fires on `import launcher` (which happens via `__init__.py`). Mutates `os.environ` process-globally. Tests that need to set specific env vars must import `launcher` first to get `.env` loaded, then override — or the `.env` values win. Also: if `.env` doesn't exist, `load_dotenv()` silently does nothing, and the import dependency shifts — importing the package may or may not load `.env` depending on filesystem state.
**Severity: Medium**
### K22: `ControlPlane` protocol not in `__init__.py.__all__`
```python
# __init__.py (__all__)
"ControlPlane" not in __all__ # ← hidden from star imports
```
`from prod.clean_arch.dita_v2 import *` exports 44 names but does NOT include `ControlPlane` (the main interface type). Concrete implementations (`InMemoryControlPlane`, `RealZincControlPlane`, etc.) are all exported. The protocol class itself is hidden.
**Severity: Informational**
### K23: `KernelSlotView.__getattr__` makes a ctypes call per attribute access — no caching
```python
# rust_backend.py:422-426
def __getattr__(self, name):
slot = self._snapshot() # FFI round-trip every time
if hasattr(slot, name):
return getattr(slot, name)
raise AttributeError(name)
```
Every attribute access on a `KernelSlotView` (e.g., `slot.size`, `slot.fsm_state`, `slot.trade_id`) does a full JSON round-trip to the Rust kernel. The `_snapshot()` method calls `self._kernel._get_slot(self._slot_id)` which calls `_get_rust().get_slot_json()` → Rust serializes slot to JSON → Python parses → creates new `TradeSlot` → attribute is read from the new object.
Accessing 5 fields on a `KernelSlotView` does 5 FFI round-trips. There is no caching of the deserialized `TradeSlot` between accesses.
**Severity: Medium**
---
## Pass 8 Summary
| # | Flaw | Layer | Severity |
|---|------|-------|----------|
| K1 | Zero stdout/stderr — system completely silent | All | **Critical** |
| K2 | No health check, metrics, or monitoring surface | All | **Critical** |
| K3 | Failed trades produce no notification — error in return value only | Bridge | **High** |
| K4 | Exception tracebacks not captured — all except:pass swallow silently | All | **High** |
| K5 | ~85+ Python objects per process_intent — 36 TradeSlot copies via FFI | Bridge | Medium |
No validation that `accepted=True` implies `diagnostic_code=OK`, or that `accepted=False` implies a non-OK diagnostic code. If the Rust kernel ever returns contradictory values (e.g., `{"accepted": true, "diagnostic_code": "INVALID_INTENT"}`), Python silently accepts them. The default for both `KernelOutcome.diagnostic_code` and `_outcome_from_payload` fallback is `OK` — an `accepted=False` with no explicit `diagnostic_code` would silently show `OK`.
Similarly, `KernelTransition` has no FSM validation — any `(prev_state, next_state)` pair is accepted, even impossible transitions like `IDLE → POSITION_CLOSED`.
**Severity: Medium**
### L2: `VenueEvent.filled_size > VenueEvent.size` possible — `_fill_event_from_row` uses different source fields
`size` comes from `executedQty` (cumulative) while `filled_size` comes from `lastFilledQty` (incremental). If `lastFilledQty > executedQty` (exchange-side rounding, partial fill of a partially-cancelled order), `filled_size > size`. The Rust kernel's `apply_fill` uses `event.filled_size` for PnL and position adjustment — an oversized fill could over-count position reduction.
Also: `VenueOrder.filled_size > intended_size` possible via `_venue_order_from_row()` (line 157-163) when the exchange reports `executedQty > origQty`.
**Severity: Medium**
### L3: `VenueEvent.price=0` can reach the kernel from multiple paths
The Rust kernel's `realized_pnl()` guards against `entry_price <= 0.0` and `exit_size <= 0.0`, but `exit_price=0` in a fill event produces `delta = (0 - entry) / entry = -1.0`. For LONG: PnL = -1.0 * notional → -100% of position. A zero-price fill event would register as a total loss.
The `mark_price()` function guards against `price <= 0`, so unrealized PnL is safe. But realized PnL from a zero-price fill is not guarded.
**Severity: High**
### L4: `BingxUserStream` — `available_margin` set to `cw` (cross wallet balance) instead of `crossWalletBalance - usedMargin`
**File:** `bingx_user_stream.py:336`
```python
available_margin=cw # cw = cross wallet balance, NOT available margin
```
In BingX's `ACCOUNT_UPDATE` frame, `"cw"` is the cross wallet balance (total equity), not the available margin. Available margin = `crossWalletBalance - usedMargin`. The `ExchangeEvent.available_margin` field receives the wrong value. This flows into the dual-ledger accounting's `EBlock.available_margin` — if used for reconcile rules, the exchange-side `available_margin` is overstated.
**Severity: High**
### L5: `BingxUserStream` — `wallet_balance` silently defaults to 0 when `"wb"` is absent
**File:** `bingx_user_stream.py:334`
```python
wallet = _safe_float(usdt_bal.get("wb") or usdt_bal.get("walletBalance"))
```
If neither `"wb"` nor `"walletBalance"` exists in the USDT balance object (possible for some account types or frame formats), `_safe_float(None | None)` returns `0.0`. The exchange wallet balance is silently zeroed, making the E-side of the dual-ledger reconciliation see `wallet_balance=0` when the actual balance is positive. This always produces an ERROR reconcile status (R1: capital >> 0 vs wallet=0).
**Severity: High**
### L6: `BingxUserStream` — `_keepalive_loop` has no stop mechanism — runs forever on old listen key after rotation
**File:** `bingx_user_stream.py:394-405`
```python
async def _keepalive_loop(self, listen_key):
while True:
await asyncio.sleep(self._keepalive_secs)
await self._http.signed_put_raw(...)
```
The keepalive loop is an `asyncio.Task` with no stop signal. When the 24h rotation creates a new listen key, the old keepalive task keeps sending PUT requests to the old (now-deleted) listen key indefinitely. BingX returns errors for keepalive on deleted keys — these errors are suppressed by `with suppress(Exception)` in the delete path but NOT in the keepalive path. The keepalive loop's errors are unhandled.
**Severity: Medium**
### L7: `BingxUserStream` — `event_id` from `frame.get("i")` can be integer 0 — `str(0)` is falsy on `or` chain, generates random UUID
**File:** `bingx_user_stream.py:283`
```python
event_id = str(frame.get("i") or frame.get("event_id") or uuid.uuid4().hex)
```
If `frame.get("i")` returns integer `0` (valid event ID in some BingX frames), `str(0)` gives `"0"` which is falsy on the `or` chain → falls through to `uuid.uuid4().hex`, losing the real event ID. Event dedup downstream sees a random UUID instead of the exchange's ID.
**Severity: Medium**
### L8: BingX test URLs hardcoded in test generators — wrong environment if system targets LIVE
Hardcoded `vst` (testnet) URLs. The production `launcher.py` path selects VST vs LIVE via `BingxEnvironment` and `DOLPHIN_BINGX_ENV`, but the test generators hardcode VST. If the system is configured for LIVE and these tests run, they hit the wrong exchange environment.
**Severity: Medium**
### L9: No proxy support — cannot be deployed behind corporate proxy
No code parses `HTTP_PROXY`, `HTTPS_PROXY`, `SOCKS_PROXY` or passes proxy configuration to `aiohttp.TCPConnector` or `ClientSession`. The `aiohttp.ClientSession` in `bingx_user_stream.py` is created without any proxy parameter. Deployment behind a corporate proxy or SOCKS proxy requires code changes.
**Severity: Low** (deployment constraint, not a correctness bug)
### L10: 5-minute DNS cache TTL in WebSocket adapter — stale IPs on infrastructure change
If BingX changes server IPs during an infrastructure migration or failover, the system continues using stale IPs for up to 5 minutes. The connector is recreated on each WS reconnect, so the cache resets — but a reconnection that uses the stale DNS from the just-discarded connector's cache... actually, `ttl_dns_cache=300` means aiohttp caches DNS results for 5 minutes. After a reconnect, the new connector starts with an empty cache. But if the system doesn't reconnect and just keeps the WS alive, DNS changes go undetected for 5 minutes.
**Severity: Low**
### L11: `getattr(intent, "limit_price", 0.0)` reads from dataclass field, not metadata dict — always 0.0
**File:** `bingx_venue.py:267`
```python
metadata["_limit_price"] = float(getattr(intent, "limit_price", 0.0) or 0.0)
```
`intent.limit_price` is a field on `KernelIntent` (default 0.0). The `or 0.0` is redundant — if it's somehow None, `float(None)` raises TypeError before `or` is evaluated. Actually, `getattr(intent, "limit_price", 0.0)` returns `0.0` (the default), then `0.0 or 0.0` → `0.0`, then `float(0.0)` → `0.0`. The result is always `0.0` regardless of what the policy layer set in metadata.
But wait — `limit_price` IS a real field on `KernelIntent` (contracts.py:257, added in this version). If the policy layer sets `intent.limit_price = 10.50`, then `getattr(intent, "limit_price", 0.0)` returns `10.50`, and `float(10.50)` → `10.50`. So this IS correct for the new code where `KernelIntent` has the field. But the `_legacy_intent` function (identical to H7) doesn't check `intent.metadata.get("limit_price")` — it reads the dataclass field. If any caller passes limit_price via metadata dict only, it's lost.
- Reconcile validates slot invariants before applying
**New Rust features:**
-`AccountState` dual-ledger struct with K-value vs E-fact reconcile rules
-`on_account_event()` FFI for account-level events
-`set_seed_capital()` FFI
-`INVALID_INTENT` diagnostic code
**Critical finding: The backup still has the entry-fill size overwrite bug (I1), the backward EXIT prev_state bug (G3), and the CANCEL-only-exit-order bug (G10).** These were all fixed in the current code. The backup represents a pre-fix state that would double-settle PnL on partial fills.
**Severity: Informational**
### L13: `_build_full_runtime` in gen_live_tests.py is never called — dead code
This function wires the full production pipeline: `HazelcastDataFeed` + `PinkDirectRuntime` + `DecisionEngine` + `IntentEngine`. But every test function calls `_build_runtime_bundle()` instead, which returns a `_RuntimeShim` with zero fidelity (J16). The real `PinkDirectRuntime` — with `step()`, data feed, decision engine, intent engine — is never instantiated in any test.
Also: `hz_client=build_projection(...)` passes a `HazelcastProjection` (write-side wrapper) where a Hazelcast client object should go — type mismatch.
**Severity: High**
### L14: `BingxUserStream` — `listenKeyExpired` raises RuntimeError instead of clean return — triggers full reconnect
**File:** `bingx_user_stream.py:273`
```python
if frame.get("e") == "listenKeyExpired":
raise RuntimeError("listenKeyExpired")
```
When the exchange sends `listenKeyExpired`, the code raises `RuntimeError` inside the `_consume()` async generator. This propagates to the outer `subscribe()` loop's `try/except`, which treats it as a connection failure — delays, creates a new listen key, reconnects. The proper behavior is to yield an `ExchangeEvent(kind=RECONNECTED)` and return cleanly, letting the caller handle the rotation without backoff delay.
**Severity: Medium**
### L15: `BingxUserStream` — `_delete_listen_key` suppresses all exceptions — leaked keys on auth failures
**File:** `bingx_user_stream.py:413-416`
```python
async def _delete_listen_key(self, listen_key):
with suppress(Exception):
await self._http.signed_delete_raw(...)
```
If the DELETE call fails (invalid signature, expired key, network error), the exception is swallowed. The old listen key remains active on BingX, wasting server resources. Over days of operation with unhandled auth failures, leaked listen keys accumulate server-side.
let target = if slot.active_entry_order.is_some() {
slot.active_entry_order.as_mut()
} else {
slot.active_exit_order.as_mut()
};
```
If an entry order exists (even if fully filled and the slot is in `POSITION_OPEN`), ANY incoming event's `venue_order_id` propagates to the entry order — even if the event is for the exit order. The `active_entry_order` status might be `FILLED` but it's still `Some(...)`, so the exit event's ID goes to the wrong order.
**Severity: Medium**
---
## Pass 9 Summary
| # | Flaw | Layer | Severity |
|---|------|-------|----------|
| L1 | `KernelOutcome(accepted=True, diag=INVALID_INTENT)` parseable — no invariant check | Bridge | Medium |
| L2 | `VenueEvent.filled_size > size` possible via different source fields | Venue | Medium |
| L3 | `VenueEvent.price=0` reaches kernel — zero-price fill = 100% loss PnL | Venue | **High** |
| L4 | `available_margin` set to cross-wallet balance, not available margin | Stream | **High** |
| L5 | `wallet_balance` defaults to 0 when `"wb"` absent — E-side reconcile always ERROR | Stream | **High** |
| L6 | `_keepalive_loop` no stop mechanism — runs on old key after rotation | Stream | Medium |
| L7 | `event_id` integer 0 → `str(0)` falsy on `or` → random UUID generated | Stream | Medium |
| L8 | Hardcoded VST URLs in test generators — wrong env if LIVE configured | Test | Medium |
| L9 | No proxy support — can't deploy behind corporate proxy | Network | Low |
| L10 | 5-minute DNS cache TTL — stale IPs on infrastructure change | Network | Low |
### M1: ENTER transition hardcodes `prev_state = IDLE` — every non-IDLE entry corrupts the audit trail
**File:** `_rust_kernel/src/lib.rs:1117`
```rust
let transition = self.transition(
&slot,
TradeStage::IDLE, // HARDCODED — lies about actual prev_state
slot.fsm_state.clone(),
"ENTER_INTENT",
);
```
When a slot is entered from `CLOSED` (re-entry) or from any other state that passed the `is_free()` or same-trade bypass, the transition record claims `prev_state = IDLE`. This is **always wrong unless the slot was genuinely IDLE**. Every ENTER transition in the journal for a re-entered slot or a slot coming from CLOSED records an impossible transition (`CLOSED → ORDER_REQUESTED` recorded as `IDLE → ORDER_REQUESTED`).
This corrupts any downstream FSM analysis, journal audit, or trade-lifecycle reconstruction that relies on accurate `prev_state` values.
**Severity: Critical**
### M2: CANCEL intent creates no transition record — invisible in audit log
**File:** `_rust_kernel/src/lib.rs:1287-1305`
The CANCEL branch in `process_intent` returns a `KernelResult` with no call to `self.transition()`. Every other intent (ENTER, EXIT, MARK_PRICE, RECONCILE) records a transition. CANCEL operations — including accepted cancels — are **invisible in the transition audit log**.
Additionally, CANCEL returns `accepted = true` but never mutates the slot's `fsm_state`. The slot stays in whatever state it was in. The caller sees `accepted = true` with no visible effect.
**Severity: Critical**
### M3: `_mk_intent` test helper drops `order_type`/`limit_price` into `metadata` instead of setting proper fields
metadata=kw, # order_type="LIMIT" goes into metadata dict, not the dataclass field!
)
```
`KernelIntent` has dedicated fields `order_type: str = "MARKET"` and `limit_price: float = 0.0` (contracts.py:274-275), but `_mk_intent` passes `**kw` as `metadata=kw`. So `_mk_intent(order_type="LIMIT")` produces `intent.order_type == "MARKET"` (the default) while `intent.metadata["order_type"] == "LIMIT"`.
The Flaw 6 tests in `test_flaws.py` that verify `order_type`/`limit_price` preservation through `_legacy_intent` pass for the **wrong reason** — they check `legacy.metadata.get("order_type")` which finds the value in the passthrough metadata, not because `_legacy_intent` correctly reads `intent.order_type`. If the production code changes and the test helper isn't fixed, the tests silently become false positives.
**Severity: High**
### M4: `test_cancel_entry_with_partial_fill` never sends a CANCEL — misnamed vacuous test
**File:** `test_flaws.py:161-172`
```python
def test_cancel_entry_with_partial_fill(self):
k = _fresh_kernel(scenario=MockVenueScenario(partial_fill_ratio=0.5))
assert slot_after.size > 0, "Should have partial fill"
```
Named "Cancel entry with partial fill," belongs to `TestFlaw1EntryCancel` — but **no CANCEL intent is ever sent**. It only verifies that a partial fill occurred. The test is completely vacuous for its stated purpose.
The same pattern affects Flaw 9 tests — `test_cancel_uses_slot_asset_not_trade_id` and `test_mock_venue_cancel_event_has_asset` both have "cancel" in their names but never call any cancel function.
**Severity: High**
### M5: Flaw 7 tests (`test_entry_exit_different_ratios`, `test_per_action_type_ratios`) never send EXIT
**File:** `test_flaws.py` Flaw 7 test class
Both tests set `exit_partial_fill_ratio` on the mock venue scenario but only ever process an ENTER intent. The `exit_partial_fill_ratio` is configured but never exercised. The tests verify entry partial fill behavior only — they don't test what their titles and class name claim.
The Flaw 10 tests reference a 64-event dedup window, but the actual Rust constant is `MAX_SEEN_EVENT_IDS = 256` (lib.rs:8). The test sends 70 events and asserts `>= 70`. Since `70 < 256`, no eviction occurs. The test passes trivially regardless of whether the old-64-bound flaw exists. To meaningfully test eviction, >256 events would be needed.
Similarly, `test_dedup_eviction_does_not_accept_old_event` sends only 70 events then checks for dedup — with a 256-entry window, the first event is never evicted. The test verifies basic dedup (non-evicted), not eviction behavior.
**Severity: Medium**
### M7: `test_outcome_state_matches_actual_slot` is tautological — compares value with itself
**File:** `test_flaws.py:200-210`
```python
result = k.process_intent(_mk_intent(action=E.ENTER, trade_id="oc1"))
slot = k._get_slot(0)
assert result.state == slot.fsm_state,
```
`result.state` is set from `final_slot.fsm_state` (which comes from `self._get_slot(outcome.slot_id)` inside `process_intent`). The test then calls `k._get_slot(0)` again. Both read from the same Rust backend — they **must** be equal by construction. This test proves nothing; it's a tautology.
**Severity: Low**
### M8: ORDER_ACK silent fallthrough when no active order — accepts event with no effect
**File:** `_rust_kernel/src/lib.rs:1476-1498`
When `on_venue_event` receives an `ORDER_ACK` for a slot with neither `active_entry_order` nor `active_exit_order` (shouldn't happen normally, but possible after a reconcile or race), the match arm executes **no branch**. The state is unchanged, `diagnostic_code` stays `OK`, and `accepted = true`. The event is silently accepted with no effect — no diagnostic, no warning.
The same bug exists for `CANCEL_ACK` (line 1545): if no matching active order exists, the event is silently accepted with no state change and `OK` diagnostic.
**Severity: Medium**
### M9: ORDER_REJECT on POSITION_OPEN with stale entry order destroys the position
**File:** `_rust_kernel/src/lib.rs:1499-1530`
```rust
KernelEventKind::ORDER_REJECT => {
if slot.active_entry_order.is_some() && slot.fsm_state != TradeStage::POSITION_OPEN {
// clear entry, wipe trade data, set IDLE
} else if slot.active_exit_order.is_some() {
// clear exit order only, set POSITION_OPEN
} else {
// no match — reset to IDLE
}
}
```
If a slot is in `POSITION_OPEN` (position active) but `active_entry_order` is still `Some` (stale — didn't get cleared on fill), the entry-reject guard `fsm_state != POSITION_OPEN` prevents the entry path. It falls to the exit check. If no exit order, the final `else` branch fires — resetting the slot to **IDLE** and destroying the open position and all trade data.
**Severity: Critical**
### M10: No aggregation of any metric — trade count, success/fail, latency all zero
**File:** entire codebase
The following metrics are completely impossible to obtain from the current system:
| Metric | Why unavailable |
|--------|----------------|
| Total trades processed | `trade_seq` declared on `AccountSnapshot` but never incremented anywhere |
| Succeeded vs failed trades | No aggregation of `KernelDiagnosticCode` outcomes |
| PnL per individual trade | `slot.realized_pnl` is overwritten on slot reuse — no per-trade persistence |
| Slippage (fill vs intended price) | Data exists transiently but no computed metric |
| API calls per minute | No call counters anywhere in the venue adapter |
| `process_intent` latency | Zero timing instrumentation — no `time.monotonic()` in kernel path |
| Process memory usage | No memory tracking of any kind |
| Deduplicated vs fresh event count | Dedup detection exists but is never counted |
The `AccountSnapshot.trade_seq` field (account.py:27) is declared as `trade_seq: int = 0` but **never assigned** — no code path ever sets it above 0. It's a dead field.
**Severity: High**
### M11: Flaw 6 tests pass via metadata passthrough, not via `_legacy_intent` field logic
**File:** `test_flaws.py` Flaw 6 tests
The two Flaw 6 tests verify that `_legacy_intent` preserves `order_type` and `limit_price`. They pass because `_mk_intent(order_type="LIMIT")` puts the value into `intent.metadata`, and `_legacy_intent` copies `intent.metadata` into `legacy.metadata` verbatim. The tests check `legacy.metadata.get("order_type")` which finds the value in the passthrough — **not** because `_legacy_intent` reads `intent.order_type` correctly.
`_legacy_intent` actually reads `getattr(intent, "order_type", "MARKET")` which returns `"MARKET"` (the default, since `_mk_intent` put it in metadata not the field), and sets `legacy.metadata["_order_type"] = "MARKET"`. The assertion passes via the wrong code path. If `_legacy_intent` stopped copying metadata entirely, the tests would still pass as long as `intent.metadata` is passed through.
**Severity: High**
### M12: No retry or fallback for ClickHouse INSERT failures
Evidence across all persistence paths: every `sink(table, row)` call in `pink_clickhouse.py` is unprotected. If ClickHouse is unreachable, slow, or returns an error, the exception propagates unhandled through `persist_step()` → `step()`. No retry, no backoff, no fallback, no queue, no error reporting to `anomaly_events`.
This means a transient ClickHouse outage (common in cloud deployments) crashes the entire policy cycle. The slot state in the Rust kernel may be lost as the exception unwinds.
**Severity: High**
### M13: `AccountSnapshot.trade_seq` declared but never incremented — dead field
**File:** `account.py:27`
```python
@dataclass
class AccountSnapshot:
...
trade_seq: int = 0
```
This field is part of the `AccountSnapshot` dataclass. It's initialized to 0 and **never assigned or incremented** anywhere in the entire codebase. Every snapshot from `kernel.snapshot()["account"]` returns `trade_seq: 0`. Despite being a standard field in every persistence row, it's always `0` — making it impossible to order trades chronologically by sequence number from any persisted data.
Allows a 50% capital deviation (12,500 USDT on 25,000). The actual PnL from the test's tiny trades (~0.02 USDT) is orders of magnitude smaller. A bug that silently leaked 10,000 USDT of PnL would pass this test. The bound provides no meaningful verification.
Also: the test never checks `diagnostic_code` for the warning it claims to test (already documented as I7 weakness).
**Severity: Low**
### M15: `test_reconcile_rejects_position_open_with_zero_size` passes even if reconcile silently ignores bad data
**File:** `test_flaws.py:568-585`
```python
result = k.reconcile_from_slots([bad_slot])
slot = k._get_slot(0)
assert slot.fsm_state != TradeStage.POSITION_OPEN or slot.size > 0
```
The assertion was true **before** calling reconcile (slot starts IDLE with size=0). The test never checks `result.accepted == False` or verifies the diagnostic code. If `reconcile_from_slots` silently ignores the bad slot and returns `accepted=True`, the test still passes — it only proves the slot wasn't in POSITION_OPEN _after_ reconcile, which was already true.
The same structural weakness exists in `test_reconcile_rejects_idle_with_nonzero_size`.
**Severity: Low**
### M16: No built-in metric for active slot count, event throughput, or memory usage
The following operational metrics cannot be obtained without writing custom code:
- **Active slot count**: `len([s for s in kernel.state.slots if not s.is_free()])` — requires Python access to the `ExecutionKernel` object. No `active_slot_count` property exists.
- **Total event count**: No counter. The journal tracks individual transitions but there's no `total_events_processed: int` anywhere.
- **Memory usage**: No `tracemalloc`, no `psutil`, no RSS polling. Nothing.
- **Runtime uptime**: No `start_time` or `uptime()` method anywhere.
**Severity: Medium**
### M17: M4 duplicate — test_cancel_uses_slot_asset_not_trade_id and test_mock_venue_cancel_event_has_asset never call cancel
**File:** `test_flaws.py` Flaw 9 class
Both tests verify that an entry order's metadata contains an `asset` key. They never call `scenario.cancel()` or `k.process_intent(action=CANCEL)`. Despite their names and class (`TestFlaw9CancelSymbolFallback`), they test **metadata preservation on entry**, not cancel behavior.
**Severity: High**
### M18: `_decision_to_kernel_intent` drops `order_type` and `limit_price` — LIMIT orders unreachable from the runtime
**File:** `pink_direct.py:79-115` (inferred from E2E trace)
The bridge function converts a `Decision` to a `KernelIntent`. It sets `timestamp`, `intent_id`, `trade_id`, `asset`, `side`, `action`, `reference_price`, `target_size`, `leverage`, `exit_leg_ratios`, `reason`, and `metadata`. It does **NOT** set `order_type` or `limit_price` — both default to `"MARKET"` and `0.0`.
Even if the `DecisionEngine` produced a LIMIT decision with a limit price, the runtime has no path to express it. The entire LIMIT-order pipeline is dead code from the runtime — LIMIT orders can only be set via direct `KernelIntent(...)` construction in tests, which is itself broken (M3).
**Severity: High**
---
## Pass 10 Summary
| # | Flaw | Layer | Severity |
|---|------|-------|----------|
| M1 | ENTER transition hardcodes prev_state=IDLE — audit trail lies for re-entries | Rust | **Critical** |
| M2 | CANCEL creates no transition record — invisible in audit log | Rust | **Critical** |
| M3 | `_mk_intent` drops order_type/limit_price into metadata, not proper field | Test | **High** |
| M4 | test_cancel_entry_with_partial_fill never sends CANCEL — misnamed vacuous test | Test | **High** |
| M5 | Flaw 7 tests never send EXIT — exit_partial_fill_ratio untested | Test | Medium |
| M6 | test_dedup tests use wrong constant (actual=256, claim 64) — 70 events insufficient | Test | Medium |
| M7 | test_outcome_state_matches_actual_slot is tautological | Test | Low |
| M8 | ORDER_ACK silent fallthrough when no active order — accepted with no effect | Rust | Medium |
| M9 | ORDER_REJECT on POSITION_OPEN with stale entry order destroys position | Rust | **Critical** |
| M10 | No aggregation of trade count, success/fail, latency — all zero | All | **High** |
| M11 | Flaw 6 tests pass via metadata passthrough, not field logic | Test | **High** |
| M12 | No retry/fallback for ClickHouse INSERT failures — crashes policy cycle | Persistence | **High** |
| M13 | AccountSnapshot.trade_seq never incremented — always 0 | Account | Medium |
// Safety: single-threaded; caller holds exclusive access for the duration.
let core = unsafe { &mut (*handle).core }; // raw ptr → &mut
```
The comment says "single-threaded" but provides **zero enforcement** — no `Mutex`, no `RwLock`, no atomic flag, no thread-local constraints, no `!Send`/`!Sync` marker on `KernelCore`. The `unsafe` block converts a raw pointer to a `&mut` reference, which under Rust's aliasing rules must be **exclusive** — two simultaneous `&mut` references to the same data is **undefined behavior** (data race, torn reads, LLVM miscompilation).
The `ctypes` FFI mechanism releases the GIL during the Rust call (`Py_BEGIN_ALLOW_THREADS`/`Py_END_ALLOW_THREADS`). Two Python threads can call any two `dita_kernel_*` functions simultaneously — one in `process_intent` (writing slot state), another in `snapshot_json` (reading). Both produce `&mut KernelCore`. This is a **compiler-level UB**, not just a logical race.
**Trigger scenario:** Thread A calls `process_intent()` (ENTRY fill → mutates slot). Thread B calls `on_venue_event()` (exit fill → mutates slot). The GIL is released during both Rust FFI calls. Both get `&mut KernelCore`. The Rust compiler can reorder, elide, or speculate any memory operation. Slot data becomes corrupted, PnL doubles, or the process segfaults.
**Severity: Critical** — undefined behavior, no enforcement, no mitigation.
### N2: `_run()` has two completely different code paths depending on event loop state — runtime branch, not design decision
**File:** `bingx_venue.py:225-238`
```python
def _run(self, result):
if inspect.isawaitable(result):
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(result) # Path A: no loop → direct run
Path A (no event loop running): `asyncio.run(result)` — creates a new event loop, runs the coroutine, closes it. All on the same thread. Correct for sync contexts.
Path B (event loop running): `pool.submit(asyncio.run, result).result()` — submits to a 3-thread pool, each worker creates yet ANOTHER event loop via `asyncio.run()`, then blocks the calling thread with `.result()`.
The `asyncio.get_running_loop()` check is a **runtime probe** — the code doesn't know from its design whether it's in an async context. Same logical operation (run a coroutine), two completely different implementations. Path B is a documented anti-pattern (creating/destroying event loops per call), Path A is correct.
This is the root cause of the entire async/sync seam problem — the architecture never committed to being async or sync.
**Severity: Critical**
### N3: `_run()` Path B blocks the event loop thread for every venue HTTP operation
When called from within a running event loop (all live tests, any async deployment), `.result()`**blocks the event loop thread** until the thread pool worker completes. During this block:
- No WS messages can be received from the `BingxUserStream`
- No keepalive tasks can run
- No timer-based events can fire
- The event loop is **stuck**
If the thread pool is exhausted (3 concurrent HTTP calls — e.g., `_backend_snapshot` from `submit()` which calls it twice plus `cancel()` which calls it three times), the 4th call blocks at `.result()`**indefinitely** — the work item is queued but no worker is free. This is a **stuck-process scenario** where the entire system freezes.
The event loop thread is blocked on `.result()`, which means it cannot process the WS events that might contain the fill for the order it just submitted. If the exchange fills instantly, the WS message arrives before `.result()` returns — the WS data sits in the kernel's TCP receive buffer, unprocessed, until `process_intent` completes and the event loop can schedule the WS reader again. This delay can cause stale fills, missed state transitions, or WS timeouts.
**Severity: Critical**
### N4: `asyncio.run()` called repeatedly inside thread pool — creates/destroys event loops per call, documented anti-pattern
**File:** `bingx_venue.py:236`
```python
return pool.submit(asyncio.run, result).result()
```
Each call to `asyncio.run()` creates a new `SelectorEventLoop`, runs it, then closes it. Doing this repeatedly for every HTTP call is a documented CPython anti-pattern:
- Each loop allocation costs memory (selector, callbacks, timeout queue)
- Each loop destruction leaves loop-internal objects for GC
- Over many calls (hundreds of trades), this creates GC pressure and memory fragmentation
- The `asyncio.run()` documentation explicitly says "don't call this repeatedly" — use a long-lived loop
Path A (no event loop) has the same issue — `asyncio.run()` is called per-`_run()` invocation.
The total cost: each `process_intent()` may call `_run()` 3-4 times (`_backend_snapshot`×2 + `submit_intent` + optionally `cancel`). Each `_run()` creates/destroys an event loop. With 10 trades/min, that's 30-40 event loop creations/destructions per minute.
**Severity: Critical**
### N5: `_snapshot_ready` Event cascading re-fetch — N concurrent callers produce N overlapping HTTP calls
**File:** `bingx_venue.py:258-274`
```python
def _backend_snapshot(self, ...):
if not self._snapshot_ready.wait(timeout=timeout_ms / 1000.0):
When `_snapshot_ready.set()` fires at the end, ALL threads waiting on `.wait()` wake up. Each one proceeds to `clear()` and start a **new** HTTP call — even though a fresh snapshot was just written. With N concurrent callers to `_backend_snapshot`, this produces N overlapping `refresh_state` HTTP calls instead of N-1 callers reading the just-received result.
On BingX VST (rate limit ~10 req/s), 3 overlapping `refresh_state` calls (each doing 5 parallel sub-requests) burns 15 of the 10 req/s budget. The calls overlap and cascade, wasting rate-limit capacity with redundant work.
**Severity: High**
### N6: `BingxUserStream.close()` does not cancel pending tasks — keepalive/rotation tasks continue after close
**File:** `bingx_user_stream.py:160-169`
```python
async def close(self) -> None:
self._closed.set()
if self._session is not None and not self._session.closed:
await self._session.close()
```
`close()` sets the `_closed` event and closes the aiohttp session. It does **not** cancel the `keepalive_task` or `rotation_task` created inside `subscribe()`. These tasks are only cancelled in the `finally` block of `subscribe()`. If `close()` is called while nobody is iterating the `subscribe()` generator (or if iteration is blocked in `_consume()`), those tasks **keep running** until:
- The event loop shuts down (automatic task cancellation)
- The subscribe generator is garbage collected
- An exception occurs in the WS reader
During this window, the keepalive loop continues sending PUT requests to the (now potentially deleted) listen key. The rotation task continues its 23h50m sleep. Both are zombie tasks with no cleanup path.
**Severity: Medium**
### N7: Live test architecture forces worst-case `_run()` behavior for every operation
**File:** `gen_live_tests.py`, `gen2.py`, `_gen_test.py` (all test generators)
The live tests use this pattern:
```python
def test_pink_ditav2_xxx(_live_client) -> None:
...
result = asyncio.run(_run_scenario(bundle, _live_client, body_fn, name, ic))
```
Each test is a **synchronous** function that calls `asyncio.run()`. Inside the resulting event loop, every call to `k.process_intent()` triggers **Path B** of `_run()` — the pool-submit-`.result()` path. The test architecture forces the architecture's slowest, most thread-expensive code path for every single intent.
Every HTTP call: creates a new event loop on a pool thread → blocks the main event loop thread → blocks WS processing → wastes pool slots. Even for trivial mock-venue tests that don't need HTTP at all, the architecture still goes through the same `_run()` → pool → `.result()` path because the mock venue also returns awaitables.
**Severity: Medium**
### N8: `BingxUserStream subscribe()` creates new tasks on every reconnect — rapid reconnect causes task churn
async for event in self._consume(listen_key, rotation_task):
yield event
```
Each iteration of the reconnect loop creates new `keepalive_task` and `rotation_task`, then cancels the previous ones in the `finally` block. If the connection drops every few seconds (unstable WS), tasks are created and cancelled in rapid succession. Cancellation races with task creation — a task can be cancelled before its first `await`, which changes its state machine.
Also: no rate limiting on the reconnect loop beyond the `delay_ms` exponential backoff. If the WS repeatedly fails immediately after connection, the loop creates/destroys tasks in a tight cycle.
**Severity: Medium**
### N9: No `asyncio.all_tasks()` or task accounting anywhere — leaked tasks undetectable
No code in the entire workspace calls `asyncio.all_tasks()` or maintains a task registry. If a task is leaked (cancellation not propagated, generator not cleaned up), there is:
- No way to detect it programmatically
- No warning log
- No metrics
- No `__del__` fallback
Combined with N6 (tasks not cancelled on close) and N8 (task churn on reconnect), leaked tasks accumulate silently. Each leaked task holds references to its coroutine frame, which may hold references to `aiohttp.ClientSession`, websocket connections, and other resources.
**Severity: Low**
### N10: `_snap_lock` / `_snapshot_ready` pattern has no reader-side protection on `_last_snapshot`
**File:** `bingx_venue.py:258-274`
The `_snap_lock` protects `_last_snapshot` only during writes (line 269-271). The fallback path (timeout at line 260-262) also reads `_last_snapshot` under `_snap_lock`. But the `_call_backend` call at line 266 is **outside** the lock — the snapshot is fetched without holding `_snap_lock`, which is correct (don't hold a lock across HTTP). However, the time between releasing the lock and reacquiring it for the write (line 269) means another thread could also be writing `_last_snapshot` concurrently. The `_snap_lock` ensures only one write at a time, but the `_last_snapshot` can still be overwritten between threads — this is the intended behavior (last writer wins for staleness purposes, not a correctness bug).
**Severity: Informational**
---
## Pass 11 Summary
| # | Flaw | Layer | Severity |
|---|------|-------|----------|
| N1 | Rust kernel `with_handle_mut` zero synchronization — `&mut` from raw ptr, UB on concurrent FFI | Rust | **Critical** |
| N2 | `_run()` has two completely different code paths — runtime branch, not design decision | Venue | **Critical** |
| N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** |
| N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call, documented anti-pattern | Venue | **Critical** |
| N5 | `_snapshot_ready` cascading re-fetch — N callers produce N overlapping HTTP calls | Venue | **High** |
| N6 | `BingxUserStream.close()` doesn't cancel pending tasks — zombie keepalive/rotation after close | Stream | Medium |
| N7 | Live test architecture forces worst-case `_run()` path for every operation | Test | Medium |
| N8 | `subscribe()` reconnect creates new tasks per iteration — rapid reconnect causes task churn | Stream | Medium |
| N9 | No `asyncio.all_tasks()` or task accounting — leaked tasks undetectable | All | Low |
| N10 | `_snap_lock`/`_snapshot_ready` no reader-side protection (informational) | Venue | Info |
### Pass 11 Severity
| Severity | Count |
|----------|-------|
| **Critical** | 4 (N1, N2, N3, N4) |
| **High** | 1 (N5) |
| Medium | 3 (N6, N7, N8) |
| Low | 1 (N9) |
| Info | 1 (N10) |
### Combined Catalog (All 11 Passes)
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
### O1: `_maybe_close()` calls `asyncio.run()` without checking for a running event loop — close/disconnect silently skipped
**File:** `launcher.py:270-274`
```python
def _maybe_close(obj):
...
if inspect.isawaitable(result):
try:
asyncio.run(result)
except RuntimeError:
pass # SILENT — coroutine never executed
```
When `_maybe_close()` is called from any context that already has a running event loop (which includes all async tests, any `async def main()` orchestrator, or any code path that imports and runs `DITAv2LauncherBundle` inside an async context), `asyncio.run(result)` raises `RuntimeError: asyncio.run() cannot be called from a running event loop`. The `except RuntimeError: pass` swallows it — the close/disconnect method **never executes**.
Affected resources when called from async context:
-`RealZincPlane.close()` — never called → 3 shared memory regions leaked
-`RealZincControlPlane.close()` — never called → 1 shared memory region leaked
-`BingxVenueAdapter` has neither `close()` nor `disconnect()` — N/A
-`InMemoryZincPlane` has no close — N/A
The `DITAv2LauncherBundle.close()` method calls `_maybe_close(self.venue)`, `_maybe_close(self.zinc_plane)`, `_maybe_close(self.control_plane)` — if any of these have async close/disconnect methods, they're all silently skipped when called from async context.
This means: in any async deployment (which is the only deployment pattern — tests, and presumably production via `asyncio.run()` at top level), **shared memory regions are never explicitly closed**. They rely on process exit cleanup.
**Severity: High**
### O2: `async def connect()` shims in all test generators call sync `venue.connect()` without `await` — misleading pattern
self.kernel.venue.connect() # sync method, no await
```
`BingxVenueAdapter.connect()` (bingx_venue.py:301) is a **sync**`def` that returns `bool`. It internally calls `self._run(result())` which under a running event loop submits to the thread pool and blocks with `.result()`. The `async def connect()` wrapper is misleading — it's `async` but immediately calls a sync method that will **block the event loop** for the HTTP round-trip duration.
The caller's perspective: `await runtime.connect()` should yield the event loop. Instead, it blocks until the BingX HTTP call inside `connect()` completes (via `_run()`'s thread pool path).
**Severity: Medium**
### O3: `gen_live_tests.py:171` — `_contract_rows(client)` NOT awaited in `async def _pick_live_symbol` — silent failure
**File:** `gen_live_tests.py:171**
```python
async def _pick_live_symbol(client):
rows = _contract_rows(client) # MISSING await! _contract_rows is async def
...
pos_rows = [r for r in rows if ...]
```
`_contract_rows` is `async def` (line 69). Without `await`, `rows` is a **coroutine object**, not the actual data. The subsequent iteration `for r in rows` would iterate over a coroutine object — in Python 3.12+, coroutines raise `TypeError: 'coroutine' object is not iterable` when iterated.
This function is called from `_run_scenario` (line 260) and `_run_pink_live_roundtrip` (line 297). If either path reaches `_pick_live_symbol`, it crashes with `TypeError`. This bug may not have manifested in practice if the code paths that call `_pick_live_symbol` are rarely exercised or if the test generator's output file hasn't been regenerated recently.
snap = asyncio.get_event_loop().run_until_complete(mock.account_snapshot()) # line 243
asyncio.get_event_loop().run_until_complete(asyncio.wait_for(_collect(), timeout=2.0)) # line 264
```
`asyncio.get_event_loop()` is **deprecated** in Python 3.12+ (raises `DeprecationWarning`). If no running event loop exists at call time, it creates a new loop and sets it as the current event loop — which can cause subtle issues when multiple event loops are active. The modern pattern is `asyncio.run()`.
These are the only two places in the workspace that use the deprecated `get_event_loop().run_until_complete()` pattern.
**Severity: Medium**
### O5: `_run()` thread pool has no timeout on `.result()` — if backend hangs, calling thread hangs forever
**File:** `bingx_venue.py:236**
```python
return pool.submit(asyncio.run, result).result() # NO timeout
```
`concurrent.futures.Future.result()` has an optional `timeout` parameter. None is set here. If the thread pool worker hangs (e.g., the `asyncio.run()` call in the worker gets stuck on a never-responding HTTP request, a deadlocked coroutine, or an infinite loop), the calling thread blocks **forever** on `.result()`.
If the calling thread is the event loop thread (Path B), the entire event loop is frozen indefinitely. No WS messages, no keepalive tasks, no timer events. The system is completely dead.
The `_backend_snapshot()` method has a 5-second timeout for its `threading.Event.wait()`, but the actual `_call_backend("refresh_state", ...)` that runs inside the thread pool has no timeout. The HTTP client (`BingxHttpClient`) may have its own default timeout (typically 30-60 seconds for `aiohttp`), but there's no fallback if it hangs beyond that.
**Severity: High**
### O6: MockVenueAdapter never exercises the thread-pool bridge — all CI tests use mock venue, bridge untested
**Files:** `mock_venue.py` vs `bingx_venue.py`
`MockVenueAdapter.submit()` is pure sync — it does `return self._events_from_submit(...)` with no awaitables, no thread pools. `BingxVenueAdapter.submit()` is a sync-bridge that goes through `_run()` → `pool.submit(asyncio.run, ...).result()`.
All 35+ tests in `test_flaws.py` use `MockVenueAdapter`. All generated live tests use `BingxVenueAdapter` but are rarely executed (require live exchange credentials and API key env vars). The thread-pool bridge — including:
- Thread creation and lifecycle
-`asyncio.run()` inside pool workers
- Event loop per HTTP call
- Thread pool exhaustion handling
- Exception propagation through `.result()`
— is **never exercised in CI**. If the bridge has a bug (e.g., the `asyncio.run()` inside the pool worker corrupts shared state, or thread-safety issues in `aiohttp`), it surfaces only in production.
**Severity: Medium**
### O7: `BingxUserStream._keepalive_loop` and `_rotation_sentinel` are fire-and-forget tasks — unhandled exceptions silently lost
Both are created with `create_task()` and tracked for later cancellation, but **not supervised during normal operation**. If `_keepalive_loop` raises an exception that's not caught by its internal `try/except` (e.g., a `asyncio.CancelledError` variant, or a `RuntimeError` from the HTTP layer), the exception is stored in the `Task` object. If `.result()` or `.exception()` is never called on that `Task`, the exception is logged by the asyncio event loop as `"Task exception was never retrieved"` — a warning message, but no structured error handling.
`_rotation_sentinel` has no exception handling in its body — it just does `await asyncio.sleep(secs)` and returns. It can't raise an exception unless the event loop is shut down during its sleep (in which case `CancelledError` is raised, which is properly handled in the `finally` block).
**Severity: Low**
### O8: `KernelSlotView.__getattr__` makes a ctypes call per attribute — each read triggers Rust FFI and is not cached
Every attribute access on a `KernelSlotView` — including `slot.size`, `slot.fsm_state`, `slot.trade_id`, `slot.active_entry_order`, etc. — does a full JSON round-trip to the Rust kernel:
7.`_slot_from_payload(dict)` → new `TradeSlot` dataclass
8.`getattr(slot, name)` → read the one field from the new object
Accessing 5 fields on a `KernelSlotView` (e.g., `slot.size`, `slot.fsm_state`, `slot.entry_price`, `slot.active_entry_order`, `slot.trade_id`) does 5 FFI round-trips. The deserialized `TradeSlot` is created and immediately discarded for each access.
The `_snapshot()` method (line 435) calls `self._kernel._get_slot(self._slot_id)` which does the full FFI round-trip. There is no caching of the deserialized `TradeSlot` between successive accesses. This is an N+1 performance issue — accessing N fields costs N FFI calls instead of 1.
**Severity: Medium**
### O9: `DITAv2LauncherBundle` has no `__del__` — bundle that's garbage collected leaks its entire resource tree
**File:** `launcher.py:64-95**
```python
@dataclass
class DITAv2LauncherBundle:
kernel: ExecutionKernel
control_plane: ControlPlane
projection: HazelcastProjection
zinc_plane: ZincPlane
venue: VenueAdapter
def close(self) -> None:
_maybe_close(self.venue)
_maybe_close(self.zinc_plane)
_maybe_close(self.control_plane)
```
No `__del__` method. If a bundle is garbage collected without an explicit `close()` call:
- The Rust kernel's `KernelHandle` is freed by `ExecutionKernel.__del__` (if GC runs)
- If `RealZincPlane` was in use, its `close()` is never called → 3 shared memory regions leaked
- If `RealZincControlPlane` was in use, its `close()` is never called → 1 shared memory region leaked
- The projection (Hazelcast) client connection is never closed
- The venue adapter's thread pool executor is never shut down
If the bundle is created and dropped in a loop (e.g., per-test setup/teardown), shared memory regions accumulate until the system runs out of `/dev/shm/` space.
**Severity: Medium**
### O10: ExecutionKernel has no `close()` — `__del__` is the only cleanup path for the Rust handle
**File:** `rust_backend.py:519-525**
```python
def __del__(self) -> None:
backend = getattr(self, "_backend", None)
if backend is not None:
try:
_get_rust().destroy(backend)
except Exception:
pass
```
No `close()` method exists on `ExecutionKernel`. The `DITAv2LauncherBundle.close()` doesn't touch the kernel (it calls `_maybe_close` on venue, zinc_plane, and control_plane only). The Rust `_backend` handle is only freed when `__del__` runs during garbage collection.
If the kernel is part of a reference cycle (K3/K6 — `Kernel → KernelStateView → KernelSlotView → Kernel`), `__del__` may be delayed indefinitely until the cycle GC runs. During that delay, the Rust `KernelHandle` is alive but unreachable — its memory is leaked until GC.
**Severity: Medium**
### O11: `KernelSlotView.__setattr__` triggers 5 side effects including durable writes — undocumented
self._kernel._set_slot(slot) # triggers: Rust FFI write + state refresh
# + account.observe_slots
# + projection.write_slot
# + zinc_plane.write_slot
```
Setting any attribute on a `KernelSlotView` — even something trivial like `slot.some_metadata_field = "test"` — triggers 5 side effects: Rust FFI write to the kernel, `KernelStateView.refresh()`, `account.observe_slots()`, `projection.write_slot()`, and `zinc_plane.write_slot()`. The method name `__setattr__` gives no indication that setting a field triggers durable writes across multiple persistence layers.
There is no read-only view that prevents accidental mutation. Any code that holds a `KernelSlotView` reference and assigns a field bypasses all FSM guards and directly mutates the Rust kernel state.
**Severity: Medium**
---
## Pass 12 Summary
| # | Flaw | Layer | Severity |
|---|------|-------|----------|
| O1 | `_maybe_close()` asyncio.run without loop guard — close/disconnect silently skipped from async context | Launcher | **High** |
| O2 | `async def connect()` shims call sync `venue.connect()` without await — blocking pattern | Test | Medium |
| O3 | `_contract_rows(client)` NOT awaited in `_pick_live_symbol` — silent coroutine iteration crash | Test | **High** |
| O4 | `test_exchange_event_seam_parity.py` uses deprecated `get_event_loop().run_until_complete()` | Test | Medium |
| O5 | `_run()` thread pool `.result()` has no timeout — backend hang freezes process indefinitely | Venue | **High** |
| O6 | MockVenueAdapter never exercises thread-pool bridge — bridge untested in CI | Venue | Medium |