PINK: E2E trace analysis — Pass 19 lifecycle/Rust subtleties/test infra (V1-V14)
Nineteenth pass: DITAv2LauncherBundle.close() never calls kernel.close() Rust handle leaks via __del__ (V1 Critical), BingxVenueAdapter no close/disconnect ThreadPoolExecutor/HTTP never release (V2 Critical), 3 generators write same output file last writer wins incompatible prologues (V4 Critical), generated tests triple env-gated never run in CI dead code (V5 Critical), kernel.close() destroys Rust handle immediately no drain no flush UAF risk (V6 Critical), process_intent ENTER doesn't clear seen_event_ids old dedup pollutes new trade (V3 High), no conftest/pytest.ini/asyncio_mode test discovery fragile (V9 High), #[serde(default)] leverage:0.0 mark_price no .max(1.0) silent accounting error (V8 Medium). 347 total flaws across 19 passes. Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
This commit is contained in:
@@ -6776,3 +6776,362 @@ The `lot_step` field is declared with a default of `0.001` but is never read by
|
|||||||
| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
|
| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
|
||||||
| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
|
| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
|
||||||
| **Total** | | **333** | **30** | **99** | **96** | **64** | **37** |
|
| **Total** | | **333** | **30** | **99** | **96** | **64** | **37** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PASS 19 — STARTUP/SHUTDOWN LIFECYCLE, RUST KERNEL SUBTLETIES, GENERATED TEST INFRA
|
||||||
|
|
||||||
|
### V1: `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust kernel handle always leaks via `__del__`
|
||||||
|
|
||||||
|
**File:** `launcher.py:40-42`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def close(self) -> None:
|
||||||
|
_maybe_close(self.venue)
|
||||||
|
_maybe_close(self.zinc_plane)
|
||||||
|
_maybe_close(self.control_plane)
|
||||||
|
# NOTE: self.kernel is never closed!
|
||||||
|
```
|
||||||
|
|
||||||
|
The bundle's `close()` method closes the venue, zinc plane, and control plane — but **not the kernel**. The `ExecutionKernel` Rust handle (`self._backend`) is only freed when Python's garbage collector calls `__del__`, which is non-deterministic.
|
||||||
|
|
||||||
|
In a `with` block or explicit `bundle.close()` pattern, the Rust handle survives until the `ExecutionKernel` object's refcount drops to zero (usually when the bundle goes out of scope). If the caller holds a reference to the kernel (e.g., `k = bundle.kernel` for inspection), the handle lives indefinitely.
|
||||||
|
|
||||||
|
Compare with `_build_rb()` in test code (gen2.py:326-338) which creates a `Shim` that has a `close` method calling `bundle.close()` and `bundle.kernel.close()`. The production `DITAv2LauncherBundle` has no equivalent.
|
||||||
|
|
||||||
|
**Severity: Critical**
|
||||||
|
|
||||||
|
### V2: `BingxVenueAdapter` has no `close()` or `disconnect()` — ThreadPoolExecutor, HTTP client sessions, connections never released
|
||||||
|
|
||||||
|
**File:** `bingx_venue.py` (entire file)
|
||||||
|
|
||||||
|
`BingxVenueAdapter` is the synchronous REST adapter to BingX. It has no `close()`, `disconnect()`, or any cleanup method. The `_maybe_close()` in the launcher tries `.close()` → `AttributeError` (caught), then `.disconnect()` → `AttributeError` (caught). **Nothing happens.**
|
||||||
|
|
||||||
|
Leaked resources:
|
||||||
|
- **ThreadPoolExecutor** (class-level `_EXECUTOR`, 3 threads) — never shut down
|
||||||
|
- **BingxDirectExecutionAdapter backend** — holds an `aiohttp.ClientSession` with TCP connection pool (`limit=4`) — never closed
|
||||||
|
- Any in-flight HTTP connections — abandoned at process exit
|
||||||
|
|
||||||
|
This is a compounding of R1 and R2 from Pass 15. It's the single largest resource leak in the system.
|
||||||
|
|
||||||
|
**Severity: Critical**
|
||||||
|
|
||||||
|
### V3: `process_intent` ENTER path does NOT clear `seen_event_ids` — old trade's dedup set pollutes new trade
|
||||||
|
|
||||||
|
**File:** `_rust_kernel/src/lib.rs:1260-1290` (ENTER path)
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// ENTER path — slot reuse:
|
||||||
|
slot.trade_id = intent.trade_id.clone();
|
||||||
|
slot.asset = intent.asset.clone();
|
||||||
|
slot.entry_price = 0.0;
|
||||||
|
slot.size = 0.0;
|
||||||
|
slot.initial_size = 0.0;
|
||||||
|
slot.unrealized_pnl = 0.0;
|
||||||
|
slot.realized_pnl = 0.0;
|
||||||
|
slot.active_leg_index = 0;
|
||||||
|
slot.active_entry_order = None;
|
||||||
|
slot.active_exit_order = None;
|
||||||
|
slot.close_reason.clear();
|
||||||
|
slot.closed = false;
|
||||||
|
slot.last_event_time = None;
|
||||||
|
// 🔴 seen_event_ids is NOT cleared — survives from old trade
|
||||||
|
slot.fsm_state = TradeStage::ORDER_REQUESTED;
|
||||||
|
```
|
||||||
|
|
||||||
|
When a slot is reused by a new ENTER intent, the `seen_event_ids` vector from the **previous trade** survives. If any event_id from the new trade collides with one observed by the old trade, the event is rejected as `DUPLICATE_EVENT` and silently dropped.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
1. Trade A in slot 0 receives events `evt-001`, `evt-002`, `evt-003`. All stored in `seen_event_ids`.
|
||||||
|
2. Trade A completes. Slot 0 returns to `IDLE`.
|
||||||
|
3. Trade B enters on slot 0. `seen_event_ids` still contains `evt-001`, `evt-002`, `evt-003`.
|
||||||
|
4. Exchange sends `evt-001` for Trade B (a legitimate new event with a reused event_id — possible on exchanges that recycle IDs daily).
|
||||||
|
5. Rust kernel sees `evt-001` in `seen_event_ids` → rejects as `DUPLICATE_EVENT`. The fill/ack is lost.
|
||||||
|
|
||||||
|
The probability of collision depends on the exchange's event ID scheme. If event IDs are UUIDs, the risk is negligible. If event IDs are daily counters (e.g., `"20260601-001"` → `"20260602-001"`), the collision risk is **guaranteed** on every day boundary.
|
||||||
|
|
||||||
|
**Fix:** Add `slot.seen_event_ids.clear();` in the ENTER path.
|
||||||
|
|
||||||
|
**Severity: High**
|
||||||
|
|
||||||
|
### V4: Three generators (`gen2.py`, `gen_live_tests.py`, `_gen_test.py`) all write to same output file — last writer wins, incompatible prologues
|
||||||
|
|
||||||
|
**Files:** `gen2.py:431-432`, `gen_live_tests.py:680-681`, `_gen_test.py:1233-1234`
|
||||||
|
|
||||||
|
All three generator scripts write to the same file:
|
||||||
|
```
|
||||||
|
/mnt/dolphinng5_predict/prod/tests/test_pink_bingx_dita_live_e2e.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Each produces **different prologues** with incompatible import styles, helper function signatures, and test runner implementations:
|
||||||
|
- `gen2.py`: Uses `RB`/`Shim` tuple, `_si()` helper, `_flatten()` helper
|
||||||
|
- `gen_live_tests.py`: Uses `_RuntimeBundle`/`_RuntimeShim` dataclass, `_run_scenario()` runner, `_flatten_via_kernel_intent()` helper
|
||||||
|
- `_gen_test.py`: Uses scenario-style `B()` body definitions, single parametrized `test_pink_ditav2` function
|
||||||
|
|
||||||
|
The 68 named body functions are **identical** across gen2.py and gen_live_tests.py but their signatures differ (`(k, symbol, p)` vs `(bundle, client, symbol, snap)`). Running gen2.py then gen_live_tests.py produces a file where the bodies don't match the runner — the file compiles (Python sees valid functions) but silently tests nothing meaningful.
|
||||||
|
|
||||||
|
`_build_pink_extended.py` and `_build_pink_bodies.py` then mutate the same file in-place with `str.replace()` — which silently does nothing if the expected format doesn't match.
|
||||||
|
|
||||||
|
**Severity: Critical**
|
||||||
|
|
||||||
|
### V5: Generated tests are triple env-gated — never run in CI, making them dead code
|
||||||
|
|
||||||
|
**File:** `gen2.py` (generated output), `gen_live_tests.py` (generated output), `_gen_test.py` (generated output)
|
||||||
|
|
||||||
|
Every generated test file has three `pytest.skip()` guards at the top:
|
||||||
|
```python
|
||||||
|
if not os.environ.get("BINGX_SMOKE_LIVE"): pytest.skip("...")
|
||||||
|
if not os.environ.get("BINGX_SMOKE_ALLOW_TRADE"): pytest.skip("...")
|
||||||
|
if not os.environ.get("PINK_DITA_E2E"): pytest.skip("...")
|
||||||
|
```
|
||||||
|
|
||||||
|
These three env vars are **never set in CI**. The CI pipeline has no step that sets them. As a result:
|
||||||
|
- 68 test functions from gen2.py → always skipped
|
||||||
|
- 70 test functions from gen_live_tests.py → always skipped
|
||||||
|
- 157 body scenarios from _gen_test.py → always skipped
|
||||||
|
- ~295 combined test scenarios → **zero executed in CI**
|
||||||
|
|
||||||
|
Only the mock-venue tests (`test_flaws.py`, `test_account_core_v2.py`, `test_account_reconcile_faults.py`, `test_exchange_event_seam_parity.py`, `test_kernel_reliability.py`, `test_kernel_fee_friction.py`, `test_pink_clickhouse_phase4.py`, `test_alpha_blue_untouched_g7.py`) actually run in CI.
|
||||||
|
|
||||||
|
**Severity: Critical**
|
||||||
|
|
||||||
|
### V6: `ExecutionKernel.close()` destroys Rust handle immediately — no drain of in-flight intents, no wait for pending FFI calls, no state flush
|
||||||
|
|
||||||
|
**File:** `rust_backend.py:320-330`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def close(self) -> None:
|
||||||
|
backend = self._backend
|
||||||
|
if backend is not None:
|
||||||
|
self._backend = None
|
||||||
|
try:
|
||||||
|
_get_rust().destroy(backend) # immediately frees Rust kernel memory
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
`destroy(backend)` calls `dita_kernel_destroy(handle)` which calls `drop(Box::from_raw(handle))`, freeing the entire `KernelHandle` heap allocation including `KernelCore`, all slots, all account state.
|
||||||
|
|
||||||
|
If `process_intent()` or `on_venue_event()` has been called from another thread and is mid-execution in Rust when `close()` fires:
|
||||||
|
1. Rust's state machine is destroyed mid-transition — dangling `self` reference
|
||||||
|
2. Any HTTP calls (venue.submit()) already in-flight complete, but their results are never processed
|
||||||
|
3. The Rust `Box::from_raw` calls `drop()` while another thread may be holding a reference to `core` — **use-after-free UB**
|
||||||
|
4. `_last_settled_pnl` dict is orphaned
|
||||||
|
5. The zinc plane, projection, and account state are all inconsistent with the destroyed kernel
|
||||||
|
|
||||||
|
There is no cancel/token mechanism, no pending-call queue drain, and no state flush before destroy.
|
||||||
|
|
||||||
|
**Severity: Critical**
|
||||||
|
|
||||||
|
### V7: `_last_settled_pnl` dict accessed from both `process_intent` and `on_venue_event` without locks — thread-unsafe
|
||||||
|
|
||||||
|
**File:** `rust_backend.py:440,475`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# process_intent (line 440):
|
||||||
|
self._last_settled_pnl[intent.slot_id] = 0.0
|
||||||
|
|
||||||
|
# on_venue_event (line 475):
|
||||||
|
incremental_pnl = slot.realized_pnl - self._last_settled_pnl.get(slot.slot_id, 0.0)
|
||||||
|
self._last_settled_pnl[slot.slot_id] = slot.realized_pnl
|
||||||
|
```
|
||||||
|
|
||||||
|
`process_intent()` (called from the async event loop or scheduler) and `on_venue_event()` (called from the venue event stream callback) both read and write `_last_settled_pnl` without any synchronization primitive. If these run on different threads, the dict can experience:
|
||||||
|
- **Lost update**: Two concurrent writes to the same `slot_id` — one overwrites the other
|
||||||
|
- **Dict corruption**: Python dict isn't thread-safe for concurrent writes — can produce `KeyError` on iteration or silently drop entries
|
||||||
|
- **Incorrect PnL settlement**: The incremental PnL calculation uses stale values
|
||||||
|
|
||||||
|
In the current single-threaded async architecture, this isn't triggered. But the lack of protection means any future multi-threaded usage introduces a data race.
|
||||||
|
|
||||||
|
**Severity: Medium**
|
||||||
|
|
||||||
|
### V8: `#[serde(default)]` on `leverage: f64` defaults to 0.0 — `mark_price()` uses leverage directly without `.max(1.0)`, silent accounting error
|
||||||
|
|
||||||
|
**File:** `_rust_kernel/src/lib.rs:400-408`
|
||||||
|
|
||||||
|
```rust
|
||||||
|
fn mark_price(&mut self, price: f64) {
|
||||||
|
// ...
|
||||||
|
if self.entry_price <= 0.0 || self.size <= 0.0 { return; }
|
||||||
|
let mut delta = (price - self.entry_price) / self.entry_price;
|
||||||
|
if self.side == TradeSide::SHORT { delta = -delta; }
|
||||||
|
self.unrealized_pnl = delta * self.size * self.entry_price * self.leverage; // uses leverage directly
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `TradeSlot` struct has `#[serde(default)] leverage: f64` which defaults to `0.0` when deserialized from JSON without a `leverage` field.
|
||||||
|
|
||||||
|
`mark_price()` uses `self.leverage` directly in `unrealized_pnl` computation — **no `.max(1.0)` guard**. If leverage is `0.0` (from missing JSON field, `set_slot_json`, or snapshot restore without leverage), `unrealized_pnl` is always `0.0` regardless of price movement.
|
||||||
|
|
||||||
|
Compare with `realized_pnl()` which correctly guards:
|
||||||
|
```rust
|
||||||
|
let notional = exit_size * slot.entry_price * slot.leverage.max(1.0); // correct
|
||||||
|
```
|
||||||
|
|
||||||
|
The `process_intent` ENTER path handles this (sets leverage to 1.0 if ≤ 0), but `set_slot_json` and `restore_full_snapshot` can bypass this and store leverage=0.0 directly into the slot.
|
||||||
|
|
||||||
|
**Impact:** Any slot restored from a snapshot that predates the `leverage` field (or saved by a version without it) gets `leverage=0.0` silently. `unrealized_pnl` is always 0.0 — the operator sees no mark-to-market PnL even though the position has moved. This PnL is only realized on close (via `realized_pnl` which correctly uses `.max(1.0)`), so the total PnL is correct but the intra-period mark appears flat — a silent accounting error.
|
||||||
|
|
||||||
|
**Severity: Medium**
|
||||||
|
|
||||||
|
### V9: No `conftest.py`, no `pytest.ini`, no `asyncio_mode` configuration — test discovery relies on default pytest behavior
|
||||||
|
|
||||||
|
**File:** (missing — no pytest configuration in workspace)
|
||||||
|
|
||||||
|
The workspace has **zero pytest configuration files**. No `conftest.py`, `pytest.ini`, `setup.cfg`, `pyproject.toml` with pytest settings. This means:
|
||||||
|
|
||||||
|
- **`asyncio_mode`** defaults to `"strict"` (pytest 8.x+), but all test files use `asyncio.run()` inline rather than `@pytest.mark.asyncio`. In strict mode, `async def` test functions are **not discovered** unless explicitly marked. Since all current tests are `sync def` wrapping `asyncio.run()`, this works — but the discovery relies on this specific pattern holding.
|
||||||
|
- No **timeout** configuration — a hanging test blocks the entire suite until killed
|
||||||
|
- No **test filtering** markers (no `slow`, `live`, `offline` markers)
|
||||||
|
- No **shared fixtures** — each test file repeats the `_build_rb()` setup pattern
|
||||||
|
- No **cache** or **rerun** configuration — flaky tests fail the CI suite
|
||||||
|
|
||||||
|
If a developer adds a new test using `async def` + `await` (the natural pattern for async code) without `@pytest.mark.asyncio`, the test silently doesn't run — a false negative.
|
||||||
|
|
||||||
|
**Severity: High**
|
||||||
|
|
||||||
|
### V10: `kernel.close()` has `except Exception: pass` — silently swallows all destroy errors
|
||||||
|
|
||||||
|
**File:** `rust_backend.py:325`
|
||||||
|
|
||||||
|
```python
|
||||||
|
try:
|
||||||
|
_get_rust().destroy(backend)
|
||||||
|
except Exception:
|
||||||
|
pass # silently swallows RuntimeError, OSError, ctypes errors
|
||||||
|
```
|
||||||
|
|
||||||
|
If `dita_kernel_destroy` fails (e.g., segfault caught by `catch_unwind` returning an error, or the shared library was unloaded), the exception is silently consumed. The caller believes the kernel was closed successfully — but the handle may still be allocated. This is a leak path that's invisible to monitoring.
|
||||||
|
|
||||||
|
**Severity: Low**
|
||||||
|
|
||||||
|
### V11: `build_launcher_bundle()` has no cleanup for partially-created components — kernel OOM orphans venue, zinc, control plane
|
||||||
|
|
||||||
|
**File:** `launcher.py:163-223`
|
||||||
|
|
||||||
|
The build order is:
|
||||||
|
1. Control plane ✅ built
|
||||||
|
2. Projection ✅ built
|
||||||
|
3. Zinc plane ✅ built
|
||||||
|
4. Venue ✅ built
|
||||||
|
5. **ExecutionKernel** — if this fails (OOM, cargo build fail, CDLL load), components 1-4 are all orphaned
|
||||||
|
|
||||||
|
There's no `try/finally` around the sequence. If `ExecutionKernel.__init__()` raises, the five already-built components (control plane, projection, zinc plane, venue, and any partially-initialized kernel state) leak. No cleanup code exists.
|
||||||
|
|
||||||
|
**Fix:** Use a context manager or `try/finally` to close created components if a later one fails.
|
||||||
|
|
||||||
|
**Severity: Medium**
|
||||||
|
|
||||||
|
### V12: `KernelResult` clones entire kernel state (all slots + indexes + AccountState) on every FFI call — performance issue
|
||||||
|
|
||||||
|
**File:** `_rust_kernel/src/lib.rs:1030-1050` (snapshot method)
|
||||||
|
|
||||||
|
Every FFI call to `process_intent()` or `on_venue_event()` returns a `KernelResult` containing a full `snapshot()`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
fn snapshot(&self) -> KernelSnapshot {
|
||||||
|
KernelSnapshot {
|
||||||
|
slots: self.slots.clone(), // O(n) clone ALL TradeSlots
|
||||||
|
active_trade_index: self.active_trade_index.clone(), // clone entire HashMap
|
||||||
|
venue_order_index: self.venue_order_index.clone(),
|
||||||
|
client_order_index: self.client_order_index.clone(),
|
||||||
|
account: self.account.clone(), // clone entire AccountState
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
For a kernel with 10 slots, each with metadata, seen_event_ids (1024 IDs), nested VenueOrders, and a full AccountState with fee_config, this produces **thousands of heap allocations per FFI call**. At 10 intents/second, this is tens of thousands of allocations/second.
|
||||||
|
|
||||||
|
The Python side only reads `KernelResult.outcome` and `KernelResult.slot` (the single affected slot) from each response — **the full snapshot is never used**. The `snapshot` field is transmitted and decoded by Python (`json.loads` parses all of it) but the caller only accesses `outcome` and `slot`. The snapshot data is generated, serialized, transmitted, deserialized, and **silently discarded** on every call.
|
||||||
|
|
||||||
|
**Severity: Medium** (performance, not correctness)
|
||||||
|
|
||||||
|
### V13: `_build_rb()` leaks bundle on any post-creation failure — Shim construction failure orphans kernel and venue
|
||||||
|
|
||||||
|
**File:** `gen2.py:326-338`, `_gen_test.py:69-80`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _build_rb(ic=25000.0, max_slots=1):
|
||||||
|
cfg = _build_config(ic)
|
||||||
|
b = build_launcher_bundle(...) # bundle created with active kernel, venue, etc.
|
||||||
|
k = b.kernel
|
||||||
|
k.account.snapshot.capital = ic # <-- if this raises (e.g., AttributeError)
|
||||||
|
... # Shim construction, etc.
|
||||||
|
return RB(runtime=Shim(k), config=cfg)
|
||||||
|
```
|
||||||
|
|
||||||
|
If any line after `build_launcher_bundle()` raises, the bundle `b` (with live kernel, venue connections, zinc plane) is leaked. No `try/finally` to call `b.close()`. In test code this is acceptable (process exits soon), but if test suites grow large, accumulated leaked bundles exhaust kernel slots or file descriptors.
|
||||||
|
|
||||||
|
**Severity: Low**
|
||||||
|
|
||||||
|
### V14: `_maybe_close` uses `break` after first method match — only tries `close` OR `disconnect`, never both
|
||||||
|
|
||||||
|
**File:** `launcher.py:67`
|
||||||
|
|
||||||
|
```python
|
||||||
|
for method_name in ("close", "disconnect"):
|
||||||
|
method = getattr(obj, method_name, None)
|
||||||
|
if callable(method):
|
||||||
|
...
|
||||||
|
break # <-- breaks after first successful match — never tries the second method
|
||||||
|
```
|
||||||
|
|
||||||
|
If an object has both `close()` and `disconnect()`, only `close()` is called. The `disconnect()` fallback is never reached if `close()` exists. This is correct for objects like `RealZincPlane` (which has `close()` but no `disconnect()`), but for objects with both methods (possible future adapters), the second method is silently skipped.
|
||||||
|
|
||||||
|
**Severity: Low**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pass 19 Summary
|
||||||
|
|
||||||
|
| # | Flaw | Layer | Severity |
|
||||||
|
|---|------|-------|----------|
|
||||||
|
| V1 | `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust handle leaks via `__del__` | Launcher | **Critical** |
|
||||||
|
| V2 | `BingxVenueAdapter` no `close()`/`disconnect()` — ThreadPoolExecutor/HTTP never release | Venue | **Critical** |
|
||||||
|
| V3 | `process_intent` ENTER doesn't clear `seen_event_ids` — old dedup pollutes new trade | Rust | **High** |
|
||||||
|
| V4 | 3 generators write same output file — last writer wins, incompatible prologues | Test | **Critical** |
|
||||||
|
| V5 | Generated tests triple env-gated — never run in CI, dead code | Test | **Critical** |
|
||||||
|
| V6 | `kernel.close()` destroys Rust handle immediately — no drain, no flush, use-after-free risk | Bridge | **Critical** |
|
||||||
|
| V7 | `_last_settled_pnl` dict accessed from process_intent and on_venue_event without locks | Bridge | Medium |
|
||||||
|
| V8 | `#[serde(default)] leverage: f64` default 0.0 — mark_price uses directly no `.max(1.0)` | Rust | Medium |
|
||||||
|
| V9 | No `conftest.py`, no `pytest.ini`, no `asyncio_mode` — test discovery fragile | Test | **High** |
|
||||||
|
| V10 | `kernel.close()` `except Exception: pass` — silently swallows destroy errors | Bridge | Low |
|
||||||
|
| V11 | `build_launcher_bundle()` no cleanup on partial failure — OOM orphans 4 components | Launcher | Medium |
|
||||||
|
| V12 | `KernelResult` clones entire kernel state every FFI call — wasted allocations | Rust | Medium |
|
||||||
|
| V13 | `_build_rb()` leaks bundle on post-creation failure | Test | Low |
|
||||||
|
| V14 | `_maybe_close` breaks after first method — never tries both close and disconnect | Launcher | Low |
|
||||||
|
|
||||||
|
### Pass 19 Severity
|
||||||
|
|
||||||
|
| Severity | Count |
|
||||||
|
|----------|-------|
|
||||||
|
| **Critical** | 5 (V1, V2, V4, V5, V6) |
|
||||||
|
| **High** | 2 (V3, V9) |
|
||||||
|
| Medium | 4 (V7, V8, V11, V12) |
|
||||||
|
| Low | 3 (V10, V13, V14) |
|
||||||
|
|
||||||
|
### Combined Catalog (All 19 Passes)
|
||||||
|
|
||||||
|
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|
||||||
|
|------|-------|-------|----------|------|--------|-----|------|
|
||||||
|
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
|
||||||
|
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
|
||||||
|
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
|
||||||
|
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
|
||||||
|
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
|
||||||
|
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
|
||||||
|
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
|
||||||
|
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
|
||||||
|
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
|
||||||
|
| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
|
||||||
|
| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
|
||||||
|
| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
|
||||||
|
| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 |
|
||||||
|
| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 |
|
||||||
|
| Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 |
|
||||||
|
| R | Pass 15 (Resource Leaks/Trust Boundaries/Security) | 14 | 2 | 6 | 3 | 2 | 1 |
|
||||||
|
| S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 |
|
||||||
|
| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
|
||||||
|
| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
|
||||||
|
| V | Pass 19 (Lifecycle/Rust Subtleties/Test Infra) | 14 | 5 | 2 | 4 | 3 | 0 |
|
||||||
|
| **Total** | | **347** | **35** | **101** | **100** | **64** | **37** |
|
||||||
|
|||||||
@@ -1,7 +1,26 @@
|
|||||||
# PINK DITAv2 — Structural Flaw Analysis (CENTRAL)
|
# PINK DITAv2 — Structural Flaw Analysis (CENTRAL)
|
||||||
|
|
||||||
**Analysis date:** 2026-05-31
|
**Analysis date:** 2026-05-31
|
||||||
|
**Last updated:** 2026-06-02 (flaw fix pass — 8 flaws closed)
|
||||||
**Scope:** Full PINK pipeline — all flaws across all modules.
|
**Scope:** Full PINK pipeline — all flaws across all modules.
|
||||||
|
|
||||||
|
> **Fix notation:** Rows marked **✅ FIXED `<sha>`** are verified-fixed with a test commit on branch `exp/pink-ditav2-sprint0-20260530`.
|
||||||
|
> Unfixed flaws remain as originally described.
|
||||||
|
|
||||||
|
### Fixes applied (2026-06-02)
|
||||||
|
|
||||||
|
| Flaw | Commit | What changed |
|
||||||
|
|------|--------|--------------|
|
||||||
|
| I1 — apply_fill partial fill accumulation | `c87ca78` | `slot.size += fill_size` (was `=`); all partial fills accumulate |
|
||||||
|
| I10 — seen_event_ids lost on restart | `c87ca78` + `3ca154e` | IndexSet dedup in AccountState; KernelFullSnapshot persists per-slot seen_event_ids; startup reconcile wired (I14) |
|
||||||
|
| I13 — stray event reactivates CLOSED slot | `c87ca78` | `slot.closed` guard returns TERMINAL_STATE before FSM branch |
|
||||||
|
| I14 — no Zinc restore on startup | `3ca154e` | `__init__` now calls `reconcile_from_slots(zinc_live)` for any non-idle Zinc slots |
|
||||||
|
| I15 — CANCEL_REJECT leaves slot in EXIT_WORKING | `9a8d1b9` | Clears `active_exit_order`, transitions to `POSITION_OPEN` |
|
||||||
|
| O1 — `_maybe_close()` silent skip from async context | `338811e` | Routes to thread-pool executor when a running loop is detected |
|
||||||
|
| O5 — `_run()` no timeout → process hang | `338811e` | `Future.result(timeout=_BACKEND_TIMEOUT_S)` (default 30 s); raises `TimeoutError` |
|
||||||
|
| O10 — no `close()` on ExecutionKernel | `3ca154e` | `close()` nulls `_backend` to prevent double-free; `__enter__`/`__exit__` added |
|
||||||
|
| N1 — `with_handle_mut` zero sync (partial) | `c87ca78` | `catch_unwind` at FFI boundary; concurrent-call UB mitigated by Python GIL |
|
||||||
|
|
||||||
**Sources:**
|
**Sources:**
|
||||||
- This file (A-series): Detailed writeups for architectural flaws.
|
- This file (A-series): Detailed writeups for architectural flaws.
|
||||||
- [PINK_DITAv2_E2E_TRACE_ANALYSIS.md](./PINK_DITAv2_E2E_TRACE_ANALYSIS.md) (E, F, G-series):
|
- [PINK_DITAv2_E2E_TRACE_ANALYSIS.md](./PINK_DITAv2_E2E_TRACE_ANALYSIS.md) (E, F, G-series):
|
||||||
@@ -33,7 +52,8 @@
|
|||||||
| S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 |
|
| S | Pass 16 (Error Handling/Arithmetic/Test Infra) | 16 | 4 | 7 | 5 | 0 | 0 |
|
||||||
| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
|
| T | Pass 17 (Unsafe Review/Dead Code/Build/Protocols) | 14 | 0 | 5 | 5 | 4 | 0 |
|
||||||
| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
|
| U | Pass 18 (Rust Test Gaps/Accounting/FFI Types) | 14 | 3 | 4 | 4 | 3 | 0 |
|
||||||
| **Total** | | **333** | **30** | **99** | **96** | **64** | **37** |
|
| V | Pass 19 (Lifecycle/Rust Subtleties/Test Infra) | 14 | 5 | 2 | 4 | 3 | 0 |
|
||||||
|
| **Total** | | **347** | **35** | **101** | **100** | **64** | **37** |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -180,7 +200,7 @@
|
|||||||
|
|
||||||
| # | Flaw | Layer | Severity |
|
| # | Flaw | Layer | Severity |
|
||||||
|---|------|-------|----------|
|
|---|------|-------|----------|
|
||||||
| I1 | Entry `apply_fill` multiple partial fills overwrite size instead of accumulating | Rust | **Critical** |
|
| I1 | Entry `apply_fill` multiple partial fills overwrite size instead of accumulating — **✅ FIXED `c87ca78`** | Rust | **Critical** |
|
||||||
| I2 | Zero exit_ratio creates zero-size exit order — slot stuck in EXIT_REQUESTED | Rust | Medium |
|
| I2 | Zero exit_ratio creates zero-size exit order — slot stuck in EXIT_REQUESTED | Rust | Medium |
|
||||||
| I3 | entry_price inconsistency — Python falsy vs Rust `<= 0.0` gate | Bridge | Info |
|
| I3 | entry_price inconsistency — Python falsy vs Rust `<= 0.0` gate | Bridge | Info |
|
||||||
| I4 | Only 1 Rust unit test for 1765-line kernel — 99% untested at Rust layer | Rust | **High** |
|
| I4 | Only 1 Rust unit test for 1765-line kernel — 99% untested at Rust layer | Rust | **High** |
|
||||||
@@ -189,12 +209,12 @@
|
|||||||
| I7 | Three weak/vacuous assertions in test_flaws.py | Test | Low |
|
| I7 | Three weak/vacuous assertions in test_flaws.py | Test | Low |
|
||||||
| I8 | Entry overfill no guard | Rust | Low |
|
| I8 | Entry overfill no guard | Rust | Low |
|
||||||
| I9 | No crash durability — slot state pure in-memory until step 7 of process_intent | Bridge | **Critical** |
|
| I9 | No crash durability — slot state pure in-memory until step 7 of process_intent | Bridge | **Critical** |
|
||||||
| I10 | seen_event_ids lost on restart — events double-processed | Rust | **Critical** |
|
| I10 | seen_event_ids lost on restart — events double-processed — **✅ FIXED `c87ca78` + `3ca154e`** (IndexSet dedup + snapshot; startup restore wired) | Rust | **Critical** |
|
||||||
| I11 | No idempotency key sent to BingX — lost response creates duplicate orders | Venue | **High** |
|
| I11 | No idempotency key sent to BingX — lost response creates duplicate orders | Venue | **High** |
|
||||||
| I12 | No graceful degradation for ANY subsystem | All | **High** |
|
| I12 | No graceful degradation for ANY subsystem | All | **High** |
|
||||||
| I13 | Stray venue event can reactivate CLOSED slot — no guard | Rust | **High** |
|
| I13 | Stray venue event can reactivate CLOSED slot — no guard — **✅ FIXED `c87ca78`** | Rust | **High** |
|
||||||
| I14 | No reconcile_from_slots call on startup — Zinc state never loaded into kernel | Restart | **High** |
|
| I14 | No reconcile_from_slots call on startup — Zinc state never loaded into kernel — **✅ FIXED `3ca154e`** | Restart | **High** |
|
||||||
| I15 | CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING | Rust | Medium |
|
| I15 | CANCEL_REJECT doesn't clear active_exit_order — slot stuck in EXIT_WORKING — **✅ FIXED `9a8d1b9`** | Rust | Medium |
|
||||||
| I16 | Zinc shared memory world-readable/writable by same-machine processes | Zinc | **High** |
|
| I16 | Zinc shared memory world-readable/writable by same-machine processes | Zinc | **High** |
|
||||||
| I17 | KernelSlotView unrestricted getattr/setattr — bypasses all FSM guards | Bridge | **High** |
|
| I17 | KernelSlotView unrestricted getattr/setattr — bypasses all FSM guards | Bridge | **High** |
|
||||||
| I18 | sys.path.insert(0) at import time in 3 production files — malicious module loading | Build | **High** |
|
| I18 | sys.path.insert(0) at import time in 3 production files — malicious module loading | Build | **High** |
|
||||||
@@ -320,7 +340,7 @@
|
|||||||
|
|
||||||
| # | Flaw | Layer | Severity |
|
| # | Flaw | Layer | Severity |
|
||||||
|---|------|-------|----------|
|
|---|------|-------|----------|
|
||||||
| N1 | Rust kernel `with_handle_mut` zero sync — `&mut` from raw ptr, UB on concurrent FFI | Rust | **Critical** |
|
| N1 | Rust kernel `with_handle_mut` zero sync — `&mut` from raw ptr, UB on concurrent FFI — *mitigated by Python GIL (single-threaded caller); catch_unwind added `c87ca78`* | Rust | **Critical** |
|
||||||
| N2 | `_run()` has two completely different code paths — runtime branch, not design | Venue | **Critical** |
|
| N2 | `_run()` has two completely different code paths — runtime branch, not design | Venue | **Critical** |
|
||||||
| N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** |
|
| N3 | `_run()` path B blocks event loop thread for every venue HTTP operation | Venue | **Critical** |
|
||||||
| N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call | Venue | **Critical** |
|
| N4 | `asyncio.run()` called repeatedly — creates/destroys event loops per call | Venue | **Critical** |
|
||||||
@@ -339,16 +359,16 @@
|
|||||||
|
|
||||||
| # | Flaw | Layer | Severity |
|
| # | Flaw | Layer | Severity |
|
||||||
|---|------|-------|----------|
|
|---|------|-------|----------|
|
||||||
| O1 | `_maybe_close()` asyncio.run without loop guard — close skipped from async context | Launcher | **High** |
|
| O1 | `_maybe_close()` asyncio.run without loop guard — close skipped from async context — **✅ FIXED `338811e`** | Launcher | **High** |
|
||||||
| O2 | `async def connect()` shims call sync venue.connect() without await — blocking | Test | Medium |
|
| O2 | `async def connect()` shims call sync venue.connect() without await — blocking | Test | Medium |
|
||||||
| O3 | `_contract_rows(client)` NOT awaited — `_pick_live_symbol` iterates coroutine = crash | Test | **High** |
|
| O3 | `_contract_rows(client)` NOT awaited — `_pick_live_symbol` iterates coroutine = crash | Test | **High** |
|
||||||
| O4 | Deprecated `get_event_loop().run_until_complete()` in test file | Test | Medium |
|
| O4 | Deprecated `get_event_loop().run_until_complete()` in test file | Test | Medium |
|
||||||
| O5 | `_run()` thread pool .result() no timeout — backend hang freezes process | Venue | **High** |
|
| O5 | `_run()` thread pool .result() no timeout — backend hang freezes process — **✅ FIXED `338811e`** | Venue | **High** |
|
||||||
| O6 | MockVenueAdapter never exercises thread-pool bridge — untested in CI | Venue | Medium |
|
| O6 | MockVenueAdapter never exercises thread-pool bridge — untested in CI | Venue | Medium |
|
||||||
| O7 | `_keepalive_loop`/`_rotation_sentinel` fire-and-forget — exceptions silently lost | Stream | Low |
|
| O7 | `_keepalive_loop`/`_rotation_sentinel` fire-and-forget — exceptions silently lost | Stream | Low |
|
||||||
| O8 | `KernelSlotView.__getattr__` N FFI calls for N fields — no caching | Bridge | Medium |
|
| O8 | `KernelSlotView.__getattr__` N FFI calls for N fields — no caching | Bridge | Medium |
|
||||||
| O9 | `DITAv2LauncherBundle` no `__del__` — GC'd bundle leaks resource tree | Launcher | Medium |
|
| O9 | `DITAv2LauncherBundle` no `__del__` — GC'd bundle leaks resource tree | Launcher | Medium |
|
||||||
| O10 | `ExecutionKernel` no `close()` — Rust handle only freed by unpredictable __del__ | Bridge | Medium |
|
| O10 | `ExecutionKernel` no `close()` — Rust handle only freed by unpredictable __del__ — **✅ FIXED `3ca154e`** | Bridge | Medium |
|
||||||
| O11 | `KernelSlotView.__setattr__` triggers 5 persistence side effects — no read-only view | Bridge | Medium |
|
| O11 | `KernelSlotView.__setattr__` triggers 5 persistence side effects — no read-only view | Bridge | Medium |
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -486,6 +506,29 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## V-Series: Startup/Shutdown Lifecycle, Rust Kernel Subtleties, Generated Test Infra (Pass 19)
|
||||||
|
|
||||||
|
*Full detail in TRACE doc under "PASS 19 — STARTUP/SHUTDOWN LIFECYCLE, RUST KERNEL SUBTLETIES, GENERATED TEST INFRA."*
|
||||||
|
|
||||||
|
| # | Flaw | Layer | Severity |
|
||||||
|
|---|------|-------|----------|
|
||||||
|
| V1 | `DITAv2LauncherBundle.close()` never calls `kernel.close()` — Rust handle leaks via `__del__` | Launcher | **Critical** |
|
||||||
|
| V2 | `BingxVenueAdapter` no `close()`/`disconnect()` — ThreadPoolExecutor/HTTP never release | Venue | **Critical** |
|
||||||
|
| V3 | `process_intent` ENTER doesn't clear `seen_event_ids` — old dedup pollutes new trade | Rust | **High** |
|
||||||
|
| V4 | 3 generators write same output file — last writer wins, incompatible prologues | Test | **Critical** |
|
||||||
|
| V5 | Generated tests triple env-gated — never run in CI, dead code | Test | **Critical** |
|
||||||
|
| V6 | `kernel.close()` destroys Rust handle immediately — no drain, no flush, UAF risk | Bridge | **Critical** |
|
||||||
|
| V7 | `_last_settled_pnl` dict accessed from process_intent and on_venue_event without locks | Bridge | Medium |
|
||||||
|
| V8 | `#[serde(default)] leverage: f64` default 0.0 — mark_price uses directly no .max(1.0) | Rust | Medium |
|
||||||
|
| V9 | No `conftest.py`, no `pytest.ini`, no `asyncio_mode` — test discovery fragile | Test | **High** |
|
||||||
|
| V10 | `kernel.close()` `except Exception: pass` — silently swallows destroy errors | Bridge | Low |
|
||||||
|
| V11 | `build_launcher_bundle()` no cleanup on partial failure — OOM orphans 4 components | Launcher | Medium |
|
||||||
|
| V12 | `KernelResult` clones entire kernel state every FFI call — wasted allocations | Rust | Medium |
|
||||||
|
| V13 | `_build_rb()` leaks bundle on post-creation failure | Test | Low |
|
||||||
|
| V14 | `_maybe_close` breaks after first method — never tries both close and disconnect | Launcher | Low |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5)
|
## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5)
|
||||||
|
|
||||||
*Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."*
|
*Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."*
|
||||||
|
|||||||
1090
prod/clean_arch/dita_v2/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md
Normal file
1090
prod/clean_arch/dita_v2/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user