PINK: E2E trace analysis — Pass 13 FFI safety/dangling pointers/coverage (P1-P9)
Thirteenth pass: dita_kernel_destroy double-free UB — Python doesn't null handle.value (P1 Critical), CStr::from_ptr(payload) without null guard in 3 FFI exports (P2 High), _check_open_orders asyncio.run from async _verify crashes live tests (P3 High), _get_rust() TOCTOU race concurrent cargo build (P6 High), into_c_string NUL sanitizer produces invalid JSON (P4 Medium), reconcile/snapshot_json null on failure no diagnostic (P5 Medium). 263 total flaws across 13 passes. Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
This commit is contained in:
@@ -4819,3 +4819,258 @@ There is no read-only view that prevents accidental mutation. Any code that hold
|
||||
| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
|
||||
| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 |
|
||||
| **Total** | | **254** | **20** | **70** | **73** | **60** | **28** |
|
||||
|
||||
---
|
||||
|
||||
## PASS 13 — FFI BOUNDARY SAFETY, DANGLING POINTERS, COVERAGE GAPS
|
||||
|
||||
### P1: `dita_kernel_destroy` double-free UB — Python does not null `handle.value` after destroy
|
||||
|
||||
**File:** `rust_backend.py:145-148`, `_rust_kernel/src/lib.rs:2081-2088`
|
||||
|
||||
```python
|
||||
# Python destroy():
|
||||
def destroy(self, handle):
|
||||
if handle and handle.value:
|
||||
self.lib.dita_kernel_destroy(handle) # handle.value NOT nulled
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust dita_kernel_destroy:
|
||||
pub extern "C" fn dita_kernel_destroy(handle: *mut KernelHandle) {
|
||||
if !handle.is_null() {
|
||||
unsafe { drop(Box::from_raw(handle)); }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If `destroy()` is called twice on the same handle:
|
||||
1. First call: `Box::from_raw(handle)` frees the memory. Python's `handle.value` still points to the old (now dangling) memory.
|
||||
2. Second call: `handle and handle.value` is `True` (dangling but non-null). Passes to Rust.
|
||||
3. Rust: `!handle.is_null()` is `True`. `Box::from_raw(handle)` on a dangling pointer is **undefined behavior** — heap corruption, use-after-free, or silent data corruption.
|
||||
|
||||
**Trigger scenarios:**
|
||||
- `ExecutionKernel.__del__` calls `destroy()` if `_backend` is non-null. If user code also calls `destroy()` explicitly (no such code today, but the method is public), double-free.
|
||||
- If `__del__` runs during interpreter shutdown after the `_RustKernelLib` CDLL object is partially finalized, the `self.lib` attribute might be `None` → `TypeError` rather than double-free, but if the CDLL is still alive, double-free.
|
||||
- Test code that creates/destroys kernels in a loop (fresh_kernel pattern) could trigger double-destroy if GC runs finalization twice on the same handle (possible with reference cycles and PEP 442).
|
||||
|
||||
**Fix:** Python `destroy()` should set `handle.value = None` after calling, or use a `_destroyed` flag.
|
||||
|
||||
**Severity: Critical** — undefined behavior on any double-destroy path.
|
||||
|
||||
### P2: `CStr::from_ptr(payload)` without null guard in multiple FFI exports
|
||||
|
||||
**File:** `_rust_kernel/src/lib.rs` — `dita_kernel_set_exchange_config_json`, `dita_kernel_calibrate_fee_json`, `dita_kernel_on_account_event_json`
|
||||
|
||||
```rust
|
||||
pub extern "C" fn dita_kernel_set_exchange_config_json(handle: *mut KernelHandle, payload: *const c_char) -> i32 {
|
||||
let payload = unsafe { CStr::from_ptr(payload) }; // NO NULL CHECK — UB if payload is null
|
||||
let payload_str = payload.to_str().map_err(|_| -1)?;
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Three FFI functions call `CStr::from_ptr(payload)` directly on the raw `*const c_char` parameter **without** checking for null first. If a null pointer is passed (from Python ctypes passing `None`, or from a bug in a future caller), this reads from memory address 0 — **segfault or undefined behavior**.
|
||||
|
||||
The existing helper `cstr_to_string()` (line 1500) correctly checks for null:
|
||||
```rust
|
||||
fn cstr_to_string(ptr: *const c_char) -> Result<String, String> {
|
||||
if ptr.is_null() { return Err("NULL_POINTER".to_string()); }
|
||||
unsafe { CStr::from_ptr(ptr) }
|
||||
.to_str().map(|s| s.to_string()).map_err(|e| e.to_string())
|
||||
}
|
||||
```
|
||||
|
||||
But these three FFI functions bypass it. Only `dita_kernel_set_exchange_config_json` is called from Python; `calibrate_fee` and `on_account_event` are newer functions.
|
||||
|
||||
**Fix:** Use `cstr_to_string()` or add an explicit `if payload.is_null() { return -1; }` guard.
|
||||
|
||||
**Severity: High** — null pointer dereference on any call with a null payload.
|
||||
|
||||
### P3: `_check_open_orders` calls `asyncio.run()` from within async `_verify()` — RuntimeError in live test execution
|
||||
|
||||
**File:** `_gen_test.py:104`, `_build_pink_extended.py:75-78`
|
||||
|
||||
```python
|
||||
def _check_open_orders(c, vs):
|
||||
r = __import__('asyncio').run(c._request_json("GET", "/openApi/swap/v2/trade/openOrders", ...))
|
||||
```
|
||||
|
||||
This is a **sync** `def` that calls `asyncio.run()`. It is called from `_verify()` which is `async def` (inside the generated test file). When `_verify` runs inside `asyncio.run(_run(...))`, there is a **running event loop**. `_check_open_orders` calls `asyncio.run(...)` which detects the running loop and raises `RuntimeError: asyncio.run() cannot be called from a running event loop`.
|
||||
|
||||
The same pattern exists in `_build_pink_extended.py:75-78` in the patched version of `_check_open_orders`.
|
||||
|
||||
**Fix:** Make `_check_open_orders` `async def` and use `await` instead of `asyncio.run()`.
|
||||
|
||||
**Severity: High** — any live test that calls `_verify` (which all live tests do via `_run`) will crash.
|
||||
|
||||
### P4: `into_c_string` replaces NUL bytes with `"\\u0000"` — produces invalid JSON
|
||||
|
||||
**File:** `_rust_kernel/src/lib.rs:2006-2013`
|
||||
|
||||
```rust
|
||||
fn into_c_string(value: &str) -> *mut c_char {
|
||||
match CString::new(value) {
|
||||
Ok(cs) => cs.into_raw(),
|
||||
Err(_) => {
|
||||
let sanitized = value.replace('\0', "\\u0000"); // literal backslash-u-0-0-0-0
|
||||
CString::new(sanitized).unwrap_or_else(|_| CString::new("").unwrap()).into_raw()
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
When a string contains an interior NUL byte (`\0`), `into_c_string` replaces it with the 8-character ASCII string `"\\u0000"`. If this string is a JSON payload — which it always is for `process_intent_json`, `on_venue_event_json`, etc. — the sanitized string is **not valid JSON**. Python's `json.loads()` in `_take_string` receives invalid JSON and raises `json.JSONDecodeError`.
|
||||
|
||||
This is a data integrity issue: a NUL byte in an intent field (which shouldn't happen in normal use but could come from a malformed exchange response) causes the entire intent to fail with a `JSONDecodeError` rather than a clean `INVALID_INTENT` diagnostic.
|
||||
|
||||
Note: The NUL-byte panic (G2) was fixed by adding this sanitizer, but the sanitizer produces invalid JSON, trading a crash for a different failure mode.
|
||||
|
||||
**Fix:** Strip NUL bytes entirely (`.replace('\0', "")`) before JSON construction, or reject the intent with `invalid_intent_cstring` if NUL bytes are detected.
|
||||
|
||||
**Severity: Medium**
|
||||
|
||||
### P5: `reconcile_slots_json` returns null on serialize failure — inconsistent with intent/venue error paths
|
||||
|
||||
**File:** `_rust_kernel/src/lib.rs:2258`
|
||||
|
||||
```rust
|
||||
// reconcile_slots_json unwrap_or:
|
||||
.with_handle_mut(handle, |core| ...)
|
||||
.unwrap_or(ptr::null_mut()) // returns null — NO diagnostic
|
||||
```
|
||||
|
||||
When `reconcile_slots_json` encounters a parse or serialize failure, it returns `ptr::null_mut()`. Python's `_take_string` raises `RuntimeError("Rust kernel returned null string")` — an unhandled exception.
|
||||
|
||||
Compare with `process_intent_json` and `on_venue_event_json` which use:
|
||||
```rust
|
||||
.map_err(|e| invalid_intent_cstring("INVALID_INTENT_PARSE", &e))
|
||||
.unwrap_or_else(|ptr| ptr) // returns structured diagnostic JSON
|
||||
```
|
||||
|
||||
The reconcile and snapshot paths return bare null — no diagnostic, no structured error, no way for the Python side to distinguish "parse error" from "serialize error" from "null handle."
|
||||
|
||||
The same issue affects `dita_kernel_snapshot_json` (line 2269).
|
||||
|
||||
**Severity: Medium**
|
||||
|
||||
### P6: `_get_rust()` TOCTOU race on first call — concurrent threads both see `_RUST is None`
|
||||
|
||||
**File:** `rust_backend.py:271-275`
|
||||
|
||||
```python
|
||||
def _get_rust():
|
||||
global _RUST
|
||||
if _RUST is None:
|
||||
_RUST = _RustKernelLib() # two threads can both enter here
|
||||
return _RUST
|
||||
```
|
||||
|
||||
Two threads calling `_get_rust()` simultaneously on first access both see `_RUST is None`. Both enter the `if` block. Both call `_RustKernelLib()` which:
|
||||
1. Calls `_ensure_library()` which runs `subprocess.run(["cargo", "build", "--release", ...], check=True)` — **two concurrent cargo builds** can corrupt the build directory.
|
||||
2. Calls `ctypes.CDLL(path)` — loads the shared library twice. The second CDLL object is assigned to `_RUST` (overwriting the first), which is then GC'd, but the Rust library's global state may have been initialized twice.
|
||||
|
||||
**Fix:** Use a module-level lock.
|
||||
|
||||
**Severity: High**
|
||||
|
||||
### P7: `KernelHandle` has no `!Send`/`!Sync` — but ctypes FFI bypasses all Rust ownership rules
|
||||
|
||||
**File:** `_rust_kernel/src/lib.rs`
|
||||
|
||||
`KernelHandle` and `KernelCore` have no explicit `unsafe impl Send` or `unsafe impl Sync`. The Rust compiler would auto-derive `Send`/`Sync` based on their fields — but because they contain `HashMap<String, Value>` (serde_json::Value is not Sync), they should NOT be auto-Send/Sync. However, the compiler's auto-derivation may include them in the `Send`/`Sync` set based on field composition.
|
||||
|
||||
The real issue: even if Rust correctly determined `KernelHandle` is `!Send` and `!Sync`, the `*mut KernelHandle` pointer passed across FFI has no type-system enforcement. Python's ctypes calls `dita_kernel_process_intent_json(handle, ...)` which immediately converts the raw pointer to `&mut KernelCore` via `unsafe { &mut (*handle).core }`. The Rust compiler cannot enforce ownership rules across the FFI boundary.
|
||||
|
||||
This means: **the Rust kernel's thread-safety design relies entirely on the Python side never calling FFI from multiple threads simultaneously.** There is no mechanism in either language to enforce this.
|
||||
|
||||
**Severity: Informational** — documenting the existing design constraint (already covered in N1, but worth noting the Send/Sync aspect).
|
||||
|
||||
### P8: `dita_kernel_destroy` not called from bundle close — no explicit Rust handle cleanup path
|
||||
|
||||
**Files:** `launcher.py:83-95`, `rust_backend.py`
|
||||
|
||||
`DITAv2LauncherBundle.close()` calls:
|
||||
```python
|
||||
_maybe_close(self.venue)
|
||||
_maybe_close(self.zinc_plane)
|
||||
_maybe_close(self.control_plane)
|
||||
```
|
||||
|
||||
It does **not** call anything on `self.kernel`. `ExecutionKernel` has no `close()` method (O10). The Rust `_backend` handle is only freed when `__del__` runs during garbage collection.
|
||||
|
||||
The bundle holds `self.kernel` as a strong reference. As long as the bundle is alive, the kernel is alive. When the bundle is GC'd (or goes out of scope), the kernel's refcount may drop to zero, triggering `__del__`. But if the kernel has a reference cycle (K6: `Kernel → StateView → SlotView → Kernel`), `__del__` is delayed until the GC cycle.
|
||||
|
||||
**Fix:** Add `ExecutionKernel.close()` and call it from `DITAv2LauncherBundle.close()`.
|
||||
|
||||
**Severity: Medium**
|
||||
|
||||
### P9: `ExecutionKernel.__del__` accesses module-level `_RUST` — NameError during shutdown
|
||||
|
||||
**File:** `rust_backend.py:519-525**
|
||||
|
||||
```python
|
||||
def __del__(self):
|
||||
backend = getattr(self, "_backend", None)
|
||||
if backend is not None:
|
||||
try:
|
||||
_get_rust().destroy(backend) # accesses module-level _RUST
|
||||
except Exception:
|
||||
pass
|
||||
```
|
||||
|
||||
During Python interpreter shutdown, the interpreter clears module globals **before** calling `__del__` on remaining objects. If `ExecutionKernel` survives to shutdown, `_get_rust()` accesses the module-level `_RUST` variable which may have already been set to `None` (module globals cleared). This raises `TypeError: 'NoneType' object is not callable` (when `_RUST is None` and `_get_rust()` tries to call `_RustKernelLib()`), or `NameError` if the variable itself has been deleted from the module namespace.
|
||||
|
||||
The `except Exception: pass` catches this, but the Rust handle is never destroyed — it leaks.
|
||||
|
||||
**Severity: Low** — caught by except, only at shutdown, Rust kernel handle is lost but process is exiting.
|
||||
|
||||
### P10: `_check_open_orders` in `_gen_test.py` has redundant `asyncio.run` already covered in P3 — different location, same pattern
|
||||
|
||||
Already covered in P3. Same root cause in `_build_pink_extended.py:75-78`. No additional finding.
|
||||
|
||||
---
|
||||
|
||||
## Pass 13 Summary
|
||||
|
||||
| # | Flaw | Layer | Severity |
|
||||
|---|------|-------|----------|
|
||||
| P1 | `dita_kernel_destroy` double-free UB — Python doesn't null handle.value | Bridge | **Critical** |
|
||||
| P2 | `CStr::from_ptr(payload)` without null guard in 3 FFI exports | Rust | **High** |
|
||||
| P3 | `_check_open_orders` calls `asyncio.run()` from async `_verify` — RuntimeError | Test | **High** |
|
||||
| P4 | `into_c_string` NUL sanitizer produces invalid JSON — json.loads fails | Rust | Medium |
|
||||
| P5 | `reconcile_slots_json`/`snapshot_json` return null on failure — no diagnostic | Rust | Medium |
|
||||
| P6 | `_get_rust()` TOCTOU race — concurrent cargo build corruption | Bridge | **High** |
|
||||
| P7 | `KernelHandle` no Send/Sync — FFI bypasses Rust ownership rules | Rust | Info |
|
||||
| P8 | No explicit Rust handle destroy path from bundle.close() | Launcher | Medium |
|
||||
| P9 | `__del__` accesses module `_RUST` during shutdown — NameError leak | Bridge | Low |
|
||||
|
||||
### Pass 13 Severity
|
||||
|
||||
| Severity | Count |
|
||||
|----------|-------|
|
||||
| **Critical** | 1 (P1) |
|
||||
| **High** | 3 (P2, P3, P6) |
|
||||
| Medium | 3 (P4, P5, P8) |
|
||||
| Low | 1 (P9) |
|
||||
| Info | 1 (P7) |
|
||||
|
||||
### Combined Catalog (All 13 Passes)
|
||||
|
||||
| Pass | Focus | Count | Critical | High | Medium | Low | Info |
|
||||
|------|-------|-------|----------|------|--------|-----|------|
|
||||
| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 |
|
||||
| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 |
|
||||
| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 |
|
||||
| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 |
|
||||
| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 |
|
||||
| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 |
|
||||
| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 |
|
||||
| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 |
|
||||
| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 |
|
||||
| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 |
|
||||
| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
|
||||
| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
|
||||
| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 |
|
||||
| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 |
|
||||
| **Total** | | **263** | **21** | **73** | **76** | **64** | **29** |
|
||||
|
||||
@@ -27,7 +27,8 @@
|
||||
| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 |
|
||||
| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 |
|
||||
| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 |
|
||||
| **Total** | | **254** | **20** | **70** | **73** | **60** | **28** |
|
||||
| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 |
|
||||
| **Total** | | **263** | **21** | **73** | **76** | **64** | **29** |
|
||||
|
||||
---
|
||||
|
||||
@@ -347,6 +348,24 @@
|
||||
|
||||
---
|
||||
|
||||
## P-Series: FFI Safety, Dangling Pointers, Coverage Gaps (Pass 13)
|
||||
|
||||
*Full detail in TRACE doc under "PASS 13 — FFI BOUNDARY SAFETY, DANGLING POINTERS, COVERAGE GAPS."*
|
||||
|
||||
| # | Flaw | Layer | Severity |
|
||||
|---|------|-------|----------|
|
||||
| P1 | `dita_kernel_destroy` double-free UB — Python doesn't null handle.value | Bridge | **Critical** |
|
||||
| P2 | `CStr::from_ptr(payload)` without null guard in 3 FFI exports | Rust | **High** |
|
||||
| P3 | `_check_open_orders` calls `asyncio.run()` from async `_verify` — RuntimeError | Test | **High** |
|
||||
| P4 | `into_c_string` NUL sanitizer produces invalid JSON — json.loads fails | Rust | Medium |
|
||||
| P5 | `reconcile_slots_json`/`snapshot_json` return null on failure — no diagnostic | Rust | Medium |
|
||||
| P6 | `_get_rust()` TOCTOU race — concurrent cargo build corruption | Bridge | **High** |
|
||||
| P7 | `KernelHandle` no Send/Sync — FFI bypasses Rust ownership rules | Rust | Info |
|
||||
| P8 | No explicit Rust handle destroy path from bundle.close() | Launcher | Medium |
|
||||
| P9 | `__del__` accesses module `_RUST` during shutdown — NameError leak | Bridge | Low |
|
||||
|
||||
---
|
||||
|
||||
## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5)
|
||||
|
||||
*Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."*
|
||||
|
||||
Reference in New Issue
Block a user