diff --git a/PINK_DITAv2_E2E_TRACE_ANALYSIS.md b/PINK_DITAv2_E2E_TRACE_ANALYSIS.md index 0cd932b..0dc0aae 100644 --- a/PINK_DITAv2_E2E_TRACE_ANALYSIS.md +++ b/PINK_DITAv2_E2E_TRACE_ANALYSIS.md @@ -5312,3 +5312,339 @@ Covered in Q1 and Q6. Listing all sites: `rust_backend.py:215,241,260` and `real | P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 | | Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 | | **Total** | | **275** | **21** | **77** | **79** | **64** | **34** | + +--- + +## PASS 15 — RESOURCE LEAKS, TRUST BOUNDARIES, SECURITY + +### R1: `BingxVenueAdapter` `ThreadPoolExecutor` never shut down — 3 threads leak for process lifetime + +**File:** `bingx_venue.py:194-208` + +```python +class BingxVenueAdapter(VenueAdapter): + _EXECUTOR: concurrent.futures.ThreadPoolExecutor | None = None + _EXECUTOR_LOCK: threading.Lock = threading.Lock() + + @classmethod + def _get_executor(cls) -> concurrent.futures.ThreadPoolExecutor: + if cls._EXECUTOR is None: + with cls._EXECUTOR_LOCK: + if cls._EXECUTOR is None: + cls._EXECUTOR = concurrent.futures.ThreadPoolExecutor( + max_workers=3, thread_name_prefix="bingx_adapter", + ) + return cls._EXECUTOR +``` + +The `ThreadPoolExecutor` is a class-level singleton with **no shutdown path**. No `close()` method, no `atexit` handler, no classmethod for cleanup. The 3 worker threads persist for the entire process lifetime. + +`ThreadPoolExecutor.__del__` in CPython calls `shutdown(wait=False)`, but this only runs when the executor object is GC'd. Since `_EXECUTOR` is a class variable, it's only GC'd when the class is (at interpreter shutdown). The CPython source for `ThreadPoolExecutor.__del__` calls `shutdown(wait=False)` which interrupts idle threads but doesn't wait for them. During shutdown, this races with module cleanup — threads accessing module globals see `None`. + +**Trigger:** Every call to `BingxVenueAdapter.submit()` (line 371), `cancel()` (line 435), or `snapshot()` (line 509) submits to this executor. After 24+ hours of trading, 3 unnamed worker threads have consumed thread-local resources (stack ~8MB each = 24MB minimum) with no clean reclamation path. + +**Severity: High** + +### R2: `BingxVenueAdapter` has no `close()` method — backend `BingxDirectExecutionAdapter` HTTP client unreleasable + +**File:** `bingx_venue.py` (no close method), `venue.py` (protocol), `launcher.py:262-264` + +The `VenueAdapter` protocol (`venue.py`) does not define `close()` or `disconnect()`. `BingxVenueAdapter` holds a `_backend: BingxDirectExecutionAdapter` which in turn holds an HTTP client session. The launcher's `_maybe_close()` tries `.close()` then `.disconnect()` but gets `AttributeError` (caught by `except Exception: pass`). + +This means: +- The `BingxDirectExecutionAdapter`'s `aiohttp.ClientSession` is never closed +- The underlying `TCPConnector` connection pool remains open +- Any HTTP keep-alive connections to BingX remain open until OS timeout +- No clean teardown path exists at any level of the venue stack + +The `BingxUserStream` (WebSocket handler) has a `close()` method and proper task cancellation. The venue adapter (the synchronous REST path) has none. Asymmetric design. + +**Severity: High** + +### R3: `real_zinc_plane._intent_cache` grows unboundedly — memory proportional to total lifetime intents + +**File:** `real_zinc_plane.py:157,202` + +```python +# Line 157 (__init__): +self._intent_cache: List[Dict[str, Any]] = [] + +# Line 201-203 (publish_intent): +self._intent_cache.append(row) +self._write_region(self.intent_region, self._intent_seq, {"items": self._intent_cache[-512:]}) +``` + +Every call to `publish_intent()` appends one dict to `_intent_cache`. Only the last 512 items are written to shared memory, but **the cache list itself is never trimmed**. Over a 24-hour session at 1 intent/second, this grows to 86,400 dicts — approximately 50-100 MB of memory for the cache alone (each dict contains timestamp, intent_id, asset, side, size, price, etc.). + +The `-512` slice on write is a dead giveaway that the developer knew only the last 512 items were relevant — but forgot to trim the source list. The fix is `self._intent_cache = self._intent_cache[-512:]` after the append. + +**Compare with:** `account.py`'s `seen_account_event_ids` (Rust side, capped at 1024 via `IndexSet` LRU), and `journal.py`'s `MemoryKernelJournal` (capped at 10,000). Every other cache in the system has a bound; this one doesn't. + +**Severity: High** + +### R4: `RealZincPlane`/`RealZincControlPlane` partial-construction leak — `SharedRegion` never cleaned up on init failure + +**File:** `real_zinc_plane.py:161-176`, `real_control_plane.py:72-83` + +```python +# real_zinc_plane.py __init__: +self.intent_region = SharedRegion.create(f"{prefix}_intent", 65536) +self.state_region = SharedRegion.create(f"{prefix}_state", 65536) # if this fails... +self.control_region = SharedRegion.create(f"{prefix}_control", 4096) # ...or this +``` + +If `SharedRegion.create()` succeeds for `intent_region` but fails for `state_region` (e.g., out of shared memory, permission denied, name collision), the constructor raises. The already-created `intent_region` has **no cleanup path** — `close()` is never called because the caller never gets a valid object reference. The shared memory segment leaks until the OS cleans it on process exit (or reboot on some systems). + +Same pattern in `RealZincControlPlane.__init__` with 2 regions. + +**Severity: Medium** + +### R5: `BingxUserStream` `ClientSession` has no `__del__` fallback — connection pool leak if `close()` not called + +**File:** `bingx_user_stream.py:229-230,433-436` + +```python +async def close(self) -> None: + self._closed.set() + if self._session is not None and not self._session.closed: + await self._session.close() +``` + +The `aiohttp.ClientSession` (created in `_get_session()` with `TCPConnector(limit=4)`) is only closed when `close()` is explicitly called. There is no `__del__`, `__aenter__`, or `__aexit__`. If a caller abandons the `BingxUserStream` object without calling `close()` — or if `close()` is never called because an exception occurs before the call — the TCP connection pool (4 connections to BingX) leaks. + +During the reconnect loop (`subscribe()`), if `self._session.closed` is detected between retries, a new session and connector are created — the old connector's connections are released by `ClientSession.close()` which is called in the retry path. So the reconnect path itself is clean. But the top-level cleanup depends entirely on external discipline. + +**Severity: Medium** + +### R6: `test_alpha_blue_untouched_g7.py` — two `open()` calls without context manager, file descriptors leak + +**File:** `test_alpha_blue_untouched_g7.py:31,63` + +```python +src = open("/mnt/dolphinng5_predict/prod/clean_arch/dita_v2/gen2.py").read() # line 31 +src = open(full).read() # line 63 +``` + +Both open a file, chain `.read()` to load contents, but **never close the file handle**. The file descriptor is leaked until garbage collection. In a test suite with thousands of tests, this can exhaust the ulimit (default 1024 on Linux, lower on macOS). + +**Severity: Low** (test code, non-production) + +### R7: All exchange REST/WS data parsed without schema validation — exchange controls all field values + +**Files:** `bingx_venue.py:60-74,80-88,96-121,151-186`, `bingx_user_stream.py:267-379` + +The system has a single trust boundary for exchange data: all BingX REST API responses and WebSocket frames are parsed without schema validation. Key entry points: + +- **`bingx_user_stream.py:267`**: `json.loads(text)` on raw WebSocket frame — any valid JSON structure is accepted. No schema validation before field access with `.get()`. +- **`bingx_venue.py:60-74`**: `_row_text()` extracts string values from exchange response dicts with no sanitization beyond `.strip()`. +- **`bingx_venue.py:80-88`**: `_row_float()` — catches `ValueError` on float parse but does not filter NaN/Inf (these pass `float()` fine). +- **`bingx_venue.py:297-301`**: `_rate_limit_retry_after_ms()` parses exchange error message with `re.search(r"unblocked after (\d+)", msg)` — exchange controls the error message content. +- **`bingx_venue.py:338-340`**: `cancel()` exception handler includes `str(exc)` directly in the response event dict. + +An exchange sending crafted responses could inject: +- Arbitrary strings into `reason`/`msg` fields (propagated to journal/ClickHouse) +- Non-numeric values that fail `float()` only on consumption +- Enormous lists in snapshot responses (OOM risk — no size limit on `snapshot.open_orders` iteration) +- NaN/Inf in price/size fields (pass through `float()` — Rust kernel `is_finite()` check on kernel side catches some but not all) + +**Severity: Critical** — exchange controls all inbound data with no schema validation, and data flows to ClickHouse journaling and the Rust kernel memory. + +### R8: Shared memory JSON deserialization without integrity check + +**Files:** `real_zinc_plane.py:127-128`, `real_control_plane.py:60-61` + +```python +# real_zinc_plane.py +def _decode_packet(self, payload: bytes) -> dict: + return json.loads(payload) + +# real_control_plane.py +payload = region.read() +if payload: + data = json.loads(payload) +``` + +Both the Zinc plane and control plane deserialize JSON from shared memory **without any integrity check** (no HMAC, no checksum, no signature). Any process with access to the shared memory segment (`/dev/shm` on Linux, world-readable by default) can: + +1. Inject arbitrary `KernelIntent` objects — the control plane reads intents from `intent_region` and dispatches them to `process_intent()`. An attacker could submit fake intents with malicious parameters. +2. Inject fake events into the event stream via the control plane region. +3. Corrupt slot state via the state region. + +The shared memory segments are named by the `DITA_V2_PREFIX` env var (default `dita_v2`). On a shared system, any process running as the same user can read/write these segments. + +**Severity: High** + +### R9: `restore_state()` deserializes arbitrary JSON into full kernel state — no provenance tracking + +**File:** `rust_backend.py:293-296` (Python), `_rust_kernel/src/lib.rs:934-968` (Rust) + +```python +# Python: +def restore_state(self, json_str: str) -> bool: + result = self.lib.dita_kernel_restore_state_json(self._backend, json_str.encode("utf-8")) + return result == 0 + +# Rust: +pub extern "C" fn dita_kernel_restore_state_json(handle: *mut KernelHandle, payload: *const c_char) -> i32 { + let payload = ...; // parse to KernelFullSnapshot + let core = unsafe { &mut (*handle).core }; + core.restore_full_snapshot(&payload) // overwrites ALL kernel state +} +``` + +`dita_kernel_restore_state_json` overwrites the entire kernel state — all slots, account balances, fee configuration, seen_event_ids, and capital_frozen flag — from a single JSON string. The method is public on `ExecutionKernel.restore_state()` with **no authentication, no authorization check, and no call stack validation**. + +The JSON string can come from: +- The `DITAv2LauncherBundle` (restart path) +- A file read from disk +- An attacker who gains access to the Python runtime (e.g., via shared memory injection, R8) + +Once restored, the kernel accepts the state as truth. There is no `restore_state` counter or version chain to prevent replay of old snapshots. + +**Severity: Critical** + +### R10: `DOLPHIN_BINGX_ENV` + `DOLPHIN_BINGX_ALLOW_MAINNET` — mainnet switch via env var + +**File:** `launcher.py:189-190` + +```python +DOLPHIN_BINGX_ENV = os.environ.get("DOLPHIN_BINGX_ENV", "VST") # VST = testnet +DOLPHIN_BINGX_ALLOW_MAINNET = os.environ.get("DOLPHIN_BINGX_ALLOW_MAINNET", "false").lower() in ("true", "1", "yes") +``` + +Setting `DOLPHIN_BINGX_ENV=LIVE` + `DOLPHIN_BINGX_ALLOW_MAINNET=true` switches from testnet to production mainnet BingX. The `DOLPHIN_BINGX_ALLOW_MAINNET` check exists specifically as a safety gate, but both are attacker-controlled env vars with the same provenance as all other env config. + +An attacker with access to set env vars (container breakout, CI/CD injection, shared hosting) could: +1. Redirect all trades to mainnet +2. Use real capital instead of testnet funds +3. Cost real money on every trade + +**Severity: High** + +### R11: `.env` file loaded from project root — secrets exposure risk + +**File:** `launcher.py:23,51` + +```python +from dotenv import load_dotenv +... +load_dotenv(PROJECT_ROOT / ".env") +``` + +The `.env` file is loaded from `PROJECT_ROOT`, which is `Path(__file__).resolve().parents[3]` — three directories up from the launcher file. On a shared development machine or CI runner, this file is: + +1. World-readable if not explicitly chmod'd (default umask creates files 644) +2. Accessible to any process running as the same user +3. Often committed to version control accidentally (no `.gitignore` guarantee) +4. Visible in Docker layer history if included in the build context + +The `.env` file contains `BINGX_API_KEY` and `BINGX_SECRET_KEY` — the exchange credentials. On a shared system, every user with read access can extract these keys. + +**Severity: High** + +### R12: Unvalidated `int()` on env vars — `DOLPHIN_BINGX_RECV_WINDOW_MS` could accept extreme values + +**File:** `launcher.py:191-193` + +```python +recv_window_ms = int(os.environ.get("DOLPHIN_BINGX_RECV_WINDOW_MS", "5000")) +default_leverage = int(os.environ.get("DOLPHIN_BINGX_DEFAULT_LEVERAGE", "1")) +exchange_leverage_cap = int(os.environ.get("DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP", "3")) +``` + +Three env vars are directly passed to `int()` with only a string default — **no bounds checking**. An attacker setting `DOLPHIN_BINGX_RECV_WINDOW_MS=2147483647` could set the exchange recv window to ~24 days, allowing replay attacks on signed requests. An attacker setting `DOLPHIN_BINGX_EXCHANGE_LEVERAGE_CAP=1000` could allow 1000x leverage on the exchange. + +**Severity: Medium** + +### R13: `BingxUserStream` `listenKey` from exchange response used in WebSocket URL — MITM injection surface + +**File:** `bingx_user_stream.py:230,398` + +```python +# Line 230: +url = f"{self._ws_url}?listenKey={listen_key}" + +# Line 398: +listen_key = resp.get("listenKey", "") # from exchange POST /openApi/user/auth/userDataStream +``` + +The `listenKey` comes from the BingX REST API response (`POST /openApi/user/auth/userDataStream`). It is used directly in the WebSocket connection URL **with no encoding or validation**. The listenKey is a short opaque string (looks like a UUID), but: + +- If an attacker can MITM the REST response (DNS spoofing, proxy, etc.), they control the listenKey value +- A malicious listenKey with URL metacharacters (`&`, `=`, `#`) could inject query parameters into the WebSocket URL +- The `listenKey` is BingX's session authentication mechanism — once an attacker controls it, they can hijack the user data stream + +The fix is `urllib.parse.urlencode({"listenKey": listen_key})` but the current code uses an f-string. + +**Severity: High** + +### R14: `mock_venue._exchange_event_queue` unbounded growth — event enqueue rate can exceed consumption rate + +**File:** `mock_venue.py:220,230` + +```python +# _queue_exchange_event: +self._exchange_event_queue.append(event) + +# subscribe (generator): +if self._exchange_event_queue: + yield self._exchange_event_queue.pop(0) +``` + +The mock venue's event queue is consumed one event at a time via a generator in `subscribe()`. If `queue_exchange_event()` is called faster than the consumer calls `next()` on the generator (which happens on every `_on_exchange_event()` callback), the list grows unboundedly. In test scenarios with rapid fire events, this can exhaust memory. + +Not a production risk (mock is test-only), but the unbounded growth pattern is worth noting. + +**Severity: Low** (test code) + +--- + +## Pass 15 Summary + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| R1 | `ThreadPoolExecutor` never shut down — 3 threads leak | Venue | **High** | +| R2 | `BingxVenueAdapter` no `close()` — backend HTTP client unreleasable | Venue | **High** | +| R3 | `real_zinc_plane._intent_cache` grows unboundedly | Plane | **High** | +| R4 | `RealZincPlane`/`ControlPlane` partial-construction SharedRegion leak | Plane | Medium | +| R5 | `BingxUserStream.ClientSession` no `__del__` — connection pool leak | Venue | Medium | +| R6 | `test_alpha_blue_untouched_g7.py` open() without context manager | Test | Low | +| R7 | All exchange REST/WS data parsed without schema validation | Venue | **Critical** | +| R8 | Shared memory JSON deserialization without integrity check | Plane | **High** | +| R9 | `restore_state()` deserializes arbitrary JSON — full kernel takeover | Bridge | **Critical** | +| R10 | `DOLPHIN_BINGX_ENV` + `ALLOW_MAINNET` mainnet switch via env | Config | **High** | +| R11 | `.env` file loaded from project root — secrets exposure | Config | **High** | +| R12 | Unvalidated `int()` on env vars — recv_window, leverage extremes | Config | Medium | +| R13 | `listenKey` from exchange in WS URL f-string — MITM injection | Venue | **High** | +| R14 | `mock_venue._exchange_event_queue` unbounded growth | Test | Low | + +### Pass 15 Severity + +| Severity | Count | +|----------|-------| +| **Critical** | 2 (R7, R9) | +| **High** | 6 (R1, R2, R3, R8, R10, R11, R13) | +| Medium | 3 (R4, R5, R12) | +| Low | 2 (R6, R14) | + +### Combined Catalog (All 15 Passes) + +| Pass | Focus | Count | Critical | High | Medium | Low | Info | +|------|-------|-------|----------|------|--------|-----|------| +| A | Architectural | 15 | 0 | 2 | 0 | 2 | 11 | +| T | Threading/Atomicity | 9 | 1 | 3 | 3 | 2 | 0 | +| E | E2E Trace (Pass 1) | 26 | 0 | 4 | 10 | 11 | 1 | +| F | Deep E2E (Pass 3) | 30 | 0 | 1 | 8 | 17 | 4 | +| G | Domain Scans (Pass 4) | 36 | 4 | 11 | 11 | 8 | 2 | +| H | Edge Domains (Pass 5) | 22 | 3 | 9 | 5 | 4 | 1 | +| I | Pass 6 (Math/Tests/Recovery/Security) | 22 | 3 | 11 | 4 | 2 | 2 | +| J | Pass 7 (Test Infra/Data/Rust/Env/Conn) | 16 | 0 | 7 | 7 | 2 | 0 | +| K | Pass 8 (Observability/Memory/Time/DeadCode) | 23 | 2 | 7 | 7 | 1 | 6 | +| L | Pass 9 (Contracts/Events/Network/FFI/Diffs) | 16 | 0 | 4 | 8 | 4 | 0 | +| M | Pass 10 (Runtime/TestBugs/FSM/Persistence/Metrics) | 18 | 3 | 7 | 5 | 3 | 0 | +| N | Pass 11 (Async/Sync Seams/Locks/Threading) | 10 | 4 | 1 | 3 | 1 | 1 | +| O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 | +| P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 | +| Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 | +| R | Pass 15 (Resource Leaks/Trust Boundaries/Security) | 14 | 2 | 6 | 3 | 2 | 1 | +| **Total** | | **289** | **23** | **83** | **82** | **64** | **37** | diff --git a/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md b/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md index 30ca78e..fd64550 100644 --- a/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md +++ b/PINK_DITAv2_FLAW_ANALYSIS_2026-05-31.md @@ -29,7 +29,8 @@ | O | Pass 12 (Sync/Async Wider Scope) | 11 | 0 | 3 | 7 | 1 | 0 | | P | Pass 13 (FFI Safety/Dangling Pointers/Coverage) | 9 | 1 | 3 | 3 | 1 | 1 | | Q | Pass 14 (Serde Edges/Backup Diffs/Market Data) | 12 | 0 | 4 | 3 | 2 | 3 | -| **Total** | | **275** | **21** | **77** | **79** | **64** | **34** | +| R | Pass 15 (Resource Leaks/Trust Boundaries/Security) | 14 | 2 | 6 | 3 | 2 | 1 | +| **Total** | | **289** | **23** | **83** | **82** | **64** | **37** | --- @@ -388,6 +389,29 @@ --- +## R-Series: Resource Leaks, Trust Boundaries, Security (Pass 15) + +*Full detail in TRACE doc under "PASS 15 — RESOURCE LEAKS, TRUST BOUNDARIES, SECURITY."* + +| # | Flaw | Layer | Severity | +|---|------|-------|----------| +| R1 | `ThreadPoolExecutor` never shut down — 3 threads leak | Venue | **High** | +| R2 | `BingxVenueAdapter` no `close()` — backend HTTP client unreleasable | Venue | **High** | +| R3 | `real_zinc_plane._intent_cache` grows unboundedly | Plane | **High** | +| R4 | `RealZincPlane`/`ControlPlane` partial-construction SharedRegion leak | Plane | Medium | +| R5 | `BingxUserStream.ClientSession` no `__del__` — connection pool leak | Venue | Medium | +| R6 | `test_alpha_blue_untouched_g7.py` open() without context manager | Test | Low | +| R7 | All exchange REST/WS data parsed without schema validation | Venue | **Critical** | +| R8 | Shared memory JSON deserialization without integrity check | Plane | **High** | +| R9 | `restore_state()` deserializes arbitrary JSON — full kernel takeover | Bridge | **Critical** | +| R10 | `DOLPHIN_BINGX_ENV` + `ALLOW_MAINNET` mainnet switch via env | Config | **High** | +| R11 | `.env` file loaded from project root — secrets exposure | Config | **High** | +| R12 | Unvalidated `int()` on env vars — recv_window, leverage extremes | Config | Medium | +| R13 | `listenKey` from exchange in WS URL f-string — MITM injection | Venue | **High** | +| R14 | `mock_venue._exchange_event_queue` unbounded growth | Test | Low | + +--- + ## H-Series: Edge Domains — Dependencies, Error Handling, Types, Contracts (Pass 5) *Full detail in TRACE doc under "PASS 5 — EDGE DOMAINS."*