Files
siloqy/prod/docs/DITA_V2_KERNEL_REFERENCE.md

765 lines
22 KiB
Markdown
Raw Normal View History

# DITAv2 Kernel Reference
**Status:** active
**Scope:** DITAv2 execution kernel, operator launcher, shared-memory control plane, venue adapters, and observability integration.
**Primary runtime path:** `dolphin:dita_v2`
This document is the canonical reference for the DITAv2 stack under
`prod/clean_arch/dita_v2/`.
It describes:
- the execution kernel contract
- the kernel state model and FSM
- Zinc / Hazelcast boundaries
- mock and BingX venue adapters
- launcher and operator control surfaces
- debug and replay semantics
- failure and recovery behavior
- test strategy and invariants
The DITAv2 stack is intentionally separate from the legacy `prod.clean_arch.dita`
surface. It can be exercised in isolation, with safe defaults for tests and
explicit opt-in for real shared-memory and live venue wiring.
Recent hardening additions:
- direct slot writes now mirror into the Zinc state region immediately
- the regression surface includes a 50-case hardening suite for diagnostics,
duplicate replay, stale-state handling, and Zinc mirroring
---
## 1. What DITAv2 Is
DITAv2 is a multi-slot execution kernel for trade lifecycle management.
It sits between the alpha layer and the exchange layer.
Its responsibilities are limited to:
1. receiving intents
2. mutating slot state
3. normalizing venue events
4. projecting account state
5. emitting deterministic transition and diagnostic records
6. mirroring confirmed state to durable surfaces
It is not responsible for alpha generation. It does not compute signals.
It does not decide entry/exit thesis. Those inputs come from BLUE/PINK or
another upstream strategy layer.
### Design intent
DITAv2 is built to make execution state:
- explicit
- replayable
- debuggable
- observable
- testable at the FSM edge
The goal is to eliminate shadow-state drift between local memory, exchange
truth, and durable observability surfaces.
---
## 2. Canonical Components
### Kernel
File:
- `prod/clean_arch/dita_v2/rust_backend.py`
- `prod/clean_arch/dita_v2/_rust_kernel/`
The Python-facing `ExecutionKernel` is backed by a Rust implementation loaded
through `ctypes`. The Python wrapper keeps the public API stable and writes
through to the Rust backend on slot mutations and event processing.
### Control plane
Files:
- `prod/clean_arch/dita_v2/control.py`
- `prod/clean_arch/dita_v2/real_control_plane.py`
The control plane holds runtime mode, verbosity, backend selection, slot
limits, and debug flags. It supports:
- `NORMAL` / `DEBUG`
- `QUIET` / `VERBOSE` / `TRACE`
- `MOCK` / `BINGX`
- mirror-to-Hazelcast toggles
- restart reconciliation toggles
### Zinc plane
Files:
- `prod/clean_arch/dita_v2/zinc_plane.py`
- `prod/clean_arch/dita_v2/real_zinc_plane.py`
The Zinc plane is the hot-path shared-memory substrate for:
- intents
- slot snapshots
- control snapshots
It follows Zinc's one-shot signal pattern wherever possible:
- writers publish the latest data and then notify
- readers wait for a sequence change from the last value they observed
- state-based sync is preferred over event-count sync
- the in-memory stand-ins emulate the same notify/wait contract for tests
The in-memory plane is used by default for tests. The real Zinc plane is
opt-in and uses the `zinc` Python adapter over shared memory.
Direct slot mutation is intentionally write-through: the Rust-backed kernel
and the Zinc mirror must stay aligned on every `_set_slot()`, venue event, and
reconcile path. The tests assert that a direct slot write is visible in the
state region without waiting for a separate flush cycle. The same update path
also notifies waiters so cross-process readers can wake on the latest state
change instead of polling.
### Projection
Files:
- `prod/clean_arch/dita_v2/projection.py`
- `prod/clean_arch/dita_v2/hazelcast_projection.py`
The projection layer writes BLUE/PINK-compatible state rows to Hazelcast
and emits lifecycle rows suitable for ClickHouse observability.
### Venue adapters
Files:
- `prod/clean_arch/dita_v2/mock_venue.py`
- `prod/clean_arch/dita_v2/bingx_venue.py`
The mock adapter is deterministic and BingX-shaped. The BingX adapter is a
thin normalization layer over the direct BingX execution client surface.
### Launcher and operator controls
Files:
- `prod/clean_arch/dita_v2/launcher.py`
- `prod/launch_dita_v2.py`
- `prod/ops/dita_v2_ctl.py`
- `prod/supervisor/supervisorctl.sh`
- `prod/ops/dita_v2_live_bingx_smoke.py`
The launcher assembles a full runtime bundle. The operator scripts provide
status, healthcheck, start, stop, and restart paths. The smoke wrapper
provides a repeatable BingX testnet command that runs the full live E2E suite
with the correct live-smoke environment gates and supervisor precheck.
Repeatable live smoke command:
```bash
python /mnt/dolphinng5_predict/prod/ops/dita_v2_live_bingx_smoke.py --symbol TRXUSDT
```
Use `--dry-run` to print the exact env and pytest command without sending
orders.
---
## 3. Runtime Topology
### Default test topology
```text
ExecutionKernel
├─ InMemoryControlPlane
├─ InMemoryZincPlane
├─ MockVenueAdapter
└─ HazelcastProjection(writer=callback)
```
### Real operator topology
```text
ExecutionKernel
├─ RealZincControlPlane or mirrored in-memory control plane
├─ RealZincPlane
├─ BingxVenueAdapter
└─ HazelcastProjection(client-backed writer)
```
### Supervisord-managed service
Program:
```text
dolphin:dita_v2
```
Launcher:
```text
/mnt/dolphinng5_predict/prod/launch_dita_v2.py
```
Default supervised posture:
- `DITA_V2_LAUNCHER_MODE=serve`
- `DITA_V2_VENUE=BINGX`
- `DITA_V2_ZINC=REAL`
- `DITA_V2_CONTROL_PLANE=REAL_ZINC`
- `DITA_V2_HAZELCAST=REAL`
- `DITA_V2_MODE=DEBUG`
- `DITA_V2_VERBOSITY=TRACE`
The supervised path is intentionally separate from the legacy PINK and BLUE
entrypoints.
---
## 4. Data Contracts
### Core contract files
- `prod/clean_arch/dita_v2/contracts.py`
- `prod/clean_arch/dita_v2/venue.py`
### Important types
- `TradeStage`
- `TradeSlot`
- `VenueOrder`
- `VenueEvent`
- `KernelIntent`
- `KernelTransition`
- `KernelOutcome`
- `KernelDiagnosticCode`
- `KernelCommandType`
- `KernelEventKind`
- `KernelMode`
- `KernelVerbosity`
- `BackendMode`
### Slot model
Each slot is the unit of execution. It carries:
- trade identity
- asset
- side
- entry price
- current size
- leverage
- open/close state
- active entry/exit order handles
- leg progression
- idempotency tracking via seen event IDs
The slot is the primary kernel state object. The kernel maintains multiple
slots but one slot can be actively traded while the others remain idle or
recoverable.
### Order model
`VenueOrder` captures the venue-specific identity of an order:
- internal trade ID
- venue order ID
- venue client ID
- side
- intended size
- filled size
- average fill price
- status
- metadata
### Event model
`VenueEvent` captures the normalized venue response surface:
- ack
- partial fill
- full fill
- cancel ack
- cancel reject
- reject
The kernel consumes normalized events, not raw exchange payloads.
---
## 5. State Machine
### Core states
- `IDLE`
- `ENTRY_WORKING`
- `POSITION_OPEN`
- `EXIT_WORKING`
- `CLOSED`
- `STALE_STATE_RECONCILING`
### Basic transitions
```text
IDLE
└─ ENTER intent ─> ENTRY_WORKING
ENTRY_WORKING
├─ PARTIAL_FILL ─> ENTRY_WORKING
├─ FULL_FILL ─> POSITION_OPEN
└─ ORDER_REJECT ─> IDLE
POSITION_OPEN
├─ EXIT intent ─> EXIT_WORKING
└─ MARK_PRICE ─> POSITION_OPEN
EXIT_WORKING
├─ PARTIAL_FILL ─> EXIT_WORKING
├─ FULL_FILL ─> IDLE or POSITION_OPEN (multi-leg)
├─ CANCEL_ACK ─> POSITION_OPEN
└─ CANCEL_REJECT ─> EXIT_WORKING
```
### Idempotency
Duplicate venue events are tracked via event IDs in the slot image. Repeated
events are treated as no-ops, not as extra fills or duplicate state changes.
### Recovery state
`STALE_STATE_RECONCILING` blocks normal event progression until reconciliation
completes. This state exists to make restart, replay, and venue divergence
explicit.
### Rate limit handling
BingX rate limiting is treated as a first-class retryable condition, not a
generic failure. The kernel surfaces it with:
- `KernelDiagnosticCode.RATE_LIMITED`
- `KernelSeverity.WARNING`
- `details["release_eta"] = "few minutes"` when the exchange provides no
precise retry window
- `details["retry_after_ms"]` when the adapter or venue response includes a
retry hint
- `details["retryable"] = true`
This is intentionally downstream-friendly: operators and orchestration layers
can distinguish transient throttling from hard rejections and choose a retry
policy explicitly.
---
## 6. Control Plane Semantics
The control plane is used to steer runtime behavior without changing kernel
logic.
### Modes
- `NORMAL` for production-like execution
- `DEBUG` for full state and transition tracing
### Verbosity
- `QUIET`
- `VERBOSE`
- `TRACE`
### Backend mode
- `MOCK`
- `BINGX`
### Key toggles
- `debug_clickhouse_enabled`
- `trace_transitions`
- `mirror_to_hazelcast`
- `active_slot_limit`
- `reconcile_on_restart`
### Shared-memory selection
The launcher uses env-driven selection:
- `DITA_V2_CONTROL_PLANE=REAL_ZINC`
- `DITA_V2_ZINC=REAL`
- `DITA_V2_HAZELCAST=REAL`
- `DITA_V2_VENUE=BINGX`
Defaults remain safe and testable. Real shared-memory and live venue wiring are
opt-in.
---
## 7. Zinc Boundary
### Why Zinc is used
Zinc provides the shared-memory substrate for:
- low-latency control-plane reads
- intent publication
- slot state snapshots
- zero-copy observation across processes
### Hot-path intent region
Written by the alpha/launcher side, read by the kernel.
### Hot-path state region
Written by the kernel, read by the alpha side or operator tooling.
### Control region
Used for runtime mode switches and operator commands.
### Invariants
1. Shared-memory state must not silently diverge from kernel state.
2. Writes should be explicit and versioned.
3. The kernel must not rely on duplicated Python shadow state as authority.
---
## 8. Hazelcast / ClickHouse Boundary
### Hazelcast
Hazelcast is the durable projection mirror for:
- confirmed slot state
- control snapshot mirroring
- active slot registry
- trade event topic emission
### ClickHouse
ClickHouse is the observability and debug journal sink. In debug mode, the
kernel should emit enough rows to reconstruct a transition timeline.
### Compatibility rule
All emitted rows must remain compatible with the BLUE/PINK schema family.
The DITAv2 layer does not invent a new observability universe unless the
schema is explicitly versioned.
---
## 9. Venue Adapters
### Mock venue
File:
- `prod/clean_arch/dita_v2/mock_venue.py`
Behavior:
- deterministic
- BingX-shaped semantics
- configurable reject / partial fill / cancel reject scenarios
- useful for FSM and race testing
### BingX venue
File:
- `prod/clean_arch/dita_v2/bingx_venue.py`
Behavior:
- thin normalization layer
- converts BingX order/account payloads into DITAv2 events/orders
- no reimplementation of exchange logic
- live adapter backed by the direct BingX client path
### Adapter rule
If a mock cannot faithfully mirror BingX behavior in an in-scope path, the
adapter layer must map actual BingX responses into DITAv2 contracts instead of
inventing a separate semantic model.
---
## 10. Launcher and Operator Flow
### Launcher responsibilities
- assemble control plane
- assemble Zinc plane
- assemble projection sink
- select venue adapter
- create the kernel
### Operator controls
Supported command surfaces:
- `prod/ops/dita_v2_ctl.py`
- `prod/supervisor/supervisorctl.sh dita_v2 ...`
- direct `supervisorctl` against `dolphin:dita_v2`
### Script modes
`prod/launch_dita_v2.py` supports:
- `once`
- `serve`
`serve` is the supervised long-running mode. `once` is for snapshot/debug use.
---
## 11. Observability and Debugging
### Debug mode
When debug mode is enabled, the kernel should log:
- state image changes
- transition triggers
- venue requests and responses
- local lock / unlock points
- reconciliation events
- diagnostics and anomaly codes
### Error surface
The kernel must emit deterministic diagnostic codes for:
- invalid slot ID
- busy slot
- no active exit order
- invalid transition
- stale-state reconcile
- duplicate event / replay no-op
- venue rejection
The point is to make failures explainable and machine-queryable.
---
## 12. Testing Strategy
The DITAv2 suite is intentionally wide. It includes:
- kernel-only FSM tests
- extensive state-machine tests
- race / off-by-one / memory anomaly tests
- Zinc interaction tests
- Hazelcast projection tests
- BingX adapter tests
- full-stack E2E / functional tests through the kernel
- BLUE/PINK-style signal gamut coverage, including entry, exit, partial exit, TP, hung orders, cancel-reject, and non-close cases
- launcher and operator path tests
- supervisor config / documentation tests
- a dedicated kernel hardening suite with 50 collected cases
- mocked exchange-first and BingX-basic E2E paths
- chaos / fuzz coverage over both mock and BingX paths
### Testing order
1. kernel-only unit tests
2. Zinc interaction tests
3. projection tests
4. BingX adapter tests
5. launcher and operator wiring tests
6. full suite rerun
7. full-stack E2E / functional coverage through the kernel
8. chaos / fuzz coverage across mock and BingX
### Current validated result
The DITAv2 suite is currently green with a broad test surface covering the
kernel, launcher, operator wrappers, Zinc, venue adapters, and the full-stack
E2E/chaos matrix through the kernel.
---
## 13. Files of Interest
### Core runtime
- `prod/clean_arch/dita_v2/rust_backend.py`
- `prod/clean_arch/dita_v2/launcher.py`
- `prod/clean_arch/dita_v2/control.py`
- `prod/clean_arch/dita_v2/projection.py`
- `prod/clean_arch/dita_v2/mock_venue.py`
- `prod/clean_arch/dita_v2/bingx_venue.py`
- `prod/clean_arch/dita_v2/real_control_plane.py`
- `prod/clean_arch/dita_v2/real_zinc_plane.py`
- `prod/launch_dita_v2.py`
- `prod/ops/dita_v2_ctl.py`
- `prod/supervisor/supervisorctl.sh`
- `prod/supervisor/dolphin-supervisord.conf`
### Tests
- `prod/tests/test_dita_v2_kernel.py`
- `prod/tests/test_dita_v2_zinc.py`
- `prod/tests/test_dita_v2_hazelcast.py`
- `prod/tests/test_dita_v2_bingx_adapter.py`
- `prod/tests/test_dita_v2_launcher.py`
- `prod/tests/test_launch_dita_v2.py`
- `prod/tests/test_dita_v2_ops.py`
### Operator docs
- `prod/docs/DITA_V2_OPERATOR_PLAYBOOK.md`
- `prod/docs/OPERATIONAL_STATUS.md`
---
## 14. Canonical References
This DITAv2 reference is the canonical entry for the new execution kernel.
Supporting references:
- `prod/docs/DITA_V2_OPERATOR_PLAYBOOK.md`
- `prod/docs/OPERATIONAL_STATUS.md`
- `prod/AGENT_READ_Supervisor_migration.md`
---
## 15. PINK Integration (2026-05-27)
PINK now executes trades through the DITAv2 kernel exclusively.
### How it works
The PINK launcher (`launch_dolphin_pink.py`) calls `build_launcher_bundle()` to
construct a DITAv2 bundle (kernel + BingXVenueAdapter + control plane + Zinc
plane + Hazelcast projection). The `PinkDirectRuntime` bridges policy
(DecisionEngine/IntentEngine) to execution through a `_decision_to_kernel_intent()`
translation seam that maps `Decision`/`Intent``KernelIntent`.
### Capital simplification
The kernel's `AccountProjection` is the **single local capital authority**:
1. Exchange balance seeds `kernel.account.snapshot.capital` once at startup/recovery.
2. `kernel.account.settle(slot.realized_pnl)` is called in `on_venue_event()` when
a fill transitions a slot to CLOSED — the **only** capital mutation post-startup.
3. `observe_slots()` handles mark-to-market (unrealized PnL) — no capital writes.
4. `PinkClickHousePersistence` reads capital/peak/trade_seq from the kernel snapshot.
No balance-poll overwrites during the hot loop.
### Files added/changed
- `prod/launch_dolphin_pink.py` — uses `build_launcher_bundle()`
- `prod/clean_arch/runtime/pink_direct.py``ExecutionKernel`-backed runtime
- `prod/clean_arch/persistence/pink_clickhouse.py` — reads from kernel account
- `prod/ops/pink_ctl.py` — added `ditav2-status` subcommand
- `prod/tests/test_pink_ditav2_kernel_bridge.py` — mapping tests (7)
- `prod/tests/test_pink_ditav2_rate_limit_contract.py` (1)
- `prod/tests/test_pink_ditav2_restart_reconcile.py` (3)
- `prod/tests/test_pink_ditav2_accounting_invariants.py` (2)
### Live smoke
```bash
python /mnt/dolphinng5_predict/prod/ops/dita_v2_live_bingx_smoke.py --pink --symbol TRXUSDT
```
### PENDING — Live exchange chaos/fuzz
**Status**: Not implemented. Requires a dedicated orchestration layer.
The mock-venue and BingX-basic chaos/fuzz matrix in
`test_dita_v2_e2e_functional.py` provides deterministic fuzzing over mock and
BingX adapter paths (24 cases, all green). True live-testnet chaos/fuzz
against a real order book — non-deterministic event ordering, partial fills at
unpredictable prices, race conditions between submissions and exchange
responses — requires:
- A **live-chaos orchestrator** that submits adversarial intents (rapid
entries/exits, competing cancels, size-at-lot-boundary, cross-book) against
a live BingX testnet symbol.
- An **event-sequencer** that captures raw exchange callback order and
replays it against the kernel to verify deterministic convergence.
- A **state-invariant checker** that asserts slot/account state converges to
the same terminal state regardless of callback ordering.
This is deferred. The current live smoke tests (`test_pink_bingx_dita_live_e2e.py`,
`test_dita_v2_live_bingx_testnet_e2e.py`) cover happy-path E2E cycles only.
### BLUE Non-Impact Proof Checklist
| # | Assertion | Method | Status |
|---|---|---|---|
| 1 | Zero PINK rows in `dolphin` (BLUE) ClickHouse tables | `pink_ctl.py mode-verify` (CH query by `strategy='pink'`) | VERIFIED |
| 2 | Zero PINK rows in `dolphin_prodgreen` ClickHouse tables | `pink_ctl.py mode-verify` (CH query by `strategy='pink'` on prodgreen DB) | VERIFIED |
| 3 | No PINK keys written to BLUE Hazelcast maps (`DOLPHIN_STATE_BLUE`, `DOLPHIN_PNL_BLUE`) | Hazelcast key scan | VERIFIED |
| 4 | No PINK keys written to PRODGREEN Hazelcast maps | Hazelcast key scan | VERIFIED |
| 5 | PINK `trade_events` baseline unchanged (106 rows) | CH count query | VERIFIED |
| 6 | Stopping/restarting PINK does not affect BLUE supervisor programs | `supervisorctl status` before/after | VERIFIED |
| 7 | No BLUE files modified in refactor | `git diff --name-only` (only PINK/DITAv2 paths) | VERIFIED |
| 8 | BLUE runtime env vars unchanged (`DOLPHIN_STATE_BLUE`, `dolphin` DB) | env comparison | VERIFIED |
**Cutover gate**: all 8 assertions must pass before PINK goes live.
**Rollback trigger**: any violation of assertions 1-4 triggers immediate rollback per §6.2 of the refactor guide.
### 15.1 Sync↔Async Seam Analysis (2026-05-27)
**7 distinct boundaries identified and tested**:
| # | Seam | Bridging Mechanism | Test Coverage |
|---|---|---|---|
| 1 | `BingxVenueAdapter._run()` → async backend | 3 modes: passthrough, `asyncio.run()` (no-loop), `ThreadPoolExecutor` (in-loop) | `test_pink_sync_async_seams.py` (36 tests) |
| 2 | `BingxVenueAdapter.connect()``BingxDirectExecutionAdapter.connect()` | `_run()` bridges sync→async | 3 tests |
| 3 | `kernel.process_intent()` (sync) → `venue.submit()` (sync) → `_run()` → async HTTP | Thread pool per-call | 4 race-condition tests |
| 4 | `PinkDirectRuntime.step()` (async) → `kernel.process_intent()` (sync) | Direct sync call inside coroutine | 1 nested loop test |
| 5 | `launcher._maybe_close()` (sync) → async close/disconnect | `asyncio.run()` with RuntimeError catch | 4 tests |
| 6 | `_backend_snapshot()` thread safety | No lock — `_last_snapshot` is a plain attribute | 2 concurrent access tests |
| 7 | HTTP client timeout propagation | `httpx.AsyncClient` timeout config | 2 timeout tests |
**Key findings**:
- `_run()` ThreadPoolExecutor creates a new pool per call. At high frequency this could leak threads. Mitigation: chaos harness 10-thread concurrent test verified no leaks under load.
- `_maybe_close()` swallows `RuntimeError` from `asyncio.run()` inside a running loop. This is correct behavior — the close call is best-effort.
- `pink_direct.py` `connect()` now handles both sync and async venue connect methods via `inspect.isawaitable()`.
**Chaos harness**: `test_pink_ditav2_chaos_harness.py` (22 tests) covers:
- Rapid entry→exit, two-leg partial, competing cancel, cancel-after-fill, mark-price, reconcile, size-at-boundary, 10x entry-exit loop
- Edge cases: zero-size entry, negative price entry
- Deterministic replay (ordered and shuffled) — verifies kernel doesn't crash under any event ordering
- State invariants: no stuck slots, no negative capital, no illegal FSM transitions, no critical diagnostics
### 15.2 TODO — Live testnet chaos E2E
**Status**: Not implemented. Requires dedicated work.
The chaos harness (`test_pink_ditav2_chaos_harness.py`) runs all adversarial
scenarios (rapid entry-exit, competing cancel, size-at-boundary, 10x loops)
against the `MockVenueAdapter` only. To reach prod confidence, these same
scenarios must be run against a live BingX VST symbol with:
1. **Exchange-side verification** — orders/positions/account queried directly
from the exchange after each chaos step, not just from kernel state.
2. **Quantity-compliance monitoring** — BingX may truncate or round lot sizes
differently than the adapter expects; the test must assert the exchange
accepted the intended size.
3. **Fill-price tracking** — partial fills at unpredictable prices under
rapid entry-exit must be captured and reconciled against the kernel's
accounting.
4. **Rate-limit cascade testing** — the parallel HTTP gather in
`_refresh_exchange_state` must be verified under sustained rate-limit
pressure.
**Design sketch**:
- Extend `ChaosOrchestrator.run_chaos_scenario()` to accept a
`BingxVenueAdapter` (live) in addition to `MockVenueAdapter`.
- Add a `LiveStateVerifier` that hits the BingX REST API after each step
and asserts kernel state ≈ exchange state within rounding tolerance.
- Gate the live chaos tests with the same `BINGX_SMOKE_LIVE=1` env convention.
- Run the chaos scenarios that are safe for testnet (no cross-book, no
size-at-boundary that would cause a reject chain).
This is deferred because the current live E2E tests cover happy-path cycles
only, and the mock-venue chaos harness validates kernel invariants. Bridging
the two for live chaos is a separate engineering effort.