docs: VIBRISS spec (+ §10.6 cascade/adaptive-TP paramsets), PINK accounting fix spec, BLUE incident docs

VIBRISS_PARAMETER_GOVERNANCE_SPEC §10.6: ob_cascade.count_threshold (currently cascade_count>0 = ONE asset widens every TP x1.40), tp_widen_factor, withdrawal_velocity_threshold as governance candidates; adaptive/Dynamic-TP threshold marked fit for VIBRISS governance; TP_FLOOR joint-policy reward requirement. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:04:15 +02:00
parent f4ff1cd9b7
commit c3a18f693a
4 changed files with 3653 additions and 0 deletions
--- a/prod/docs/MALFORMED_OPEN_RESTORE_BUG.md
+++ b/prod/docs/MALFORMED_OPEN_RESTORE_BUG.md
@@ -0,0 +1,131 @@
+# MALFORMED_OPEN_RESTORE_BUG
+
+## Summary
+
+BLUE was repeatedly rehydrating after startup because `dolphin.position_state` contained stale `OPEN` rows with zero effective size.
+
+The restore path treated those rows as fatal:
+
+- it selected the latest `OPEN` row per `trade_id`
+- it accepted that row even when `quantity` or `notional` had been driven to `0`
+- it hard-stopped on `position_state row invalid quantity ...`
+- `supervisord` then restarted the trader
+- the next startup read the same bad row again
+
+That created a restart loop.
+
+This was observed most clearly on the `2026-06-11` BLUE window. The recurring bad row was the legacy `ATOMUSDT` leg `1a3d2f9c`, which was persisted as:
+
+- `status = OPEN`
+- `quantity = 0`
+- `notional = 0`
+- `bars_held = 34`
+
+That row is not a live position. It is a stale snapshot that should have been treated as tombstoned history.
+
+## Root Cause
+
+The bad rows were self-inflicted by the partial-retract path in `nautilus_event_trader.py`.
+
+Before the fix:
+
+1. `_apply_internal_retract()` shrank the live position.
+2. It wrote a new `position_state` row with `status="OPEN"` for the remaining leg.
+3. If the remaining size rounded to zero, the row still existed as an `OPEN` snapshot.
+4. A later startup restore could pick that row and treat it as authoritative.
+
+That is enough to leave behind `OPEN` rows with:
+
+- `quantity = 0`
+- `notional = 0`
+
+These are not valid live positions, but they looked like one to the old restore logic.
+
+There is a second contributing factor in the restore path:
+
+- the restore code historically trusted the latest `OPEN` candidate too early
+- zero-sized `OPEN` rows were only rejected after the row had already been chosen as the best candidate
+- rejection used a hard failure path, which made the process exit instead of trying the next sane source
+
+That means the persistence bug and the restore policy bug reinforced each other.
+
+## Observable Symptoms
+
+- repeated `restore candidate parse failed from capital_update_ledger: 'list' object has no attribute 'get'`
+- repeated `position_state row invalid quantity for trade ...: 0.0`
+- `RESTORE HALT`
+- immediate restart by `supervisord`
+
+The chain-token mismatch logs were a separate warning. They were not the restart trigger.
+
+The capital-ledger parse warning is also distinct:
+
+- it indicates the ledger file is list-shaped, not a dict
+- it forces restore to rely more heavily on the other state surfaces
+- it is noisy, but it is not what actually killed the process in this incident
+
+## Fix Applied
+
+Two changes were made.
+
+### 1. Stop writing zero-sized `OPEN` rows
+
+In `_apply_internal_retract()`:
+
+- compute `remaining_qty`
+- if the remaining size is effectively zero, treat the retract as a full close
+- return the forced exit without emitting a new `position_state` row with `status="OPEN"`
+
+This prevents the bad row from being created in the first place.
+
+### 2. Make restore skip legacy bad `OPEN` rows
+
+In `_restore_position_state()`:
+
+- the ClickHouse restore query now filters `OPEN` rows with `quantity > 0 AND notional > 0`
+- if an invalid candidate still appears, restore logs and rejects it instead of hard-halting the process
+- restore falls back to HZ state or flat continuation rather than turning a stale row into a restart loop
+
+This is important because the repository already contains stale history. The fix is not only to stop producing new malformed rows; it also has to prevent old rows from re-triggering the same failure path on the next reboot.
+
+### 3. Keep the full-close path coherent
+
+The retract path now computes `remaining_qty` explicitly and treats `remaining_notional <= 1e-9` or `remaining_qty <= 0.0` as a full close.
+
+That means:
+
+- a full retract does not leave a zero-size `OPEN` snapshot behind
+- the exit is finalized as a close, not as a pseudo-open partial state
+- the runtime slot is removed cleanly instead of being left in a half-closed limbo
+
+## Verification Added
+
+Regression tests were added for both sides:
+
+- full-close retracts no longer emit zero-sized `OPEN` rows
+- restore skips zero-sized `OPEN` candidates without setting `restore_failed`
+
+The tests use the existing retract and restore harnesses:
+
+- one test seeds a tiny short leg that collapses to zero on retract and asserts no `OPEN` zero-size row is written
+- one test feeds a zero-sized `OPEN` `position_state` row into restore and asserts restore does not hard-halt
+
+## Operational Impact
+
+After this fix:
+
+- stale zero-sized `OPEN` rows no longer restart BLUE
+- malformed open snapshots are quarantined as legacy garbage
+- the live runtime can continue from a sane source instead of bouncing on the same bad record
+
+## What This Does Not Fix
+
+This change does not rewrite historical ClickHouse rows already present in the warehouse.
+
+It only changes:
+
+- new retract writes
+- restore selection and rejection policy
+- restart behavior when the old garbage is encountered
+
+If you want the historical ledger cleaned up, that is a separate reconciliation task. The current patch is intentionally conservative and only stops the bad row from causing further damage.