# MALFORMED_OPEN_RESTORE_BUG ## Summary BLUE was repeatedly rehydrating after startup because `dolphin.position_state` contained stale `OPEN` rows with zero effective size. The restore path treated those rows as fatal: - it selected the latest `OPEN` row per `trade_id` - it accepted that row even when `quantity` or `notional` had been driven to `0` - it hard-stopped on `position_state row invalid quantity ...` - `supervisord` then restarted the trader - the next startup read the same bad row again That created a restart loop. This was observed most clearly on the `2026-06-11` BLUE window. The recurring bad row was the legacy `ATOMUSDT` leg `1a3d2f9c`, which was persisted as: - `status = OPEN` - `quantity = 0` - `notional = 0` - `bars_held = 34` That row is not a live position. It is a stale snapshot that should have been treated as tombstoned history. ## Root Cause The bad rows were self-inflicted by the partial-retract path in `nautilus_event_trader.py`. Before the fix: 1. `_apply_internal_retract()` shrank the live position. 2. It wrote a new `position_state` row with `status="OPEN"` for the remaining leg. 3. If the remaining size rounded to zero, the row still existed as an `OPEN` snapshot. 4. A later startup restore could pick that row and treat it as authoritative. That is enough to leave behind `OPEN` rows with: - `quantity = 0` - `notional = 0` These are not valid live positions, but they looked like one to the old restore logic. There is a second contributing factor in the restore path: - the restore code historically trusted the latest `OPEN` candidate too early - zero-sized `OPEN` rows were only rejected after the row had already been chosen as the best candidate - rejection used a hard failure path, which made the process exit instead of trying the next sane source That means the persistence bug and the restore policy bug reinforced each other. ## Observable Symptoms - repeated `restore candidate parse failed from capital_update_ledger: 'list' object has no attribute 'get'` - repeated `position_state row invalid quantity for trade ...: 0.0` - `RESTORE HALT` - immediate restart by `supervisord` The chain-token mismatch logs were a separate warning. They were not the restart trigger. The capital-ledger parse warning is also distinct: - it indicates the ledger file is list-shaped, not a dict - it forces restore to rely more heavily on the other state surfaces - it is noisy, but it is not what actually killed the process in this incident ## Fix Applied Two changes were made. ### 1. Stop writing zero-sized `OPEN` rows In `_apply_internal_retract()`: - compute `remaining_qty` - if the remaining size is effectively zero, treat the retract as a full close - return the forced exit without emitting a new `position_state` row with `status="OPEN"` This prevents the bad row from being created in the first place. ### 2. Make restore skip legacy bad `OPEN` rows In `_restore_position_state()`: - the ClickHouse restore query now filters `OPEN` rows with `quantity > 0 AND notional > 0` - if an invalid candidate still appears, restore logs and rejects it instead of hard-halting the process - restore falls back to HZ state or flat continuation rather than turning a stale row into a restart loop This is important because the repository already contains stale history. The fix is not only to stop producing new malformed rows; it also has to prevent old rows from re-triggering the same failure path on the next reboot. ### 3. Keep the full-close path coherent The retract path now computes `remaining_qty` explicitly and treats `remaining_notional <= 1e-9` or `remaining_qty <= 0.0` as a full close. That means: - a full retract does not leave a zero-size `OPEN` snapshot behind - the exit is finalized as a close, not as a pseudo-open partial state - the runtime slot is removed cleanly instead of being left in a half-closed limbo ## Verification Added Regression tests were added for both sides: - full-close retracts no longer emit zero-sized `OPEN` rows - restore skips zero-sized `OPEN` candidates without setting `restore_failed` The tests use the existing retract and restore harnesses: - one test seeds a tiny short leg that collapses to zero on retract and asserts no `OPEN` zero-size row is written - one test feeds a zero-sized `OPEN` `position_state` row into restore and asserts restore does not hard-halt ## Operational Impact After this fix: - stale zero-sized `OPEN` rows no longer restart BLUE - malformed open snapshots are quarantined as legacy garbage - the live runtime can continue from a sane source instead of bouncing on the same bad record ## What This Does Not Fix This change does not rewrite historical ClickHouse rows already present in the warehouse. It only changes: - new retract writes - restore selection and rejection policy - restart behavior when the old garbage is encountered If you want the historical ledger cleaned up, that is a separate reconciliation task. The current patch is intentionally conservative and only stops the bad row from causing further damage.