Files
siloqy/prod/docs/MALFORMED_OPEN_RESTORE_BUG.md
Codex c3a18f693a docs: VIBRISS spec (+ §10.6 cascade/adaptive-TP paramsets), PINK accounting fix spec, BLUE incident docs
VIBRISS_PARAMETER_GOVERNANCE_SPEC §10.6: ob_cascade.count_threshold
(currently cascade_count>0 = ONE asset widens every TP x1.40),
tp_widen_factor, withdrawal_velocity_threshold as governance candidates;
adaptive/Dynamic-TP threshold marked fit for VIBRISS governance; TP_FLOOR
joint-policy reward requirement.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:04:15 +02:00

5.0 KiB

MALFORMED_OPEN_RESTORE_BUG

Summary

BLUE was repeatedly rehydrating after startup because dolphin.position_state contained stale OPEN rows with zero effective size.

The restore path treated those rows as fatal:

  • it selected the latest OPEN row per trade_id
  • it accepted that row even when quantity or notional had been driven to 0
  • it hard-stopped on position_state row invalid quantity ...
  • supervisord then restarted the trader
  • the next startup read the same bad row again

That created a restart loop.

This was observed most clearly on the 2026-06-11 BLUE window. The recurring bad row was the legacy ATOMUSDT leg 1a3d2f9c, which was persisted as:

  • status = OPEN
  • quantity = 0
  • notional = 0
  • bars_held = 34

That row is not a live position. It is a stale snapshot that should have been treated as tombstoned history.

Root Cause

The bad rows were self-inflicted by the partial-retract path in nautilus_event_trader.py.

Before the fix:

  1. _apply_internal_retract() shrank the live position.
  2. It wrote a new position_state row with status="OPEN" for the remaining leg.
  3. If the remaining size rounded to zero, the row still existed as an OPEN snapshot.
  4. A later startup restore could pick that row and treat it as authoritative.

That is enough to leave behind OPEN rows with:

  • quantity = 0
  • notional = 0

These are not valid live positions, but they looked like one to the old restore logic.

There is a second contributing factor in the restore path:

  • the restore code historically trusted the latest OPEN candidate too early
  • zero-sized OPEN rows were only rejected after the row had already been chosen as the best candidate
  • rejection used a hard failure path, which made the process exit instead of trying the next sane source

That means the persistence bug and the restore policy bug reinforced each other.

Observable Symptoms

  • repeated restore candidate parse failed from capital_update_ledger: 'list' object has no attribute 'get'
  • repeated position_state row invalid quantity for trade ...: 0.0
  • RESTORE HALT
  • immediate restart by supervisord

The chain-token mismatch logs were a separate warning. They were not the restart trigger.

The capital-ledger parse warning is also distinct:

  • it indicates the ledger file is list-shaped, not a dict
  • it forces restore to rely more heavily on the other state surfaces
  • it is noisy, but it is not what actually killed the process in this incident

Fix Applied

Two changes were made.

1. Stop writing zero-sized OPEN rows

In _apply_internal_retract():

  • compute remaining_qty
  • if the remaining size is effectively zero, treat the retract as a full close
  • return the forced exit without emitting a new position_state row with status="OPEN"

This prevents the bad row from being created in the first place.

2. Make restore skip legacy bad OPEN rows

In _restore_position_state():

  • the ClickHouse restore query now filters OPEN rows with quantity > 0 AND notional > 0
  • if an invalid candidate still appears, restore logs and rejects it instead of hard-halting the process
  • restore falls back to HZ state or flat continuation rather than turning a stale row into a restart loop

This is important because the repository already contains stale history. The fix is not only to stop producing new malformed rows; it also has to prevent old rows from re-triggering the same failure path on the next reboot.

3. Keep the full-close path coherent

The retract path now computes remaining_qty explicitly and treats remaining_notional <= 1e-9 or remaining_qty <= 0.0 as a full close.

That means:

  • a full retract does not leave a zero-size OPEN snapshot behind
  • the exit is finalized as a close, not as a pseudo-open partial state
  • the runtime slot is removed cleanly instead of being left in a half-closed limbo

Verification Added

Regression tests were added for both sides:

  • full-close retracts no longer emit zero-sized OPEN rows
  • restore skips zero-sized OPEN candidates without setting restore_failed

The tests use the existing retract and restore harnesses:

  • one test seeds a tiny short leg that collapses to zero on retract and asserts no OPEN zero-size row is written
  • one test feeds a zero-sized OPEN position_state row into restore and asserts restore does not hard-halt

Operational Impact

After this fix:

  • stale zero-sized OPEN rows no longer restart BLUE
  • malformed open snapshots are quarantined as legacy garbage
  • the live runtime can continue from a sane source instead of bouncing on the same bad record

What This Does Not Fix

This change does not rewrite historical ClickHouse rows already present in the warehouse.

It only changes:

  • new retract writes
  • restore selection and rejection policy
  • restart behavior when the old garbage is encountered

If you want the historical ledger cleaned up, that is a separate reconciliation task. The current patch is intentionally conservative and only stops the bad row from causing further damage.