VIBRISS_PARAMETER_GOVERNANCE_SPEC §10.6: ob_cascade.count_threshold (currently cascade_count>0 = ONE asset widens every TP x1.40), tp_widen_factor, withdrawal_velocity_threshold as governance candidates; adaptive/Dynamic-TP threshold marked fit for VIBRISS governance; TP_FLOOR joint-policy reward requirement. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
5.0 KiB
MALFORMED_OPEN_RESTORE_BUG
Summary
BLUE was repeatedly rehydrating after startup because dolphin.position_state contained stale OPEN rows with zero effective size.
The restore path treated those rows as fatal:
- it selected the latest
OPENrow pertrade_id - it accepted that row even when
quantityornotionalhad been driven to0 - it hard-stopped on
position_state row invalid quantity ... supervisordthen restarted the trader- the next startup read the same bad row again
That created a restart loop.
This was observed most clearly on the 2026-06-11 BLUE window. The recurring bad row was the legacy ATOMUSDT leg 1a3d2f9c, which was persisted as:
status = OPENquantity = 0notional = 0bars_held = 34
That row is not a live position. It is a stale snapshot that should have been treated as tombstoned history.
Root Cause
The bad rows were self-inflicted by the partial-retract path in nautilus_event_trader.py.
Before the fix:
_apply_internal_retract()shrank the live position.- It wrote a new
position_staterow withstatus="OPEN"for the remaining leg. - If the remaining size rounded to zero, the row still existed as an
OPENsnapshot. - A later startup restore could pick that row and treat it as authoritative.
That is enough to leave behind OPEN rows with:
quantity = 0notional = 0
These are not valid live positions, but they looked like one to the old restore logic.
There is a second contributing factor in the restore path:
- the restore code historically trusted the latest
OPENcandidate too early - zero-sized
OPENrows were only rejected after the row had already been chosen as the best candidate - rejection used a hard failure path, which made the process exit instead of trying the next sane source
That means the persistence bug and the restore policy bug reinforced each other.
Observable Symptoms
- repeated
restore candidate parse failed from capital_update_ledger: 'list' object has no attribute 'get' - repeated
position_state row invalid quantity for trade ...: 0.0 RESTORE HALT- immediate restart by
supervisord
The chain-token mismatch logs were a separate warning. They were not the restart trigger.
The capital-ledger parse warning is also distinct:
- it indicates the ledger file is list-shaped, not a dict
- it forces restore to rely more heavily on the other state surfaces
- it is noisy, but it is not what actually killed the process in this incident
Fix Applied
Two changes were made.
1. Stop writing zero-sized OPEN rows
In _apply_internal_retract():
- compute
remaining_qty - if the remaining size is effectively zero, treat the retract as a full close
- return the forced exit without emitting a new
position_staterow withstatus="OPEN"
This prevents the bad row from being created in the first place.
2. Make restore skip legacy bad OPEN rows
In _restore_position_state():
- the ClickHouse restore query now filters
OPENrows withquantity > 0 AND notional > 0 - if an invalid candidate still appears, restore logs and rejects it instead of hard-halting the process
- restore falls back to HZ state or flat continuation rather than turning a stale row into a restart loop
This is important because the repository already contains stale history. The fix is not only to stop producing new malformed rows; it also has to prevent old rows from re-triggering the same failure path on the next reboot.
3. Keep the full-close path coherent
The retract path now computes remaining_qty explicitly and treats remaining_notional <= 1e-9 or remaining_qty <= 0.0 as a full close.
That means:
- a full retract does not leave a zero-size
OPENsnapshot behind - the exit is finalized as a close, not as a pseudo-open partial state
- the runtime slot is removed cleanly instead of being left in a half-closed limbo
Verification Added
Regression tests were added for both sides:
- full-close retracts no longer emit zero-sized
OPENrows - restore skips zero-sized
OPENcandidates without settingrestore_failed
The tests use the existing retract and restore harnesses:
- one test seeds a tiny short leg that collapses to zero on retract and asserts no
OPENzero-size row is written - one test feeds a zero-sized
OPENposition_staterow into restore and asserts restore does not hard-halt
Operational Impact
After this fix:
- stale zero-sized
OPENrows no longer restart BLUE - malformed open snapshots are quarantined as legacy garbage
- the live runtime can continue from a sane source instead of bouncing on the same bad record
What This Does Not Fix
This change does not rewrite historical ClickHouse rows already present in the warehouse.
It only changes:
- new retract writes
- restore selection and rejection policy
- restart behavior when the old garbage is encountered
If you want the historical ledger cleaned up, that is a separate reconciliation task. The current patch is intentionally conservative and only stops the bad row from causing further damage.