siloqy/prod/docs/MALFORMED_OPEN_RESTORE_BUG.md

# MALFORMED_OPEN_RESTORE_BUG

## Summary

BLUE was repeatedly rehydrating after startup because `dolphin.position_state` contained stale `OPEN` rows with zero effective size.

The restore path treated those rows as fatal:

- it selected the latest `OPEN` row per `trade_id`
- it accepted that row even when `quantity` or `notional` had been driven to `0`
- it hard-stopped on `position_state row invalid quantity ...`
- `supervisord` then restarted the trader
- the next startup read the same bad row again

That created a restart loop.

This was observed most clearly on the `2026-06-11` BLUE window. The recurring bad row was the legacy `ATOMUSDT` leg `1a3d2f9c`, which was persisted as:

- `status = OPEN`
- `quantity = 0`
- `notional = 0`
- `bars_held = 34`

That row is not a live position. It is a stale snapshot that should have been treated as tombstoned history.

## Root Cause

The bad rows were self-inflicted by the partial-retract path in `nautilus_event_trader.py`.

Before the fix:

1. `_apply_internal_retract()` shrank the live position.
2. It wrote a new `position_state` row with `status="OPEN"` for the remaining leg.
3. If the remaining size rounded to zero, the row still existed as an `OPEN` snapshot.
4. A later startup restore could pick that row and treat it as authoritative.

That is enough to leave behind `OPEN` rows with:

- `quantity = 0`
- `notional = 0`

These are not valid live positions, but they looked like one to the old restore logic.

There is a second contributing factor in the restore path:

- the restore code historically trusted the latest `OPEN` candidate too early
- zero-sized `OPEN` rows were only rejected after the row had already been chosen as the best candidate
- rejection used a hard failure path, which made the process exit instead of trying the next sane source

That means the persistence bug and the restore policy bug reinforced each other.

## Observable Symptoms

- repeated `restore candidate parse failed from capital_update_ledger: 'list' object has no attribute 'get'`
- repeated `position_state row invalid quantity for trade ...: 0.0`
- `RESTORE HALT`
- immediate restart by `supervisord`

The chain-token mismatch logs were a separate warning. They were not the restart trigger.

The capital-ledger parse warning is also distinct:

- it indicates the ledger file is list-shaped, not a dict
- it forces restore to rely more heavily on the other state surfaces
- it is noisy, but it is not what actually killed the process in this incident

## Fix Applied

Two changes were made.

### 1. Stop writing zero-sized `OPEN` rows

In `_apply_internal_retract()`:

- compute `remaining_qty`
- if the remaining size is effectively zero, treat the retract as a full close
- return the forced exit without emitting a new `position_state` row with `status="OPEN"`

This prevents the bad row from being created in the first place.

### 2. Make restore skip legacy bad `OPEN` rows

In `_restore_position_state()`:

- the ClickHouse restore query now filters `OPEN` rows with `quantity > 0 AND notional > 0`
- if an invalid candidate still appears, restore logs and rejects it instead of hard-halting the process
- restore falls back to HZ state or flat continuation rather than turning a stale row into a restart loop

This is important because the repository already contains stale history. The fix is not only to stop producing new malformed rows; it also has to prevent old rows from re-triggering the same failure path on the next reboot.

### 3. Keep the full-close path coherent

The retract path now computes `remaining_qty` explicitly and treats `remaining_notional <= 1e-9` or `remaining_qty <= 0.0` as a full close.

That means:

- a full retract does not leave a zero-size `OPEN` snapshot behind
- the exit is finalized as a close, not as a pseudo-open partial state
- the runtime slot is removed cleanly instead of being left in a half-closed limbo

## Verification Added

Regression tests were added for both sides:

- full-close retracts no longer emit zero-sized `OPEN` rows
- restore skips zero-sized `OPEN` candidates without setting `restore_failed`

The tests use the existing retract and restore harnesses:

- one test seeds a tiny short leg that collapses to zero on retract and asserts no `OPEN` zero-size row is written
- one test feeds a zero-sized `OPEN` `position_state` row into restore and asserts restore does not hard-halt

## Operational Impact

After this fix:

- stale zero-sized `OPEN` rows no longer restart BLUE
- malformed open snapshots are quarantined as legacy garbage
- the live runtime can continue from a sane source instead of bouncing on the same bad record

## What This Does Not Fix

This change does not rewrite historical ClickHouse rows already present in the warehouse.

It only changes:

- new retract writes
- restore selection and rejection policy
- restart behavior when the old garbage is encountered

If you want the historical ledger cleaned up, that is a separate reconciliation task. The current patch is intentionally conservative and only stops the bad row from causing further damage.
docs: VIBRISS spec (+ §10.6 cascade/adaptive-TP paramsets), PINK accounting fix spec, BLUE incident docs VIBRISS_PARAMETER_GOVERNANCE_SPEC §10.6: ob_cascade.count_threshold (currently cascade_count>0 = ONE asset widens every TP x1.40), tp_widen_factor, withdrawal_velocity_threshold as governance candidates; adaptive/Dynamic-TP threshold marked fit for VIBRISS governance; TP_FLOOR joint-policy reward requirement. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-12 15:04:15 +02:00			`# MALFORMED_OPEN_RESTORE_BUG`

			`## Summary`

			BLUE was repeatedly rehydrating after startup because `dolphin.position_state` contained stale `OPEN` rows with zero effective size.

			`The restore path treated those rows as fatal:`

			- it selected the latest `OPEN` row per `trade_id`
			- it accepted that row even when `quantity` or `notional` had been driven to `0`
			- it hard-stopped on `position_state row invalid quantity ...`
			- `supervisord` then restarted the trader
			`- the next startup read the same bad row again`

			`That created a restart loop.`

			This was observed most clearly on the `2026-06-11` BLUE window. The recurring bad row was the legacy `ATOMUSDT` leg `1a3d2f9c`, which was persisted as:

			- `status = OPEN`
			- `quantity = 0`
			- `notional = 0`
			- `bars_held = 34`

			`That row is not a live position. It is a stale snapshot that should have been treated as tombstoned history.`

			`## Root Cause`

			The bad rows were self-inflicted by the partial-retract path in `nautilus_event_trader.py`.

			`Before the fix:`

			1. `_apply_internal_retract()` shrank the live position.
			2. It wrote a new `position_state` row with `status="OPEN"` for the remaining leg.
			3. If the remaining size rounded to zero, the row still existed as an `OPEN` snapshot.
			`4. A later startup restore could pick that row and treat it as authoritative.`

			That is enough to leave behind `OPEN` rows with:

			- `quantity = 0`
			- `notional = 0`

			`These are not valid live positions, but they looked like one to the old restore logic.`

			`There is a second contributing factor in the restore path:`

			- the restore code historically trusted the latest `OPEN` candidate too early
			- zero-sized `OPEN` rows were only rejected after the row had already been chosen as the best candidate
			`- rejection used a hard failure path, which made the process exit instead of trying the next sane source`

			`That means the persistence bug and the restore policy bug reinforced each other.`

			`## Observable Symptoms`

			- repeated `restore candidate parse failed from capital_update_ledger: 'list' object has no attribute 'get'`
			- repeated `position_state row invalid quantity for trade ...: 0.0`
			- `RESTORE HALT`
			- immediate restart by `supervisord`

			`The chain-token mismatch logs were a separate warning. They were not the restart trigger.`

			`The capital-ledger parse warning is also distinct:`

			`- it indicates the ledger file is list-shaped, not a dict`
			`- it forces restore to rely more heavily on the other state surfaces`
			`- it is noisy, but it is not what actually killed the process in this incident`

			`## Fix Applied`

			`Two changes were made.`

			### 1. Stop writing zero-sized `OPEN` rows

			In `_apply_internal_retract()`:

			- compute `remaining_qty`
			`- if the remaining size is effectively zero, treat the retract as a full close`
			- return the forced exit without emitting a new `position_state` row with `status="OPEN"`

			`This prevents the bad row from being created in the first place.`

			### 2. Make restore skip legacy bad `OPEN` rows

			In `_restore_position_state()`:

			- the ClickHouse restore query now filters `OPEN` rows with `quantity > 0 AND notional > 0`
			`- if an invalid candidate still appears, restore logs and rejects it instead of hard-halting the process`
			`- restore falls back to HZ state or flat continuation rather than turning a stale row into a restart loop`

			`This is important because the repository already contains stale history. The fix is not only to stop producing new malformed rows; it also has to prevent old rows from re-triggering the same failure path on the next reboot.`

			`### 3. Keep the full-close path coherent`

			The retract path now computes `remaining_qty` explicitly and treats `remaining_notional <= 1e-9` or `remaining_qty <= 0.0` as a full close.

			`That means:`

			- a full retract does not leave a zero-size `OPEN` snapshot behind
			`- the exit is finalized as a close, not as a pseudo-open partial state`
			`- the runtime slot is removed cleanly instead of being left in a half-closed limbo`

			`## Verification Added`

			`Regression tests were added for both sides:`

			- full-close retracts no longer emit zero-sized `OPEN` rows
			- restore skips zero-sized `OPEN` candidates without setting `restore_failed`

			`The tests use the existing retract and restore harnesses:`

			- one test seeds a tiny short leg that collapses to zero on retract and asserts no `OPEN` zero-size row is written
			- one test feeds a zero-sized `OPEN` `position_state` row into restore and asserts restore does not hard-halt

			`## Operational Impact`

			`After this fix:`

			- stale zero-sized `OPEN` rows no longer restart BLUE
			`- malformed open snapshots are quarantined as legacy garbage`
			`- the live runtime can continue from a sane source instead of bouncing on the same bad record`

			`## What This Does Not Fix`

			`This change does not rewrite historical ClickHouse rows already present in the warehouse.`

			`It only changes:`

			`- new retract writes`
			`- restore selection and rejection policy`
			`- restart behavior when the old garbage is encountered`

			`If you want the historical ledger cleaned up, that is a separate reconciliation task. The current patch is intentionally conservative and only stops the bad row from causing further damage.`