Codex 6d08e97e28 BLUE hardening: spool-poison guards, dead-session clock fix, HZ black-box, RETRACT race-safety
Seven uncommitted production fixes to BLUE's main runner that the LIVE
process has already been running since the 2026-06-15 17:23 restart (file
mtime 17:17, pid started 17:23). Each fix answers a documented incident;
committing now so they survive in history and a stray checkout can't
silently revert running-config code on the next restart.

1. bars_held = max(0, int(...)) at BOTH journal sites (terminal + sub-day).
   CH column is UInt16 — a negative value poisons the spool with a
   head-of-line jam (incident 2026-06-12: bars_held=-106).

2. entry_bar = int(restored_entry_bar) at BOTH reconstruction sites; NEVER
   from chain_meta. trade_reconstruction payloads carry the DEAD session's
   bar counter, so the old override reinstated the stale clock frame the
   re-anchor exists to fix → negative bars_held → same UInt16 spool poison
   (zombie-trade resurrections, incident 2026-06-12). restored_entry_bar
   already encodes hold continuity via stored_bars in THIS session's frame.

3. capital parse handles list/ledger-style payloads: when the restore blob
   is a list of update rows, take the latest dict row instead of falling
   through to {} and losing the capital anchor.

4. _connect_hz routes the `hazelcast` logger to stderr at INFO. The
   silent-HZ-death investigation found ZERO client log lines because
   nothing routed them; without this the reactor's health is invisible.

5. _dump_blackbox(reason): forensic thread dump before a watchdog restart —
   lifecycle.is_running, active_connections, every thread's stack, and a
   flag when any hazelcast/reactor-named thread is MISSING (= reactor died,
   the prime suspect for the silent 40min–8h client deaths). print()-only,
   CIFS-safe. _watchdog_restart calls it first.

6. _drain_runtime_commands / _process_runtime_commands gain
   `*, allow_retract=True`; the heartbeat path drains with
   allow_retract=False and re-queues any RETRACT commands. A RETRACT can
   force a terminal close that must run through the scan-thread close
   finalizer, so the heartbeat must not race it.

7. +import traceback (for the black-box stack dumps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 12:03:20 +02:00
Description
No description provided
33 MiB
Languages
Python 100%