Files
DOLPHIN/prod/docs/OPERATIONAL_STATUS.md
hjnormey 01c19662cb initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree
Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
2026-04-21 16:58:38 +02:00

183 lines
4.8 KiB
Markdown
Executable File
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Operational Status - NG7 Live
**Last Updated:** 2026-03-25 05:35 UTC
**Status:** ✅ FULLY OPERATIONAL
---
## Current State
| Component | Status | Details |
|-----------|--------|---------|
| NG7 (Windows) | ✅ LIVE | Writing directly to Hz over Tailscale |
| Hz Server | ✅ HEALTHY | Receiving scans ~5s interval |
| Nautilus Trader | ✅ RUNNING | Processing scans, 0 lag |
| Scan Bridge | ✅ RUNNING | Legacy backup (unused) |
---
## Recent Changes
### 1. NG7 Direct Hz Write (Primary)
- **Before:** Arrow → SMB → Scan Bridge → Hz (~5-60s lag)
- **After:** NG7 → Hz direct (~67ms network + ~55ms processing)
- **Result:** 400-500x faster, real-time sync
### 2. Supervisord Migration
- Migrated `nautilus_trader` and `scan_bridge` from systemd to supervisord
- Config: `/mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf`
- Status: `supervisorctl -c ... status`
### 3. Bug Fix: file_mtime
- **Issue:** Nautilus dedup failed (missing `file_mtime` field)
- **Fix:** Added NG7 compatibility fallback using `timestamp`
- **Location:** `nautilus_event_trader.py` line ~320
---
## Test Results
### Latency Benchmark
```
Network (Tailscale): ~67ms (52% of total)
Engine processing: ~55ms (42% of total)
Total end-to-end: ~130ms
Sync quality: 0 lag (100% in-sync)
```
### Scan Statistics (Current)
```
Hz latest scan: #1803
Engine last scan: #1803
Scans processed: 1674
Bar index: 1613
Capital: $25,000
Posture: APEX
```
### Integrity Checks
- ✅ NG7 metadata present
- ✅ Eigenvalue tracking active
- ✅ Pricing data (50 symbols)
- ✅ Multi-window results
- ✅ Byte-for-byte Hz/disk congruence
---
## Architecture
```
NG7 (Windows) ──Tailscale──→ Hz (Linux) ──→ Nautilus
│ │
└────Disk (backup)───────┘
```
**Bottleneck:** Network RTT (~67ms) - physics limited, optimal.
---
## Commands
```bash
# Status
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
# Hz check
python3 -c "import hazelcast; c=HazelcastClient(cluster_name='dolphin',cluster_members=['localhost:5701']); print(json.loads(c.get_map('DOLPHIN_FEATURES').get('latest_eigen_scan').result()))"
# Logs
tail -50 /mnt/dolphinng5_predict/prod/supervisor/logs/nautilus_trader.log
```
---
## Notes
- Network latency (~67ms) is the dominant factor - expected for EU→Sweden
- Engine processing (~55ms) is secondary
- 0 scan lag = optimal sync achieved
- MHS disabled to prevent restart loops
---
## System Recovery - 2026-03-26 08:00 UTC
**Issue:** System extremely sluggish, terminal locked, load average 16.6+
### Root Causes
| Issue | Details |
|-------|---------|
| Zombie Process Storm | 12,385 zombie `timeout` processes from Hazelcast healthcheck |
| Hung CIFS Mounts | DolphinNG6 shares (3 mounts) unresponsive from `100.119.158.61` |
| Stuck Process | `grep -ri` scanning `/mnt` in D-state for 24+ hours |
| I/O Wait | 38% wait time from blocked SMB operations |
### Actions Taken
1. **Killed stuck processes:**
- `grep -ri` (PID 101907) - unlocked terminal
- `meta_health_daemon_v2.py` (PID 224047) - D-state cleared
- Stuck `ls` processes on CIFS mounts
2. **Cleared zombie processes:**
- Killed Hazelcast parent (PID 2049)
- Lazy unmounted 3 hung CIFS shares
- Zombie count: 12,385 → 3
3. **Fixed Hazelcast zombie leak:**
- Added `init: true` to `docker-compose.yml`
- Recreated container with tini init system
- Healthcheck `timeout` processes now properly reaped
### Results
| Metric | Before | After |
|--------|--------|-------|
| Load Average | 16.6+ | 2.72 |
| Zombie Processes | 12,385 | 3 (stable) |
| I/O Wait | 38% | 0% |
| Total Tasks | 12,682 | 352 |
| System Response | Timeout | <100ms |
### Docker Compose Fix
```yaml
# /mnt/dolphinng5_predict/prod/docker-compose.yml
services:
hazelcast:
image: hazelcast/hazelcast:5.3
init: true # Added: enables proper zombie reaping
# ... rest of config
```
### Current Status
| Component | Status | Notes |
|-----------|--------|-------|
| Hazelcast | Healthy | Init: true, zombie reaping working |
| Hz Management Center | Up 36h | Stable |
| Prefect Server | Up 36h | Stable |
| CIFS Mounts | Partial | Only DolphinNG5_Predict mounted |
| System Performance | Normal | Responsive, low latency |
### CIFS Mount Status
```bash
# Currently mounted:
//100.119.158.61/DolphinNG5_Predict on /mnt/dolphinng5_predict
# Unmounted (server unresponsive):
//100.119.158.61/DolphinNG6
//100.119.158.61/DolphinNG6_Data
//100.119.158.61/DolphinNG6_Data_New
//100.119.158.61/Vids
```
**Note:** DolphinNG6 server at `100.119.158.61` is unresponsive for new mount attempts. DolphinNG5_Predict remains operational.
---
**Last Updated:** 2026-03-26 08:15 UTC
**Status:** OPERATIONAL (post-recovery)