Files
DOLPHIN/prod/docs/OPERATIONAL_STATUS.md

183 lines
4.8 KiB
Markdown
Raw Normal View History

# Operational Status - NG7 Live
**Last Updated:** 2026-03-25 05:35 UTC
**Status:** ✅ FULLY OPERATIONAL
---
## Current State
| Component | Status | Details |
|-----------|--------|---------|
| NG7 (Windows) | ✅ LIVE | Writing directly to Hz over Tailscale |
| Hz Server | ✅ HEALTHY | Receiving scans ~5s interval |
| Nautilus Trader | ✅ RUNNING | Processing scans, 0 lag |
| Scan Bridge | ✅ RUNNING | Legacy backup (unused) |
---
## Recent Changes
### 1. NG7 Direct Hz Write (Primary)
- **Before:** Arrow → SMB → Scan Bridge → Hz (~5-60s lag)
- **After:** NG7 → Hz direct (~67ms network + ~55ms processing)
- **Result:** 400-500x faster, real-time sync
### 2. Supervisord Migration
- Migrated `nautilus_trader` and `scan_bridge` from systemd to supervisord
- Config: `/mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf`
- Status: `supervisorctl -c ... status`
### 3. Bug Fix: file_mtime
- **Issue:** Nautilus dedup failed (missing `file_mtime` field)
- **Fix:** Added NG7 compatibility fallback using `timestamp`
- **Location:** `nautilus_event_trader.py` line ~320
---
## Test Results
### Latency Benchmark
```
Network (Tailscale): ~67ms (52% of total)
Engine processing: ~55ms (42% of total)
Total end-to-end: ~130ms
Sync quality: 0 lag (100% in-sync)
```
### Scan Statistics (Current)
```
Hz latest scan: #1803
Engine last scan: #1803
Scans processed: 1674
Bar index: 1613
Capital: $25,000
Posture: APEX
```
### Integrity Checks
- ✅ NG7 metadata present
- ✅ Eigenvalue tracking active
- ✅ Pricing data (50 symbols)
- ✅ Multi-window results
- ✅ Byte-for-byte Hz/disk congruence
---
## Architecture
```
NG7 (Windows) ──Tailscale──→ Hz (Linux) ──→ Nautilus
│ │
└────Disk (backup)───────┘
```
**Bottleneck:** Network RTT (~67ms) - physics limited, optimal.
---
## Commands
```bash
# Status
supervisorctl -c /mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf status
# Hz check
python3 -c "import hazelcast; c=HazelcastClient(cluster_name='dolphin',cluster_members=['localhost:5701']); print(json.loads(c.get_map('DOLPHIN_FEATURES').get('latest_eigen_scan').result()))"
# Logs
tail -50 /mnt/dolphinng5_predict/prod/supervisor/logs/nautilus_trader.log
```
---
## Notes
- Network latency (~67ms) is the dominant factor - expected for EU→Sweden
- Engine processing (~55ms) is secondary
- 0 scan lag = optimal sync achieved
- MHS disabled to prevent restart loops
---
## System Recovery - 2026-03-26 08:00 UTC
**Issue:** System extremely sluggish, terminal locked, load average 16.6+
### Root Causes
| Issue | Details |
|-------|---------|
| Zombie Process Storm | 12,385 zombie `timeout` processes from Hazelcast healthcheck |
| Hung CIFS Mounts | DolphinNG6 shares (3 mounts) unresponsive from `100.119.158.61` |
| Stuck Process | `grep -ri` scanning `/mnt` in D-state for 24+ hours |
| I/O Wait | 38% wait time from blocked SMB operations |
### Actions Taken
1. **Killed stuck processes:**
- `grep -ri` (PID 101907) - unlocked terminal
- `meta_health_daemon_v2.py` (PID 224047) - D-state cleared
- Stuck `ls` processes on CIFS mounts
2. **Cleared zombie processes:**
- Killed Hazelcast parent (PID 2049)
- Lazy unmounted 3 hung CIFS shares
- Zombie count: 12,385 → 3
3. **Fixed Hazelcast zombie leak:**
- Added `init: true` to `docker-compose.yml`
- Recreated container with tini init system
- Healthcheck `timeout` processes now properly reaped
### Results
| Metric | Before | After |
|--------|--------|-------|
| Load Average | 16.6+ | 2.72 |
| Zombie Processes | 12,385 | 3 (stable) |
| I/O Wait | 38% | 0% |
| Total Tasks | 12,682 | 352 |
| System Response | Timeout | <100ms |
### Docker Compose Fix
```yaml
# /mnt/dolphinng5_predict/prod/docker-compose.yml
services:
hazelcast:
image: hazelcast/hazelcast:5.3
init: true # Added: enables proper zombie reaping
# ... rest of config
```
### Current Status
| Component | Status | Notes |
|-----------|--------|-------|
| Hazelcast | ✅ Healthy | Init: true, zombie reaping working |
| Hz Management Center | ✅ Up 36h | Stable |
| Prefect Server | ✅ Up 36h | Stable |
| CIFS Mounts | ⚠️ Partial | Only DolphinNG5_Predict mounted |
| System Performance | ✅ Normal | Responsive, low latency |
### CIFS Mount Status
```bash
# Currently mounted:
//100.119.158.61/DolphinNG5_Predict on /mnt/dolphinng5_predict
# Unmounted (server unresponsive):
//100.119.158.61/DolphinNG6
//100.119.158.61/DolphinNG6_Data
//100.119.158.61/DolphinNG6_Data_New
//100.119.158.61/Vids
```
**Note:** DolphinNG6 server at `100.119.158.61` is unresponsive for new mount attempts. DolphinNG5_Predict remains operational.
---
**Last Updated:** 2026-03-26 08:15 UTC
**Status:** ✅ OPERATIONAL (post-recovery)