Includes core prod + GREEN/BLUE subsystems: - prod/ (BLUE harness, configs, scripts, docs) - nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved) - adaptive_exit/ (AEM engine + models/bucket_assignments.pkl) - Observability/ (EsoF advisor, TUI, dashboards) - external_factors/ (EsoF producer) - mc_forewarning_qlabs_fork/ (MC regime/envelope) Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
19 KiB
Executable File
Scan Bridge Prefect Integration Study
Date: 2026-03-24
Version: v1.0
Status: Analysis Complete - Recommendation: Hybrid Approach
Executive Summary
The Scan Bridge Service can be integrated into Prefect orchestration, but NOT as a standard flow task. Due to its continuous watchdog nature (file system monitoring), it requires special handling. The recommended approach is a hybrid architecture where the bridge runs as a standalone supervised service with Prefect providing health monitoring and automatic restart capabilities.
1. Current Architecture (Standalone)
┌─────────────────────────────────────────────────────────────────┐
│ CURRENT: Standalone Service │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ scan_bridge_ │─────▶│ Hazelcast │ │
│ │ service.py │ │ (SSOT) │ │
│ │ │ │ │ │
│ │ • watchdog │ │ latest_eigen_ │ │
│ │ • mtime-based │ │ scan │ │
│ │ • continuous │ │ │ │
│ └─────────────────┘ └─────────────────┘ │
│ ▲ │
│ │ watches │
│ ┌────────┴─────────────────┐ │
│ │ /mnt/ng6_data/arrow_ │ │
│ │ scans/YYYY-MM-DD/*. │ │
│ │ arrow │ │
│ └──────────────────────────┘ │
│ │
│ MANAGEMENT: Manual (./scan_bridge_restart.sh) │
│ │
└─────────────────────────────────────────────────────────────────┘
Current Issues
- ❌ No automatic restart on crash
- ❌ No health monitoring
- ❌ No integration with system-wide orchestration
- ❌ Manual log rotation
2. Integration Options Analysis
Option A: Prefect Flow Task (REJECTED)
Concept: Run scan bridge as a Prefect flow task
@flow
def scan_bridge_flow():
while True: # ← PROBLEM: Infinite loop in task
scan_files()
sleep(1)
Why Rejected:
| Issue | Explanation |
|---|---|
| Task Timeout | Prefect tasks have default 3600s timeout |
| Worker Lock | Blocks Prefect worker indefinitely |
| Resource Waste | Prefect worker tied up doing file watching |
| Anti-pattern | Prefect is for discrete workflows, not continuous daemons |
Verdict: ❌ Not suitable
Option B: Prefect Daemon Service (RECOMMENDED)
Concept: Use Prefect's infrastructure to manage the bridge as a long-running service
┌─────────────────────────────────────────────────────────────────┐
│ RECOMMENDED: Prefect-Supervised Daemon │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Prefect Server (localhost:4200) │ │
│ │ │ │
│ │ ┌────────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Health Check │───▶│ Scan Bridge Deployment │ │ │
│ │ │ Flow (30s) │ │ (type: daemon) │ │ │
│ │ └────────────────┘ └─────────────────────────┘ │ │
│ │ │ │ │ │
│ │ │ monitors │ manages │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ scan_bridge_service.py process │ │ │
│ │ │ • systemd/Prefect managed │ │ │
│ │ │ • auto-restart on failure │ │ │
│ │ │ • stdout/stderr to Prefect logs │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation:
# scan_bridge_prefect_daemon.py
from prefect import flow, task, get_run_logger
from prefect.runner import Runner
import subprocess
import time
import signal
import sys
DAEMON_CMD = [sys.executable, "/mnt/dolphinng5_predict/prod/scan_bridge_service.py"]
class ScanBridgeDaemon:
def __init__(self):
self.process = None
self.logger = get_run_logger()
def start(self):
"""Start the scan bridge daemon."""
self.logger.info("Starting Scan Bridge daemon...")
self.process = subprocess.Popen(
DAEMON_CMD,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True
)
# Wait for startup confirmation
time.sleep(2)
if self.process.poll() is None:
self.logger.info(f"✓ Daemon started (PID: {self.process.pid})")
return True
else:
self.logger.error("✗ Daemon failed to start")
return False
def health_check(self) -> bool:
"""Check if daemon is healthy."""
if self.process is None:
return False
# Check process is running
if self.process.poll() is not None:
self.logger.error(f"Daemon exited with code {self.process.poll()}")
return False
# Check Hazelcast for recent data
from dolphin_hz_utils import check_scan_freshness
try:
age_sec = check_scan_freshness()
if age_sec > 60: # Data older than 60s
self.logger.warning(f"Stale data detected (age: {age_sec}s)")
return False
return True
except Exception as e:
self.logger.error(f"Health check failed: {e}")
return False
def stop(self):
"""Stop the daemon gracefully."""
if self.process and self.process.poll() is None:
self.logger.info("Stopping daemon...")
self.process.send_signal(signal.SIGTERM)
self.process.wait(timeout=5)
self.logger.info("✓ Daemon stopped")
# Global daemon instance
daemon = ScanBridgeDaemon()
@flow(name="scan-bridge-daemon")
def scan_bridge_daemon_flow():
"""
Long-running Prefect flow that manages the scan bridge daemon.
This flow runs indefinitely, monitoring and restarting the bridge as needed.
"""
logger = get_run_logger()
logger.info("=" * 60)
logger.info("🐬 Scan Bridge Daemon Manager (Prefect)")
logger.info("=" * 60)
# Initial start
if not daemon.start():
raise RuntimeError("Failed to start daemon")
try:
while True:
# Health check every 30 seconds
time.sleep(30)
if not daemon.health_check():
logger.warning("Health check failed, restarting daemon...")
daemon.stop()
time.sleep(1)
if daemon.start():
logger.info("✓ Daemon restarted")
else:
logger.error("✗ Failed to restart daemon")
raise RuntimeError("Daemon restart failed")
else:
logger.debug("Health check passed")
except KeyboardInterrupt:
logger.info("Shutting down...")
finally:
daemon.stop()
if __name__ == "__main__":
# Deploy as long-running daemon
scan_bridge_daemon_flow()
Pros:
| Advantage | Description |
|---|---|
| Auto-restart | Prefect manages process lifecycle |
| Centralized Logs | Bridge logs in Prefect UI |
| Health Monitoring | Automatic detection of stale data |
| Integration | Part of overall orchestration |
Cons:
| Disadvantage | Mitigation |
|---|---|
| Requires Prefect worker | Use dedicated worker pool |
| Flow never completes | Mark as "daemon" deployment type |
Option C: Systemd Service with Prefect Monitoring (ALTERNATIVE)
Concept: Use systemd for process management, Prefect for health checks
┌─────────────────────────────────────────────────────────────────┐
│ ALTERNATIVE: Systemd + Prefect Monitoring │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ systemd │ │ Prefect │ │
│ │ │ │ Server │ │
│ │ ┌───────────┐ │ │ │ │
│ │ │ scan-bridge│◀─┼──────┤ Health Check │ │
│ │ │ service │ │ │ Flow (60s) │ │
│ │ │ (auto- │ │ │ │ │
│ │ │ restart) │ │ │ Alerts on: │ │
│ │ └───────────┘ │ │ • stale data │ │
│ │ │ │ │ • process down │ │
│ │ ▼ │ │ │ │
│ │ ┌───────────┐ │ │ │ │
│ │ │ journald │──┼──────┤ Log ingestion │ │
│ │ │ (logs) │ │ │ │ │
│ │ └───────────┘ │ │ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Systemd Service:
# /etc/systemd/system/dolphin-scan-bridge.service
[Unit]
Description=DOLPHIN Scan Bridge Service
After=network.target hazelcast.service
Wants=hazelcast.service
[Service]
Type=simple
User=dolphin
Group=dolphin
WorkingDirectory=/mnt/dolphinng5_predict/prod
Environment="PATH=/home/dolphin/siloqy_env/bin"
ExecStart=/home/dolphin/siloqy_env/bin/python3 \
/mnt/dolphinng5_predict/prod/scan_bridge_service.py
Restart=always
RestartSec=5
StartLimitInterval=60s
StartLimitBurst=3
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Prefect Health Check Flow:
@flow(name="scan-bridge-health-check")
def scan_bridge_health_check():
"""Periodic health check for scan bridge (runs every 60s)."""
logger = get_run_logger()
# Check 1: Process running
result = subprocess.run(
["systemctl", "is-active", "dolphin-scan-bridge"],
capture_output=True
)
if result.returncode != 0:
logger.error("❌ Scan bridge service DOWN")
send_alert("Scan bridge service not active")
return False
# Check 2: Data freshness
age_sec = check_hz_scan_freshness()
if age_sec > 60:
logger.error(f"❌ Stale data detected (age: {age_sec}s)")
send_alert(f"Scan data stale: {age_sec}s old")
return False
logger.info(f"✅ Healthy (data age: {age_sec}s)")
return True
Pros:
- Industry-standard process management
- Automatic restart on crash
- Independent of Prefect availability
Cons:
- Requires root access for systemd
- Log aggregation separate from Prefect
- Two systems to manage
3. Comparative Analysis
| Criteria | Option A (Flow Task) | Option B (Prefect Daemon) | Option C (Systemd + Prefect) |
|---|---|---|---|
| Complexity | Low | Medium | Medium |
| Auto-restart | ❌ No | ✅ Yes | ✅ Yes (systemd) |
| Centralized Logs | ✅ Yes | ✅ Yes | ⚠️ Partial (journald) |
| Prefect Integration | ❌ Poor | ✅ Full | ⚠️ Monitoring only |
| Resource Usage | ❌ High (blocks worker) | ✅ Efficient | ✅ Efficient |
| Restart Speed | N/A | ~5 seconds | ~5 seconds |
| Root Required | ❌ No | ❌ No | ✅ Yes |
| Production Ready | ❌ No | ✅ Yes | ✅ Yes |
4. Recommendation
Primary: Option B - Prefect Daemon Service
Rationale:
- Unified orchestration - Everything in Prefect (flows, logs, alerts)
- No root required - Runs as dolphin user
- Auto-restart - Prefect manages lifecycle
- Health monitoring - Built-in stale data detection
Deployment Plan:
# 1. Create deployment
cd /mnt/dolphinng5_predict/prod
prefect deployment build \
scan_bridge_prefect_daemon.py:scan_bridge_daemon_flow \
--name "scan-bridge-daemon" \
--pool dolphin-daemon-pool \
--type process
# 2. Configure as long-running
cat >> prefect.yaml << 'EOF'
deployments:
- name: scan-bridge-daemon
entrypoint: scan_bridge_prefect_daemon.py:scan_bridge_daemon_flow
work_pool:
name: dolphin-daemon-pool
parameters: {}
# Long-running daemon settings
enforce_parameter_schema: false
schedules: []
is_schedule_active: true
EOF
# 3. Deploy
prefect deployment apply scan_bridge_daemon-deployment.yaml
# 4. Start daemon worker
prefect worker start --pool dolphin-daemon-pool
Secondary: Option C - Systemd (if Prefect unstable)
If Prefect server experiences downtime, systemd ensures the bridge continues running.
5. Implementation Phases
Phase 1: Immediate (Today)
- ✅ Created
scan_bridge_restart.shwrapper - ✅ Created
dolphin-scan-bridge.servicesystemd file - Use manual script for now
Phase 2: Prefect Integration (Next Sprint)
- Create
scan_bridge_prefect_daemon.py - Implement health check flow
- Set up daemon worker pool
- Deploy to Prefect
- Configure alerting
Phase 3: Monitoring Hardening
- Dashboard for scan bridge metrics
- Alert on data staleness > 30s
- Log rotation strategy
- Performance metrics (lag from file write to Hz push)
6. Health Check Specifications
Metrics to Monitor
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Data age | > 30s | > 60s | Alert / Restart |
| Process CPU | > 50% | > 80% | Investigate |
| Memory | > 100MB | > 500MB | Restart |
| Hz connection | - | Failed | Restart |
| Files processed | < 1/min | < 1/5min | Alert |
Alerting Rules
ALERT_RULES = {
"stale_data": {
"condition": "hz_data_age > 60",
"severity": "critical",
"action": "restart_bridge",
"notify": ["ops", "trading"]
},
"high_lag": {
"condition": "file_to_hz_lag > 10",
"severity": "warning",
"action": "log_only",
"notify": ["ops"]
},
"process_crash": {
"condition": "process_exit_code != 0",
"severity": "critical",
"action": "auto_restart",
"notify": ["ops"]
}
}
7. Conclusion
The scan bridge SHOULD be integrated into Prefect orchestration using Option B (Prefect Daemon). This provides:
- Automatic management - Start, stop, restart handled by Prefect
- Unified observability - Logs, metrics, alerts in one place
- Self-healing - Automatic restart on failure
- No root required - Runs as dolphin user
Next Steps:
- Implement
scan_bridge_prefect_daemon.py - Create Prefect deployment
- Add to SYSTEM_BIBLE v4.1
Document: SCAN_BRIDGE_PREFECT_INTEGRATION_STUDY.md
Version: 1.0
Author: DOLPHIN System Architecture
Date: 2026-03-24