initial: import DOLPHIN baseline 2026-04-21 from dolphinng5_predict working tree

Includes core prod + GREEN/BLUE subsystems:
- prod/ (BLUE harness, configs, scripts, docs)
- nautilus_dolphin/ (GREEN Nautilus-native impl + dvae/ preserved)
- adaptive_exit/ (AEM engine + models/bucket_assignments.pkl)
- Observability/ (EsoF advisor, TUI, dashboards)
- external_factors/ (EsoF producer)
- mc_forewarning_qlabs_fork/ (MC regime/envelope)

Excludes runtime caches, logs, backups, and reproducible artifacts per .gitignore.
This commit is contained in:
hjnormey
2026-04-21 16:58:38 +02:00
commit 01c19662cb
643 changed files with 260241 additions and 0 deletions

View File

@@ -0,0 +1,119 @@
# Service Architecture Options
## Option 1: Single Supervisor (Recommended for You)
**One systemd service → Manages multiple internal components**
```
dolphin-supervisor.service
├── ExF Component (thread)
├── OB Component (thread)
├── Watchdog Component (thread)
└── MC Component (thread)
```
**Pros:**
- One systemd unit to manage
- Components share memory efficiently
- Centralized health monitoring
- Built-in restart per component
- Lower system overhead
**Cons:**
- Single process (if it crashes, all components stop)
- Less isolation between components
**Use when:** Components are tightly coupled, share data
**Commands:**
```bash
systemctl --user start dolphin-supervisor
journalctl --user -u dolphin-supervisor -f
```
---
## Option 2: Multiple Separate Services
**Each component = separate systemd service**
```
dolphin-exf.service
├── ExF Component
dolphin-ob.service
├── OB Component
dolphin-watchdog.service
├── Watchdog Component
```
**Pros:**
- Full isolation between components
- Independent restart/failure domains
- Can set different resource limits per service
- Systemd handles everything
**Cons:**
- More systemd units to manage
- Higher memory overhead (separate processes)
- IPC needed for shared data
**Use when:** Components are independent, need strong isolation
**Commands:**
```bash
./service_manager.py start
./service_manager.py status
```
---
## Option 3: Hybrid (Single Supervisor + Critical Services Separate)
```
dolphin-supervisor.service
├── ExF Component
├── OB Component
└── MC Component (scheduled)
dolphin-watchdog.service (separate - critical!)
└── Watchdog Component
```
**Use when:** One component is critical/safety-related
---
## Recommendation
For your Dolphin system, **Option 1 (Single Supervisor)** is likely best because:
1. **Tight coupling**: ExF, OB, Watchdog all need Hazelcast
2. **Data sharing**: Components share state via memory
3. **Simplicity**: One command to start/stop everything
4. **Resource efficiency**: Lower overhead than separate processes
The supervisor handles:
- Auto-restart of failed components
- Health monitoring
- Structured logging
- Graceful shutdown
---
## Quick Start: Single Supervisor
```bash
# 1. Enable and start
cd /mnt/dolphinng5_predict/prod/services
systemctl --user enable dolphin-supervisor
systemctl --user start dolphin-supervisor
# 2. Check status
systemctl --user status dolphin-supervisor
# 3. View logs
journalctl --user -u dolphin-supervisor -f
# 4. Stop
systemctl --user stop dolphin-supervisor
```

View File

@@ -0,0 +1,427 @@
# Industrial-Grade Service Frameworks
## 🏆 Recommendation: Supervisor
**Supervisor** is the industry standard for process management in Python deployments.
### Why Supervisor?
-**Battle-tested**: Used by millions of production systems
-**Mature**: 20+ years of development
-**Simple**: INI-style configuration
-**Reliable**: Handles crashes, restarts, logging automatically
-**Web UI**: Built-in web interface for monitoring
-**API**: XML-RPC API for programmatic control
---
## Quick Start: Supervisor
```bash
# 1. Start supervisor and all services
cd /mnt/dolphinng5_predict/prod/supervisor
./supervisorctl.sh start
# 2. Check status
./supervisorctl.sh status
# 3. View logs
./supervisorctl.sh logs exf
./supervisorctl.sh logs ob_streamer
# 4. Restart a service
./supervisorctl.sh ctl restart exf
# 5. Stop everything
./supervisorctl.sh stop
```
---
## Alternative: Circus (Mozilla)
**Circus** is Mozilla's Python process & socket manager.
### Pros:
- ✅ Python-native (easier to extend)
- ✅ Built-in statistics (CPU, memory per process)
- ✅ Socket management
- ✅ Web dashboard
### Cons:
- ❌ Less widely used than Supervisor
- ❌ Smaller community
```bash
# Install
pip install circus
# Run
circusd circus.ini
```
---
## Alternative: Honcho (Python Foreman)
**Honcho** is a Python port of Ruby's Foreman.
### Pros:
- ✅ Very simple (Procfile-based)
- ✅ Good for development
- ✅ Easy to understand
### Cons:
- ❌ Less production features
- ❌ No auto-restart on crash
```bash
# Procfile
exf: python -m external_factors.realtime_exf_service
ob: python -m services.ob_stream_service
watchdog: python -m services.system_watchdog_service
# Run
honcho start
```
---
## Comparison Table
| Feature | Supervisor | Circus | Honcho | Custom Code |
|---------|-----------|--------|--------|-------------|
| Auto-restart | ✅ | ✅ | ❌ | ✅ (if built) |
| Web UI | ✅ | ✅ | ❌ | ❌ |
| Log rotation | ✅ | ✅ | ❌ | ⚠️ (manual) |
| Resource limits | ✅ | ✅ | ❌ | ⚠️ (partial) |
| API | ✅ XML-RPC | ✅ | ❌ | ❌ |
| Maturity | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐ |
| Ease of use | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
---
## Our Setup: Supervisor
**Location**: `/mnt/dolphinng5_predict/prod/supervisor/`
**Config**: `dolphin-supervisord.conf`
**Services managed**:
- `exf` - External Factors (0.5s)
- `ob_streamer` - Order Book (0.5s)
- `watchdog` - Survival Stack (10s)
- `mc_forewarner` - MC-Forewarner (4h)
**Features enabled**:
- Auto-restart with backoff
- Separate stdout/stderr logs
- Log rotation (50MB, 10 backups)
- Process groups
- Event listeners (alerts)
---
## Integration with Existing Code
Your existing service code works **unchanged** with Supervisor:
```python
# Your existing service (works with Supervisor)
class ExFService:
def run(self):
while True:
self.fetch_indicators()
self.push_to_hz()
time.sleep(0.5)
# Supervisor handles:
# - Starting it
# - Restarting if it crashes
# - Logging stdout/stderr
# - Monitoring
```
No code changes needed!
---
## Web Dashboard
Supervisor includes a web interface:
```ini
[inet_http_server]
port=0.0.0.0:9001
username=user
password=pass
```
Then visit: `http://localhost:9001`
---
## Summary
| Use Case | Recommendation |
|----------|---------------|
| **Production trading system** | **Supervisor** ✅ |
| Development/Testing | Honcho |
| Need sockets + stats | Circus |
| Maximum control | Custom + systemd |
We recommend **Supervisor** for Dolphin production.
---
# CHANGE LOG - All Modifications Made
## Session: 2026-03-25 (Current Session)
### 1. Supervisor Installation
**Command executed:**
```bash
pip install supervisor
```
**Result:** Supervisor 4.3.0 installed
---
### 2. Directory Structure Created
```
/mnt/dolphinng5_predict/prod/supervisor/
├── dolphin-supervisord.conf # Main supervisor configuration
├── supervisorctl.sh # Control wrapper script
├── logs/ # Log directory (created)
└── run/ # PID/socket directory (created)
```
---
### 3. Configuration File: dolphin-supervisord.conf
**Location:** `/mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf`
**Contents:**
- `[supervisord]` section with logging, pidfile, environment
- `[unix_http_server]` for supervisorctl communication
- `[rpcinterface:supervisor]` for API
- `[supervisorctl]` client configuration
- `[program:exf]` - External Factors service (0.5s)
- `[program:ob_streamer]` - Order Book Streamer (0.5s)
- `[program:watchdog]` - Survival Stack Watchdog (10s)
- `[program:mc_forewarner]` - MC-Forewarner (4h)
- `[eventlistener:crashmail]` - Alert on crashes
- `[group:dolphin]` - Group all programs
**Key settings:**
- `autostart=true` - All services start with supervisor
- `autorestart=true` - Auto-restart on crash
- `startretries=3` - 3 restart attempts
- `stdout_logfile_maxbytes=50MB` - Log rotation
- `rlimit_as=512MB` - Memory limit per service
---
### 4. Control Script: supervisorctl.sh
**Location:** `/mnt/dolphinng5_predict/prod/supervisor/supervisorctl.sh`
**Commands implemented:**
- `start` - Start supervisord and all services
- `stop` - Stop all services and supervisord
- `restart` - Restart all services
- `status` - Show service status
- `logs [service]` - Show logs (last 50 lines)
- `ctl [cmd]` - Pass through to supervisorctl
**Usage:**
```bash
./supervisorctl.sh start
./supervisorctl.sh status
./supervisorctl.sh logs exf
```
---
### 5. Python Libraries Installed
**Via pip:**
- `supervisor==4.3.0` - Main process manager
- `tenacity==9.1.4` - Retry logic (previously installed)
- `schedule==1.2.2` - Task scheduling (previously installed)
**System packages checked:**
- `supervisor.noarch` available via dnf (not installed, using pip)
---
### 6. Alternative Architectures (Previously Created)
#### 6.1 Custom Supervisor (Pure Python)
**Location:** `/mnt/dolphinng5_predict/prod/services/supervisor.py`
**Features:**
- `ServiceComponent` base class
- `DolphinSupervisor` manager
- Thread-based component management
- Built-in health monitoring
- Example components: ExF, OB, Watchdog, MC
**Status:** Available but NOT primary (Supervisor preferred)
---
#### 6.2 Systemd User Services
**Location:** `~/.config/systemd/user/`
**Files created:**
- `dolphin-exf.service` - External Factors
- `dolphin-ob.service` - Order Book
- `dolphin-watchdog.service` - Watchdog
- `dolphin-mc.service` + `dolphin-mc.timer` - MC-Forewarner
- `dolphin-supervisor.service` - Custom supervisor (optional)
- `dolphin-test.service` - Test service
**Control script:** `/mnt/dolphinng5_predict/prod/services/service_manager.py`
---
### 7. Service Base Class (Boilerplate)
**Location:** `/mnt/dolphinng5_predict/prod/services/service_base.py`
**Features:**
- `ServiceBase` abstract class
- Automatic retries with tenacity
- Structured JSON logging
- Health check endpoints
- Graceful shutdown handling
- Systemd notify support
- `run_scheduled()` helper
**Status:** Available for custom implementations
---
### 8. Documentation Files Created
| File | Location | Purpose |
|------|----------|---------|
| `INDUSTRIAL_FRAMEWORKS.md` | `/mnt/dolphinng5_predict/prod/services/` | This document - framework comparison |
| `ARCHITECTURE_CHOICE.md` | `/mnt/dolphinng5_predict/prod/services/` | Architecture options comparison |
| `README.md` | `/mnt/dolphinng5_predict/prod/services/` | General services documentation |
| `dolphin-supervisord.conf` | `/mnt/dolphinng5_predict/prod/supervisor/` | Supervisor configuration |
---
### 9. kimi.json Updated
**Change:** Associated session with ops directory
**Before:**
```json
{
"path": "/mnt/dolphinng5_predict/prod/ops",
"kaos": "local",
"last_session_id": null
}
```
**After:**
```json
{
"path": "/mnt/dolphinng5_predict/prod/ops",
"kaos": "local",
"last_session_id": "c23a69c5-ba4a-41c4-8624-05114e8fd9ea"
}
```
---
### 10. Session Backup
**Session backed up:** `c23a69c5-ba4a-41c4-8624-05114e8fd9ea`
- **Original location:** `~/.kimi/sessions/9330f053b5f85e950222ed1fed8f6f02/`
- **Backup location 1:** `/mnt/dolphinng5_predict/prod/ops/kimi_session_backup/`
- **Backup location 2:** `/mnt/vids/`
- **Markdown transcript:** `KIMI_Session_Rearch_Services-Prefect.md` (684KB)
---
## Summary: What to Use
### For Production Trading System:
**Recommended: SUPERVISOR**
```bash
cd /mnt/dolphinng5_predict/prod/supervisor
./supervisorctl.sh start
./supervisorctl.sh status
```
**Why:** Battle-tested, 20+ years, web UI, API, log rotation
### For Simplicity / No Extra Deps:
**Alternative: SYSTEMD --user**
```bash
systemctl --user start dolphin-exf
systemctl --user start dolphin-ob
systemctl --user start dolphin-watchdog
```
**Why:** Built-in, no pip installs, OS-integrated
### For Full Control:
**Alternative: Custom Python**
```bash
systemctl --user start dolphin-supervisor # Custom one
```
**Why:** Educational, customizable, no external deps
---
## Files Modified/Created Summary
### New Directories:
1. `/mnt/dolphinng5_predict/prod/supervisor/`
2. `/mnt/dolphinng5_predict/prod/supervisor/logs/`
3. `/mnt/dolphinng5_predict/prod/supervisor/run/`
4. `/mnt/dolphinng5_predict/prod/ops/kimi_session_backup/`
### New Files:
1. `/mnt/dolphinng5_predict/prod/supervisor/dolphin-supervisord.conf`
2. `/mnt/dolphinng5_predict/prod/supervisor/supervisorctl.sh`
3. `/mnt/dolphinng5_predict/prod/services/INDUSTRIAL_FRAMEWORKS.md` (this file)
4. `/mnt/dolphinng5_predict/prod/services/ARCHITECTURE_CHOICE.md`
5. `/mnt/dolphinng5_predict/prod/services/supervisor.py` (custom impl)
6. `/mnt/dolphinng5_predict/prod/services/service_base.py` (boilerplate)
7. `/mnt/dolphinng5_predict/prod/services/service_manager.py` (systemd ctl)
8. `/mnt/dolphinng5_predict/prod/ops/KIMI_Session_Rearch_Services-Prefect.md`
9. `/mnt/dolphinng5_predict/prod/ops/SESSION_INFO.txt`
10. `/mnt/dolphinng5_predict/prod/ops/resume_session.sh`
### Modified Files:
1. `~/.config/systemd/user/dolphin-*.service` (6 services)
2. `~/.config/systemd/user/dolphin-mc.timer`
3. `~/.kimi/kimi.json` (session association)
---
## Current Status
**Supervisor 4.3.0** installed and configured
**6 systemd user services** configured (backup option)
**Custom supervisor** available (educational)
**Service base class** with retries/logging (boilerplate)
**All documentation** complete
**Session backed up** to multiple locations
**Ready for:** Production deployment

195
prod/services/README.md Executable file
View File

@@ -0,0 +1,195 @@
# Dolphin Userland Services
**Server-grade service management without root!** Uses `systemd --user` for reliability.
## 🚀 Quick Start
```bash
# Check status
./service_manager.py status
# Start all services
./service_manager.py start
# View logs
./service_manager.py logs exf -f
```
## 📋 Service Overview
| Service | File | Description | Interval |
|---------|------|-------------|----------|
| **exf** | `dolphin-exf.service` | External Factors (aggressive) | 0.5s |
| **ob** | `dolphin-ob.service` | Order Book Streamer | 500ms |
| **watchdog** | `dolphin-watchdog.service` | Survival Stack | 10s |
| **mc** | `dolphin-mc.timer` | MC-Forewarner | 4h |
## 🔧 Service Manager Commands
```bash
# Status
./service_manager.py status # All services
./service_manager.py status exf # Specific service
# Control
./service_manager.py start # Start all
./service_manager.py stop # Stop all
./service_manager.py restart exf # Restart specific
# Logs
./service_manager.py logs exf # Last 50 lines
./service_manager.py logs exf -f # Follow
./service_manager.py logs exf -n 100 # Last 100 lines
# Auto-start on boot
./service_manager.py enable # Enable all
./service_manager.py disable # Disable all
# After editing .service files
./service_manager.py reload # Reload systemd
```
## 🏗️ Creating a New Service
### Option 1: Full Service Base (Recommended)
```python
#!/usr/bin/env python3
from services.service_base import ServiceBase
class MyService(ServiceBase):
def __init__(self):
super().__init__(
name='my-service',
check_interval=30,
max_retries=3,
notify_systemd=True
)
async def run_cycle(self):
# Your logic here
await do_work()
await asyncio.sleep(1) # Cycle interval
async def health_check(self) -> bool:
# Optional: custom health check
return True
if __name__ == '__main__':
MyService().run()
```
Create systemd service file:
```bash
cat > ~/.config/systemd/user/dolphin-my.service << 'SERVICEFILE'
[Unit]
Description=My Service
After=network.target
[Service]
Type=notify
ExecStart=/usr/bin/python3 /path/to/my_service.py
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=default.target
SERVICEFILE
# Enable and start
systemctl --user daemon-reload
systemctl --user enable dolphin-my.service
systemctl --user start dolphin-my.service
```
### Option 2: Simple Scheduled Task
```python
from services.service_base import run_scheduled
def my_task():
print("Running...")
run_scheduled(my_task, interval_seconds=60, name='my-task')
```
## 📊 Features
### Automatic
- **Restart on crash**: Services auto-restart with backoff
- **Health checks**: Built-in monitoring
- **Structured logging**: JSON to systemd journal
- **Resource limits**: Memory/CPU quotas
- **Graceful shutdown**: SIGTERM handling
### Retry Logic (Tenacity)
```python
@ServiceBase.retry_with_backoff
async def fetch_data(self):
# Automatically retries with exponential backoff
pass
```
### Health Check Endpoint
Services expose health via Hazelcast or file:
```python
async def health_check(self) -> bool:
return self.last_update > time.time() - 2.0
```
## 📝 Logging
All services log structured JSON:
```json
{
"timestamp": "2024-03-25T15:30:00",
"level": "INFO",
"service": "exf",
"message": "Indicators updated"
}
```
View logs:
```bash
# All services
journalctl --user -f
# Specific service
journalctl --user -u dolphin-exf -f
```
## 🔍 Monitoring
```bash
# Service status
systemctl --user status
# Resource usage
systemctl --user show dolphin-exf --property=MemoryCurrent,CPUUsageNSec
# Recent failures
systemctl --user --failed
```
## 🛠️ Troubleshooting
| Issue | Solution |
|-------|----------|
| Service won't start | Check `journalctl --user -u dolphin-exf` |
| High memory usage | Adjust `MemoryMax=` in .service file |
| Restart loop | Check exit code: `systemctl --user status exf` |
| Logs not showing | Ensure `StandardOutput=journal` |
| Permission denied | Service files must be in `~/.config/systemd/user/` |
## 🔄 Service Dependencies
```
exf -> hazelcast
ob -> hazelcast, exf
watchdog -> hazelcast, exf, ob
mc -> hazelcast (timer-triggered)
```
Configured via `After=` and `Wants=` in service files.

6
prod/services/__init__.py Executable file
View File

@@ -0,0 +1,6 @@
"""
Dolphin Services Package
"""
from .service_base import ServiceBase, ServiceHealth, get_logger, run_scheduled
__all__ = ['ServiceBase', 'ServiceHealth', 'get_logger', 'run_scheduled']

View File

@@ -0,0 +1,82 @@
#!/usr/bin/env python3
"""
Example: External Factors Service using ServiceBase
"""
import asyncio
from service_base import ServiceBase, get_logger, run_scheduled
class ExFService(ServiceBase):
"""
External Factors Service - 0.5s aggressive oversampling
"""
def __init__(self):
super().__init__(
name='exf',
check_interval=30,
max_retries=3,
notify_systemd=True
)
self.indicators = {}
self.cycle_count = 0
async def run_cycle(self):
"""Main cycle - runs every 0.5s"""
self.cycle_count += 1
# Fetch indicators with retry
await self._fetch_with_retry('basis')
await self._fetch_with_retry('spread')
await self._fetch_with_retry('imbal_btc')
await self._fetch_with_retry('imbal_eth')
# Push to Hazelcast
await self._push_to_hz()
# Log every 100 cycles
if self.cycle_count % 100 == 0:
self.logger.info(f"Cycle {self.cycle_count}: indicators updated")
# Sleep for 0.5s (non-blocking)
await asyncio.sleep(0.5)
@ServiceBase.retry_with_backoff
async def _fetch_with_retry(self, indicator: str):
"""Fetch single indicator with automatic retry"""
# Your fetch logic here
self.indicators[indicator] = {'value': 0.0, 'timestamp': time.time()}
async def _push_to_hz(self):
"""Push to Hazelcast with retry"""
try:
# Your HZ push logic here
pass
except Exception as e:
self.logger.error(f"HZ push failed: {e}")
raise
async def health_check(self) -> bool:
"""Custom health check"""
# Check if indicators are fresh
now = time.time()
for name, data in self.indicators.items():
if now - data.get('timestamp', 0) > 2.0:
self.logger.warning(f"Stale indicator: {name}")
return False
return True
# Alternative: Simple scheduled function
def simple_exf_task():
"""Simple version without full service overhead"""
logger = get_logger('dolphin.exf.simple')
logger.info("Running ExF fetch")
# Your logic here
if __name__ == '__main__':
import time
# Option 1: Full service with all features
service = ExFService()
service.run()
# Option 2: Simple scheduled task
# run_scheduled(simple_exf_task, interval_seconds=0.5, name='exf')

View File

@@ -0,0 +1,82 @@
#!/usr/bin/env python3
"""
Example: System Watchdog Service using ServiceBase
"""
import asyncio
from service_base import ServiceBase
class WatchdogService(ServiceBase):
"""
Survival Stack Watchdog - 10s check interval
"""
def __init__(self):
super().__init__(
name='watchdog',
check_interval=10, # Health check every 10s
max_retries=5,
notify_systemd=True
)
self.cat1_ok = True
self.cat2_ok = True
self.last_posture = 'APEX'
async def run_cycle(self):
"""Main cycle - runs every 10s"""
# Check all categories
await self._check_cat1_invariants()
await self._check_cat2_structural()
await self._check_cat3_microstructure()
await self._check_cat4_environmental()
await self._check_cat5_capital()
# Compute posture
posture = self._compute_posture()
if posture != self.last_posture:
self.logger.warning(f"Posture change: {self.last_posture} -> {posture}")
self.last_posture = posture
# Write to Hazelcast
await self._update_safety_ref(posture)
# Sleep until next cycle
await asyncio.sleep(10)
async def _check_cat1_invariants(self):
"""Binary kill switches"""
# Check HZ quorum, heartbeat
pass
async def _check_cat2_structural(self):
"""MC-Forewarner staleness"""
pass
async def _check_cat3_microstructure(self):
"""OB depth/fill quality"""
pass
async def _check_cat4_environmental(self):
"""DVOL spike"""
pass
async def _check_cat5_capital(self):
"""Drawdown check"""
pass
def _compute_posture(self) -> str:
"""Compute Rm and map to posture"""
# Rm = Cat1 × Cat2 × Cat3 × Cat4 × Cat5
# Posture: APEX/STALKER/TURTLE/HIBERNATE
return 'APEX'
async def _update_safety_ref(self, posture: str):
"""Update DOLPHIN_SAFETY AtomicReference"""
pass
async def health_check(self) -> bool:
"""Watchdog health check"""
# If we're running, we're healthy
return True
if __name__ == '__main__':
service = WatchdogService()
service.run()

331
prod/services/service_base.py Executable file
View File

@@ -0,0 +1,331 @@
#!/usr/bin/env python3
"""
Dolphin Service Base Class - Boilerplate for reliable userland services
Features:
- Automatic retries with exponential backoff
- Structured logging to journal
- Health check endpoints
- Graceful shutdown on signals
- Systemd notify support (Type=notify)
- Memory/CPU monitoring
"""
import abc
import asyncio
import logging
import signal
import sys
import os
import time
import json
from typing import Optional, Callable, Any
from dataclasses import dataclass, asdict
from datetime import datetime
from functools import wraps
# Optional imports - graceful degradation if not available
try:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
TENACITY_AVAILABLE = True
except ImportError:
TENACITY_AVAILABLE = False
try:
from pystemd.daemon import notify, Notification
SYSTEMD_AVAILABLE = True
except ImportError:
SYSTEMD_AVAILABLE = False
def notify(*args, **kwargs):
pass
# Configure logging for systemd journal
class JournalHandler(logging.Handler):
"""Log handler that outputs JSON for systemd journal"""
def emit(self, record):
try:
msg = {
'timestamp': datetime.utcnow().isoformat(),
'level': record.levelname,
'logger': record.name,
'message': self.format(record),
'source': getattr(record, 'source', 'unknown'),
'service': getattr(record, 'service', 'unknown'),
}
print(json.dumps(msg), flush=True)
except Exception:
self.handleError(record)
def get_logger(name: str) -> logging.Logger:
"""Get configured logger for services"""
logger = logging.getLogger(name)
if not logger.handlers:
handler = JournalHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
logger.addHandler(handler)
logger.setLevel(logging.INFO)
return logger
@dataclass
class ServiceHealth:
"""Health check status"""
status: str # 'healthy', 'degraded', 'unhealthy'
last_check: float
uptime: float
memory_mb: float
cpu_percent: float
error_count: int
message: str
def to_json(self) -> str:
return json.dumps(asdict(self))
class ServiceBase(abc.ABC):
"""
Base class for reliable Dolphin services
Usage:
class MyService(ServiceBase):
def __init__(self):
super().__init__("my-service", check_interval=30)
async def run_cycle(self):
# Your service logic here
pass
if __name__ == '__main__':
service = MyService()
service.run()
"""
def __init__(
self,
name: str,
check_interval: float = 30.0,
max_retries: int = 3,
notify_systemd: bool = True
):
self.name = name
self.check_interval = check_interval
self.max_retries = max_retries
self.notify_systemd = notify_systemd and SYSTEMD_AVAILABLE
self.logger = get_logger(f'dolphin.{name}')
self.logger.service = name
self._shutdown_event = asyncio.Event()
self._start_time = time.time()
self._health = ServiceHealth(
status='starting',
last_check=time.time(),
uptime=0.0,
memory_mb=0.0,
cpu_percent=0.0,
error_count=0,
message='Initializing'
)
self._tasks = []
# Setup signal handlers
self._setup_signals()
def _setup_signals(self):
"""Setup graceful shutdown handlers"""
for sig in (signal.SIGTERM, signal.SIGINT):
asyncio.get_event_loop().add_signal_handler(
sig, lambda: asyncio.create_task(self._shutdown())
)
async def _shutdown(self):
"""Graceful shutdown"""
self.logger.warning(f"{self.name}: Shutdown signal received")
self._shutdown_event.set()
# Cancel all tasks
for task in self._tasks:
if not task.done():
task.cancel()
# Give tasks time to cleanup
await asyncio.sleep(0.5)
def _update_health(self, status: str, message: str = ''):
"""Update health status"""
import psutil
process = psutil.Process()
self._health = ServiceHealth(
status=status,
last_check=time.time(),
uptime=time.time() - self._start_time,
memory_mb=process.memory_info().rss / 1024 / 1024,
cpu_percent=process.cpu_percent(),
error_count=self._health.error_count,
message=message
)
def _log_extra(self, **kwargs):
"""Add extra context to logs"""
for key, value in kwargs.items():
setattr(self.logger, key, value)
def retry_with_backoff(self, func: Callable, **kwargs):
"""Decorator/wrapper for retry logic"""
if not TENACITY_AVAILABLE:
return func
retry_kwargs = {
'stop': stop_after_attempt(kwargs.get('max_retries', self.max_retries)),
'wait': wait_exponential(multiplier=1, min=4, max=60),
'retry': retry_if_exception_type((Exception,)),
'before_sleep': lambda retry_state: self.logger.warning(
f"Retry {retry_state.attempt_number}: {retry_state.outcome.exception()}"
)
}
return retry(**retry_kwargs)(func)
@abc.abstractmethod
async def run_cycle(self):
"""
Main service logic - implement this!
Called repeatedly in the main loop.
Should be non-blocking or use asyncio.
"""
pass
async def health_check(self) -> bool:
"""
Optional: Implement custom health check
Return True if healthy, False otherwise
"""
return True
async def _health_loop(self):
"""Background health check loop"""
while not self._shutdown_event.is_set():
try:
healthy = await self.health_check()
if healthy:
self._update_health('healthy', 'Service operating normally')
else:
self._update_health('degraded', 'Health check failed')
# Notify systemd we're still alive
if self.notify_systemd:
notify(Notification.WATCHDOG)
except Exception as e:
self._health.error_count += 1
self._update_health('unhealthy', str(e))
self.logger.error(f"Health check error: {e}")
try:
await asyncio.wait_for(
self._shutdown_event.wait(),
timeout=self.check_interval
)
except asyncio.TimeoutError:
pass # Normal - continue loop
async def _main_loop(self):
"""Main service loop"""
self.logger.info(f"{self.name}: Starting main loop")
while not self._shutdown_event.is_set():
try:
await self.run_cycle()
except asyncio.CancelledError:
break
except Exception as e:
self._health.error_count += 1
self.logger.error(f"Cycle error: {e}", exc_info=True)
# Brief pause before retry
await asyncio.sleep(1)
def run(self):
"""Run the service (blocking)"""
self.logger.info(f"{self.name}: Service starting")
# Notify systemd we're ready
if self.notify_systemd:
notify(Notification.READY)
self.logger.info("Notified systemd: READY")
# Start health check loop
health_task = asyncio.create_task(self._health_loop())
self._tasks.append(health_task)
# Start main loop
main_task = asyncio.create_task(self._main_loop())
self._tasks.append(main_task)
try:
# Run until shutdown
asyncio.get_event_loop().run_until_complete(self._shutdown_event.wait())
except KeyboardInterrupt:
pass
finally:
self.logger.info(f"{self.name}: Service stopping")
# Cleanup
for task in self._tasks:
if not task.done():
task.cancel()
# Wait for cleanup
if self._tasks:
asyncio.get_event_loop().run_until_complete(
asyncio.gather(*self._tasks, return_exceptions=True)
)
self.logger.info(f"{self.name}: Service stopped")
def run_scheduled(
func: Callable,
interval_seconds: float,
name: str = 'scheduled-task'
):
"""
Run a function on a schedule (simple alternative to full service)
Usage:
def my_task():
print("Running...")
run_scheduled(my_task, interval_seconds=60, name='my-task')
"""
logger = get_logger(f'dolphin.scheduled.{name}')
logger.info(f"Starting scheduled task: {name} (interval: {interval_seconds}s)")
async def loop():
while True:
try:
start = time.time()
if asyncio.iscoroutinefunction(func):
await func()
else:
func()
elapsed = time.time() - start
logger.info(f"Task completed in {elapsed:.2f}s")
# Sleep remaining time
sleep_time = max(0, interval_seconds - elapsed)
await asyncio.sleep(sleep_time)
except Exception as e:
logger.error(f"Task error: {e}", exc_info=True)
await asyncio.sleep(interval_seconds)
try:
asyncio.run(loop())
except KeyboardInterrupt:
logger.info("Stopped by user")
__all__ = [
'ServiceBase',
'ServiceHealth',
'get_logger',
'JournalHandler',
'run_scheduled',
'notify',
'SYSTEMD_AVAILABLE',
'TENACITY_AVAILABLE',
]

203
prod/services/service_manager.py Executable file
View File

@@ -0,0 +1,203 @@
#!/usr/bin/env python3
"""
Dolphin Service Manager - Centralized userland service control
No root required! Uses systemd --user
"""
import argparse
import subprocess
import sys
import os
from typing import List, Optional
SERVICES = {
'exf': 'dolphin-exf.service',
'ob': 'dolphin-ob.service',
'watchdog': 'dolphin-watchdog.service',
'mc': 'dolphin-mc.service',
'mc-timer': 'dolphin-mc.timer',
}
def run_cmd(cmd: List[str], check: bool = True) -> subprocess.CompletedProcess:
"""Run systemctl command for user services"""
full_cmd = ['systemctl', '--user'] + cmd
print(f"Running: {' '.join(full_cmd)}")
return subprocess.run(full_cmd, check=check, capture_output=True, text=True)
def status(service: Optional[str] = None):
"""Show status of all or specific service"""
if service:
svc = SERVICES.get(service, service)
result = run_cmd(['status', svc], check=False)
print(result.stdout or result.stderr)
else:
print("=== Dolphin Services Status ===\n")
for name, svc in SERVICES.items():
result = run_cmd(['is-active', svc], check=False)
status = "✓ RUNNING" if result.returncode == 0 else "✗ STOPPED"
print(f"{name:12} {status}")
print("\n=== Recent Logs ===")
result = run_cmd(['--lines=20', 'status'], check=False)
print(result.stdout[-2000:] if result.stdout else "No recent output")
def start(service: Optional[str] = None):
"""Start service(s)"""
if service:
svc = SERVICES.get(service, service)
run_cmd(['start', svc])
print(f"Started {service}")
else:
for name, svc in SERVICES.items():
if name == 'mc': # Skip mc service, use timer
continue
run_cmd(['start', svc])
print(f"Started {name}")
def stop(service: Optional[str] = None):
"""Stop service(s)"""
if service:
svc = SERVICES.get(service, service)
run_cmd(['stop', svc])
print(f"Stopped {service}")
else:
for name, svc in SERVICES.items():
run_cmd(['stop', svc])
print(f"Stopped {name}")
def restart(service: Optional[str] = None):
"""Restart service(s)"""
if service:
svc = SERVICES.get(service, service)
run_cmd(['restart', svc])
print(f"Restarted {service}")
else:
for name, svc in SERVICES.items():
run_cmd(['restart', svc])
print(f"Restarted {name}")
def logs(service: str, follow: bool = False, lines: int = 50):
"""Show logs for a service"""
svc = SERVICES.get(service, service)
cmd = ['journalctl', '--user', '-u', svc, f'--lines={lines}']
if follow:
cmd.append('--follow')
subprocess.run(cmd)
def enable():
"""Enable services to start on boot"""
for name, svc in SERVICES.items():
run_cmd(['enable', svc])
print(f"Enabled {name}")
def disable():
"""Disable services from starting on boot"""
for name, svc in SERVICES.items():
run_cmd(['disable', svc])
print(f"Disabled {name}")
def daemon_reload():
"""Reload systemd daemon (after editing .service files)"""
run_cmd(['daemon-reload'])
print("Daemon reloaded")
def main():
parser = argparse.ArgumentParser(
description='Dolphin Service Manager - Userland service control',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s status # Show all service status
%(prog)s start exf # Start ExF service
%(prog)s logs ob -f # Follow OB service logs
%(prog)s restart # Restart all services
%(prog)s enable # Enable auto-start on boot
"""
)
subparsers = parser.add_subparsers(dest='command', help='Command')
# Status
p_status = subparsers.add_parser('status', help='Show service status')
p_status.add_argument('service', nargs='?', help='Specific service')
# Start
p_start = subparsers.add_parser('start', help='Start service(s)')
p_start.add_argument('service', nargs='?', help='Specific service')
# Stop
p_stop = subparsers.add_parser('stop', help='Stop service(s)')
p_stop.add_argument('service', nargs='?', help='Specific service')
# Restart
p_restart = subparsers.add_parser('restart', help='Restart service(s)')
p_restart.add_argument('service', nargs='?', help='Specific service')
# Logs
p_logs = subparsers.add_parser('logs', help='Show service logs')
p_logs.add_argument('service', help='Service name')
p_logs.add_argument('-f', '--follow', action='store_true', help='Follow logs')
p_logs.add_argument('-n', '--lines', type=int, default=50, help='Number of lines')
# Enable/Disable
subparsers.add_parser('enable', help='Enable auto-start')
subparsers.add_parser('disable', help='Disable auto-start')
subparsers.add_parser('reload', help='Reload systemd daemon')
args = parser.parse_args()
if not args.command:
parser.print_help()
return
try:
if args.command == 'status':
status(args.service)
elif args.command == 'start':
start(args.service)
elif args.command == 'stop':
stop(args.service)
elif args.command == 'restart':
restart(args.service)
elif args.command == 'logs':
logs(args.service, args.follow, args.lines)
elif args.command == 'enable':
enable()
elif args.command == 'disable':
disable()
elif args.command == 'reload':
daemon_reload()
except subprocess.CalledProcessError as e:
print(f"Error: {e}", file=sys.stderr)
if e.stderr:
print(e.stderr, file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()
# =============================================================================
# SUPERVISOR-SPECIFIC COMMANDS
# =============================================================================
def supervisor_status():
"""Show supervisor internal component status"""
import subprocess
result = subprocess.run(
['journalctl', '--user', '-u', 'dolphin-supervisor', '--lines=100', '-o', 'json'],
capture_output=True, text=True
)
print("=== Supervisor Component Status ===")
print("(Parse logs for component health)")
print(result.stdout[-2000:] if result.stdout else "No logs")
def supervisor_components():
"""List components managed by supervisor"""
print("""
Components managed by dolphin-supervisor.service:
- exf (0.5s) External Factors
- ob (0.5s) Order Book Streamer
- watchdog (10s) Survival Stack
- mc (4h) MC-Forewarner
""")
# Add to main() argument parser if needed

411
prod/services/supervisor.py Executable file
View File

@@ -0,0 +1,411 @@
#!/usr/bin/env python3
"""
Dolphin Service Supervisor
==========================
A SINGLE userland service that manages MULTIPLE service-like components.
Architecture:
- One systemd service: dolphin-supervisor.service
- Internally manages: ExF, OB, Watchdog, MC, etc.
- Each component is a Python thread/async task
- Centralized health, logging, restart
"""
import asyncio
import threading
import signal
import sys
import time
import json
import traceback
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Callable
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor
import logging
# Optional systemd notify
try:
from pystemd.daemon import notify, Notification
SYSTEMD_AVAILABLE = True
except ImportError:
SYSTEMD_AVAILABLE = False
def notify(*args, **kwargs):
pass
# Optional tenacity for retries
try:
from tenacity import retry, stop_after_attempt, wait_exponential
TENACITY_AVAILABLE = True
except ImportError:
TENACITY_AVAILABLE = False
# =============================================================================
# STRUCTURED LOGGING
# =============================================================================
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
'timestamp': datetime.utcnow().isoformat(),
'level': record.levelname,
'component': getattr(record, 'component', 'supervisor'),
'message': record.getMessage(),
'source': record.name,
}
if hasattr(record, 'extra_data'):
log_data.update(record.extra_data)
return json.dumps(log_data)
def get_logger(name: str) -> logging.Logger:
logger = logging.getLogger(name)
if not logger.handlers:
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
return logger
# =============================================================================
# COMPONENT BASE CLASS
# =============================================================================
@dataclass
class ComponentHealth:
name: str
status: str # 'healthy', 'degraded', 'failed', 'stopped'
last_run: float
error_count: int
message: str
uptime: float = 0.0
class ServiceComponent(ABC):
"""
Base class for a service-like component.
Runs in its own thread, managed by the supervisor.
"""
def __init__(self, name: str, interval: float = 1.0, max_retries: int = 3):
self.name = name
self.interval = interval
self.max_retries = max_retries
self.logger = get_logger(f'component.{name}')
self.logger.component = name
self._running = False
self._thread: Optional[threading.Thread] = None
self._error_count = 0
self._last_run = 0
self._start_time = 0
self._health = ComponentHealth(
name=name, status='stopped',
last_run=0, error_count=0, message='Not started'
)
@abstractmethod
def run_cycle(self):
"""Override this with your component's work"""
pass
def health_check(self) -> bool:
"""Override for custom health check"""
return True
def _execute_with_retry(self):
"""Execute run_cycle with retry logic"""
for attempt in range(self.max_retries):
try:
self.run_cycle()
self._error_count = 0
self._last_run = time.time()
return
except Exception as e:
self._error_count += 1
self.logger.error(
f"Cycle failed (attempt {attempt + 1}): {e}",
extra={'extra_data': {'attempt': attempt + 1, 'error': str(e)}}
)
if attempt < self.max_retries - 1:
time.sleep(min(2 ** attempt, 30)) # Exponential backoff
else:
raise
def _loop(self):
"""Main component loop (runs in thread)"""
self._running = True
self._start_time = time.time()
self.logger.info(f"{self.name}: Component started")
while self._running:
try:
self._execute_with_retry()
self._health.status = 'healthy'
self._health.message = 'Running normally'
except Exception as e:
self._health.status = 'failed'
self._health.message = f'Failed: {str(e)[:100]}'
self.logger.error(f"{self.name}: Component failed: {e}")
# Continue running (supervisor will restart if needed)
# Sleep until next cycle
time.sleep(self.interval)
self._health.status = 'stopped'
self.logger.info(f"{self.name}: Component stopped")
def start(self):
"""Start the component in a new thread"""
if self._thread and self._thread.is_alive():
self.logger.warning(f"{self.name}: Already running")
return
self._thread = threading.Thread(target=self._loop, name=f"component-{self.name}")
self._thread.daemon = True
self._thread.start()
self.logger.info(f"{self.name}: Thread started")
def stop(self, timeout: float = 5.0):
"""Stop the component gracefully"""
self._running = False
if self._thread and self._thread.is_alive():
self._thread.join(timeout=timeout)
if self._thread.is_alive():
self.logger.warning(f"{self.name}: Thread did not stop gracefully")
def get_health(self) -> ComponentHealth:
"""Get current health status"""
self._health.last_run = self._last_run
self._health.error_count = self._error_count
if self._start_time:
self._health.uptime = time.time() - self._start_time
return self._health
# =============================================================================
# SUPERVISOR (SINGLE SERVICE)
# =============================================================================
class DolphinSupervisor:
"""
SINGLE service that manages MULTIPLE userland components.
Usage:
supervisor = DolphinSupervisor()
supervisor.register(ExFComponent())
supervisor.register(OBComponent())
supervisor.register(WatchdogComponent())
supervisor.run()
"""
def __init__(self, health_check_interval: float = 10.0):
self.logger = get_logger('supervisor')
self.logger.component = 'supervisor'
self.components: Dict[str, ServiceComponent] = {}
self._running = False
self._shutdown_event = threading.Event()
self._health_check_interval = health_check_interval
self._supervisor_thread: Optional[threading.Thread] = None
# Signal handling
self._setup_signals()
def _setup_signals(self):
"""Setup graceful shutdown"""
def handler(signum, frame):
self.logger.info(f"Received signal {signum}, shutting down...")
self._shutdown_event.set()
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)
def register(self, component: ServiceComponent):
"""Register a component to be managed"""
self.components[component.name] = component
self.logger.info(f"Registered component: {component.name}")
def start_all(self):
"""Start all registered components"""
self.logger.info(f"Starting {len(self.components)} components...")
for name, component in self.components.items():
try:
component.start()
except Exception as e:
self.logger.error(f"Failed to start {name}: {e}")
# Notify systemd we're ready
if SYSTEMD_AVAILABLE:
notify(Notification.READY)
self.logger.info("Notified systemd: READY")
def stop_all(self, timeout: float = 5.0):
"""Stop all components gracefully"""
self.logger.info("Stopping all components...")
for name, component in self.components.items():
try:
component.stop(timeout=timeout)
except Exception as e:
self.logger.error(f"Error stopping {name}: {e}")
def _supervisor_loop(self):
"""Main supervisor loop - monitors components"""
self.logger.info("Supervisor monitoring started")
while not self._shutdown_event.is_set():
# Check health of all components
health_report = {}
for name, component in self.components.items():
health = component.get_health()
health_report[name] = {
'status': health.status,
'uptime': health.uptime,
'errors': health.error_count,
'message': health.message
}
# Restart failed components
if health.status == 'failed' and component._running:
self.logger.warning(f"{name}: Restarting failed component...")
component.stop(timeout=2.0)
time.sleep(1)
component.start()
# Log health summary
failed = sum(1 for h in health_report.values() if h['status'] == 'failed')
if failed > 0:
self.logger.error(f"Health check: {failed} components failed",
extra={'extra_data': health_report})
else:
self.logger.debug("Health check: all components healthy",
extra={'extra_data': health_report})
# Notify systemd watchdog
if SYSTEMD_AVAILABLE:
notify(Notification.WATCHDOG)
# Wait for next check
self._shutdown_event.wait(self._health_check_interval)
self.logger.info("Supervisor monitoring stopped")
def get_status(self) -> Dict:
"""Get full status of supervisor and components"""
return {
'supervisor': {
'running': self._running,
'components_count': len(self.components)
},
'components': {
name: {
'status': comp.get_health().status,
'uptime': comp.get_health().uptime,
'errors': comp.get_health().error_count,
'message': comp.get_health().message
}
for name, comp in self.components.items()
}
}
def run(self):
"""Run the supervisor (blocking)"""
self.logger.info("=" * 60)
self.logger.info("Dolphin Service Supervisor Starting")
self.logger.info("=" * 60)
self._running = True
# Start all components
self.start_all()
# Start supervisor monitoring thread
self._supervisor_thread = threading.Thread(
target=self._supervisor_loop,
name="supervisor-monitor"
)
self._supervisor_thread.start()
# Wait for shutdown signal
try:
while not self._shutdown_event.is_set():
self._shutdown_event.wait(1)
except KeyboardInterrupt:
pass
finally:
self._running = False
self.stop_all()
if self._supervisor_thread:
self._supervisor_thread.join(timeout=5.0)
self.logger.info("Supervisor shutdown complete")
# =============================================================================
# EXAMPLE COMPONENTS
# =============================================================================
class ExFComponent(ServiceComponent):
"""External Factors - 0.5s aggressive oversampling"""
def __init__(self):
super().__init__(name='exf', interval=0.5, max_retries=3)
self.indicators = {}
def run_cycle(self):
# Simulate fetching indicators
self.indicators['basis'] = {'value': 0.01, 'timestamp': time.time()}
self.indicators['spread'] = {'value': 0.02, 'timestamp': time.time()}
# In real implementation: fetch from APIs, push to Hazelcast
class OBComponent(ServiceComponent):
"""Order Book Streamer - 500ms"""
def __init__(self):
super().__init__(name='ob', interval=0.5, max_retries=3)
def run_cycle(self):
# Simulate OB snapshot
pass
class WatchdogComponent(ServiceComponent):
"""Survival Stack Watchdog - 10s"""
def __init__(self):
super().__init__(name='watchdog', interval=10.0, max_retries=5)
self.posture = 'APEX'
def run_cycle(self):
# Check categories, compute posture
pass
class MCComponent(ServiceComponent):
"""MC-Forewarner - 4h (but we check every 5s if it's time)"""
def __init__(self):
super().__init__(name='mc', interval=300, max_retries=3) # 5 min check
self.last_run = 0
def run_cycle(self):
# Only actually run every 4 hours
if time.time() - self.last_run > 14400: # 4 hours
self.logger.info("Running MC-Forewarner assessment")
self.last_run = time.time()
# =============================================================================
# MAIN ENTRY POINT
# =============================================================================
if __name__ == '__main__':
# Create supervisor
supervisor = DolphinSupervisor(health_check_interval=10.0)
# Register components
supervisor.register(ExFComponent())
supervisor.register(OBComponent())
supervisor.register(WatchdogComponent())
supervisor.register(MCComponent())
# Run
supervisor.run()

27
prod/services/test_service.py Executable file
View File

@@ -0,0 +1,27 @@
#!/usr/bin/env python3
"""Test service to validate the setup"""
import asyncio
import sys
sys.path.insert(0, '/mnt/dolphinng5_predict/prod')
from services.service_base import ServiceBase
class TestService(ServiceBase):
def __init__(self):
super().__init__(
name='test',
check_interval=5,
max_retries=3,
notify_systemd=True
)
self.counter = 0
async def run_cycle(self):
self.counter += 1
self.logger.info(f"Test cycle {self.counter}")
await asyncio.sleep(2)
if __name__ == '__main__':
print("Starting test service...")
service = TestService()
service.run()