Monitoring
Monitoring in AutoResearch is not an agent — it's a set of mechanisms used by the Orchestrator and the OMCC harness to watch over long-running processes. There is no "Monitor Agent" that sits in a tmux session. Instead, monitoring is a responsibility distributed across the system.
Monitoring Categories
| Category | What's Watched | Who Watches | How |
|---|---|---|---|
| Environment setup | Conda/pip install, CUDA availability | Orchestrator | Coder reports back |
| Data download | Download progress, checksums, disk space | Coder | Periodic status check |
| GPU allocation | GPU availability, memory, utilization | OMCC harness | nvidia-smi polling |
| Training Phase 1 | Loss convergence, gradient norms, NaN detection | Orchestrator | Active watch (tail log) |
| Training Phase 2 | Loss trend, checkpoint saves, ETA | CronCreate | Periodic patrol |
| Agent health | Heartbeat freshness, error rates | OMCC harness | Heartbeat protocol |
Two-Phase Training Model
Training monitoring uses two distinct phases, because the failure modes are different at the start vs. during steady-state training.
Phase 1: Active Watch
When: First N steps of training (configurable, default 1000 steps).
Why: Most training failures happen early — wrong learning rate, data loading errors, shape mismatches, NaN losses. These need immediate intervention.
How: The Orchestrator actively watches the training log, checking every few seconds.
graph LR
A[Training Starts] --> B{Step < 1000?}
B -->|Yes| C[Active Watch]
C --> D{Healthy?}
D -->|Yes| B
D -->|No| E[Intervene]
E --> F[Fix + Restart]
F --> A
B -->|No| G[Transition to Phase 2]
style C fill:#dbeafe,stroke:#2563eb
style E fill:#fee2e2,stroke:#dc2626
style G fill:#dcfce7,stroke:#16a34aChecks during Phase 1:
| Check | Threshold | Action on Failure |
|---|---|---|
| Loss is finite | No NaN/Inf | Stop training, diagnose |
| Loss is decreasing | Avg over 100 steps | Warning, continue watching |
| Gradient norm | < 100.0 | Reduce LR or clip gradients |
| GPU utilization | > 50% | Check data pipeline bottleneck |
| Memory usage | < 95% of available | Reduce batch size |
Phase 2: CronCreate Patrol
When: After Phase 1 (training is stable), until training completes.
Why: Stable training rarely fails catastrophically, but can drift slowly (loss plateau, disk full, GPU throttling). Periodic checks are sufficient.
How: CronCreate schedules a patrol script that runs every N minutes (configurable, default 30 min).
graph TD
A[CronCreate Schedule] --> B[Patrol Script Runs]
B --> C[Read Latest Log Lines]
C --> D{All Healthy?}
D -->|Yes| E[Write OK to agent_health.yaml]
D -->|No| F{Severity?}
F -->|Warning| G[Log Warning<br/>Continue Patrol]
F -->|Critical| H[Alert Orchestrator<br/>Pause Training]
style A fill:#fef3c7,stroke:#d97706
style E fill:#dcfce7,stroke:#16a34a
style H fill:#fee2e2,stroke:#dc2626Checks during Phase 2:
| Check | Frequency | Action on Failure |
|---|---|---|
| Training process alive | Every patrol | Alert, attempt restart |
| Loss still decreasing | Every patrol | Warning after 2 consecutive flat patrols |
| Disk space adequate | Every patrol | Alert if < 10GB free |
| Checkpoint saved recently | Every patrol | Warning if last checkpoint > 2 hours ago |
| GPU temperature | Every patrol | Throttle alert if > 85C |
Why not just watch everything all the time?
Active watching consumes Orchestrator context. During Phase 1, this is worth it because failures are frequent and immediate response matters. During Phase 2, the Orchestrator's context is better spent on other tasks (analysis planning, paper prep). CronCreate patrols use zero context — they run externally and only alert if something goes wrong.
Monitoring Configuration
thresholds.yaml
training:
phase1_steps: 1000
loss_nan_action: stop
gradient_norm_max: 100.0
gpu_util_min: 0.5
memory_usage_max: 0.95
patrol:
interval_minutes: 30
loss_plateau_patience: 2 # patrols before warning
disk_min_gb: 10
checkpoint_max_hours: 2
gpu_temp_max: 85alerts.yaml
channels:
terminal:
enabled: true # Print to Orchestrator terminal
file:
enabled: true
path: .omc/research/logs/alerts.log
webhook:
enabled: false
url: "" # Optional: Slack, Discord, etc.
severity_routing:
info: [file]
warning: [terminal, file]
critical: [terminal, file, webhook]Agent Health via OMCC Heartbeat
The OMCC harness monitors agent health independently of training monitoring.
| Agent | Heartbeat Method | Healthy If |
|---|---|---|
| Coder (tmux) | Process alive + last output timestamp | Output within last 5 min |
| Scout (tmux) | Process alive + last output timestamp | Output within last 10 min |
| Writer (session) | Session active check | Session exists |
| Judge (stateless) | N/A | Responds to codex exec |
Health-based, not timeout-based
AutoResearch does not use fixed timeouts for operations. Instead, it checks health signals. A training run that takes 48 hours is fine as long as the process is alive and loss is moving. A training run that takes 5 minutes is wrong if loss is NaN.
This distinction matters: timeout-based monitoring kills healthy long jobs. Health-based monitoring catches unhealthy short jobs.
Next
- Workspace Isolation — per-project monitoring separation
- Training Stage — how monitoring integrates with the training pipeline stage