Monitoring

Monitoring in AutoResearch is not an agent — it's a set of mechanisms used by the Orchestrator and the OMCC harness to watch over long-running processes. There is no "Monitor Agent" that sits in a tmux session. Instead, monitoring is a responsibility distributed across the system.

Monitoring Categories

Category	What's Watched	Who Watches	How
Environment setup	Conda/pip install, CUDA availability	Orchestrator	Coder reports back
Data download	Download progress, checksums, disk space	Coder	Periodic status check
GPU allocation	GPU availability, memory, utilization	OMCC harness	`nvidia-smi` polling
Training Phase 1	Loss convergence, gradient norms, NaN detection	Orchestrator	Active watch (tail log)
Training Phase 2	Loss trend, checkpoint saves, ETA	CronCreate	Periodic patrol
Agent health	Heartbeat freshness, error rates	OMCC harness	Heartbeat protocol

Two-Phase Training Model

Training monitoring uses two distinct phases, because the failure modes are different at the start vs. during steady-state training.

Phase 1: Active Watch

When: First N steps of training (configurable, default 1000 steps).

Why: Most training failures happen early — wrong learning rate, data loading errors, shape mismatches, NaN losses. These need immediate intervention.

How: The Orchestrator actively watches the training log, checking every few seconds.

mermaid

graph LR
    A[Training Starts] --> B{Step < 1000?}
    B -->|Yes| C[Active Watch]
    C --> D{Healthy?}
    D -->|Yes| B
    D -->|No| E[Intervene]
    E --> F[Fix + Restart]
    F --> A
    B -->|No| G[Transition to Phase 2]

    style C fill:#dbeafe,stroke:#2563eb
    style E fill:#fee2e2,stroke:#dc2626
    style G fill:#dcfce7,stroke:#16a34a

Checks during Phase 1:

Check	Threshold	Action on Failure
Loss is finite	No NaN/Inf	Stop training, diagnose
Loss is decreasing	Avg over 100 steps	Warning, continue watching
Gradient norm	< 100.0	Reduce LR or clip gradients
GPU utilization	> 50%	Check data pipeline bottleneck
Memory usage	< 95% of available	Reduce batch size

Phase 2: CronCreate Patrol

When: After Phase 1 (training is stable), until training completes.

Why: Stable training rarely fails catastrophically, but can drift slowly (loss plateau, disk full, GPU throttling). Periodic checks are sufficient.

How: CronCreate schedules a patrol script that runs every N minutes (configurable, default 30 min).

mermaid

graph TD
    A[CronCreate Schedule] --> B[Patrol Script Runs]
    B --> C[Read Latest Log Lines]
    C --> D{All Healthy?}
    D -->|Yes| E[Write OK to agent_health.yaml]
    D -->|No| F{Severity?}
    F -->|Warning| G[Log Warning<br/>Continue Patrol]
    F -->|Critical| H[Alert Orchestrator<br/>Pause Training]

    style A fill:#fef3c7,stroke:#d97706
    style E fill:#dcfce7,stroke:#16a34a
    style H fill:#fee2e2,stroke:#dc2626

Checks during Phase 2:

Check	Frequency	Action on Failure
Training process alive	Every patrol	Alert, attempt restart
Loss still decreasing	Every patrol	Warning after 2 consecutive flat patrols
Disk space adequate	Every patrol	Alert if < 10GB free
Checkpoint saved recently	Every patrol	Warning if last checkpoint > 2 hours ago
GPU temperature	Every patrol	Throttle alert if > 85C

Why not just watch everything all the time?

Active watching consumes Orchestrator context. During Phase 1, this is worth it because failures are frequent and immediate response matters. During Phase 2, the Orchestrator's context is better spent on other tasks (analysis planning, paper prep). CronCreate patrols use zero context — they run externally and only alert if something goes wrong.

Monitoring Configuration

thresholds.yaml

yaml

training:
  phase1_steps: 1000
  loss_nan_action: stop
  gradient_norm_max: 100.0
  gpu_util_min: 0.5
  memory_usage_max: 0.95

patrol:
  interval_minutes: 30
  loss_plateau_patience: 2      # patrols before warning
  disk_min_gb: 10
  checkpoint_max_hours: 2
  gpu_temp_max: 85

alerts.yaml

yaml

channels:
  terminal:
    enabled: true                # Print to Orchestrator terminal
  file:
    enabled: true
    path: .omc/research/logs/alerts.log
  webhook:
    enabled: false
    url: ""                      # Optional: Slack, Discord, etc.

severity_routing:
  info: [file]
  warning: [terminal, file]
  critical: [terminal, file, webhook]

Agent Health via OMCC Heartbeat

The OMCC harness monitors agent health independently of training monitoring.

Agent	Heartbeat Method	Healthy If
Coder (tmux)	Process alive + last output timestamp	Output within last 5 min
Scout (tmux)	Process alive + last output timestamp	Output within last 10 min
Writer (session)	Session active check	Session exists
Judge (stateless)	N/A	Responds to `codex exec`

Health-based, not timeout-based

AutoResearch does not use fixed timeouts for operations. Instead, it checks health signals. A training run that takes 48 hours is fine as long as the process is alive and loss is moving. A training run that takes 5 minutes is wrong if loss is NaN.

This distinction matters: timeout-based monitoring kills healthy long jobs. Health-based monitoring catches unhealthy short jobs.

Workspace Isolation — per-project monitoring separation
Training Stage — how monitoring integrates with the training pipeline stage

Monitoring ​

Monitoring Categories ​

Two-Phase Training Model ​

Phase 1: Active Watch ​

Phase 2: CronCreate Patrol ​

Monitoring Configuration ​

thresholds.yaml ​

alerts.yaml ​

Agent Health via OMCC Heartbeat ​

Next ​