Skip to content

Monitoring

Monitoring in AutoResearch is not an agent — it's a set of mechanisms used by the Orchestrator and the OMCC harness to watch over long-running processes. There is no "Monitor Agent" that sits in a tmux session. Instead, monitoring is a responsibility distributed across the system.

Monitoring Categories

CategoryWhat's WatchedWho WatchesHow
Environment setupConda/pip install, CUDA availabilityOrchestratorCoder reports back
Data downloadDownload progress, checksums, disk spaceCoderPeriodic status check
GPU allocationGPU availability, memory, utilizationOMCC harnessnvidia-smi polling
Training Phase 1Loss convergence, gradient norms, NaN detectionOrchestratorActive watch (tail log)
Training Phase 2Loss trend, checkpoint saves, ETACronCreatePeriodic patrol
Agent healthHeartbeat freshness, error ratesOMCC harnessHeartbeat protocol

Two-Phase Training Model

Training monitoring uses two distinct phases, because the failure modes are different at the start vs. during steady-state training.

Phase 1: Active Watch

When: First N steps of training (configurable, default 1000 steps).

Why: Most training failures happen early — wrong learning rate, data loading errors, shape mismatches, NaN losses. These need immediate intervention.

How: The Orchestrator actively watches the training log, checking every few seconds.

mermaid
graph LR
    A[Training Starts] --> B{Step < 1000?}
    B -->|Yes| C[Active Watch]
    C --> D{Healthy?}
    D -->|Yes| B
    D -->|No| E[Intervene]
    E --> F[Fix + Restart]
    F --> A
    B -->|No| G[Transition to Phase 2]

    style C fill:#dbeafe,stroke:#2563eb
    style E fill:#fee2e2,stroke:#dc2626
    style G fill:#dcfce7,stroke:#16a34a

Checks during Phase 1:

CheckThresholdAction on Failure
Loss is finiteNo NaN/InfStop training, diagnose
Loss is decreasingAvg over 100 stepsWarning, continue watching
Gradient norm< 100.0Reduce LR or clip gradients
GPU utilization> 50%Check data pipeline bottleneck
Memory usage< 95% of availableReduce batch size

Phase 2: CronCreate Patrol

When: After Phase 1 (training is stable), until training completes.

Why: Stable training rarely fails catastrophically, but can drift slowly (loss plateau, disk full, GPU throttling). Periodic checks are sufficient.

How: CronCreate schedules a patrol script that runs every N minutes (configurable, default 30 min).

mermaid
graph TD
    A[CronCreate Schedule] --> B[Patrol Script Runs]
    B --> C[Read Latest Log Lines]
    C --> D{All Healthy?}
    D -->|Yes| E[Write OK to agent_health.yaml]
    D -->|No| F{Severity?}
    F -->|Warning| G[Log Warning<br/>Continue Patrol]
    F -->|Critical| H[Alert Orchestrator<br/>Pause Training]

    style A fill:#fef3c7,stroke:#d97706
    style E fill:#dcfce7,stroke:#16a34a
    style H fill:#fee2e2,stroke:#dc2626

Checks during Phase 2:

CheckFrequencyAction on Failure
Training process aliveEvery patrolAlert, attempt restart
Loss still decreasingEvery patrolWarning after 2 consecutive flat patrols
Disk space adequateEvery patrolAlert if < 10GB free
Checkpoint saved recentlyEvery patrolWarning if last checkpoint > 2 hours ago
GPU temperatureEvery patrolThrottle alert if > 85C

Why not just watch everything all the time?

Active watching consumes Orchestrator context. During Phase 1, this is worth it because failures are frequent and immediate response matters. During Phase 2, the Orchestrator's context is better spent on other tasks (analysis planning, paper prep). CronCreate patrols use zero context — they run externally and only alert if something goes wrong.

Monitoring Configuration

thresholds.yaml

yaml
training:
  phase1_steps: 1000
  loss_nan_action: stop
  gradient_norm_max: 100.0
  gpu_util_min: 0.5
  memory_usage_max: 0.95

patrol:
  interval_minutes: 30
  loss_plateau_patience: 2      # patrols before warning
  disk_min_gb: 10
  checkpoint_max_hours: 2
  gpu_temp_max: 85

alerts.yaml

yaml
channels:
  terminal:
    enabled: true                # Print to Orchestrator terminal
  file:
    enabled: true
    path: .omc/research/logs/alerts.log
  webhook:
    enabled: false
    url: ""                      # Optional: Slack, Discord, etc.

severity_routing:
  info: [file]
  warning: [terminal, file]
  critical: [terminal, file, webhook]

Agent Health via OMCC Heartbeat

The OMCC harness monitors agent health independently of training monitoring.

AgentHeartbeat MethodHealthy If
Coder (tmux)Process alive + last output timestampOutput within last 5 min
Scout (tmux)Process alive + last output timestampOutput within last 10 min
Writer (session)Session active checkSession exists
Judge (stateless)N/AResponds to codex exec

Health-based, not timeout-based

AutoResearch does not use fixed timeouts for operations. Instead, it checks health signals. A training run that takes 48 hours is fine as long as the process is alive and loss is moving. A training run that takes 5 minutes is wrong if loss is NaN.

This distinction matters: timeout-based monitoring kills healthy long jobs. Health-based monitoring catches unhealthy short jobs.

Next

AutoResearch — Multi-agent Deep Learning Research System