Skip to content

Stage 4: Training

The training stage launches and monitors the full training run. This is typically the longest stage, running from hours to days. The monitoring system uses a two-phase approach — active watch at start, periodic patrol after stabilization.

Entering This Stage

What you have:

  • Working, tested code (passed implementation gate)
  • Training configuration (experiments/exp-001/config.yaml)
  • Evaluation pipeline ready
  • Monitoring thresholds (config/thresholds.yaml)

What you don't have yet:

  • Trained model checkpoints
  • Final metrics
  • Training logs

Steps

mermaid
graph TD
    A[1. Pre-Flight Check] --> B[2. Launch Training]
    B --> C[3. Phase 1: Active Watch]
    C --> D{Stable?}
    D -->|Yes| E[4. Phase 2: CronCreate Patrol]
    D -->|No| F[Intervene]
    F -->|fixable| B
    F -->|design issue| ESC[Escalate]
    E --> G{Complete?}
    G -->|No| E
    G -->|Yes| H[5. Post-Training]
    H --> I[Gate: Advance to Analysis]

    style C fill:#dbeafe,stroke:#2563eb
    style E fill:#fef3c7,stroke:#d97706
    style F fill:#fee2e2,stroke:#dc2626
    style H fill:#dcfce7,stroke:#16a34a

1. Pre-Flight Check

Agent: Coder (Codex, tmux worker)

Before launching the full training run:

CheckVerify
GPU availabilityAll allocated GPUs are free and accessible
Disk spaceEnough for checkpoints + logs (estimated)
Data integrityDataset accessible, checksums match
Config validityAll hyperparameters set, no defaults overridden
Checkpoint dirWritable, enough space
Monitoringthresholds.yaml loaded, alert channels configured

Pre-flight prevents wasted compute

A training run that fails at step 50,000 because of a full disk wastes hours of GPU time. Pre-flight checks catch these issues in seconds.

2. Launch Training

Agent: Coder (Codex, tmux worker)

The Coder launches training in its persistent tmux session:

  • Training runs as a background process with structured logging
  • Logs are written to experiments/exp-001/log.jsonl
  • Checkpoints save according to schedule in config.yaml
yaml
# experiments/exp-001/config.yaml (training section)
training:
  total_steps: 100000
  batch_size: 64
  learning_rate: 3e-4
  lr_schedule: cosine_warmup
  warmup_steps: 2000
  checkpoint_every: 5000
  log_every: 100
  eval_every: 5000

3. Phase 1: Active Watch

Agent: Orchestrator (active monitoring)

For the first N steps (default: 1000), the Orchestrator actively watches the training log:

CheckFrequencyThresholdAction
Loss finiteEvery 100 stepsNo NaN/InfStop, diagnose
Loss decreasingEvery 100 stepsAvg trend negativeWarning
Gradient normEvery 100 steps< 100.0Reduce LR / clip
GPU utilizationEvery 100 steps> 50%Check data pipeline
Memory usageEvery 100 steps< 95%Reduce batch size
Step throughputEvery 100 steps> 50% of expectedCheck bottleneck
# Active watch log (Orchestrator context)
Step 100: loss=9.87, grad_norm=12.3, gpu_util=94%, mem=62GB ✓
Step 200: loss=8.45, grad_norm=8.7,  gpu_util=93%, mem=62GB ✓
Step 300: loss=7.12, grad_norm=6.2,  gpu_util=95%, mem=62GB ✓
...
Step 1000: loss=4.31, grad_norm=3.1, gpu_util=94%, mem=62GB ✓
→ Training stable. Transitioning to Phase 2.

Phase 1 is short but critical

Most training failures happen in the first few hundred steps. Wrong learning rate, data format errors, numerical instability — these surface early. The investment of actively watching 1000 steps pays for itself many times over.

4. Phase 2: CronCreate Patrol

Agent: CronCreate scheduled script (external)

After Phase 1, the Orchestrator hands off monitoring to a CronCreate patrol:

yaml
# CronCreate patrol configuration
schedule: "*/30 * * * *"  # Every 30 minutes
script: patrol_training.sh
checks:
  - process_alive
  - loss_trend
  - disk_space
  - checkpoint_freshness
  - gpu_temperature

The patrol script runs independently — it does not consume Orchestrator context. It writes results to logs/agent_health.yaml and only alerts if something goes wrong.

mermaid
graph LR
    C[CronCreate] -->|every 30 min| P[Patrol Script]
    P -->|healthy| L[agent_health.yaml]
    P -->|unhealthy| A[Alert Orchestrator]
    A --> O[Orchestrator Wakes Up]
    O --> D{Fixable?}
    D -->|yes| F[Coder fixes]
    D -->|no| H[Human alert]

    style C fill:#fef3c7,stroke:#d97706
    style A fill:#fee2e2,stroke:#dc2626
    style L fill:#dcfce7,stroke:#16a34a

Why CronCreate instead of continuous watch?

Continuous watching during Phase 2 would fill the Orchestrator's context with thousands of "still training, everything fine" checks. CronCreate patrols externally and only consume context when something needs attention. This frees the Orchestrator to work on other tasks (planning analysis, preparing for writing).

5. Post-Training

Agent: Coder (Codex, tmux worker)

After training completes:

  1. Run final evaluation on all benchmarks
  2. Extract metrics into experiments/exp-001/results.yaml
  3. Save final checkpoint
  4. Generate training curves (loss, learning rate over time)
  5. Run baseline evaluations if not done yet
yaml
# experiments/exp-001/results.yaml
experiment: exp-001
method: flash_recurrent_attention
total_steps: 100000
training_time_hours: 18.5

metrics:
  perplexity:
    wikitext103: 17.8
    lambada: 22.1
  throughput:
    seq_1024: 14200
    seq_4096: 11800
    seq_16384: 8900
  memory_peak_gb: 38.2

baselines:
  vanilla_attention:
    perplexity_wikitext103: 17.9
    throughput_seq_4096: 8200
  retnet:
    perplexity_wikitext103: 18.5
    throughput_seq_4096: 10100

Gate

Gate TypeRecommendedBehavior
humanRarelyOnly if you want to inspect before analysis
auto-judgePossibleJudge checks results completeness
autoRecommendedTraining output is data; analysis judges quality

Auto gate for training

Training produces raw data — the quality judgment happens in the Analysis stage. The training gate just checks that training completed successfully and results were extracted. This is a good candidate for auto.

Error Handling

ErrorPhaseRecovery
NaN lossPhase 1Stop, reduce LR, restart
OOM (out of memory)Phase 1Reduce batch size, restart
Loss plateauPhase 2Patrol alerts → Orchestrator reduces LR
Disk fullPhase 2Patrol alerts → Coder cleans old checkpoints
Process killedPhase 2Patrol alerts → Coder restarts from checkpoint
GPU errorEitherAlert → check hardware, restart on different GPU

Design issues during training

If training reveals a design problem (e.g., the method fundamentally doesn't converge), the Orchestrator does NOT auto-fix this. It escalates to the human with a clear report: "Training suggests the design needs revision. Loss at step 50k is 2x higher than expected baseline."

Outputs Summary

FileContents
experiments/exp-001/log.jsonlFull structured training log
experiments/exp-001/results.yamlFinal metrics and baseline comparison
experiments/exp-001/checkpoints/Model checkpoints
logs/agent_health.yamlTraining health history

Next Stage

When the gate passes, the pipeline advances to Analysis with complete training results.

AutoResearch — Multi-agent Deep Learning Research System