Stage 4: Training

The training stage launches and monitors the full training run. This is typically the longest stage, running from hours to days. The monitoring system uses a two-phase approach — active watch at start, periodic patrol after stabilization.

Entering This Stage

What you have:

Working, tested code (passed implementation gate)
Training configuration (experiments/exp-001/config.yaml)
Evaluation pipeline ready
Monitoring thresholds (config/thresholds.yaml)

What you don't have yet:

Trained model checkpoints
Final metrics
Training logs

Steps

mermaid

graph TD
    A[1. Pre-Flight Check] --> B[2. Launch Training]
    B --> C[3. Phase 1: Active Watch]
    C --> D{Stable?}
    D -->|Yes| E[4. Phase 2: CronCreate Patrol]
    D -->|No| F[Intervene]
    F -->|fixable| B
    F -->|design issue| ESC[Escalate]
    E --> G{Complete?}
    G -->|No| E
    G -->|Yes| H[5. Post-Training]
    H --> I[Gate: Advance to Analysis]

    style C fill:#dbeafe,stroke:#2563eb
    style E fill:#fef3c7,stroke:#d97706
    style F fill:#fee2e2,stroke:#dc2626
    style H fill:#dcfce7,stroke:#16a34a

1. Pre-Flight Check

Agent: Coder (Codex, tmux worker)

Before launching the full training run:

Check	Verify
GPU availability	All allocated GPUs are free and accessible
Disk space	Enough for checkpoints + logs (estimated)
Data integrity	Dataset accessible, checksums match
Config validity	All hyperparameters set, no defaults overridden
Checkpoint dir	Writable, enough space
Monitoring	`thresholds.yaml` loaded, alert channels configured

Pre-flight prevents wasted compute

A training run that fails at step 50,000 because of a full disk wastes hours of GPU time. Pre-flight checks catch these issues in seconds.

2. Launch Training

Agent: Coder (Codex, tmux worker)

The Coder launches training in its persistent tmux session:

Training runs as a background process with structured logging
Logs are written to experiments/exp-001/log.jsonl
Checkpoints save according to schedule in config.yaml

yaml

# experiments/exp-001/config.yaml (training section)
training:
  total_steps: 100000
  batch_size: 64
  learning_rate: 3e-4
  lr_schedule: cosine_warmup
  warmup_steps: 2000
  checkpoint_every: 5000
  log_every: 100
  eval_every: 5000

3. Phase 1: Active Watch

Agent: Orchestrator (active monitoring)

For the first N steps (default: 1000), the Orchestrator actively watches the training log:

Check	Frequency	Threshold	Action
Loss finite	Every 100 steps	No NaN/Inf	Stop, diagnose
Loss decreasing	Every 100 steps	Avg trend negative	Warning
Gradient norm	Every 100 steps	< 100.0	Reduce LR / clip
GPU utilization	Every 100 steps	> 50%	Check data pipeline
Memory usage	Every 100 steps	< 95%	Reduce batch size
Step throughput	Every 100 steps	> 50% of expected	Check bottleneck

# Active watch log (Orchestrator context)
Step 100: loss=9.87, grad_norm=12.3, gpu_util=94%, mem=62GB ✓
Step 200: loss=8.45, grad_norm=8.7,  gpu_util=93%, mem=62GB ✓
Step 300: loss=7.12, grad_norm=6.2,  gpu_util=95%, mem=62GB ✓
...
Step 1000: loss=4.31, grad_norm=3.1, gpu_util=94%, mem=62GB ✓
→ Training stable. Transitioning to Phase 2.

Phase 1 is short but critical

Most training failures happen in the first few hundred steps. Wrong learning rate, data format errors, numerical instability — these surface early. The investment of actively watching 1000 steps pays for itself many times over.

4. Phase 2: CronCreate Patrol

Agent: CronCreate scheduled script (external)

After Phase 1, the Orchestrator hands off monitoring to a CronCreate patrol:

yaml

# CronCreate patrol configuration
schedule: "*/30 * * * *"  # Every 30 minutes
script: patrol_training.sh
checks:
  - process_alive
  - loss_trend
  - disk_space
  - checkpoint_freshness
  - gpu_temperature

The patrol script runs independently — it does not consume Orchestrator context. It writes results to logs/agent_health.yaml and only alerts if something goes wrong.

mermaid

graph LR
    C[CronCreate] -->|every 30 min| P[Patrol Script]
    P -->|healthy| L[agent_health.yaml]
    P -->|unhealthy| A[Alert Orchestrator]
    A --> O[Orchestrator Wakes Up]
    O --> D{Fixable?}
    D -->|yes| F[Coder fixes]
    D -->|no| H[Human alert]

    style C fill:#fef3c7,stroke:#d97706
    style A fill:#fee2e2,stroke:#dc2626
    style L fill:#dcfce7,stroke:#16a34a

Why CronCreate instead of continuous watch?

Continuous watching during Phase 2 would fill the Orchestrator's context with thousands of "still training, everything fine" checks. CronCreate patrols externally and only consume context when something needs attention. This frees the Orchestrator to work on other tasks (planning analysis, preparing for writing).

5. Post-Training

Agent: Coder (Codex, tmux worker)

After training completes:

Run final evaluation on all benchmarks
Extract metrics into experiments/exp-001/results.yaml
Save final checkpoint
Generate training curves (loss, learning rate over time)
Run baseline evaluations if not done yet

yaml

# experiments/exp-001/results.yaml
experiment: exp-001
method: flash_recurrent_attention
total_steps: 100000
training_time_hours: 18.5

metrics:
  perplexity:
    wikitext103: 17.8
    lambada: 22.1
  throughput:
    seq_1024: 14200
    seq_4096: 11800
    seq_16384: 8900
  memory_peak_gb: 38.2

baselines:
  vanilla_attention:
    perplexity_wikitext103: 17.9
    throughput_seq_4096: 8200
  retnet:
    perplexity_wikitext103: 18.5
    throughput_seq_4096: 10100

Gate

Gate Type	Recommended	Behavior
`human`	Rarely	Only if you want to inspect before analysis
`auto-judge`	Possible	Judge checks results completeness
`auto`	Recommended	Training output is data; analysis judges quality

Auto gate for training

Training produces raw data — the quality judgment happens in the Analysis stage. The training gate just checks that training completed successfully and results were extracted. This is a good candidate for auto.

Error Handling

Error	Phase	Recovery
NaN loss	Phase 1	Stop, reduce LR, restart
OOM (out of memory)	Phase 1	Reduce batch size, restart
Loss plateau	Phase 2	Patrol alerts → Orchestrator reduces LR
Disk full	Phase 2	Patrol alerts → Coder cleans old checkpoints
Process killed	Phase 2	Patrol alerts → Coder restarts from checkpoint
GPU error	Either	Alert → check hardware, restart on different GPU

Design issues during training

If training reveals a design problem (e.g., the method fundamentally doesn't converge), the Orchestrator does NOT auto-fix this. It escalates to the human with a clear report: "Training suggests the design needs revision. Loss at step 50k is 2x higher than expected baseline."

Outputs Summary

File	Contents
`experiments/exp-001/log.jsonl`	Full structured training log
`experiments/exp-001/results.yaml`	Final metrics and baseline comparison
`experiments/exp-001/checkpoints/`	Model checkpoints
`logs/agent_health.yaml`	Training health history

Next Stage

When the gate passes, the pipeline advances to Analysis with complete training results.

Stage 4: Training ​

Entering This Stage ​

Steps ​

1. Pre-Flight Check ​

2. Launch Training ​

3. Phase 1: Active Watch ​

4. Phase 2: CronCreate Patrol ​

5. Post-Training ​

Gate ​

Error Handling ​

Outputs Summary ​

Next Stage ​

Stage 4: Training

Entering This Stage

Steps

1. Pre-Flight Check

2. Launch Training

3. Phase 1: Active Watch

4. Phase 2: CronCreate Patrol

5. Post-Training

Gate

Error Handling

Outputs Summary

Next Stage