Stage 4: Training
The training stage launches and monitors the full training run. This is typically the longest stage, running from hours to days. The monitoring system uses a two-phase approach — active watch at start, periodic patrol after stabilization.
Entering This Stage
What you have:
- Working, tested code (passed implementation gate)
- Training configuration (
experiments/exp-001/config.yaml) - Evaluation pipeline ready
- Monitoring thresholds (
config/thresholds.yaml)
What you don't have yet:
- Trained model checkpoints
- Final metrics
- Training logs
Steps
graph TD
A[1. Pre-Flight Check] --> B[2. Launch Training]
B --> C[3. Phase 1: Active Watch]
C --> D{Stable?}
D -->|Yes| E[4. Phase 2: CronCreate Patrol]
D -->|No| F[Intervene]
F -->|fixable| B
F -->|design issue| ESC[Escalate]
E --> G{Complete?}
G -->|No| E
G -->|Yes| H[5. Post-Training]
H --> I[Gate: Advance to Analysis]
style C fill:#dbeafe,stroke:#2563eb
style E fill:#fef3c7,stroke:#d97706
style F fill:#fee2e2,stroke:#dc2626
style H fill:#dcfce7,stroke:#16a34a1. Pre-Flight Check
Agent: Coder (Codex, tmux worker)
Before launching the full training run:
| Check | Verify |
|---|---|
| GPU availability | All allocated GPUs are free and accessible |
| Disk space | Enough for checkpoints + logs (estimated) |
| Data integrity | Dataset accessible, checksums match |
| Config validity | All hyperparameters set, no defaults overridden |
| Checkpoint dir | Writable, enough space |
| Monitoring | thresholds.yaml loaded, alert channels configured |
Pre-flight prevents wasted compute
A training run that fails at step 50,000 because of a full disk wastes hours of GPU time. Pre-flight checks catch these issues in seconds.
2. Launch Training
Agent: Coder (Codex, tmux worker)
The Coder launches training in its persistent tmux session:
- Training runs as a background process with structured logging
- Logs are written to
experiments/exp-001/log.jsonl - Checkpoints save according to schedule in
config.yaml
# experiments/exp-001/config.yaml (training section)
training:
total_steps: 100000
batch_size: 64
learning_rate: 3e-4
lr_schedule: cosine_warmup
warmup_steps: 2000
checkpoint_every: 5000
log_every: 100
eval_every: 50003. Phase 1: Active Watch
Agent: Orchestrator (active monitoring)
For the first N steps (default: 1000), the Orchestrator actively watches the training log:
| Check | Frequency | Threshold | Action |
|---|---|---|---|
| Loss finite | Every 100 steps | No NaN/Inf | Stop, diagnose |
| Loss decreasing | Every 100 steps | Avg trend negative | Warning |
| Gradient norm | Every 100 steps | < 100.0 | Reduce LR / clip |
| GPU utilization | Every 100 steps | > 50% | Check data pipeline |
| Memory usage | Every 100 steps | < 95% | Reduce batch size |
| Step throughput | Every 100 steps | > 50% of expected | Check bottleneck |
# Active watch log (Orchestrator context)
Step 100: loss=9.87, grad_norm=12.3, gpu_util=94%, mem=62GB ✓
Step 200: loss=8.45, grad_norm=8.7, gpu_util=93%, mem=62GB ✓
Step 300: loss=7.12, grad_norm=6.2, gpu_util=95%, mem=62GB ✓
...
Step 1000: loss=4.31, grad_norm=3.1, gpu_util=94%, mem=62GB ✓
→ Training stable. Transitioning to Phase 2.Phase 1 is short but critical
Most training failures happen in the first few hundred steps. Wrong learning rate, data format errors, numerical instability — these surface early. The investment of actively watching 1000 steps pays for itself many times over.
4. Phase 2: CronCreate Patrol
Agent: CronCreate scheduled script (external)
After Phase 1, the Orchestrator hands off monitoring to a CronCreate patrol:
# CronCreate patrol configuration
schedule: "*/30 * * * *" # Every 30 minutes
script: patrol_training.sh
checks:
- process_alive
- loss_trend
- disk_space
- checkpoint_freshness
- gpu_temperatureThe patrol script runs independently — it does not consume Orchestrator context. It writes results to logs/agent_health.yaml and only alerts if something goes wrong.
graph LR
C[CronCreate] -->|every 30 min| P[Patrol Script]
P -->|healthy| L[agent_health.yaml]
P -->|unhealthy| A[Alert Orchestrator]
A --> O[Orchestrator Wakes Up]
O --> D{Fixable?}
D -->|yes| F[Coder fixes]
D -->|no| H[Human alert]
style C fill:#fef3c7,stroke:#d97706
style A fill:#fee2e2,stroke:#dc2626
style L fill:#dcfce7,stroke:#16a34aWhy CronCreate instead of continuous watch?
Continuous watching during Phase 2 would fill the Orchestrator's context with thousands of "still training, everything fine" checks. CronCreate patrols externally and only consume context when something needs attention. This frees the Orchestrator to work on other tasks (planning analysis, preparing for writing).
5. Post-Training
Agent: Coder (Codex, tmux worker)
After training completes:
- Run final evaluation on all benchmarks
- Extract metrics into
experiments/exp-001/results.yaml - Save final checkpoint
- Generate training curves (loss, learning rate over time)
- Run baseline evaluations if not done yet
# experiments/exp-001/results.yaml
experiment: exp-001
method: flash_recurrent_attention
total_steps: 100000
training_time_hours: 18.5
metrics:
perplexity:
wikitext103: 17.8
lambada: 22.1
throughput:
seq_1024: 14200
seq_4096: 11800
seq_16384: 8900
memory_peak_gb: 38.2
baselines:
vanilla_attention:
perplexity_wikitext103: 17.9
throughput_seq_4096: 8200
retnet:
perplexity_wikitext103: 18.5
throughput_seq_4096: 10100Gate
| Gate Type | Recommended | Behavior |
|---|---|---|
human | Rarely | Only if you want to inspect before analysis |
auto-judge | Possible | Judge checks results completeness |
auto | Recommended | Training output is data; analysis judges quality |
Auto gate for training
Training produces raw data — the quality judgment happens in the Analysis stage. The training gate just checks that training completed successfully and results were extracted. This is a good candidate for auto.
Error Handling
| Error | Phase | Recovery |
|---|---|---|
| NaN loss | Phase 1 | Stop, reduce LR, restart |
| OOM (out of memory) | Phase 1 | Reduce batch size, restart |
| Loss plateau | Phase 2 | Patrol alerts → Orchestrator reduces LR |
| Disk full | Phase 2 | Patrol alerts → Coder cleans old checkpoints |
| Process killed | Phase 2 | Patrol alerts → Coder restarts from checkpoint |
| GPU error | Either | Alert → check hardware, restart on different GPU |
Design issues during training
If training reveals a design problem (e.g., the method fundamentally doesn't converge), the Orchestrator does NOT auto-fix this. It escalates to the human with a clear report: "Training suggests the design needs revision. Loss at step 50k is 2x higher than expected baseline."
Outputs Summary
| File | Contents |
|---|---|
experiments/exp-001/log.jsonl | Full structured training log |
experiments/exp-001/results.yaml | Final metrics and baseline comparison |
experiments/exp-001/checkpoints/ | Model checkpoints |
logs/agent_health.yaml | Training health history |
Next Stage
When the gate passes, the pipeline advances to Analysis with complete training results.