Stage 3: Implementation
The implementation stage translates the experiment plan into working, tested code. The Coder follows the Planner's task decomposition and the Orchestrator reviews the output.
Entering This Stage
What you have:
- Complete experiment plan (
design/plan.md) - Task decomposition with dependencies (
design/tasks.yaml) - Baseline sources with code links (
design/baselines.yaml) - Metrics and evaluation spec (
design/metrics.yaml)
What you don't have yet:
- Working code
- Training scripts
- Evaluation pipeline
Steps
graph TD
A[1. Environment Setup] --> B[2. Core Implementation]
B --> C[3. Baseline Setup]
C --> D[4. Evaluation Pipeline]
D --> E[5. Integration Testing]
E --> F{Tests Pass?}
F -->|Yes| G[6. Code Review]
F -->|No| H[Ralph Fix Loop]
H --> E
G --> I{Gate}
I -->|pass| J[Advance to Training]
I -->|revise| B
style A fill:#fef3c7,stroke:#d97706
style B fill:#fef3c7,stroke:#d97706
style G fill:#f9f0ff,stroke:#7c3aed
style H fill:#fee2e2,stroke:#dc26261. Environment Setup
Agent: Coder (Codex, tmux worker)
The Coder sets up the development environment:
- Create conda environment with pinned dependencies
- Verify CUDA availability and GPU access
- Download and prepare datasets
- Set up project directory structure
experiments/
├── exp-001/
│ ├── config.yaml # Hyperparameters
│ └── ...
src/
├── models/
├── data/
├── training/
└── evaluation/Environment problems are caught early
Most environment issues (wrong CUDA version, missing libraries, data download failures) surface here. The Coder's ralph loop handles these automatically. If setup fails after 3 retries, it escalates to the Orchestrator.
2. Core Implementation
Agent: Coder (Codex, tmux worker)
The Coder implements tasks from design/tasks.yaml in dependency order:
| Task | What's Built |
|---|---|
| Model architecture | Core modules, layers, attention mechanism |
| Training loop | Forward pass, loss computation, optimizer, scheduler |
| Data pipeline | DataLoader, tokenization, batching |
| Checkpointing | Save/load model state, resume training |
| Logging | Structured JSONL logging for monitoring |
The Coder follows the Planner's specification exactly. When the spec is ambiguous, the Coder reports the ambiguity to the Orchestrator rather than making design decisions.
3. Baseline Setup
Agent: Coder (Codex, tmux worker)
For each baseline in design/baselines.yaml:
- Clone the reference implementation
- Adapt to use the same data pipeline and evaluation
- Verify reproduction of reported numbers (within 5%)
Baseline reproduction is a quality gate
If a baseline can't be reproduced within 5% of reported numbers, the Coder flags this. The Orchestrator decides whether to use the reproduction as-is, debug further, or replace the baseline.
4. Evaluation Pipeline
Agent: Coder (Codex, tmux worker)
Build the evaluation pipeline per design/metrics.yaml:
- Perplexity computation
- Throughput benchmarking (tokens/second)
- Memory profiling (peak GPU memory)
- Automated result extraction into
results.yaml
5. Integration Testing
Agent: Coder (Codex, tmux worker)
Run all components together:
- Short training run (100 steps) to verify the full pipeline
- Check loss decreases
- Verify checkpoint save/load cycle
- Run evaluation on the short-trained model
- Verify all metrics are computed and logged correctly
# Quick verification checklist
tests:
- name: "forward_pass"
status: pass
detail: "Output shape correct, gradients flow"
- name: "short_training"
status: pass
detail: "100 steps, loss decreased from 11.2 to 8.7"
- name: "checkpoint_roundtrip"
status: pass
detail: "Save at step 50, load, loss matches"
- name: "evaluation_pipeline"
status: pass
detail: "All metrics computed, output matches schema"6. Code Review
Agent: Orchestrator (Claude Opus) reviews Coder's (Codex) code
Cross-LLM review in action
The Coder (Codex) wrote the code. The Orchestrator (Claude Opus) reviews it. This is the cross-LLM review principle — the reviewing model is always different from the creating model.
The Orchestrator checks:
- Does the code match the experiment plan?
- Are there obvious bugs or logic errors?
- Is the training loop correct (gradient accumulation, LR schedule)?
- Are metrics computed correctly?
If auto-judge gate: The Judge (Codex, stateless) also evaluates code quality independently.
Gate
| Gate Type | Recommended | Behavior |
|---|---|---|
human | For first project | Full code review by user |
auto-judge | Recommended | Judge evaluates code quality + test results |
auto | For routine re-implementations | Skip review, proceed to training |
The Judge evaluates:
- All tests pass
- Code matches design spec
- No obvious correctness issues
- Baseline reproduction within tolerance
Error Handling
| Error | Recovery |
|---|---|
| Dependency installation failure | Coder ralph loop: try alternative versions, build from source |
| CUDA/GPU error | Coder checks driver version, falls back to CPU for testing |
| Test failure | Coder ralph loop: analyze error, fix, re-test (max 3 tries) |
| Baseline reproduction fails | Escalate to Orchestrator — may need design adjustment |
| Design ambiguity | Coder asks Orchestrator — never makes design decisions |
What if ralph can't fix it?
After 3 failed fix attempts, the Coder stops and reports the full error context to the Orchestrator. The Orchestrator then decides:
- Give the Coder different instructions
- Re-invoke the Planner to adjust the design
- Escalate to the human
Outputs Summary
| File | Contents |
|---|---|
src/ | All implementation code |
experiments/exp-001/config.yaml | Training configuration |
| Test results | Integration test pass/fail |
logs/errors.log | Any errors and their resolutions |
Next Stage
When the gate passes, the pipeline advances to Training with working, tested code.