Stage 3: Implementation

The implementation stage translates the experiment plan into working, tested code. The Coder follows the Planner's task decomposition and the Orchestrator reviews the output.

Entering This Stage

What you have:

Complete experiment plan (design/plan.md)
Task decomposition with dependencies (design/tasks.yaml)
Baseline sources with code links (design/baselines.yaml)
Metrics and evaluation spec (design/metrics.yaml)

What you don't have yet:

Working code
Training scripts
Evaluation pipeline

Steps

mermaid

graph TD
    A[1. Environment Setup] --> B[2. Core Implementation]
    B --> C[3. Baseline Setup]
    C --> D[4. Evaluation Pipeline]
    D --> E[5. Integration Testing]
    E --> F{Tests Pass?}
    F -->|Yes| G[6. Code Review]
    F -->|No| H[Ralph Fix Loop]
    H --> E
    G --> I{Gate}
    I -->|pass| J[Advance to Training]
    I -->|revise| B

    style A fill:#fef3c7,stroke:#d97706
    style B fill:#fef3c7,stroke:#d97706
    style G fill:#f9f0ff,stroke:#7c3aed
    style H fill:#fee2e2,stroke:#dc2626

1. Environment Setup

Agent: Coder (Codex, tmux worker)

The Coder sets up the development environment:

Create conda environment with pinned dependencies
Verify CUDA availability and GPU access
Download and prepare datasets
Set up project directory structure

experiments/
├── exp-001/
│   ├── config.yaml    # Hyperparameters
│   └── ...
src/
├── models/
├── data/
├── training/
└── evaluation/

Environment problems are caught early

Most environment issues (wrong CUDA version, missing libraries, data download failures) surface here. The Coder's ralph loop handles these automatically. If setup fails after 3 retries, it escalates to the Orchestrator.

2. Core Implementation

Agent: Coder (Codex, tmux worker)

The Coder implements tasks from design/tasks.yaml in dependency order:

Task	What's Built
Model architecture	Core modules, layers, attention mechanism
Training loop	Forward pass, loss computation, optimizer, scheduler
Data pipeline	DataLoader, tokenization, batching
Checkpointing	Save/load model state, resume training
Logging	Structured JSONL logging for monitoring

The Coder follows the Planner's specification exactly. When the spec is ambiguous, the Coder reports the ambiguity to the Orchestrator rather than making design decisions.

3. Baseline Setup

Agent: Coder (Codex, tmux worker)

For each baseline in design/baselines.yaml:

Clone the reference implementation
Adapt to use the same data pipeline and evaluation
Verify reproduction of reported numbers (within 5%)

Baseline reproduction is a quality gate

If a baseline can't be reproduced within 5% of reported numbers, the Coder flags this. The Orchestrator decides whether to use the reproduction as-is, debug further, or replace the baseline.

4. Evaluation Pipeline

Agent: Coder (Codex, tmux worker)

Build the evaluation pipeline per design/metrics.yaml:

Perplexity computation
Throughput benchmarking (tokens/second)
Memory profiling (peak GPU memory)
Automated result extraction into results.yaml

5. Integration Testing

Agent: Coder (Codex, tmux worker)

Run all components together:

Short training run (100 steps) to verify the full pipeline
Check loss decreases
Verify checkpoint save/load cycle
Run evaluation on the short-trained model
Verify all metrics are computed and logged correctly

yaml

# Quick verification checklist
tests:
  - name: "forward_pass"
    status: pass
    detail: "Output shape correct, gradients flow"
  - name: "short_training"
    status: pass
    detail: "100 steps, loss decreased from 11.2 to 8.7"
  - name: "checkpoint_roundtrip"
    status: pass
    detail: "Save at step 50, load, loss matches"
  - name: "evaluation_pipeline"
    status: pass
    detail: "All metrics computed, output matches schema"

6. Code Review

Agent: Orchestrator (Claude Opus) reviews Coder's (Codex) code

Cross-LLM review in action

The Coder (Codex) wrote the code. The Orchestrator (Claude Opus) reviews it. This is the cross-LLM review principle — the reviewing model is always different from the creating model.

The Orchestrator checks:

Does the code match the experiment plan?
Are there obvious bugs or logic errors?
Is the training loop correct (gradient accumulation, LR schedule)?
Are metrics computed correctly?

If auto-judge gate: The Judge (Codex, stateless) also evaluates code quality independently.

Gate

Gate Type	Recommended	Behavior
`human`	For first project	Full code review by user
`auto-judge`	Recommended	Judge evaluates code quality + test results
`auto`	For routine re-implementations	Skip review, proceed to training

The Judge evaluates:

All tests pass
Code matches design spec
No obvious correctness issues
Baseline reproduction within tolerance

Error Handling

Error	Recovery
Dependency installation failure	Coder ralph loop: try alternative versions, build from source
CUDA/GPU error	Coder checks driver version, falls back to CPU for testing
Test failure	Coder ralph loop: analyze error, fix, re-test (max 3 tries)
Baseline reproduction fails	Escalate to Orchestrator — may need design adjustment
Design ambiguity	Coder asks Orchestrator — never makes design decisions

What if ralph can't fix it?

After 3 failed fix attempts, the Coder stops and reports the full error context to the Orchestrator. The Orchestrator then decides:

Give the Coder different instructions
Re-invoke the Planner to adjust the design
Escalate to the human

Outputs Summary

File	Contents
`src/`	All implementation code
`experiments/exp-001/config.yaml`	Training configuration
Test results	Integration test pass/fail
`logs/errors.log`	Any errors and their resolutions

Next Stage

When the gate passes, the pipeline advances to Training with working, tested code.

Stage 3: Implementation ​

Entering This Stage ​

Steps ​

1. Environment Setup ​

2. Core Implementation ​

3. Baseline Setup ​

4. Evaluation Pipeline ​

5. Integration Testing ​

6. Code Review ​

Gate ​

Error Handling ​

Outputs Summary ​

Next Stage ​

Stage 3: Implementation

Entering This Stage

Steps

1. Environment Setup

2. Core Implementation

3. Baseline Setup

4. Evaluation Pipeline

5. Integration Testing

6. Code Review

Gate

Error Handling

Outputs Summary

Next Stage