Skip to content

Stage 3: Implementation

The implementation stage translates the experiment plan into working, tested code. The Coder follows the Planner's task decomposition and the Orchestrator reviews the output.

Entering This Stage

What you have:

  • Complete experiment plan (design/plan.md)
  • Task decomposition with dependencies (design/tasks.yaml)
  • Baseline sources with code links (design/baselines.yaml)
  • Metrics and evaluation spec (design/metrics.yaml)

What you don't have yet:

  • Working code
  • Training scripts
  • Evaluation pipeline

Steps

mermaid
graph TD
    A[1. Environment Setup] --> B[2. Core Implementation]
    B --> C[3. Baseline Setup]
    C --> D[4. Evaluation Pipeline]
    D --> E[5. Integration Testing]
    E --> F{Tests Pass?}
    F -->|Yes| G[6. Code Review]
    F -->|No| H[Ralph Fix Loop]
    H --> E
    G --> I{Gate}
    I -->|pass| J[Advance to Training]
    I -->|revise| B

    style A fill:#fef3c7,stroke:#d97706
    style B fill:#fef3c7,stroke:#d97706
    style G fill:#f9f0ff,stroke:#7c3aed
    style H fill:#fee2e2,stroke:#dc2626

1. Environment Setup

Agent: Coder (Codex, tmux worker)

The Coder sets up the development environment:

  • Create conda environment with pinned dependencies
  • Verify CUDA availability and GPU access
  • Download and prepare datasets
  • Set up project directory structure
experiments/
├── exp-001/
│   ├── config.yaml    # Hyperparameters
│   └── ...
src/
├── models/
├── data/
├── training/
└── evaluation/

Environment problems are caught early

Most environment issues (wrong CUDA version, missing libraries, data download failures) surface here. The Coder's ralph loop handles these automatically. If setup fails after 3 retries, it escalates to the Orchestrator.

2. Core Implementation

Agent: Coder (Codex, tmux worker)

The Coder implements tasks from design/tasks.yaml in dependency order:

TaskWhat's Built
Model architectureCore modules, layers, attention mechanism
Training loopForward pass, loss computation, optimizer, scheduler
Data pipelineDataLoader, tokenization, batching
CheckpointingSave/load model state, resume training
LoggingStructured JSONL logging for monitoring

The Coder follows the Planner's specification exactly. When the spec is ambiguous, the Coder reports the ambiguity to the Orchestrator rather than making design decisions.

3. Baseline Setup

Agent: Coder (Codex, tmux worker)

For each baseline in design/baselines.yaml:

  • Clone the reference implementation
  • Adapt to use the same data pipeline and evaluation
  • Verify reproduction of reported numbers (within 5%)

Baseline reproduction is a quality gate

If a baseline can't be reproduced within 5% of reported numbers, the Coder flags this. The Orchestrator decides whether to use the reproduction as-is, debug further, or replace the baseline.

4. Evaluation Pipeline

Agent: Coder (Codex, tmux worker)

Build the evaluation pipeline per design/metrics.yaml:

  • Perplexity computation
  • Throughput benchmarking (tokens/second)
  • Memory profiling (peak GPU memory)
  • Automated result extraction into results.yaml

5. Integration Testing

Agent: Coder (Codex, tmux worker)

Run all components together:

  • Short training run (100 steps) to verify the full pipeline
  • Check loss decreases
  • Verify checkpoint save/load cycle
  • Run evaluation on the short-trained model
  • Verify all metrics are computed and logged correctly
yaml
# Quick verification checklist
tests:
  - name: "forward_pass"
    status: pass
    detail: "Output shape correct, gradients flow"
  - name: "short_training"
    status: pass
    detail: "100 steps, loss decreased from 11.2 to 8.7"
  - name: "checkpoint_roundtrip"
    status: pass
    detail: "Save at step 50, load, loss matches"
  - name: "evaluation_pipeline"
    status: pass
    detail: "All metrics computed, output matches schema"

6. Code Review

Agent: Orchestrator (Claude Opus) reviews Coder's (Codex) code

Cross-LLM review in action

The Coder (Codex) wrote the code. The Orchestrator (Claude Opus) reviews it. This is the cross-LLM review principle — the reviewing model is always different from the creating model.

The Orchestrator checks:

  • Does the code match the experiment plan?
  • Are there obvious bugs or logic errors?
  • Is the training loop correct (gradient accumulation, LR schedule)?
  • Are metrics computed correctly?

If auto-judge gate: The Judge (Codex, stateless) also evaluates code quality independently.

Gate

Gate TypeRecommendedBehavior
humanFor first projectFull code review by user
auto-judgeRecommendedJudge evaluates code quality + test results
autoFor routine re-implementationsSkip review, proceed to training

The Judge evaluates:

  • All tests pass
  • Code matches design spec
  • No obvious correctness issues
  • Baseline reproduction within tolerance

Error Handling

ErrorRecovery
Dependency installation failureCoder ralph loop: try alternative versions, build from source
CUDA/GPU errorCoder checks driver version, falls back to CPU for testing
Test failureCoder ralph loop: analyze error, fix, re-test (max 3 tries)
Baseline reproduction failsEscalate to Orchestrator — may need design adjustment
Design ambiguityCoder asks Orchestrator — never makes design decisions
What if ralph can't fix it?

After 3 failed fix attempts, the Coder stops and reports the full error context to the Orchestrator. The Orchestrator then decides:

  1. Give the Coder different instructions
  2. Re-invoke the Planner to adjust the design
  3. Escalate to the human

Outputs Summary

FileContents
src/All implementation code
experiments/exp-001/config.yamlTraining configuration
Test resultsIntegration test pass/fail
logs/errors.logAny errors and their resolutions

Next Stage

When the gate passes, the pipeline advances to Training with working, tested code.

AutoResearch — Multi-agent Deep Learning Research System