Stage 2: Design
The design stage transforms a selected research idea into a complete experiment plan with baselines, ablations, metrics, and task decomposition.
Entering This Stage
What you have:
- Selected idea with Judge's evaluation (
ideas/selected.yaml) - Initial literature survey (
papers/related_work/summaries.yaml) - Resource constraints (
infrastructure.yaml)
What you don't have yet:
- Detailed method specification
- Baseline implementations
- Experiment configurations
Steps
graph TD
A[1. Baseline Identification] --> B[2. Method Specification]
B --> C[3. Experiment Design]
C --> D[4. Task Decomposition]
D --> E[5. Plan Review]
E --> F{Gate}
F -->|pass| G[Advance to Implementation]
F -->|revise| C
style A fill:#ecfdf5,stroke:#059669
style B fill:#ede9fe,stroke:#7c3aed
style C fill:#ede9fe,stroke:#7c3aed
style D fill:#ede9fe,stroke:#7c3aed
style E fill:#fef3c7,stroke:#d977061. Baseline Identification
Agent: Scout (Gemini)
The Scout finds concrete baselines for comparison:
- Top-venue papers with available code (required)
- Expected performance numbers on target benchmarks
- Compute requirements for each baseline
Output: design/baselines.yaml
Code availability is mandatory
A baseline without reproducible code is not a baseline — it's a claim. The Scout flags papers without code as code_available: false and the Planner deprioritizes them.
2. Method Specification
Agent: Planner (Claude Opus, sub-agent)
The Planner writes a detailed method description:
- Algorithm pseudocode
- Key design choices and rationale
- Differences from closest prior work
- Expected computational complexity
Output: design/plan.md (method section)
3. Experiment Design
Agent: Planner (Claude Opus, sub-agent)
The Planner designs the full experiment suite:
| Component | Contents |
|---|---|
| Main experiments | Proposed method vs. all baselines |
| Ablation studies | Remove each component, measure impact |
| Scaling experiments | Vary sequence length, model size, etc. |
| Metrics | Primary (perplexity), secondary (throughput, memory) |
| Statistical plan | Number of seeds, significance tests |
Output: design/ablations.yaml, design/metrics.yaml
# design/metrics.yaml
primary:
- name: perplexity
benchmark: wikitext-103
target: "< 18.5 (RetNet baseline)"
secondary:
- name: throughput
unit: "tokens/second"
target: "> 12000 on 4x A100"
- name: peak_memory
unit: "GB"
target: "< 40 on 80GB A100"
statistical:
seeds: 3
significance_test: "paired t-test, p < 0.05"
error_reporting: "mean +/- std"4. Task Decomposition
Agent: Planner (Claude Opus, sub-agent)
The Planner breaks the implementation into numbered tasks for the Coder:
# design/tasks.yaml
tasks:
- id: 1
title: "Set up training infrastructure"
description: "Conda env, data loading, training loop skeleton"
dependencies: []
estimated_hours: 4
- id: 2
title: "Implement retention mechanism"
description: "Core retention module based on RetNet paper"
dependencies: [1]
estimated_hours: 6
- id: 3
title: "Integrate flash attention tiling"
description: "Adapt flash attention's tiling to retention compute"
dependencies: [2]
estimated_hours: 8
- id: 4
title: "Implement baselines"
description: "Set up vanilla attention and linear attention baselines"
dependencies: [1]
estimated_hours: 4
- id: 5
title: "Evaluation pipeline"
description: "Perplexity, throughput, and memory benchmarks"
dependencies: [1]
estimated_hours: 3Tasks have explicit dependencies
The Planner specifies which tasks depend on which. This enables ultrawork mode to parallelize independent tasks (e.g., tasks 2 and 4 above can run in parallel).
5. Plan Review
Agent: Orchestrator (and optionally Judge)
The Orchestrator reviews the complete plan for:
- Consistency between method spec and experiment design
- Feasibility within resource constraints
- Completeness of baselines and ablations
- Clear task decomposition for the Coder
Gate
| Gate Type | Recommended | Behavior |
|---|---|---|
human | Yes | User reviews the full experiment plan |
auto-judge | Possible | Judge checks plan completeness and feasibility |
auto | Not recommended | Experiment design benefits from human review |
Why human review for design?
A flawed experiment design is expensive to discover during training. Spending 30 minutes reviewing the plan now can save days of wasted compute later. This is the second-highest-leverage human review point after ideation.
Error Handling
| Error | Recovery |
|---|---|
| No baselines with code found | Scout broadens search; Planner notes which baselines need reimplementation |
| Resource constraints too tight | Planner proposes scaled-down experiment; Orchestrator discusses with user |
| Plan too ambitious for timeline | Planner splits into "must-have" and "nice-to-have" experiments |
| Missing information about baseline | Orchestrator dispatches Scout for deeper paper analysis |
Outputs Summary
| File | Contents |
|---|---|
design/plan.md | Complete method specification |
design/baselines.yaml | Baselines with code links and expected performance |
design/ablations.yaml | Ablation study design |
design/metrics.yaml | Metrics, targets, and statistical plan |
design/tasks.yaml | Numbered implementation tasks with dependencies |
Next Stage
When the gate passes, the pipeline advances to Implementation with the complete experiment plan.