Stage 2: Design

The design stage transforms a selected research idea into a complete experiment plan with baselines, ablations, metrics, and task decomposition.

Entering This Stage

What you have:

Selected idea with Judge's evaluation (ideas/selected.yaml)
Initial literature survey (papers/related_work/summaries.yaml)
Resource constraints (infrastructure.yaml)

What you don't have yet:

Detailed method specification
Baseline implementations
Experiment configurations

Steps

mermaid

graph TD
    A[1. Baseline Identification] --> B[2. Method Specification]
    B --> C[3. Experiment Design]
    C --> D[4. Task Decomposition]
    D --> E[5. Plan Review]
    E --> F{Gate}
    F -->|pass| G[Advance to Implementation]
    F -->|revise| C

    style A fill:#ecfdf5,stroke:#059669
    style B fill:#ede9fe,stroke:#7c3aed
    style C fill:#ede9fe,stroke:#7c3aed
    style D fill:#ede9fe,stroke:#7c3aed
    style E fill:#fef3c7,stroke:#d97706

1. Baseline Identification

Agent: Scout (Gemini)

The Scout finds concrete baselines for comparison:

Top-venue papers with available code (required)
Expected performance numbers on target benchmarks
Compute requirements for each baseline

Output: design/baselines.yaml

Code availability is mandatory

A baseline without reproducible code is not a baseline — it's a claim. The Scout flags papers without code as code_available: false and the Planner deprioritizes them.

2. Method Specification

Agent: Planner (Claude Opus, sub-agent)

The Planner writes a detailed method description:

Algorithm pseudocode
Key design choices and rationale
Differences from closest prior work
Expected computational complexity

Output: design/plan.md (method section)

3. Experiment Design

Agent: Planner (Claude Opus, sub-agent)

The Planner designs the full experiment suite:

Component	Contents
Main experiments	Proposed method vs. all baselines
Ablation studies	Remove each component, measure impact
Scaling experiments	Vary sequence length, model size, etc.
Metrics	Primary (perplexity), secondary (throughput, memory)
Statistical plan	Number of seeds, significance tests

Output: design/ablations.yaml, design/metrics.yaml

yaml

# design/metrics.yaml
primary:
  - name: perplexity
    benchmark: wikitext-103
    target: "< 18.5 (RetNet baseline)"
    
secondary:
  - name: throughput
    unit: "tokens/second"
    target: "> 12000 on 4x A100"
  - name: peak_memory
    unit: "GB"
    target: "< 40 on 80GB A100"

statistical:
  seeds: 3
  significance_test: "paired t-test, p < 0.05"
  error_reporting: "mean +/- std"

4. Task Decomposition

Agent: Planner (Claude Opus, sub-agent)

The Planner breaks the implementation into numbered tasks for the Coder:

yaml

# design/tasks.yaml
tasks:
  - id: 1
    title: "Set up training infrastructure"
    description: "Conda env, data loading, training loop skeleton"
    dependencies: []
    estimated_hours: 4
    
  - id: 2
    title: "Implement retention mechanism"
    description: "Core retention module based on RetNet paper"
    dependencies: [1]
    estimated_hours: 6
    
  - id: 3
    title: "Integrate flash attention tiling"
    description: "Adapt flash attention's tiling to retention compute"
    dependencies: [2]
    estimated_hours: 8
    
  - id: 4
    title: "Implement baselines"
    description: "Set up vanilla attention and linear attention baselines"
    dependencies: [1]
    estimated_hours: 4
    
  - id: 5
    title: "Evaluation pipeline"
    description: "Perplexity, throughput, and memory benchmarks"
    dependencies: [1]
    estimated_hours: 3

Tasks have explicit dependencies

The Planner specifies which tasks depend on which. This enables ultrawork mode to parallelize independent tasks (e.g., tasks 2 and 4 above can run in parallel).

5. Plan Review

Agent: Orchestrator (and optionally Judge)

The Orchestrator reviews the complete plan for:

Consistency between method spec and experiment design
Feasibility within resource constraints
Completeness of baselines and ablations
Clear task decomposition for the Coder

Gate

Gate Type	Recommended	Behavior
`human`	Yes	User reviews the full experiment plan
`auto-judge`	Possible	Judge checks plan completeness and feasibility
`auto`	Not recommended	Experiment design benefits from human review

Why human review for design?

A flawed experiment design is expensive to discover during training. Spending 30 minutes reviewing the plan now can save days of wasted compute later. This is the second-highest-leverage human review point after ideation.

Error Handling

Error	Recovery
No baselines with code found	Scout broadens search; Planner notes which baselines need reimplementation
Resource constraints too tight	Planner proposes scaled-down experiment; Orchestrator discusses with user
Plan too ambitious for timeline	Planner splits into "must-have" and "nice-to-have" experiments
Missing information about baseline	Orchestrator dispatches Scout for deeper paper analysis

Outputs Summary

File	Contents
`design/plan.md`	Complete method specification
`design/baselines.yaml`	Baselines with code links and expected performance
`design/ablations.yaml`	Ablation study design
`design/metrics.yaml`	Metrics, targets, and statistical plan
`design/tasks.yaml`	Numbered implementation tasks with dependencies

Next Stage

When the gate passes, the pipeline advances to Implementation with the complete experiment plan.

Stage 2: Design ​

Entering This Stage ​

Steps ​

1. Baseline Identification ​

2. Method Specification ​

3. Experiment Design ​

4. Task Decomposition ​

5. Plan Review ​

Gate ​

Error Handling ​

Outputs Summary ​

Next Stage ​

Stage 2: Design

Entering This Stage

Steps

1. Baseline Identification

2. Method Specification

3. Experiment Design

4. Task Decomposition

5. Plan Review

Gate

Error Handling

Outputs Summary

Next Stage