Skip to content

Stage 2: Design

The design stage transforms a selected research idea into a complete experiment plan with baselines, ablations, metrics, and task decomposition.

Entering This Stage

What you have:

  • Selected idea with Judge's evaluation (ideas/selected.yaml)
  • Initial literature survey (papers/related_work/summaries.yaml)
  • Resource constraints (infrastructure.yaml)

What you don't have yet:

  • Detailed method specification
  • Baseline implementations
  • Experiment configurations

Steps

mermaid
graph TD
    A[1. Baseline Identification] --> B[2. Method Specification]
    B --> C[3. Experiment Design]
    C --> D[4. Task Decomposition]
    D --> E[5. Plan Review]
    E --> F{Gate}
    F -->|pass| G[Advance to Implementation]
    F -->|revise| C

    style A fill:#ecfdf5,stroke:#059669
    style B fill:#ede9fe,stroke:#7c3aed
    style C fill:#ede9fe,stroke:#7c3aed
    style D fill:#ede9fe,stroke:#7c3aed
    style E fill:#fef3c7,stroke:#d97706

1. Baseline Identification

Agent: Scout (Gemini)

The Scout finds concrete baselines for comparison:

  • Top-venue papers with available code (required)
  • Expected performance numbers on target benchmarks
  • Compute requirements for each baseline

Output: design/baselines.yaml

Code availability is mandatory

A baseline without reproducible code is not a baseline — it's a claim. The Scout flags papers without code as code_available: false and the Planner deprioritizes them.

2. Method Specification

Agent: Planner (Claude Opus, sub-agent)

The Planner writes a detailed method description:

  • Algorithm pseudocode
  • Key design choices and rationale
  • Differences from closest prior work
  • Expected computational complexity

Output: design/plan.md (method section)

3. Experiment Design

Agent: Planner (Claude Opus, sub-agent)

The Planner designs the full experiment suite:

ComponentContents
Main experimentsProposed method vs. all baselines
Ablation studiesRemove each component, measure impact
Scaling experimentsVary sequence length, model size, etc.
MetricsPrimary (perplexity), secondary (throughput, memory)
Statistical planNumber of seeds, significance tests

Output: design/ablations.yaml, design/metrics.yaml

yaml
# design/metrics.yaml
primary:
  - name: perplexity
    benchmark: wikitext-103
    target: "< 18.5 (RetNet baseline)"
    
secondary:
  - name: throughput
    unit: "tokens/second"
    target: "> 12000 on 4x A100"
  - name: peak_memory
    unit: "GB"
    target: "< 40 on 80GB A100"

statistical:
  seeds: 3
  significance_test: "paired t-test, p < 0.05"
  error_reporting: "mean +/- std"

4. Task Decomposition

Agent: Planner (Claude Opus, sub-agent)

The Planner breaks the implementation into numbered tasks for the Coder:

yaml
# design/tasks.yaml
tasks:
  - id: 1
    title: "Set up training infrastructure"
    description: "Conda env, data loading, training loop skeleton"
    dependencies: []
    estimated_hours: 4
    
  - id: 2
    title: "Implement retention mechanism"
    description: "Core retention module based on RetNet paper"
    dependencies: [1]
    estimated_hours: 6
    
  - id: 3
    title: "Integrate flash attention tiling"
    description: "Adapt flash attention's tiling to retention compute"
    dependencies: [2]
    estimated_hours: 8
    
  - id: 4
    title: "Implement baselines"
    description: "Set up vanilla attention and linear attention baselines"
    dependencies: [1]
    estimated_hours: 4
    
  - id: 5
    title: "Evaluation pipeline"
    description: "Perplexity, throughput, and memory benchmarks"
    dependencies: [1]
    estimated_hours: 3

Tasks have explicit dependencies

The Planner specifies which tasks depend on which. This enables ultrawork mode to parallelize independent tasks (e.g., tasks 2 and 4 above can run in parallel).

5. Plan Review

Agent: Orchestrator (and optionally Judge)

The Orchestrator reviews the complete plan for:

  • Consistency between method spec and experiment design
  • Feasibility within resource constraints
  • Completeness of baselines and ablations
  • Clear task decomposition for the Coder

Gate

Gate TypeRecommendedBehavior
humanYesUser reviews the full experiment plan
auto-judgePossibleJudge checks plan completeness and feasibility
autoNot recommendedExperiment design benefits from human review

Why human review for design?

A flawed experiment design is expensive to discover during training. Spending 30 minutes reviewing the plan now can save days of wasted compute later. This is the second-highest-leverage human review point after ideation.

Error Handling

ErrorRecovery
No baselines with code foundScout broadens search; Planner notes which baselines need reimplementation
Resource constraints too tightPlanner proposes scaled-down experiment; Orchestrator discusses with user
Plan too ambitious for timelinePlanner splits into "must-have" and "nice-to-have" experiments
Missing information about baselineOrchestrator dispatches Scout for deeper paper analysis

Outputs Summary

FileContents
design/plan.mdComplete method specification
design/baselines.yamlBaselines with code links and expected performance
design/ablations.yamlAblation study design
design/metrics.yamlMetrics, targets, and statistical plan
design/tasks.yamlNumbered implementation tasks with dependencies

Next Stage

When the gate passes, the pipeline advances to Implementation with the complete experiment plan.

AutoResearch — Multi-agent Deep Learning Research System