Skip to content

Stage 5: Analysis

The analysis stage interprets experiment results, runs additional evaluations if needed, and produces a structured understanding of what the results mean for the paper.

Entering This Stage

What you have:

  • Complete training results (experiments/exp-001/results.yaml)
  • Baseline comparisons (already evaluated)
  • Training logs and curves
  • Original hypotheses from the experiment plan (design/plan.md)

What you don't have yet:

  • Result interpretation
  • Ablation results (may need additional runs)
  • Statistical significance tests
  • Figures and visualizations

Steps

mermaid
graph TD
    A[1. Result Validation] --> B[2. Hypothesis Checking]
    B --> C[3. Ablation Runs]
    C --> D[4. Statistical Analysis]
    D --> E[5. Visualization]
    E --> F[6. Analysis Synthesis]
    F --> G{Gate}
    G -->|pass| H[Advance to Writing]
    G -->|need more data| C

    style A fill:#f9f0ff,stroke:#7c3aed
    style B fill:#f9f0ff,stroke:#7c3aed
    style C fill:#fef3c7,stroke:#d97706
    style E fill:#fef3c7,stroke:#d97706
    style F fill:#f9f0ff,stroke:#7c3aed

1. Result Validation

Agent: Orchestrator (Claude Opus)

The Orchestrator sanity-checks the results:

  • Are numbers in plausible ranges?
  • Do baselines match reported numbers (within tolerance)?
  • Are there obvious anomalies?
  • Is the data complete (all metrics, all experiments)?

Garbage in, garbage out

If baseline reproduction is off by more than 10%, the analysis is suspect. The Orchestrator flags this before proceeding.

2. Hypothesis Checking

Agent: Orchestrator (Claude Opus) + Judge (Codex)

Compare results against the hypotheses from the experiment plan:

yaml
# Hypothesis checking
hypotheses:
  - claim: "Our method matches Transformer perplexity"
    expected: "< 18.5 ppl"
    actual: "17.8 ppl"
    status: CONFIRMED
    
  - claim: "Our method is 50% faster than vanilla attention"
    expected: "> 12000 tok/s"
    actual: "11800 tok/s at 4096 len"
    status: PARTIALLY_CONFIRMED
    note: "Meets target at shorter sequences, slightly below at 4096"
    
  - claim: "O(n) memory scaling"
    expected: "Memory grows linearly with sequence length"
    actual: "38.2GB at 4096, 39.1GB at 8192, 40.0GB at 16384"
    status: CONFIRMED

The Judge independently evaluates whether the claimed results support the hypotheses. This prevents the Orchestrator from seeing what it wants to see.

3. Ablation Runs

Agent: Coder (Codex, tmux worker)

Run ablation studies as designed in design/ablations.yaml:

AblationWhat's RemovedPurpose
No flash tilingUse standard retention computeIsolate flash tiling contribution
No recurrenceUse parallel-only modeIsolate recurrence benefit
Single-scaleRemove multi-scale retentionIsolate multi-scale contribution

Each ablation is a separate short training run (or a reduced-step run if compute is limited).

Ablations can run in parallel with ultrawork

If multiple ablations are independent, ultrawork mode dispatches them simultaneously across available GPUs. This is one of the highest-value uses of parallel execution.

4. Statistical Analysis

Agent: Coder (Codex, tmux worker)

Run the statistical plan from design/metrics.yaml:

  • Re-run key experiments with different seeds (if not done during training)
  • Compute mean and standard deviation
  • Run significance tests (paired t-test, bootstrap, etc.)
  • Flag any results that are not statistically significant
yaml
# experiments/summary.yaml (statistical section)
statistical_results:
  main_comparison:
    ours_vs_transformer:
      metric: perplexity
      ours: "17.8 ± 0.3"
      baseline: "17.9 ± 0.2"
      p_value: 0.42
      significant: false
      note: "Perplexity difference is NOT significant"
    ours_vs_retnet:
      metric: throughput
      ours: "11800 ± 200"
      baseline: "10100 ± 150"
      p_value: 0.001
      significant: true

Handle non-significant results honestly

If the main perplexity difference is not significant, this is a finding — not a problem to hide. The Orchestrator notes this for the Writer: the paper should frame the contribution as efficiency (significant throughput gain) with comparable quality (non-significant perplexity difference).

5. Visualization

Agent: Coder (Codex, tmux worker)

Generate figures based on Scout's descriptions (papers/figures/descriptions.yaml):

  • Training curves (loss over steps)
  • Throughput vs. sequence length comparison
  • Memory scaling plot
  • Ablation bar charts

Output: papers/figures/*.pdf

6. Analysis Synthesis

Agent: Orchestrator (Claude Opus)

The Orchestrator writes a structured analysis document:

yaml
# experiments/exp-001/analysis.md
## Key Findings

1. **Throughput**: Our method achieves 44% higher throughput than 
   vanilla attention at 4096 length (significant, p<0.001)
2. **Perplexity**: Comparable to Transformer (17.8 vs 17.9, 
   not significant, p=0.42)
3. **Memory**: Linear scaling confirmed — memory grows <5% 
   from 4096 to 16384 tokens

## Ablation Insights

- Flash tiling contributes most of the speedup (+35%)
- Multi-scale retention improves perplexity by 0.4 points
- Recurrence alone is slower than parallel mode at short sequences

## Paper Framing Recommendation

Frame as an efficiency contribution: "same quality, significantly faster"
NOT as a quality contribution: perplexity improvement is not significant

## Remaining Questions

- Performance at 32k+ sequence lengths (not tested due to compute)
- Behavior with larger model sizes (tested only at 125M params)

Gate

Gate TypeRecommendedBehavior
humanFor first projectUser reviews analysis and framing
auto-judgeRecommendedJudge evaluates analysis completeness
autoPossibleIf analysis is straightforward

The Judge checks:

  • All planned experiments completed
  • Statistical tests run
  • Ablations complete
  • Results interpretation is consistent with data

Error Handling

ErrorRecovery
Ablation training failsCoder ralph loop, then escalate
Results contradicting hypothesesOrchestrator flags for human review
Insufficient statistical significanceRun more seeds, or adjust paper framing
Missing baseline comparisonRun additional baseline, or note limitation

Outputs Summary

FileContents
experiments/summary.yamlCross-experiment comparison with statistics
experiments/exp-001/analysis.mdStructured analysis and paper framing
papers/figures/*.pdfGenerated visualizations
papers/figures/descriptions.yamlUpdated with actual data

Next Stage

When the gate passes, the pipeline advances to Writing with complete, analyzed results.

AutoResearch — Multi-agent Deep Learning Research System