Stage 5: Analysis

The analysis stage interprets experiment results, runs additional evaluations if needed, and produces a structured understanding of what the results mean for the paper.

Entering This Stage

What you have:

Complete training results (experiments/exp-001/results.yaml)
Baseline comparisons (already evaluated)
Training logs and curves
Original hypotheses from the experiment plan (design/plan.md)

What you don't have yet:

Result interpretation
Ablation results (may need additional runs)
Statistical significance tests
Figures and visualizations

Steps

mermaid

graph TD
    A[1. Result Validation] --> B[2. Hypothesis Checking]
    B --> C[3. Ablation Runs]
    C --> D[4. Statistical Analysis]
    D --> E[5. Visualization]
    E --> F[6. Analysis Synthesis]
    F --> G{Gate}
    G -->|pass| H[Advance to Writing]
    G -->|need more data| C

    style A fill:#f9f0ff,stroke:#7c3aed
    style B fill:#f9f0ff,stroke:#7c3aed
    style C fill:#fef3c7,stroke:#d97706
    style E fill:#fef3c7,stroke:#d97706
    style F fill:#f9f0ff,stroke:#7c3aed

1. Result Validation

Agent: Orchestrator (Claude Opus)

The Orchestrator sanity-checks the results:

Are numbers in plausible ranges?
Do baselines match reported numbers (within tolerance)?
Are there obvious anomalies?
Is the data complete (all metrics, all experiments)?

Garbage in, garbage out

If baseline reproduction is off by more than 10%, the analysis is suspect. The Orchestrator flags this before proceeding.

2. Hypothesis Checking

Agent: Orchestrator (Claude Opus) + Judge (Codex)

Compare results against the hypotheses from the experiment plan:

yaml

# Hypothesis checking
hypotheses:
  - claim: "Our method matches Transformer perplexity"
    expected: "< 18.5 ppl"
    actual: "17.8 ppl"
    status: CONFIRMED
    
  - claim: "Our method is 50% faster than vanilla attention"
    expected: "> 12000 tok/s"
    actual: "11800 tok/s at 4096 len"
    status: PARTIALLY_CONFIRMED
    note: "Meets target at shorter sequences, slightly below at 4096"
    
  - claim: "O(n) memory scaling"
    expected: "Memory grows linearly with sequence length"
    actual: "38.2GB at 4096, 39.1GB at 8192, 40.0GB at 16384"
    status: CONFIRMED

The Judge independently evaluates whether the claimed results support the hypotheses. This prevents the Orchestrator from seeing what it wants to see.

3. Ablation Runs

Agent: Coder (Codex, tmux worker)

Run ablation studies as designed in design/ablations.yaml:

Ablation	What's Removed	Purpose
No flash tiling	Use standard retention compute	Isolate flash tiling contribution
No recurrence	Use parallel-only mode	Isolate recurrence benefit
Single-scale	Remove multi-scale retention	Isolate multi-scale contribution

Each ablation is a separate short training run (or a reduced-step run if compute is limited).

Ablations can run in parallel with ultrawork

If multiple ablations are independent, ultrawork mode dispatches them simultaneously across available GPUs. This is one of the highest-value uses of parallel execution.

4. Statistical Analysis

Agent: Coder (Codex, tmux worker)

Run the statistical plan from design/metrics.yaml:

Re-run key experiments with different seeds (if not done during training)
Compute mean and standard deviation
Run significance tests (paired t-test, bootstrap, etc.)
Flag any results that are not statistically significant

yaml

# experiments/summary.yaml (statistical section)
statistical_results:
  main_comparison:
    ours_vs_transformer:
      metric: perplexity
      ours: "17.8 ± 0.3"
      baseline: "17.9 ± 0.2"
      p_value: 0.42
      significant: false
      note: "Perplexity difference is NOT significant"
    ours_vs_retnet:
      metric: throughput
      ours: "11800 ± 200"
      baseline: "10100 ± 150"
      p_value: 0.001
      significant: true

Handle non-significant results honestly

If the main perplexity difference is not significant, this is a finding — not a problem to hide. The Orchestrator notes this for the Writer: the paper should frame the contribution as efficiency (significant throughput gain) with comparable quality (non-significant perplexity difference).

5. Visualization

Agent: Coder (Codex, tmux worker)

Generate figures based on Scout's descriptions (papers/figures/descriptions.yaml):

Training curves (loss over steps)
Throughput vs. sequence length comparison
Memory scaling plot
Ablation bar charts

Output: papers/figures/*.pdf

6. Analysis Synthesis

Agent: Orchestrator (Claude Opus)

The Orchestrator writes a structured analysis document:

yaml

# experiments/exp-001/analysis.md
## Key Findings

1. **Throughput**: Our method achieves 44% higher throughput than 
   vanilla attention at 4096 length (significant, p<0.001)
2. **Perplexity**: Comparable to Transformer (17.8 vs 17.9, 
   not significant, p=0.42)
3. **Memory**: Linear scaling confirmed — memory grows <5% 
   from 4096 to 16384 tokens

## Ablation Insights

- Flash tiling contributes most of the speedup (+35%)
- Multi-scale retention improves perplexity by 0.4 points
- Recurrence alone is slower than parallel mode at short sequences

## Paper Framing Recommendation

Frame as an efficiency contribution: "same quality, significantly faster"
NOT as a quality contribution: perplexity improvement is not significant

## Remaining Questions

- Performance at 32k+ sequence lengths (not tested due to compute)
- Behavior with larger model sizes (tested only at 125M params)

Gate

Gate Type	Recommended	Behavior
`human`	For first project	User reviews analysis and framing
`auto-judge`	Recommended	Judge evaluates analysis completeness
`auto`	Possible	If analysis is straightforward

The Judge checks:

All planned experiments completed
Statistical tests run
Ablations complete
Results interpretation is consistent with data

Error Handling

Error	Recovery
Ablation training fails	Coder ralph loop, then escalate
Results contradicting hypotheses	Orchestrator flags for human review
Insufficient statistical significance	Run more seeds, or adjust paper framing
Missing baseline comparison	Run additional baseline, or note limitation

Outputs Summary

File	Contents
`experiments/summary.yaml`	Cross-experiment comparison with statistics
`experiments/exp-001/analysis.md`	Structured analysis and paper framing
`papers/figures/*.pdf`	Generated visualizations
`papers/figures/descriptions.yaml`	Updated with actual data

Next Stage

When the gate passes, the pipeline advances to Writing with complete, analyzed results.

Stage 5: Analysis ​

Entering This Stage ​

Steps ​

1. Result Validation ​

2. Hypothesis Checking ​

3. Ablation Runs ​

4. Statistical Analysis ​

5. Visualization ​

6. Analysis Synthesis ​

Gate ​

Error Handling ​

Outputs Summary ​

Next Stage ​

Stage 5: Analysis

Entering This Stage

Steps

1. Result Validation

2. Hypothesis Checking

3. Ablation Runs

4. Statistical Analysis

5. Visualization

6. Analysis Synthesis

Gate

Error Handling

Outputs Summary

Next Stage