Stage 5: Analysis
The analysis stage interprets experiment results, runs additional evaluations if needed, and produces a structured understanding of what the results mean for the paper.
Entering This Stage
What you have:
- Complete training results (
experiments/exp-001/results.yaml) - Baseline comparisons (already evaluated)
- Training logs and curves
- Original hypotheses from the experiment plan (
design/plan.md)
What you don't have yet:
- Result interpretation
- Ablation results (may need additional runs)
- Statistical significance tests
- Figures and visualizations
Steps
graph TD
A[1. Result Validation] --> B[2. Hypothesis Checking]
B --> C[3. Ablation Runs]
C --> D[4. Statistical Analysis]
D --> E[5. Visualization]
E --> F[6. Analysis Synthesis]
F --> G{Gate}
G -->|pass| H[Advance to Writing]
G -->|need more data| C
style A fill:#f9f0ff,stroke:#7c3aed
style B fill:#f9f0ff,stroke:#7c3aed
style C fill:#fef3c7,stroke:#d97706
style E fill:#fef3c7,stroke:#d97706
style F fill:#f9f0ff,stroke:#7c3aed1. Result Validation
Agent: Orchestrator (Claude Opus)
The Orchestrator sanity-checks the results:
- Are numbers in plausible ranges?
- Do baselines match reported numbers (within tolerance)?
- Are there obvious anomalies?
- Is the data complete (all metrics, all experiments)?
Garbage in, garbage out
If baseline reproduction is off by more than 10%, the analysis is suspect. The Orchestrator flags this before proceeding.
2. Hypothesis Checking
Agent: Orchestrator (Claude Opus) + Judge (Codex)
Compare results against the hypotheses from the experiment plan:
# Hypothesis checking
hypotheses:
- claim: "Our method matches Transformer perplexity"
expected: "< 18.5 ppl"
actual: "17.8 ppl"
status: CONFIRMED
- claim: "Our method is 50% faster than vanilla attention"
expected: "> 12000 tok/s"
actual: "11800 tok/s at 4096 len"
status: PARTIALLY_CONFIRMED
note: "Meets target at shorter sequences, slightly below at 4096"
- claim: "O(n) memory scaling"
expected: "Memory grows linearly with sequence length"
actual: "38.2GB at 4096, 39.1GB at 8192, 40.0GB at 16384"
status: CONFIRMEDThe Judge independently evaluates whether the claimed results support the hypotheses. This prevents the Orchestrator from seeing what it wants to see.
3. Ablation Runs
Agent: Coder (Codex, tmux worker)
Run ablation studies as designed in design/ablations.yaml:
| Ablation | What's Removed | Purpose |
|---|---|---|
| No flash tiling | Use standard retention compute | Isolate flash tiling contribution |
| No recurrence | Use parallel-only mode | Isolate recurrence benefit |
| Single-scale | Remove multi-scale retention | Isolate multi-scale contribution |
Each ablation is a separate short training run (or a reduced-step run if compute is limited).
Ablations can run in parallel with ultrawork
If multiple ablations are independent, ultrawork mode dispatches them simultaneously across available GPUs. This is one of the highest-value uses of parallel execution.
4. Statistical Analysis
Agent: Coder (Codex, tmux worker)
Run the statistical plan from design/metrics.yaml:
- Re-run key experiments with different seeds (if not done during training)
- Compute mean and standard deviation
- Run significance tests (paired t-test, bootstrap, etc.)
- Flag any results that are not statistically significant
# experiments/summary.yaml (statistical section)
statistical_results:
main_comparison:
ours_vs_transformer:
metric: perplexity
ours: "17.8 ± 0.3"
baseline: "17.9 ± 0.2"
p_value: 0.42
significant: false
note: "Perplexity difference is NOT significant"
ours_vs_retnet:
metric: throughput
ours: "11800 ± 200"
baseline: "10100 ± 150"
p_value: 0.001
significant: trueHandle non-significant results honestly
If the main perplexity difference is not significant, this is a finding — not a problem to hide. The Orchestrator notes this for the Writer: the paper should frame the contribution as efficiency (significant throughput gain) with comparable quality (non-significant perplexity difference).
5. Visualization
Agent: Coder (Codex, tmux worker)
Generate figures based on Scout's descriptions (papers/figures/descriptions.yaml):
- Training curves (loss over steps)
- Throughput vs. sequence length comparison
- Memory scaling plot
- Ablation bar charts
Output: papers/figures/*.pdf
6. Analysis Synthesis
Agent: Orchestrator (Claude Opus)
The Orchestrator writes a structured analysis document:
# experiments/exp-001/analysis.md
## Key Findings
1. **Throughput**: Our method achieves 44% higher throughput than
vanilla attention at 4096 length (significant, p<0.001)
2. **Perplexity**: Comparable to Transformer (17.8 vs 17.9,
not significant, p=0.42)
3. **Memory**: Linear scaling confirmed — memory grows <5%
from 4096 to 16384 tokens
## Ablation Insights
- Flash tiling contributes most of the speedup (+35%)
- Multi-scale retention improves perplexity by 0.4 points
- Recurrence alone is slower than parallel mode at short sequences
## Paper Framing Recommendation
Frame as an efficiency contribution: "same quality, significantly faster"
NOT as a quality contribution: perplexity improvement is not significant
## Remaining Questions
- Performance at 32k+ sequence lengths (not tested due to compute)
- Behavior with larger model sizes (tested only at 125M params)Gate
| Gate Type | Recommended | Behavior |
|---|---|---|
human | For first project | User reviews analysis and framing |
auto-judge | Recommended | Judge evaluates analysis completeness |
auto | Possible | If analysis is straightforward |
The Judge checks:
- All planned experiments completed
- Statistical tests run
- Ablations complete
- Results interpretation is consistent with data
Error Handling
| Error | Recovery |
|---|---|
| Ablation training fails | Coder ralph loop, then escalate |
| Results contradicting hypotheses | Orchestrator flags for human review |
| Insufficient statistical significance | Run more seeds, or adjust paper framing |
| Missing baseline comparison | Run additional baseline, or note limitation |
Outputs Summary
| File | Contents |
|---|---|
experiments/summary.yaml | Cross-experiment comparison with statistics |
experiments/exp-001/analysis.md | Structured analysis and paper framing |
papers/figures/*.pdf | Generated visualizations |
papers/figures/descriptions.yaml | Updated with actual data |
Next Stage
When the gate passes, the pipeline advances to Writing with complete, analyzed results.