Judge
The Judge is the independent evaluator. It reviews artifacts — ideas, code, results, papers — and renders structured verdicts. Its independence is the cornerstone of AutoResearch's quality assurance.
Identity
| Property | Value |
|---|---|
| LLM | Codex (GPT) |
| Invocation | codex exec --skip-git-repo-check -m gpt-5.4 "prompt" |
| Lifecycle | Stateless — no memory between invocations |
| Context | Only the artifact being evaluated + evaluation criteria |
Independence Rules
These rules are critical
The Judge's independence is not a nice-to-have. It is an architectural invariant enforced by the omc-orchestrator hook. Violating independence invalidates the evaluation.
| Rule | Description |
|---|---|
| No creation history | The Judge never sees how an artifact was created |
| No agent context | The Judge doesn't know what the Coder struggled with |
| No Orchestrator reasoning | The Judge doesn't know why the Orchestrator chose this direction |
| Stateless invocation | Each codex exec call starts from zero — no memory of previous evaluations |
| Cross-LLM for self-produced | When evaluating Codex-produced code, Claude (Orchestrator) also reviews |
| Structured output only | Verdicts are YAML, not prose — forces concrete evaluation |
Why stateless?
Statefulness creates bias. If the Judge remembered evaluating an earlier version of the same idea, it might be anchored to its previous assessment. Stateless invocation means every evaluation is fresh, based solely on the artifact's merits.
Evaluation Tasks
| Task | When | What It Receives |
|---|---|---|
| Idea review | After ideation | Idea description + evaluation criteria |
| Code review | After implementation | Code + design spec + test results |
| Result evaluation | After training | Results + expected baselines + metrics |
| Paper review | After writing | Paper draft + venue criteria |
| Revision check | After revision | Original reviews + revised draft |
Idea Review
The most nuanced evaluation task. The Judge assesses ideas across five dimensions.
Five Dimensions
| Dimension | Question | Scale |
|---|---|---|
| Novelty | Has this been done before? How different is it? | 1-10 |
| Feasibility | Can this be implemented with available resources? | 1-10 |
| Verifiability | Can the claims be empirically validated? | 1-10 |
| Attack Surface | What are the obvious failure modes and criticisms? | List |
| Impact | If it works, how significant is the contribution? | 1-10 |
What the Judge Receives
# Input to codex exec
idea:
title: "Flash-Recurrent Attention"
description: |
Combine flash attention's IO-aware tiling with RetNet's
recurrent formulation. Apply flash attention's memory-efficient
forward/backward pass to the retention mechanism.
expected_benefit: "O(n) memory + hardware efficiency of flash attention"
constraints:
gpu: "4x A100 80GB"
timeline: "2 weeks implementation"
venue: "ICML 2025"
evaluation_criteria:
novelty_bar: 6 # Minimum novelty score to pass
feasibility_bar: 7 # Minimum feasibility score to passWhat the Judge Does NOT Receive
- Who proposed the idea
- What other ideas were considered
- The Orchestrator's preference
- Scout's search process
- Any previous Judge evaluations
Structured Verdict Output
All verdicts are structured YAML. No free-form prose for the verdict itself (though each dimension includes a brief justification).
# reviews/idea_review.yaml
verdict: PASS
confidence: 0.75
scores:
novelty:
score: 7
justification: |
Flash attention + recurrent attention combination is unexplored.
Both components are well-known but their integration is novel.
feasibility:
score: 8
justification: |
Both flash attention and RetNet have open-source implementations.
Integration is engineering work, not research risk.
verifiability:
score: 9
justification: |
Standard LM benchmarks. Throughput and memory are directly measurable.
Perplexity comparison with baselines is straightforward.
impact:
score: 6
justification: |
Incremental improvement on efficiency. Useful but not paradigm-shifting.
Would be a solid workshop paper; borderline for main conference.
attack_surface:
- "Flash attention's tiling may not be compatible with retention's decay pattern"
- "Memory savings may be marginal if the recurrent state is large"
- "Reviewer may argue this is 'just engineering' rather than a research contribution"
recommendation: |
Proceed with implementation. Address attack surface point 1 early —
if tiling is incompatible, the idea may need fundamental revision.
Strengthen the 'research contribution' angle by showing the tiling
adaptation requires non-trivial algorithmic changes.PASS / REVISE / FAIL
The verdict is always one of three values:
- PASS — artifact meets the bar, proceed
- REVISE — specific issues identified, fix and re-submit
- FAIL — fundamental problems, reconsider the approach
Three-Model Paper Review
Paper review is the Judge's most complex task. It orchestrates a three-model review panel to simulate peer review.
graph TD
O[Orchestrator] -->|"paper draft"| R[Review Process]
R --> C[Codex Review<br/>codex exec]
R --> CL[Claude Review<br/>sub-agent]
R --> G[Gemini Review<br/>tmux worker]
C -->|"review.yaml"| M[Meta-Review]
CL -->|"review.yaml"| M
G -->|"review.yaml"| M
M --> V[Aggregated Verdict]
style R fill:#dbeafe,stroke:#2563eb
style C fill:#fef3c7,stroke:#d97706
style CL fill:#ede9fe,stroke:#7c3aed
style G fill:#ecfdf5,stroke:#059669
style V fill:#fef3c7,stroke:#d97706Each Reviewer's Focus
| Reviewer | LLM | Focus Areas |
|---|---|---|
| Codex | Codex (GPT) | Technical correctness, experimental design, reproducibility |
| Claude | Claude Opus | Clarity of writing, strength of arguments, novelty claims |
| Gemini | Gemini | Related work completeness, positioning, broader impact |
Review Output Format
Each reviewer produces a structured review:
# reviews/paper_reviews/codex_review.yaml
reviewer: codex
overall: WEAK_ACCEPT
strengths:
- "Clear experimental setup with strong baselines"
- "Ablation study is thorough"
- "Code will be released (reproducibility)"
weaknesses:
- "Missing comparison with Mamba (concurrent work)"
- "Wall-clock time not reported, only throughput"
- "Error bars missing from Table 2"
questions:
- "How does performance scale beyond 32k sequence length?"
- "What is the training time compared to standard Transformer?"
suggestions:
- "Add Mamba comparison in Table 1"
- "Report wall-clock training time"
- "Add standard deviation to all reported numbers"
confidence: 3 # 1-5 scaleMeta-Review
The Orchestrator aggregates the three reviews into a meta-review:
# reviews/meta_review.yaml
verdict: REVISE
consensus: 2/3 accept, 1/3 borderline
critical_issues:
- "Missing Mamba comparison (raised by 2/3 reviewers)"
- "No error bars (raised by all reviewers)"
revision_priorities:
1: "Add Mamba baseline experiment"
2: "Re-run experiments with 3 seeds, add error bars"
3: "Report wall-clock training time"Why three models instead of one?
A single reviewer has blind spots. Codex might miss writing quality issues. Claude might not catch a subtle experimental flaw. Gemini might miss a recent related paper. Three models with different training data and reasoning styles provide broader coverage — similar to how real peer review uses multiple reviewers.
Next
- Orchestrator — who dispatches the Judge
- Review Stage — the full review pipeline stage
- Architecture — cross-LLM review principle