Skip to content

Judge

The Judge is the independent evaluator. It reviews artifacts — ideas, code, results, papers — and renders structured verdicts. Its independence is the cornerstone of AutoResearch's quality assurance.

Identity

PropertyValue
LLMCodex (GPT)
Invocationcodex exec --skip-git-repo-check -m gpt-5.4 "prompt"
LifecycleStateless — no memory between invocations
ContextOnly the artifact being evaluated + evaluation criteria

Independence Rules

These rules are critical

The Judge's independence is not a nice-to-have. It is an architectural invariant enforced by the omc-orchestrator hook. Violating independence invalidates the evaluation.

RuleDescription
No creation historyThe Judge never sees how an artifact was created
No agent contextThe Judge doesn't know what the Coder struggled with
No Orchestrator reasoningThe Judge doesn't know why the Orchestrator chose this direction
Stateless invocationEach codex exec call starts from zero — no memory of previous evaluations
Cross-LLM for self-producedWhen evaluating Codex-produced code, Claude (Orchestrator) also reviews
Structured output onlyVerdicts are YAML, not prose — forces concrete evaluation

Why stateless?

Statefulness creates bias. If the Judge remembered evaluating an earlier version of the same idea, it might be anchored to its previous assessment. Stateless invocation means every evaluation is fresh, based solely on the artifact's merits.

Evaluation Tasks

TaskWhenWhat It Receives
Idea reviewAfter ideationIdea description + evaluation criteria
Code reviewAfter implementationCode + design spec + test results
Result evaluationAfter trainingResults + expected baselines + metrics
Paper reviewAfter writingPaper draft + venue criteria
Revision checkAfter revisionOriginal reviews + revised draft

Idea Review

The most nuanced evaluation task. The Judge assesses ideas across five dimensions.

Five Dimensions

DimensionQuestionScale
NoveltyHas this been done before? How different is it?1-10
FeasibilityCan this be implemented with available resources?1-10
VerifiabilityCan the claims be empirically validated?1-10
Attack SurfaceWhat are the obvious failure modes and criticisms?List
ImpactIf it works, how significant is the contribution?1-10

What the Judge Receives

yaml
# Input to codex exec
idea:
  title: "Flash-Recurrent Attention"
  description: |
    Combine flash attention's IO-aware tiling with RetNet's 
    recurrent formulation. Apply flash attention's memory-efficient 
    forward/backward pass to the retention mechanism.
  expected_benefit: "O(n) memory + hardware efficiency of flash attention"
  
constraints:
  gpu: "4x A100 80GB"
  timeline: "2 weeks implementation"
  venue: "ICML 2025"

evaluation_criteria:
  novelty_bar: 6        # Minimum novelty score to pass
  feasibility_bar: 7    # Minimum feasibility score to pass

What the Judge Does NOT Receive

  • Who proposed the idea
  • What other ideas were considered
  • The Orchestrator's preference
  • Scout's search process
  • Any previous Judge evaluations

Structured Verdict Output

All verdicts are structured YAML. No free-form prose for the verdict itself (though each dimension includes a brief justification).

yaml
# reviews/idea_review.yaml
verdict: PASS
confidence: 0.75

scores:
  novelty:
    score: 7
    justification: |
      Flash attention + recurrent attention combination is unexplored.
      Both components are well-known but their integration is novel.
  feasibility:
    score: 8
    justification: |
      Both flash attention and RetNet have open-source implementations.
      Integration is engineering work, not research risk.
  verifiability:
    score: 9
    justification: |
      Standard LM benchmarks. Throughput and memory are directly measurable.
      Perplexity comparison with baselines is straightforward.
  impact:
    score: 6
    justification: |
      Incremental improvement on efficiency. Useful but not paradigm-shifting.
      Would be a solid workshop paper; borderline for main conference.

attack_surface:
  - "Flash attention's tiling may not be compatible with retention's decay pattern"
  - "Memory savings may be marginal if the recurrent state is large"
  - "Reviewer may argue this is 'just engineering' rather than a research contribution"

recommendation: |
  Proceed with implementation. Address attack surface point 1 early — 
  if tiling is incompatible, the idea may need fundamental revision.
  Strengthen the 'research contribution' angle by showing the tiling 
  adaptation requires non-trivial algorithmic changes.

PASS / REVISE / FAIL

The verdict is always one of three values:

  • PASS — artifact meets the bar, proceed
  • REVISE — specific issues identified, fix and re-submit
  • FAIL — fundamental problems, reconsider the approach

Three-Model Paper Review

Paper review is the Judge's most complex task. It orchestrates a three-model review panel to simulate peer review.

mermaid
graph TD
    O[Orchestrator] -->|"paper draft"| R[Review Process]
    R --> C[Codex Review<br/>codex exec]
    R --> CL[Claude Review<br/>sub-agent]
    R --> G[Gemini Review<br/>tmux worker]
    
    C -->|"review.yaml"| M[Meta-Review]
    CL -->|"review.yaml"| M
    G -->|"review.yaml"| M
    
    M --> V[Aggregated Verdict]

    style R fill:#dbeafe,stroke:#2563eb
    style C fill:#fef3c7,stroke:#d97706
    style CL fill:#ede9fe,stroke:#7c3aed
    style G fill:#ecfdf5,stroke:#059669
    style V fill:#fef3c7,stroke:#d97706

Each Reviewer's Focus

ReviewerLLMFocus Areas
CodexCodex (GPT)Technical correctness, experimental design, reproducibility
ClaudeClaude OpusClarity of writing, strength of arguments, novelty claims
GeminiGeminiRelated work completeness, positioning, broader impact

Review Output Format

Each reviewer produces a structured review:

yaml
# reviews/paper_reviews/codex_review.yaml
reviewer: codex
overall: WEAK_ACCEPT

strengths:
  - "Clear experimental setup with strong baselines"
  - "Ablation study is thorough"
  - "Code will be released (reproducibility)"

weaknesses:
  - "Missing comparison with Mamba (concurrent work)"
  - "Wall-clock time not reported, only throughput"
  - "Error bars missing from Table 2"

questions:
  - "How does performance scale beyond 32k sequence length?"
  - "What is the training time compared to standard Transformer?"

suggestions:
  - "Add Mamba comparison in Table 1"
  - "Report wall-clock training time"
  - "Add standard deviation to all reported numbers"

confidence: 3  # 1-5 scale

Meta-Review

The Orchestrator aggregates the three reviews into a meta-review:

yaml
# reviews/meta_review.yaml
verdict: REVISE
consensus: 2/3 accept, 1/3 borderline

critical_issues:
  - "Missing Mamba comparison (raised by 2/3 reviewers)"
  - "No error bars (raised by all reviewers)"

revision_priorities:
  1: "Add Mamba baseline experiment"
  2: "Re-run experiments with 3 seeds, add error bars"
  3: "Report wall-clock training time"
Why three models instead of one?

A single reviewer has blind spots. Codex might miss writing quality issues. Claude might not catch a subtle experimental flaw. Gemini might miss a recent related paper. Three models with different training data and reasoning styles provide broader coverage — similar to how real peer review uses multiple reviewers.

Next

AutoResearch — Multi-agent Deep Learning Research System