Judge

The Judge is the independent evaluator. It reviews artifacts — ideas, code, results, papers — and renders structured verdicts. Its independence is the cornerstone of AutoResearch's quality assurance.

Identity

Property	Value
LLM	Codex (GPT)
Invocation	`codex exec --skip-git-repo-check -m gpt-5.4 "prompt"`
Lifecycle	Stateless — no memory between invocations
Context	Only the artifact being evaluated + evaluation criteria

Independence Rules

These rules are critical

The Judge's independence is not a nice-to-have. It is an architectural invariant enforced by the omc-orchestrator hook. Violating independence invalidates the evaluation.

Rule	Description
No creation history	The Judge never sees how an artifact was created
No agent context	The Judge doesn't know what the Coder struggled with
No Orchestrator reasoning	The Judge doesn't know why the Orchestrator chose this direction
Stateless invocation	Each `codex exec` call starts from zero — no memory of previous evaluations
Cross-LLM for self-produced	When evaluating Codex-produced code, Claude (Orchestrator) also reviews
Structured output only	Verdicts are YAML, not prose — forces concrete evaluation

Why stateless?

Statefulness creates bias. If the Judge remembered evaluating an earlier version of the same idea, it might be anchored to its previous assessment. Stateless invocation means every evaluation is fresh, based solely on the artifact's merits.

Evaluation Tasks

Task	When	What It Receives
Idea review	After ideation	Idea description + evaluation criteria
Code review	After implementation	Code + design spec + test results
Result evaluation	After training	Results + expected baselines + metrics
Paper review	After writing	Paper draft + venue criteria
Revision check	After revision	Original reviews + revised draft

Idea Review

The most nuanced evaluation task. The Judge assesses ideas across five dimensions.

Five Dimensions

Dimension	Question	Scale
Novelty	Has this been done before? How different is it?	1-10
Feasibility	Can this be implemented with available resources?	1-10
Verifiability	Can the claims be empirically validated?	1-10
Attack Surface	What are the obvious failure modes and criticisms?	List
Impact	If it works, how significant is the contribution?	1-10

What the Judge Receives

yaml

# Input to codex exec
idea:
  title: "Flash-Recurrent Attention"
  description: |
    Combine flash attention's IO-aware tiling with RetNet's 
    recurrent formulation. Apply flash attention's memory-efficient 
    forward/backward pass to the retention mechanism.
  expected_benefit: "O(n) memory + hardware efficiency of flash attention"
  
constraints:
  gpu: "4x A100 80GB"
  timeline: "2 weeks implementation"
  venue: "ICML 2025"

evaluation_criteria:
  novelty_bar: 6        # Minimum novelty score to pass
  feasibility_bar: 7    # Minimum feasibility score to pass

What the Judge Does NOT Receive

Who proposed the idea
What other ideas were considered
The Orchestrator's preference
Scout's search process
Any previous Judge evaluations

Structured Verdict Output

All verdicts are structured YAML. No free-form prose for the verdict itself (though each dimension includes a brief justification).

yaml

# reviews/idea_review.yaml
verdict: PASS
confidence: 0.75

scores:
  novelty:
    score: 7
    justification: |
      Flash attention + recurrent attention combination is unexplored.
      Both components are well-known but their integration is novel.
  feasibility:
    score: 8
    justification: |
      Both flash attention and RetNet have open-source implementations.
      Integration is engineering work, not research risk.
  verifiability:
    score: 9
    justification: |
      Standard LM benchmarks. Throughput and memory are directly measurable.
      Perplexity comparison with baselines is straightforward.
  impact:
    score: 6
    justification: |
      Incremental improvement on efficiency. Useful but not paradigm-shifting.
      Would be a solid workshop paper; borderline for main conference.

attack_surface:
  - "Flash attention's tiling may not be compatible with retention's decay pattern"
  - "Memory savings may be marginal if the recurrent state is large"
  - "Reviewer may argue this is 'just engineering' rather than a research contribution"

recommendation: |
  Proceed with implementation. Address attack surface point 1 early — 
  if tiling is incompatible, the idea may need fundamental revision.
  Strengthen the 'research contribution' angle by showing the tiling 
  adaptation requires non-trivial algorithmic changes.

PASS / REVISE / FAIL

The verdict is always one of three values:

PASS — artifact meets the bar, proceed
REVISE — specific issues identified, fix and re-submit
FAIL — fundamental problems, reconsider the approach

Three-Model Paper Review

Paper review is the Judge's most complex task. It orchestrates a three-model review panel to simulate peer review.

mermaid

graph TD
    O[Orchestrator] -->|"paper draft"| R[Review Process]
    R --> C[Codex Review<br/>codex exec]
    R --> CL[Claude Review<br/>sub-agent]
    R --> G[Gemini Review<br/>tmux worker]
    
    C -->|"review.yaml"| M[Meta-Review]
    CL -->|"review.yaml"| M
    G -->|"review.yaml"| M
    
    M --> V[Aggregated Verdict]

    style R fill:#dbeafe,stroke:#2563eb
    style C fill:#fef3c7,stroke:#d97706
    style CL fill:#ede9fe,stroke:#7c3aed
    style G fill:#ecfdf5,stroke:#059669
    style V fill:#fef3c7,stroke:#d97706

Each Reviewer's Focus

Reviewer	LLM	Focus Areas
Codex	Codex (GPT)	Technical correctness, experimental design, reproducibility
Claude	Claude Opus	Clarity of writing, strength of arguments, novelty claims
Gemini	Gemini	Related work completeness, positioning, broader impact

Review Output Format

Each reviewer produces a structured review:

yaml

# reviews/paper_reviews/codex_review.yaml
reviewer: codex
overall: WEAK_ACCEPT

strengths:
  - "Clear experimental setup with strong baselines"
  - "Ablation study is thorough"
  - "Code will be released (reproducibility)"

weaknesses:
  - "Missing comparison with Mamba (concurrent work)"
  - "Wall-clock time not reported, only throughput"
  - "Error bars missing from Table 2"

questions:
  - "How does performance scale beyond 32k sequence length?"
  - "What is the training time compared to standard Transformer?"

suggestions:
  - "Add Mamba comparison in Table 1"
  - "Report wall-clock training time"
  - "Add standard deviation to all reported numbers"

confidence: 3  # 1-5 scale

Meta-Review

The Orchestrator aggregates the three reviews into a meta-review:

yaml

# reviews/meta_review.yaml
verdict: REVISE
consensus: 2/3 accept, 1/3 borderline

critical_issues:
  - "Missing Mamba comparison (raised by 2/3 reviewers)"
  - "No error bars (raised by all reviewers)"

revision_priorities:
  1: "Add Mamba baseline experiment"
  2: "Re-run experiments with 3 seeds, add error bars"
  3: "Report wall-clock training time"

Why three models instead of one?

A single reviewer has blind spots. Codex might miss writing quality issues. Claude might not catch a subtle experimental flaw. Gemini might miss a recent related paper. Three models with different training data and reasoning styles provide broader coverage — similar to how real peer review uses multiple reviewers.

Orchestrator — who dispatches the Judge
Review Stage — the full review pipeline stage
Architecture — cross-LLM review principle

Judge ​

Identity ​

Independence Rules ​

Evaluation Tasks ​

Idea Review ​

Five Dimensions ​

What the Judge Receives ​

What the Judge Does NOT Receive ​

Structured Verdict Output ​

Three-Model Paper Review ​

Each Reviewer's Focus ​

Review Output Format ​

Meta-Review ​

Next ​

Judge

Identity

Independence Rules

Evaluation Tasks

Idea Review

Five Dimensions

What the Judge Receives

What the Judge Does NOT Receive

Structured Verdict Output

Three-Model Paper Review

Each Reviewer's Focus

Review Output Format

Meta-Review

Next