← Back to report

UX Judge

Methodology

How designs were evaluated and how model conclusions were derived. Judge model: Composer 2.5.

Pipeline

  1. Isolated workspace created; benchmark HTML served read-only.
  2. Composer 2.5 runs one session per generation model. Each design is opened in a browser, screenshot at desktop (1440×900) and mobile (390×844), and scored on 10 weighted criteria.
  3. Per-model run files merged; cross-model rank assigned per variation.
  4. Composer 2.5 reads all judge write-ups and produces model-level conclusions (no re-scoring).

Scoring rubric

Each design receives 0–10 on ten criteria. Overall score is the weighted composite (0–100). Verdict thresholds: ship ≥ 82, iterate 65–81, reject < 65.

  • Visual Identity — weight 12%
  • Typography & Hierarchy — weight 10%
  • Layout & Composition — weight 12%
  • Color & Contrast — weight 8%
  • Content & Copy — weight 10%
  • Interaction & CTAs — weight 10%
  • Brand Fit — weight 10%
  • Section Completeness — weight 10%
  • Polish & Craft — weight 10%
  • Mobile Readiness — weight 8%

Individual judge prompt

You are a UX design evaluator for ui-bench.

## Scope
Judge ONE generation model at a time. For the assigned modelId and categoryId, score EVERY design:
- all variations (1–10)
- all harness and skill combinations present on disk

## Inputs (read-only)
- Category brief from bench/prompts.ts
- Rubric: 10 weighted criteria in src/data/ux-judgments.ts (0–10 each)
- Design lenses: clarity, hierarchy, honesty, intentionality, familiarity, conversion, craft, antiSlop, inclusive
- Gate question: "Would a design lead approve this for a real launch?"

## Procedure
1. Serve benchmark HTML from the isolated workspace (read-only sources).
2. For each design file:
   a. Open in agent-browser
   b. Wait for network idle
   c. Capture desktop screenshot (1440×900)
   d. Capture DOM/accessibility snapshot
   e. Scroll full page; capture mobile screenshot (390×844)
3. Score all 10 criteria with a one-sentence note citing what you observed.
4. Compute overallScore = round(weighted sum × 10), range 0–100.
5. Assign verdict: ship ≥ 82 · iterate 65–81 · reject < 65.
6. Write structured comments: summary, strengths[], weaknesses[], designerNotes.
7. Rank designs within this model (rankWithinModel).

## Output
Write ONE run file per model: runs/{modelId}__{categoryId}.json
Include screenshot paths, criteria scores, comments, judgedAt timestamps.
Do NOT assign cross-model rank. Do NOT modify source HTML.

Consolidator prompt

Analyze src/data/ux-judgments.json thoroughly.

This file contains per-design UX scores for landing-page designs from multiple generation models.
Each entry has: overallScore, verdict, criteria scores with notes, comments (summary, strengths, weaknesses, designerNotes).

Your job: Read ALL entries and produce a structured meta-analysis synthesis.
Do NOT re-score. Synthesize what the judges said across models and aspects.

Return JSON with this structure:
{
  "generatedAt": "ISO date",
  "modelsAnalyzed": ["..."],
  "totalEntries": number,
  "byModel": {
    "modelId": {
      "personality": "1–2 sentences on this model's design DNA",
      "consistentStrengths": ["..."],
      "consistentWeaknesses": ["..."],
      "skillDelta": "how skill variants changed outcomes vs baseline, if applicable",
      "bestMoment": "which variation/entry and why",
      "worstMoment": "which variation/entry and why",
      "judgeVerdict": "1 paragraph conclusive summary"
    }
  }
}

Be specific. Cite actual patterns from judge comments (empty sections, emoji heroes, purple gradients, editorial minimalism, etc.).