← Back to report
UX Judge
Methodology
How designs were evaluated and how model conclusions were derived. Judge model: Composer 2.5.
Pipeline
- Isolated workspace created; benchmark HTML served read-only.
- Composer 2.5 runs one session per generation model. Each design is opened in a browser, screenshot at desktop (1440×900) and mobile (390×844), and scored on 10 weighted criteria.
- Per-model run files merged; cross-model rank assigned per variation.
- Composer 2.5 reads all judge write-ups and produces model-level conclusions (no re-scoring).
Scoring rubric
Each design receives 0–10 on ten criteria. Overall score is the weighted composite (0–100). Verdict thresholds: ship ≥ 82, iterate 65–81, reject < 65.
- Visual Identity — weight 12%
- Typography & Hierarchy — weight 10%
- Layout & Composition — weight 12%
- Color & Contrast — weight 8%
- Content & Copy — weight 10%
- Interaction & CTAs — weight 10%
- Brand Fit — weight 10%
- Section Completeness — weight 10%
- Polish & Craft — weight 10%
- Mobile Readiness — weight 8%
Individual judge prompt
You are a UX design evaluator for ui-bench.
## Scope
Judge ONE generation model at a time. For the assigned modelId and categoryId, score EVERY design:
- all variations (1–10)
- all harness and skill combinations present on disk
## Inputs (read-only)
- Category brief from bench/prompts.ts
- Rubric: 10 weighted criteria in src/data/ux-judgments.ts (0–10 each)
- Design lenses: clarity, hierarchy, honesty, intentionality, familiarity, conversion, craft, antiSlop, inclusive
- Gate question: "Would a design lead approve this for a real launch?"
## Procedure
1. Serve benchmark HTML from the isolated workspace (read-only sources).
2. For each design file:
a. Open in agent-browser
b. Wait for network idle
c. Capture desktop screenshot (1440×900)
d. Capture DOM/accessibility snapshot
e. Scroll full page; capture mobile screenshot (390×844)
3. Score all 10 criteria with a one-sentence note citing what you observed.
4. Compute overallScore = round(weighted sum × 10), range 0–100.
5. Assign verdict: ship ≥ 82 · iterate 65–81 · reject < 65.
6. Write structured comments: summary, strengths[], weaknesses[], designerNotes.
7. Rank designs within this model (rankWithinModel).
## Output
Write ONE run file per model: runs/{modelId}__{categoryId}.json
Include screenshot paths, criteria scores, comments, judgedAt timestamps.
Do NOT assign cross-model rank. Do NOT modify source HTML.Consolidator prompt
Analyze src/data/ux-judgments.json thoroughly.
This file contains per-design UX scores for landing-page designs from multiple generation models.
Each entry has: overallScore, verdict, criteria scores with notes, comments (summary, strengths, weaknesses, designerNotes).
Your job: Read ALL entries and produce a structured meta-analysis synthesis.
Do NOT re-score. Synthesize what the judges said across models and aspects.
Return JSON with this structure:
{
"generatedAt": "ISO date",
"modelsAnalyzed": ["..."],
"totalEntries": number,
"byModel": {
"modelId": {
"personality": "1–2 sentences on this model's design DNA",
"consistentStrengths": ["..."],
"consistentWeaknesses": ["..."],
"skillDelta": "how skill variants changed outcomes vs baseline, if applicable",
"bestMoment": "which variation/entry and why",
"worstMoment": "which variation/entry and why",
"judgeVerdict": "1 paragraph conclusive summary"
}
}
}
Be specific. Cite actual patterns from judge comments (empty sections, emoji heroes, purple gradients, editorial minimalism, etc.).