CompareBench
Motivation
Method
Benchmarks
Results
Dataset
Visual Reasoning
VLM Evaluation
Multi-Model Benchmark
Can VLMs reallycompare images?
CompareBench systematically evaluates frontier vision-language models on six distinct visual comparison tasks — from counting objects to reasoning about historical time.
Why This Matters
VLMs are getting stronger — but how do we really measure them?
Existing benchmarks often test recognition or captioning. We ask a harder question: can models reason about differences across images?
📊
Counting is underrated
Most benchmarks skip precise object counting. TallyBench fills this gap with 2,000 images across 50+ categories, from dogs to electronics.
🔍
Comparison is harder than classification
Asking "which has more?" or "which is older?" requires cross-image reasoning — a fundamentally different skill from describing a single image.
⏳
Temporal reasoning is rarely tested
Can a model tell which historical photograph was taken earlier? Which landmark is older? We test this with date-grounded image pairs.
⚖️
Models need fair, diverse evaluation
We test GPT, Claude, Gemini, and Grok under identical conditions across the same tasks — apples-to-apples at scale.
How It Works
The 2×2 Grid Format
We present models with a 1600×1600 px composite of four images labeled A–D, then ask a targeted comparison question.
🖼️
4 Images
Selected from benchmark dataset
→
🔲
2×2 Grid
A / B / C / D labeled composite
→
❓
Question
"Which has the most X?"
→
🤖
VLM Answer
Single letter: A / B / C / D
→
✅
Score
Exact match vs. ground truth
Which image contains the most dogs?
A — 3 dogs
B — 7 dogs ✓
C — 2 dogs
D — 5 dogs
The same format is used across all six task types — only the question and image source change.
What We Test
Six Comparison Tasks
Each sub-benchmark isolates a specific visual reasoning ability.
600
CompareTallyBench
Most / least objects — 51 categories (animals, food, electronics, people…)
Counting
200
CompareGeometryBench
Longest / shortest / thinnest / widest / smallest diameter
Geometry
100
CompareSpatialBench
Deepest / highest — depth & vertical height comparisons
Spatial
100
CompareHistBench
Earliest / latest — historical photographs dated 1707–2025
Temporal
100
CompareCelebrityBench
Oldest / youngest — notable people ranked by birth year (1643–2007)
Temporal
100
CompareLandmarkBench
Oldest / newest — world landmarks ranked by construction date (80–2024)
Temporal
Key Findings
How Do Models Compare?
Frontier models vary significantly across tasks — strong on temporal reasoning, weaker on precise counting.
96%
GPT-5 on CompareLandmarkBench — best single result across all tasks
82%
GPT-5 on CompareTallyBench — counting comparison is harder
75.6%
GPT-5.2 on TallyBench — best single-image counting performance
49.5%
Grok-4 on TallyBench — near random baseline on counting
CompareBench — Tally task
CompareBench — Landmark task
TallyBench — Single-image counting (2,000 samples)
* Partial run — 1,406 / 2,000 samples.
The Dataset
Built on OmniCaps + TallyBench
CompareBench's temporal tasks are powered by OmniCaps — a date-keyed caption dataset covering history, landmarks, and notable people.
📜
HistCaps
516
1707 – 2025
Historical events, each image named by the event date
🏛️
LandmarkCaps
100
80 – 2024 AD
World landmarks named by construction or opening date
🌟
CelebrityCaps
100
1643 – 2007
Notable people named by birth date
TallyBench — 2,000 Counting Images
A diverse mix of real photographs, AI-generated synthetic scenes, and curated objects.
66%
Real Photos 1,325 images
22%
Synthetic (FLUX.1-dev) 435 images
12%
Artificial / Curated 240 images
Models Evaluated
Claude Haiku / Sonnet / Opus 4-5
Qwen2.5-VL 3B / 7B / 32B / 72B