Visual Reasoning VLM Evaluation Multi-Model Benchmark

Can VLMs really
compare images?

CompareBench systematically evaluates frontier vision-language models on six distinct visual comparison tasks — from counting objects to reasoning about historical time.

🤗 Hugging Face See How It Works ↓
1,100
Comparison Pairs
2,000
Counting Images
6
Task Categories
8+
Models Tested
Why This Matters

VLMs are getting stronger —
but how do we really measure them?

Existing benchmarks often test recognition or captioning. We ask a harder question: can models reason about differences across images?

📊

Counting is underrated

Most benchmarks skip precise object counting. TallyBench fills this gap with 2,000 images across 50+ categories, from dogs to electronics.

🔍

Comparison is harder than classification

Asking "which has more?" or "which is older?" requires cross-image reasoning — a fundamentally different skill from describing a single image.

Temporal reasoning is rarely tested

Can a model tell which historical photograph was taken earlier? Which landmark is older? We test this with date-grounded image pairs.

⚖️

Models need fair, diverse evaluation

We test GPT, Claude, Gemini, and Grok under identical conditions across the same tasks — apples-to-apples at scale.

How It Works

The 2×2 Grid Format

We present models with a 1600×1600 px composite of four images labeled A–D, then ask a targeted comparison question.

🖼️
4 Images
Selected from benchmark dataset
🔲
2×2 Grid
A / B / C / D labeled composite
Question
"Which has the most X?"
🤖
VLM Answer
Single letter: A / B / C / D
Score
Exact match vs. ground truth
A🐕🐕🐕
B🐕🐕🐕🐕🐕🐕🐕
C🐕🐕
D🐕🐕🐕🐕🐕
Which image contains the most dogs?
A — 3 dogs
B — 7 dogs ✓
C — 2 dogs
D — 5 dogs

The same format is used across all six task types — only the question and image source change.

What We Test

Six Comparison Tasks

Each sub-benchmark isolates a specific visual reasoning ability.

600
CompareTallyBench
Most / least objects — 51 categories (animals, food, electronics, people…)
Counting
200
CompareGeometryBench
Longest / shortest / thinnest / widest / smallest diameter
Geometry
100
CompareSpatialBench
Deepest / highest — depth & vertical height comparisons
Spatial
100
CompareHistBench
Earliest / latest — historical photographs dated 1707–2025
Temporal
100
CompareCelebrityBench
Oldest / youngest — notable people ranked by birth year (1643–2007)
Temporal
100
CompareLandmarkBench
Oldest / newest — world landmarks ranked by construction date (80–2024)
Temporal
Key Findings

How Do Models Compare?

Frontier models vary significantly across tasks — strong on temporal reasoning, weaker on precise counting.

96%
GPT-5 on CompareLandmarkBench — best single result across all tasks
82%
GPT-5 on CompareTallyBench — counting comparison is harder
75.6%
GPT-5.2 on TallyBench — best single-image counting performance
49.5%
Grok-4 on TallyBench — near random baseline on counting

CompareBench — Tally task

GPT-5
82.2%
claude-opus-4-6
72.3%
xAI grok-4
67.8%
claude-haiku-4-5
46.2%

CompareBench — Landmark task

GPT-5
96.0%
xAI grok-4
91.0%
claude-haiku-4-5
84.0%
claude-opus-4-6
79.0%

TallyBench — Single-image counting (2,000 samples)

GPT-5.2
75.60%
GPT-5
73.70%
GPT-5.1
73.20%
claude-sonnet-4-6
68.75%
claude-sonnet-4-5
67.75%
claude-opus-4-6
66.05%
claude-opus-4-5
61.30%
claude-haiku-4-5
58.65%
xAI grok-4 *
49.50%

* Partial run — 1,406 / 2,000 samples.

The Dataset

Built on OmniCaps + TallyBench

CompareBench's temporal tasks are powered by OmniCaps — a date-keyed caption dataset covering history, landmarks, and notable people.

📜
HistCaps
516
1707 – 2025
Historical events, each image named by the event date
🏛️
LandmarkCaps
100
80 – 2024 AD
World landmarks named by construction or opening date
🌟
CelebrityCaps
100
1643 – 2007
Notable people named by birth date

TallyBench — 2,000 Counting Images

A diverse mix of real photographs, AI-generated synthetic scenes, and curated objects.

66%
Real Photos
1,325 images
22%
Synthetic (FLUX.1-dev)
435 images
12%
Artificial / Curated
240 images

Models Evaluated

GPT-5 / 5.1 / 5.2
Claude Haiku / Sonnet / Opus 4-5
Claude Sonnet / Opus 4-6
Gemini 2.5 Pro Preview
xAI Grok-4
Qwen2.5-VL 3B / 7B / 32B / 72B