Visual Reasoning VLM Evaluation Multi-Model Benchmark

Can VLMs really
compare images?

CompareBench systematically evaluates frontier vision-language models on six distinct visual comparison tasks — from counting objects to reasoning about historical time.

🤗 Hugging Face See How It Works ↓

1,100

Comparison Pairs

2,000

Counting Images

Task Categories

Models Tested

Why This Matters

VLMs are getting stronger —
but how do we really measure them?

Existing benchmarks often test recognition or captioning. We ask a harder question: can models reason about differences across images?

📊

Counting is underrated

Most benchmarks skip precise object counting. TallyBench fills this gap with 2,000 images across 50+ categories, from dogs to electronics.

🔍

Comparison is harder than classification

Asking "which has more?" or "which is older?" requires cross-image reasoning — a fundamentally different skill from describing a single image.

⏳

Temporal reasoning is rarely tested

Can a model tell which historical photograph was taken earlier? Which landmark is older? We test this with date-grounded image pairs.

⚖️

Models need fair, diverse evaluation

We test GPT, Claude, Gemini, and Grok under identical conditions across the same tasks — apples-to-apples at scale.

How It Works

The 2×2 Grid Format

We present models with a 1600×1600 px composite of four images labeled A–D, then ask a targeted comparison question.

🖼️

4 Images

Selected from benchmark dataset

→

🔲

2×2 Grid

A / B / C / D labeled composite

→

❓

Question

"Which has the most X?"

→

🤖

VLM Answer

Single letter: A / B / C / D

→

✅

Score

Exact match vs. ground truth

A🐕🐕🐕

B🐕🐕🐕🐕🐕🐕🐕

C🐕🐕

D🐕🐕🐕🐕🐕

Which image contains the most dogs?

A — 3 dogs

B — 7 dogs ✓

C — 2 dogs

D — 5 dogs

The same format is used across all six task types — only the question and image source change.

What We Test

Six Comparison Tasks

Each sub-benchmark isolates a specific visual reasoning ability.

600

CompareTallyBench

Most / least objects — 51 categories (animals, food, electronics, people…)

Counting

200

CompareGeometryBench

Longest / shortest / thinnest / widest / smallest diameter

Geometry

100

CompareSpatialBench

Deepest / highest — depth & vertical height comparisons

Spatial

100

CompareHistBench

Earliest / latest — historical photographs dated 1707–2025

Temporal

100

CompareCelebrityBench

Oldest / youngest — notable people ranked by birth year (1643–2007)

Temporal

100

CompareLandmarkBench

Oldest / newest — world landmarks ranked by construction date (80–2024)

Temporal

Key Findings

How Do Models Compare?

Frontier models vary significantly across tasks — strong on temporal reasoning, weaker on precise counting.

96%

GPT-5 on CompareLandmarkBench — best single result across all tasks

82%

GPT-5 on CompareTallyBench — counting comparison is harder

75.6%

GPT-5.2 on TallyBench — best single-image counting performance

49.5%

Grok-4 on TallyBench — near random baseline on counting

CompareBench — Tally task

GPT-5

82.2%

claude-opus-4-6

72.3%

xAI grok-4

67.8%

claude-haiku-4-5

46.2%

CompareBench — Landmark task

GPT-5

96.0%

xAI grok-4

91.0%

claude-haiku-4-5

84.0%

claude-opus-4-6

79.0%

TallyBench — Single-image counting (2,000 samples)

GPT-5.2

75.60%

GPT-5

73.70%

GPT-5.1

73.20%

claude-sonnet-4-6

68.75%

claude-sonnet-4-5

67.75%

claude-opus-4-6

66.05%

claude-opus-4-5

61.30%

claude-haiku-4-5

58.65%

xAI grok-4 *

49.50%

* Partial run — 1,406 / 2,000 samples.

The Dataset

Built on OmniCaps + TallyBench

CompareBench's temporal tasks are powered by OmniCaps — a date-keyed caption dataset covering history, landmarks, and notable people.

📜

HistCaps

516

1707 – 2025

Historical events, each image named by the event date

🏛️

LandmarkCaps

100

80 – 2024 AD

World landmarks named by construction or opening date

🌟

CelebrityCaps

100

1643 – 2007

Notable people named by birth date

TallyBench — 2,000 Counting Images

A diverse mix of real photographs, AI-generated synthetic scenes, and curated objects.

66%

Real Photos
1,325 images

22%

Synthetic (FLUX.1-dev)
435 images

12%

Artificial / Curated
240 images

Models Evaluated

GPT-5 / 5.1 / 5.2

Claude Haiku / Sonnet / Opus 4-5

Claude Sonnet / Opus 4-6

Gemini 2.5 Pro Preview

xAI Grok-4

Qwen2.5-VL 3B / 7B / 32B / 72B

Can VLMs reallycompare images?

VLMs are getting stronger —but how do we really measure them?

Counting is underrated

Comparison is harder than classification

Temporal reasoning is rarely tested

Models need fair, diverse evaluation

The 2×2 Grid Format

Six Comparison Tasks

How Do Models Compare?

CompareBench — Tally task

CompareBench — Landmark task

TallyBench — Single-image counting (2,000 samples)

Built on OmniCaps + TallyBench

TallyBench — 2,000 Counting Images

Models Evaluated

Can VLMs really
compare images?

VLMs are getting stronger —
but how do we really measure them?