CompareBench

A comprehensive multi-model visual reasoning benchmark suite — evaluating VLMs on side-by-side image comparison, object counting, and temporal reasoning.

CompareBench  1,100 samples TallyBench  2,000 images OmniCaps  716 images 8+ VLMs evaluated Hugging Face
1,100
CompareBench Samples
2,000
TallyBench Images
716
OmniCaps Images
6
Sub-Benchmarks
51
Object Categories
8+
Models Evaluated

Project Structure

CompareBench/
├── CompareBench/ # Main 2×2 comparison benchmark (1,100 samples, 6 sub-tasks)
├── TallyBench/ # Single-image object counting (2,000 images)
├── OmniCaps/ # Multi-domain caption datasets (716 images)
│ ├── HistCaps/ # Historical events (516 images, 1707–2025)
│ ├── LandmarkCaps/ # World landmarks (100 images, 80–2024)
│ └── CelebrityCaps/# Notable people (100 images, 1643–2007)
├── Agent_Solution/ # Agentic task packaging for automated evaluation
├── Drawing_python/ # Dataset visualization scripts
└── Drawing_pptx/ # Presentation assets and paper figures

Benchmarks at a Glance

CompareBench — Visual Comparison Reasoning

Models are shown a 1600×1600 px 2×2 grid of images and asked "which image is most / least X?" with an A/B/C/D answer.

CompareTallyBench
600
samples
Most / least objects across 51 categories
CompareGeometryBench
200
samples
Longest / shortest / thinnest / widest / smallest diameter
CompareSpatialBench
100
samples
Deepest / highest (depth & vertical height)
CompareHistBench
100
samples
Earliest / latest historical photograph
CompareCelebrityBench
100
samples
Oldest / youngest celebrity by birth year
CompareLandmarkBench
100
samples
Oldest / newest landmark by construction date

TallyBench — Object Counting

2,000 single images across 50+ fine-grained categories. Models output an integer count.

Image TypeCountShare
Real photographs1,32566%
Synthetic (FLUX.1-dev)43522%
Artificial / curated24012%

OmniCaps — Caption Dataset

Date-keyed image captions powering the three temporal sub-benchmarks. Hosted on Hugging Face at qiuzhangTiTi/OmniCaps.

SplitImagesDate RangeSubject
HistCaps5161707–2025Historical events
LandmarkCaps10080–2024World landmarks
CelebrityCaps1001643–2007Notable people (birth dates)

Benchmark Results

CompareBench (A/B/C/D accuracy)

Model CompareTally CompareGeo CompareSpatial CompareHist CompareCelebrity CompareLandmark
GPT-5 82.17% 70.50% 96.00%
claude-opus-4-6 72.33% 79.00%
xAI grok-4 67.83% 91.00%
claude-haiku-4-5 46.17% 84.00%

TallyBench (counting accuracy)

GPT-5.2
75.60%
GPT-5
73.70%
GPT-5.1
73.20%
claude-sonnet-4-6
68.75%
claude-sonnet-4-5
67.75%
claude-opus-4-6
66.05%
claude-opus-4-5
61.30%
claude-haiku-4-5
58.65%
xAI grok-4 *
49.50%

* Partial run — 1,406 / 2,000 samples.

Running Evaluations

Setup

source ~/miniforge3/etc/profile.d/conda.sh && conda activate CompareBench

CompareBench

cd CompareBench/
python anthropic_HF.py   # Claude haiku / sonnet / opus (4-5, 4-6)
python openai_HF.py      # GPT-5, GPT-5.1, GPT-5.2
python gemini_HF.py      # Gemini 2.5 Pro Preview
python xAI_HF.py         # Grok-4

TallyBench

cd TallyBench/
python anthropic_HF.py
python openai_HF.py
python gemini_HF.py
python xAI_HF.py
python Other/Qwen2.5-VL_HF.py   # Qwen2.5-VL 3B/7B/32B/72B (GPU required)

Accuracy

python CompareBench/Results/acc.py
python TallyBench/Results/acc.py

Data Pipeline

CompareBench image generation

cd CompareBench/Other/
python quantity_comparison_data_json.py   # 600 2×2 grids (counting)
python temporal_comparison_data_json.py  # 300 grids (Hist / Celebrity / Landmark)
python image_concat.py                   # Composite 2×2 grid images
python upload_to_hf.py                   # Push to Hugging Face Hub

TallyBench image preparation

cd TallyBench/Other/
python prepare_tallybench_images.py   # Resize ≤1024px, 16px-aligned, ≥224px
python excel_to_json_converter.py     # Excel metadata → JSON
python rename.py                      # Zero-pad filenames (0001.jpg – 2000.jpg)
python check.py                       # Validate completeness + dimensions
python upload_HF.py                   # Push to Hugging Face Hub

OmniCaps upload

python OmniCaps/HistCaps/Other/upload_HF.py
python OmniCaps/LandmarkCaps/Other/upload_HF.py
python OmniCaps/CelebrityCaps/Other/upload_HF.py

Hugging Face Datasets

Agentic Evaluation

Agent_Solution/ packages four tasks for automated agent frameworks:

TaskAnswer FormatTimeout
counting/Integer1200s
compare_tally/A/B/C/D1200s
compare_geometry/A/B/C/D1200s
compare_spatial/A/B/C/D1200s

Each task folder contains task.toml, instruction.md, and README.md.

Tech Stack

Anthropic SDK OpenAI SDK Google Gemini xAI Qwen2.5-VL FLUX.1-dev HuggingFace Datasets Transformers Pillow OpenCV Pandas Plotly PyYAML Python 3.10 Conda CUDA