CompareBench

1,100

CompareBench Samples

2,000

TallyBench Images

716

OmniCaps Images

Sub-Benchmarks

Object Categories

Models Evaluated

Project Structure

CompareBench/
├── CompareBench/ # Main 2×2 comparison benchmark (1,100 samples, 6 sub-tasks)
├── TallyBench/ # Single-image object counting (2,000 images)
├── OmniCaps/ # Multi-domain caption datasets (716 images)
│ ├── HistCaps/ # Historical events (516 images, 1707–2025)
│ ├── LandmarkCaps/ # World landmarks (100 images, 80–2024)
│ └── CelebrityCaps/# Notable people (100 images, 1643–2007)
├── Agent_Solution/ # Agentic task packaging for automated evaluation
├── Drawing_python/ # Dataset visualization scripts
└── Drawing_pptx/ # Presentation assets and paper figures

Benchmarks at a Glance

CompareBench — Visual Comparison Reasoning

Models are shown a 1600×1600 px 2×2 grid of images and asked "which image is most / least X?" with an A/B/C/D answer.

CompareTallyBench

600

samples

Most / least objects across 51 categories

CompareGeometryBench

200

samples

Longest / shortest / thinnest / widest / smallest diameter

CompareSpatialBench

100

samples

Deepest / highest (depth & vertical height)

CompareHistBench

100

samples

Earliest / latest historical photograph

CompareCelebrityBench

100

samples

Oldest / youngest celebrity by birth year

CompareLandmarkBench

100

samples

Oldest / newest landmark by construction date

TallyBench — Object Counting

2,000 single images across 50+ fine-grained categories. Models output an integer count.

Image Type	Count	Share
Real photographs	1,325	66%
Synthetic (FLUX.1-dev)	435	22%
Artificial / curated	240	12%

OmniCaps — Caption Dataset

Date-keyed image captions powering the three temporal sub-benchmarks. Hosted on Hugging Face at qiuzhangTiTi/OmniCaps.

Split	Images	Date Range	Subject
HistCaps	516	1707–2025	Historical events
LandmarkCaps	100	80–2024	World landmarks
CelebrityCaps	100	1643–2007	Notable people (birth dates)

Benchmark Results

CompareBench (A/B/C/D accuracy)

Model	CompareTally	CompareGeo	CompareSpatial	CompareHist	CompareCelebrity	CompareLandmark
GPT-5	82.17%	70.50%	—	—	—	96.00%
claude-opus-4-6	72.33%	—	—	—	—	79.00%
xAI grok-4	67.83%	—	—	—	91.00%	—
claude-haiku-4-5	46.17%	—	—	—	84.00%	—

TallyBench (counting accuracy)

GPT-5.2

75.60%

GPT-5

73.70%

GPT-5.1

73.20%

claude-sonnet-4-6

68.75%

claude-sonnet-4-5

67.75%

claude-opus-4-6

66.05%

claude-opus-4-5

61.30%

claude-haiku-4-5

58.65%

xAI grok-4 *

49.50%

* Partial run — 1,406 / 2,000 samples.

Running Evaluations

Setup

source ~/miniforge3/etc/profile.d/conda.sh && conda activate CompareBench

CompareBench

cd CompareBench/
python anthropic_HF.py   # Claude haiku / sonnet / opus (4-5, 4-6)
python openai_HF.py      # GPT-5, GPT-5.1, GPT-5.2
python gemini_HF.py      # Gemini 2.5 Pro Preview
python xAI_HF.py         # Grok-4

TallyBench

cd TallyBench/
python anthropic_HF.py
python openai_HF.py
python gemini_HF.py
python xAI_HF.py
python Other/Qwen2.5-VL_HF.py   # Qwen2.5-VL 3B/7B/32B/72B (GPU required)

Accuracy

python CompareBench/Results/acc.py
python TallyBench/Results/acc.py

Data Pipeline

CompareBench image generation

cd CompareBench/Other/
python quantity_comparison_data_json.py   # 600 2×2 grids (counting)
python temporal_comparison_data_json.py  # 300 grids (Hist / Celebrity / Landmark)
python image_concat.py                   # Composite 2×2 grid images
python upload_to_hf.py                   # Push to Hugging Face Hub

TallyBench image preparation

cd TallyBench/Other/
python prepare_tallybench_images.py   # Resize ≤1024px, 16px-aligned, ≥224px
python excel_to_json_converter.py     # Excel metadata → JSON
python rename.py                      # Zero-pad filenames (0001.jpg – 2000.jpg)
python check.py                       # Validate completeness + dimensions
python upload_HF.py                   # Push to Hugging Face Hub

OmniCaps upload

python OmniCaps/HistCaps/Other/upload_HF.py
python OmniCaps/LandmarkCaps/Other/upload_HF.py
python OmniCaps/CelebrityCaps/Other/upload_HF.py

Hugging Face Datasets

🤗 TallyBench
qiuzhangTiTi/TallyBench 🤗 OmniCaps
qiuzhangTiTi/OmniCaps · 3 splits 🤗 CompareBench
qiuzhangTiTi/CompareBench · 6 splits

Agentic Evaluation

Agent_Solution/ packages four tasks for automated agent frameworks:

Task	Answer Format	Timeout
counting/	Integer	1200s
compare_tally/	A/B/C/D	1200s
compare_geometry/	A/B/C/D	1200s
compare_spatial/	A/B/C/D	1200s

Each task folder contains task.toml, instruction.md, and README.md.

Tech Stack

Anthropic SDK OpenAI SDK Google Gemini xAI Qwen2.5-VL FLUX.1-dev HuggingFace Datasets Transformers Pillow OpenCV Pandas Plotly PyYAML Python 3.10 Conda CUDA