A comprehensive multi-model visual reasoning benchmark suite — evaluating VLMs on side-by-side image comparison, object counting, and temporal reasoning.
Models are shown a 1600×1600 px 2×2 grid of images and asked "which image is most / least X?" with an A/B/C/D answer.
2,000 single images across 50+ fine-grained categories. Models output an integer count.
| Image Type | Count | Share |
|---|---|---|
| Real photographs | 1,325 | 66% |
| Synthetic (FLUX.1-dev) | 435 | 22% |
| Artificial / curated | 240 | 12% |
Date-keyed image captions powering the three temporal sub-benchmarks. Hosted on Hugging Face at qiuzhangTiTi/OmniCaps.
| Split | Images | Date Range | Subject |
|---|---|---|---|
| HistCaps | 516 | 1707–2025 | Historical events |
| LandmarkCaps | 100 | 80–2024 | World landmarks |
| CelebrityCaps | 100 | 1643–2007 | Notable people (birth dates) |
| Model | CompareTally | CompareGeo | CompareSpatial | CompareHist | CompareCelebrity | CompareLandmark |
|---|---|---|---|---|---|---|
| GPT-5 | 82.17% | 70.50% | — | — | — | 96.00% |
| claude-opus-4-6 | 72.33% | — | — | — | — | 79.00% |
| xAI grok-4 | 67.83% | — | — | — | 91.00% | — |
| claude-haiku-4-5 | 46.17% | — | — | — | 84.00% | — |
* Partial run — 1,406 / 2,000 samples.
source ~/miniforge3/etc/profile.d/conda.sh && conda activate CompareBench
cd CompareBench/
python anthropic_HF.py # Claude haiku / sonnet / opus (4-5, 4-6)
python openai_HF.py # GPT-5, GPT-5.1, GPT-5.2
python gemini_HF.py # Gemini 2.5 Pro Preview
python xAI_HF.py # Grok-4
cd TallyBench/
python anthropic_HF.py
python openai_HF.py
python gemini_HF.py
python xAI_HF.py
python Other/Qwen2.5-VL_HF.py # Qwen2.5-VL 3B/7B/32B/72B (GPU required)
python CompareBench/Results/acc.py
python TallyBench/Results/acc.py
cd CompareBench/Other/
python quantity_comparison_data_json.py # 600 2×2 grids (counting)
python temporal_comparison_data_json.py # 300 grids (Hist / Celebrity / Landmark)
python image_concat.py # Composite 2×2 grid images
python upload_to_hf.py # Push to Hugging Face Hub
cd TallyBench/Other/
python prepare_tallybench_images.py # Resize ≤1024px, 16px-aligned, ≥224px
python excel_to_json_converter.py # Excel metadata → JSON
python rename.py # Zero-pad filenames (0001.jpg – 2000.jpg)
python check.py # Validate completeness + dimensions
python upload_HF.py # Push to Hugging Face Hub
python OmniCaps/HistCaps/Other/upload_HF.py
python OmniCaps/LandmarkCaps/Other/upload_HF.py
python OmniCaps/CelebrityCaps/Other/upload_HF.py
Agent_Solution/ packages four tasks for automated agent frameworks:
| Task | Answer Format | Timeout |
|---|---|---|
| counting/ | Integer | 1200s |
| compare_tally/ | A/B/C/D | 1200s |
| compare_geometry/ | A/B/C/D | 1200s |
| compare_spatial/ | A/B/C/D | 1200s |
Each task folder contains task.toml, instruction.md, and README.md.