VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Overview

VisualPuzzles is a multimodal benchmark specifically designed to evaluate reasoning abilities in large models while deliberately minimizing reliance on domain-specific knowledge.

Key features:

1168 diverse puzzles
5 reasoning categories: Algorithmic, Analogical, Deductive, Inductive, Spatial
Difficulty labels: Easy, Medium, Hard
Less knowledge-intensive than existing benchmarks (e.g., MMMU)
More reasoning-complex than existing benchmarks (e.g., MMMU)

Key Findings

All models perform worse than humans; most can't surpass even 5th-percentile human performance.
Strong performance on knowledge-heavy benchmarks does not transfer well.
Larger models and structured "thinking modes" don't guarantee better results.
Scaling model size does not ensure stronger reasoning

Dataset

The dataset is available on HuggingFace 🤗.

Model Outputs

Outputs of all models we evaluated are available on Zeno.

Experiments

We gratefully use the lmms-eval package to evaluate VisualPuzzles.

To reproduce experimental results on VisualPuzzles, run the following commands:

Installation:

cd evaluation/lmms-eval
pip install -e .

Experiments:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model model_type \ # for example, llava
    --model_args pretrained=model_name \ # for example, "liuhaotian/llava-v1.5-7b"
    --tasks VisualPuzzles_cot \ # use VisualPuzzles_cot if you are evaluating CoT performance, or use VisualPuzzles_direct if not.
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix VisualPuzzles \
    --output_path ./logs/

Alternatively, you could also run model_evaluation.py with a custom model.

Knowledge Intensity Evaluation of MMMU v.s. VisualPuzzles

This experiment investigates

the extent to which solving problems in the VisualPuzzles benchmark relies on domain-specific knowledge, compared to the widely-used MMMU dataset; and
whether models already possess the knowledge required to solve VisualPuzzles, as compared to MMMU.

Knowledge Checklist Generation

We prompted GPT-4o to generate "knowledge concept checklists" for 50 randomly selected questions from each of MMMU and VisualPuzzles.

The knowledge concept checklists we generated for MMMU and VisualPuzzles could be found in knowledge/mmmu_questions.json and knowledge/puzzle_questions.json respectively.

Run the following command to reproduce this experiment.

python get_knowledge_checklists.py

Note that we went through manual validation as dicussed in the paper.

Knowledge Accuracy

We measured models' knowledge accuracy - their ability to answer the knowledge checklist questions correctly - on both benchmarks. We used llm-as-a-judge with GPT-4o to evaluate whether models answered the knowledge checklist questions correctly. Model outputs and judge outputs could be found in knowledge/knowledge_eval_output.

After generating model responses for the knowledge checklist questions knowledge/mmmu_questions.json and knowledge/puzzle_questions.json, run the following command to reproduce this experiment on models' knowledge accuracy.

cd knowledge
python get_knowledge_scores.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
evaluation		evaluation
knowledge		knowledge
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Overview

Key Findings

Dataset

Model Outputs

Experiments

Knowledge Intensity Evaluation of MMMU v.s. VisualPuzzles

Knowledge Checklist Generation

Knowledge Accuracy

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Overview

Key Findings

Dataset

Model Outputs

Experiments

Knowledge Intensity Evaluation of MMMU v.s. VisualPuzzles

Knowledge Checklist Generation

Knowledge Accuracy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages