This repository provides a minimal, tutorial-style evaluation pipeline for the HAERAE-HUB/KMMMU dataset using:
- vLLM (offline inference) with
LLM.generate - Math-Verify for structured answer parsing
- LiteLLM for LLM-as-a-Judge evaluation
The goal is clarity and simplicity. There is no over-engineering, no defensive try/except logic, and no batching logic. The full dataset is processed in a single forward pass and a single judge call.
The main notebook that:
- Loads the KMMMU dataset from Hugging Face
- Runs multimodal inference with vLLM
- Forces answers in
$\\boxed{...}$format - Parses outputs using Math-Verify
parse - Sends
(question, gold, raw prediction, parsed prediction)to an LLM judge - Computes accuracy and saves results to CSV
The notebook is structured with markdown explanations for each stage of the pipeline.
Utility module containing:
- System prompts
- Image utilities (e.g.,
fetch_image, optional image helpers) - Shared configuration components
This keeps the notebook clean and focused on the evaluation flow.
Install the following:
pip install datasets vllm litellm "math-verify[inference]"You also need:
- A local GPU setup compatible with vLLM
- An API key for your LiteLLM provider (e.g.,
OPENAI_API_KEY)