Evaluation of Captain's multimodal retrieval-augmented generation on MRAG-Bench (ICLR 2025).
MRAG-Bench is a vision-centric benchmark with 16,130 images and 1,251 human-annotated multiple-choice questions across 9 scenarios, designed to evaluate multimodal RAG systems where visual retrieval outperforms text retrieval.
pip install -r requirements.txtSet your Captain API credentials:
export CAPTAIN_API_KEY="your_api_key"
export CAPTAIN_ORG_ID="your_org_id"
export CAPTAIN_API_URL="https://api.runcaptain.com" # or staging URL
export CAPTAIN_COLLECTION="multimodal-eval"Download the MRAG-Bench image corpus (~3.7GB, 16,130 images) and index it into a Captain collection:
python index_corpus.py --collection multimodal-evalThis uploads all corpus images to your Captain collection using the /v2/collections/{name}/index/file endpoint.
# Full eval (1,251 questions)
python eval.py
# Quick test (first 50 questions)
python eval.py --limit 50
# Filter by scenario
python eval.py --scenario Angle
# Filter by image type
python eval.py --image-type Animal
# Change retrieval depth
python eval.py --top-k 10
# Run multiple top-k values
python eval.py --top-k 1 5 10 20Full evaluation on 1,251 questions from MRAG-Bench, with 16,130 corpus images indexed into Captain.
| Metric | K=5 | K=10 |
|---|---|---|
| ContentHit@K | 81.3% | 81.8% |
| MRR | 0.807 | 0.798 |
| Precision@K | 74.2% | 70.3% |
- ContentHit@K: At least 1 retrieved image's VLM description contains the correct answer
- MRR: Mean Reciprocal Rank of the first correct hit
- Precision@K: Fraction of retrieved images whose descriptions match the answer
| Scenario | Count | ContentHit@5 | ContentHit@10 | MRR@5 | Precision@5 |
|---|---|---|---|---|---|
| High Precision | 120 | 100.0% | 100.0% | 1.000 | 98.5% |
| Scope | 102 | 96.1% | 96.1% | 0.956 | 84.7% |
| Incomplete | 102 | 94.1% | 94.1% | 0.931 | 91.0% |
| Obstruction | 108 | 92.6% | 92.6% | 0.926 | 84.1% |
| Partial | 246 | 86.6% | 87.0% | 0.862 | 78.3% |
| Angle | 322 | 83.2% | 84.2% | 0.829 | 75.2% |
| Temporal | 149 | 57.0% | 57.7% | 0.554 | 50.3% |
| Deformation | 102 | 36.3% | 37.3% | 0.353 | 29.4% |
| Aspect | Count | ContentHit@5 | ContentHit@10 | MRR@5 | Precision@5 |
|---|---|---|---|---|---|
| High Precision | 120 | 100.0% | 100.0% | 1.000 | 98.5% |
| Perspective | 778 | 87.3% | 87.8% | 0.870 | 78.7% |
| Transformative | 353 | 61.8% | 62.3% | 0.605 | 56.0% |
| Image Type | Count | ContentHit@5 | ContentHit@10 | MRR@5 | Precision@5 |
|---|---|---|---|---|---|
| High Precision | 120 | 100.0% | 100.0% | 1.000 | 98.5% |
| Keyboard | 102 | 94.1% | 94.1% | 0.931 | 91.0% |
| Building | 44 | 93.2% | 93.2% | 0.902 | 84.1% |
| Animal | 416 | 91.8% | 92.3% | 0.916 | 80.0% |
| Car | 286 | 77.3% | 77.6% | 0.769 | 73.7% |
| Flowers | 178 | 63.5% | 64.6% | 0.626 | 55.4% |
| Cats | 105 | 41.9% | 42.9% | 0.408 | 36.2% |
The MRAG-Bench paper evaluates 14 LVLMs in an end-to-end setting: retrieve top-5 images with CLIP, then feed them to the model to answer the question. The table below compares Captain's retrieval quality (ContentHit@5) against the end-to-end accuracy of the best models from the paper using CLIP-retrieved images.
These metrics are not directly equivalent — ContentHit measures whether the retriever surfaces relevant images, while the paper measures whether the full system (retriever + LLM) answers correctly. However, retrieval quality is the bottleneck: the paper shows a 95% correlation between retriever Recall@5 and downstream LVLM accuracy. A system can only answer correctly if the retriever finds the right images first.
| System | Overall | Perspective | Transformative | High Precision |
|---|---|---|---|---|
| Captain (retrieval) | 81.3% | 87.3% | 61.8% | 100.0% |
| GPT-4o + Retrieved RAG | 68.96% | 77.95% | 54.9% | 68.33% |
| Gemini Pro + Retrieved RAG | 65.93% | 73.29% | 55.84% | 65.0% |
| Claude 3.5 Sonnet + Retrieved RAG | 63.56% | 73.91% | 47.0% | 53.33% |
| GPT-4-Turbo + Retrieved RAG | 58.95% | 66.53% | 49.06% | 58.83% |
| Human + Retrieved RAG | 61.38% | 62.42% | 54.36% | 62.5% |
| Best Open-Source (LLaVA-OneVision) + Retrieved RAG | 50.11% | 50.93% | 48.04% | 54.17% |
Captain's retrieval outperforms the end-to-end accuracy of every system evaluated in the paper, including GPT-4o with retrieved images. This is notable because Captain is only measuring retrieval (whether the right images are found), while the baselines include both retrieval AND the LLM answering the question — meaning their retriever (CLIP) is a significant bottleneck.
| Scenario | Captain | GPT-4o + CLIP RAG | Delta |
|---|---|---|---|
| High Precision | 100.0% | 68.33% | +31.7 |
| Scope | 96.1% | 69.61% | +26.5 |
| Incomplete | 94.1% | 30.95% | +63.2 |
| Obstruction | 92.6% | 75.0% | +17.6 |
| Partial | 86.6% | 78.86% | +7.7 |
| Angle | 83.2% | 77.95% | +5.3 |
| Temporal | 57.0% | 73.83% | -16.8 |
| Deformation | 36.3% | 54.9% | -18.6 |
Captain dramatically outperforms on perspective scenarios (Scope, Obstruction, Incomplete) where native multimodal embeddings excel at visual similarity matching. The Incomplete scenario (+63.2) is particularly striking — Captain finds relevant keyboard images far more effectively than CLIP.
The two scenarios where GPT-4o + CLIP RAG scores higher (Temporal, Deformation) are transformative tasks where the question asks about properties not directly visible in the retrieved images. GPT-4o's advantage here comes from its parametric knowledge (it can reason about temporal changes or engine specs), not from better retrieval.
-
Captain's retriever alone (81.3%) outperforms every full RAG system in the paper, including GPT-4o + CLIP retrieval (68.96%). This validates the quality of native multimodal embeddings over CLIP for image corpus retrieval.
-
The retrieval gap is the real bottleneck. The paper shows that even GPT-4o only achieves a 5.82% improvement with ground-truth images vs. retrieved images, while humans achieve 33.16%. Better retrieval (like Captain's) directly translates to better downstream RAG performance.
-
Perspective scenarios are largely solved by Captain's approach — 87.3% ContentHit@5 across Angle, Partial, Scope, and Obstruction. These are the scenarios where visual similarity is most important.
-
Transformative scenarios remain challenging for all systems. These require reasoning about temporal changes, deformation, or processes that go beyond visual similarity matching.
Strong performance (>90%) on perspective-based scenarios where visual similarity directly maps to the correct answer: identifying animals, buildings, geographic objects, and reading keyboards. Captain's native multimodal embeddings excel at matching visually similar content across different viewpoints.
Moderate performance (60-85%) on angle/partial views and car identification, where the retriever must match across different viewpoints or identify specific models from partial views.
Challenging scenarios (<60%):
- Temporal (57%): Matching across significant temporal changes (e.g., buildings across decades, kittens to adult cats).
- Cats (42%): Breed identification requires matching across age-related visual transformation.
- Deformation (36%): Identifying car engine specs from deformed/damaged vehicle images requires knowledge not directly visible in the image.
These results align with the MRAG-Bench finding that transformative scenarios are fundamentally harder than perspective scenarios for retrieval systems.
| Metric | Description |
|---|---|
| ContentHit@K | Fraction of questions where at least 1 retrieved image's VLM description contains the correct answer |
| Precision@K | Average fraction of top-K results whose VLM descriptions match the correct answer |
| MRR | Mean Reciprocal Rank of the first content-matching result |
| FnHit@K | Fraction of questions where a ground-truth corpus filename appears in top-K |
Results are saved to results/ as JSON with full per-question details and aggregate metrics.
MRAG-Bench (Hu et al., ICLR 2025) systematically identifies scenarios where visual retrieval provides more benefit than text retrieval. The benchmark spans 9 scenarios across perspective understanding (Scope, Angle, Obstruction, Partial) and transformative understanding (Temporal, Deformation, Biological, Incomplete).
Citation:
@article{hu2024mragbench,
title={MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
author={Hu, Wenbo and Gu, Jia-Chen and Dou, Zi-Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai-Wei and Peng, Nanyun},
journal={arXiv preprint arXiv:2410.08182},
year={2024}
}Captain is a multimodal RAG platform that indexes and searches across text, images, video, and audio using native multimodal embeddings. This evaluation measures Captain's image retrieval quality on a standardized academic benchmark.
This evaluation code is released under the MIT License. The MRAG-Bench dataset is released under CC-BY-4.0 by UCLA NLP.
