Captain MRAG-Bench Evaluation

Evaluation of Captain's multimodal retrieval-augmented generation on MRAG-Bench (ICLR 2025).

MRAG-Bench is a vision-centric benchmark with 16,130 images and 1,251 human-annotated multiple-choice questions across 9 scenarios, designed to evaluate multimodal RAG systems where visual retrieval outperforms text retrieval.

Setup

pip install -r requirements.txt

Set your Captain API credentials:

export CAPTAIN_API_KEY="your_api_key"
export CAPTAIN_ORG_ID="your_org_id"
export CAPTAIN_API_URL="https://api.runcaptain.com"  # or staging URL
export CAPTAIN_COLLECTION="multimodal-eval"

Indexing the Corpus

Download the MRAG-Bench image corpus (~3.7GB, 16,130 images) and index it into a Captain collection:

python index_corpus.py --collection multimodal-eval

This uploads all corpus images to your Captain collection using the /v2/collections/{name}/index/file endpoint.

Running the Evaluation

# Full eval (1,251 questions)
python eval.py

# Quick test (first 50 questions)
python eval.py --limit 50

# Filter by scenario
python eval.py --scenario Angle

# Filter by image type
python eval.py --image-type Animal

# Change retrieval depth
python eval.py --top-k 10

# Run multiple top-k values
python eval.py --top-k 1 5 10 20

Results

Full evaluation on 1,251 questions from MRAG-Bench, with 16,130 corpus images indexed into Captain.

Overall

Metric	K=5	K=10
ContentHit@K	81.3%	81.8%
MRR	0.807	0.798
Precision@K	74.2%	70.3%

ContentHit@K: At least 1 retrieved image's VLM description contains the correct answer
MRR: Mean Reciprocal Rank of the first correct hit
Precision@K: Fraction of retrieved images whose descriptions match the answer

By Scenario

Scenario	Count	ContentHit@5	ContentHit@10	MRR@5	Precision@5
High Precision	120	100.0%	100.0%	1.000	98.5%
Scope	102	96.1%	96.1%	0.956	84.7%
Incomplete	102	94.1%	94.1%	0.931	91.0%
Obstruction	108	92.6%	92.6%	0.926	84.1%
Partial	246	86.6%	87.0%	0.862	78.3%
Angle	322	83.2%	84.2%	0.829	75.2%
Temporal	149	57.0%	57.7%	0.554	50.3%
Deformation	102	36.3%	37.3%	0.353	29.4%

By Aspect

Aspect	Count	ContentHit@5	ContentHit@10	MRR@5	Precision@5
High Precision	120	100.0%	100.0%	1.000	98.5%
Perspective	778	87.3%	87.8%	0.870	78.7%
Transformative	353	61.8%	62.3%	0.605	56.0%

By Image Type

Image Type	Count	ContentHit@5	ContentHit@10	MRR@5	Precision@5
High Precision	120	100.0%	100.0%	1.000	98.5%
Keyboard	102	94.1%	94.1%	0.931	91.0%
Building	44	93.2%	93.2%	0.902	84.1%
Animal	416	91.8%	92.3%	0.916	80.0%
Car	286	77.3%	77.6%	0.769	73.7%
Flowers	178	63.5%	64.6%	0.626	55.4%
Cats	105	41.9%	42.9%	0.408	36.2%

Comparison with Published Baselines

The MRAG-Bench paper evaluates 14 LVLMs in an end-to-end setting: retrieve top-5 images with CLIP, then feed them to the model to answer the question. The table below compares Captain's retrieval quality (ContentHit@5) against the end-to-end accuracy of the best models from the paper using CLIP-retrieved images.

These metrics are not directly equivalent — ContentHit measures whether the retriever surfaces relevant images, while the paper measures whether the full system (retriever + LLM) answers correctly. However, retrieval quality is the bottleneck: the paper shows a 95% correlation between retriever Recall@5 and downstream LVLM accuracy. A system can only answer correctly if the retriever finds the right images first.

Overall

System	Overall	Perspective	Transformative	High Precision
Captain (retrieval)	81.3%	87.3%	61.8%	100.0%
GPT-4o + Retrieved RAG	68.96%	77.95%	54.9%	68.33%
Gemini Pro + Retrieved RAG	65.93%	73.29%	55.84%	65.0%
Claude 3.5 Sonnet + Retrieved RAG	63.56%	73.91%	47.0%	53.33%
GPT-4-Turbo + Retrieved RAG	58.95%	66.53%	49.06%	58.83%
Human + Retrieved RAG	61.38%	62.42%	54.36%	62.5%
Best Open-Source (LLaVA-OneVision) + Retrieved RAG	50.11%	50.93%	48.04%	54.17%

Captain's retrieval outperforms the end-to-end accuracy of every system evaluated in the paper, including GPT-4o with retrieved images. This is notable because Captain is only measuring retrieval (whether the right images are found), while the baselines include both retrieval AND the LLM answering the question — meaning their retriever (CLIP) is a significant bottleneck.

By Scenario (Captain vs. GPT-4o + Retrieved RAG)

Scenario	Captain	GPT-4o + CLIP RAG	Delta
High Precision	100.0%	68.33%	+31.7
Scope	96.1%	69.61%	+26.5
Incomplete	94.1%	30.95%	+63.2
Obstruction	92.6%	75.0%	+17.6
Partial	86.6%	78.86%	+7.7
Angle	83.2%	77.95%	+5.3
Temporal	57.0%	73.83%	-16.8
Deformation	36.3%	54.9%	-18.6

Captain dramatically outperforms on perspective scenarios (Scope, Obstruction, Incomplete) where native multimodal embeddings excel at visual similarity matching. The Incomplete scenario (+63.2) is particularly striking — Captain finds relevant keyboard images far more effectively than CLIP.

The two scenarios where GPT-4o + CLIP RAG scores higher (Temporal, Deformation) are transformative tasks where the question asks about properties not directly visible in the retrieved images. GPT-4o's advantage here comes from its parametric knowledge (it can reason about temporal changes or engine specs), not from better retrieval.

Key Takeaways

Captain's retriever alone (81.3%) outperforms every full RAG system in the paper, including GPT-4o + CLIP retrieval (68.96%). This validates the quality of native multimodal embeddings over CLIP for image corpus retrieval.
The retrieval gap is the real bottleneck. The paper shows that even GPT-4o only achieves a 5.82% improvement with ground-truth images vs. retrieved images, while humans achieve 33.16%. Better retrieval (like Captain's) directly translates to better downstream RAG performance.
Perspective scenarios are largely solved by Captain's approach — 87.3% ContentHit@5 across Angle, Partial, Scope, and Obstruction. These are the scenarios where visual similarity is most important.
Transformative scenarios remain challenging for all systems. These require reasoning about temporal changes, deformation, or processes that go beyond visual similarity matching.

Analysis

Strong performance (>90%) on perspective-based scenarios where visual similarity directly maps to the correct answer: identifying animals, buildings, geographic objects, and reading keyboards. Captain's native multimodal embeddings excel at matching visually similar content across different viewpoints.

Moderate performance (60-85%) on angle/partial views and car identification, where the retriever must match across different viewpoints or identify specific models from partial views.

Challenging scenarios (<60%):

Temporal (57%): Matching across significant temporal changes (e.g., buildings across decades, kittens to adult cats).
Cats (42%): Breed identification requires matching across age-related visual transformation.
Deformation (36%): Identifying car engine specs from deformed/damaged vehicle images requires knowledge not directly visible in the image.

These results align with the MRAG-Bench finding that transformative scenarios are fundamentally harder than perspective scenarios for retrieval systems.

Metrics

Metric	Description
ContentHit@K	Fraction of questions where at least 1 retrieved image's VLM description contains the correct answer
Precision@K	Average fraction of top-K results whose VLM descriptions match the correct answer
MRR	Mean Reciprocal Rank of the first content-matching result
FnHit@K	Fraction of questions where a ground-truth corpus filename appears in top-K

Output

Results are saved to results/ as JSON with full per-question details and aggregate metrics.

About MRAG-Bench

MRAG-Bench (Hu et al., ICLR 2025) systematically identifies scenarios where visual retrieval provides more benefit than text retrieval. The benchmark spans 9 scenarios across perspective understanding (Scope, Angle, Obstruction, Partial) and transformative understanding (Temporal, Deformation, Biological, Incomplete).

Citation:

@article{hu2024mragbench,
  title={MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
  author={Hu, Wenbo and Gu, Jia-Chen and Dou, Zi-Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai-Wei and Peng, Nanyun},
  journal={arXiv preprint arXiv:2410.08182},
  year={2024}
}

About Captain

Captain is a multimodal RAG platform that indexes and searches across text, images, video, and audio using native multimodal embeddings. This evaluation measures Captain's image retrieval quality on a standardized academic benchmark.

License

This evaluation code is released under the MIT License. The MRAG-Bench dataset is released under CC-BY-4.0 by UCLA NLP.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
results		results
.gitignore		.gitignore
BLOG.md		BLOG.md
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
index_corpus.py		index_corpus.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Captain MRAG-Bench Evaluation

Setup

Indexing the Corpus

Running the Evaluation

Results

Overall

By Scenario

By Aspect

By Image Type

Comparison with Published Baselines

Overall

By Scenario (Captain vs. GPT-4o + Retrieved RAG)

Key Takeaways

Analysis

Metrics

Output

About MRAG-Bench

About Captain

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Captain MRAG-Bench Evaluation

Setup

Indexing the Corpus

Running the Evaluation

Results

Overall

By Scenario

By Aspect

By Image Type

Comparison with Published Baselines

Overall

By Scenario (Captain vs. GPT-4o + Retrieved RAG)

Key Takeaways

Analysis

Metrics

Output

About MRAG-Bench

About Captain

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages