Skip to content

runcaptain/captain-mrag-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Captain MRAG-Bench Evaluation

Evaluation of Captain's multimodal retrieval-augmented generation on MRAG-Bench (ICLR 2025).

MRAG-Bench is a vision-centric benchmark with 16,130 images and 1,251 human-annotated multiple-choice questions across 9 scenarios, designed to evaluate multimodal RAG systems where visual retrieval outperforms text retrieval.

MRAG-Bench Results

Setup

pip install -r requirements.txt

Set your Captain API credentials:

export CAPTAIN_API_KEY="your_api_key"
export CAPTAIN_ORG_ID="your_org_id"
export CAPTAIN_API_URL="https://api.runcaptain.com"  # or staging URL
export CAPTAIN_COLLECTION="multimodal-eval"

Indexing the Corpus

Download the MRAG-Bench image corpus (~3.7GB, 16,130 images) and index it into a Captain collection:

python index_corpus.py --collection multimodal-eval

This uploads all corpus images to your Captain collection using the /v2/collections/{name}/index/file endpoint.

Running the Evaluation

# Full eval (1,251 questions)
python eval.py

# Quick test (first 50 questions)
python eval.py --limit 50

# Filter by scenario
python eval.py --scenario Angle

# Filter by image type
python eval.py --image-type Animal

# Change retrieval depth
python eval.py --top-k 10

# Run multiple top-k values
python eval.py --top-k 1 5 10 20

Results

Full evaluation on 1,251 questions from MRAG-Bench, with 16,130 corpus images indexed into Captain.

Overall

Metric K=5 K=10
ContentHit@K 81.3% 81.8%
MRR 0.807 0.798
Precision@K 74.2% 70.3%
  • ContentHit@K: At least 1 retrieved image's VLM description contains the correct answer
  • MRR: Mean Reciprocal Rank of the first correct hit
  • Precision@K: Fraction of retrieved images whose descriptions match the answer

By Scenario

Scenario Count ContentHit@5 ContentHit@10 MRR@5 Precision@5
High Precision 120 100.0% 100.0% 1.000 98.5%
Scope 102 96.1% 96.1% 0.956 84.7%
Incomplete 102 94.1% 94.1% 0.931 91.0%
Obstruction 108 92.6% 92.6% 0.926 84.1%
Partial 246 86.6% 87.0% 0.862 78.3%
Angle 322 83.2% 84.2% 0.829 75.2%
Temporal 149 57.0% 57.7% 0.554 50.3%
Deformation 102 36.3% 37.3% 0.353 29.4%

By Aspect

Aspect Count ContentHit@5 ContentHit@10 MRR@5 Precision@5
High Precision 120 100.0% 100.0% 1.000 98.5%
Perspective 778 87.3% 87.8% 0.870 78.7%
Transformative 353 61.8% 62.3% 0.605 56.0%

By Image Type

Image Type Count ContentHit@5 ContentHit@10 MRR@5 Precision@5
High Precision 120 100.0% 100.0% 1.000 98.5%
Keyboard 102 94.1% 94.1% 0.931 91.0%
Building 44 93.2% 93.2% 0.902 84.1%
Animal 416 91.8% 92.3% 0.916 80.0%
Car 286 77.3% 77.6% 0.769 73.7%
Flowers 178 63.5% 64.6% 0.626 55.4%
Cats 105 41.9% 42.9% 0.408 36.2%

Comparison with Published Baselines

The MRAG-Bench paper evaluates 14 LVLMs in an end-to-end setting: retrieve top-5 images with CLIP, then feed them to the model to answer the question. The table below compares Captain's retrieval quality (ContentHit@5) against the end-to-end accuracy of the best models from the paper using CLIP-retrieved images.

These metrics are not directly equivalent — ContentHit measures whether the retriever surfaces relevant images, while the paper measures whether the full system (retriever + LLM) answers correctly. However, retrieval quality is the bottleneck: the paper shows a 95% correlation between retriever Recall@5 and downstream LVLM accuracy. A system can only answer correctly if the retriever finds the right images first.

Overall

System Overall Perspective Transformative High Precision
Captain (retrieval) 81.3% 87.3% 61.8% 100.0%
GPT-4o + Retrieved RAG 68.96% 77.95% 54.9% 68.33%
Gemini Pro + Retrieved RAG 65.93% 73.29% 55.84% 65.0%
Claude 3.5 Sonnet + Retrieved RAG 63.56% 73.91% 47.0% 53.33%
GPT-4-Turbo + Retrieved RAG 58.95% 66.53% 49.06% 58.83%
Human + Retrieved RAG 61.38% 62.42% 54.36% 62.5%
Best Open-Source (LLaVA-OneVision) + Retrieved RAG 50.11% 50.93% 48.04% 54.17%

Captain's retrieval outperforms the end-to-end accuracy of every system evaluated in the paper, including GPT-4o with retrieved images. This is notable because Captain is only measuring retrieval (whether the right images are found), while the baselines include both retrieval AND the LLM answering the question — meaning their retriever (CLIP) is a significant bottleneck.

By Scenario (Captain vs. GPT-4o + Retrieved RAG)

Scenario Captain GPT-4o + CLIP RAG Delta
High Precision 100.0% 68.33% +31.7
Scope 96.1% 69.61% +26.5
Incomplete 94.1% 30.95% +63.2
Obstruction 92.6% 75.0% +17.6
Partial 86.6% 78.86% +7.7
Angle 83.2% 77.95% +5.3
Temporal 57.0% 73.83% -16.8
Deformation 36.3% 54.9% -18.6

Captain dramatically outperforms on perspective scenarios (Scope, Obstruction, Incomplete) where native multimodal embeddings excel at visual similarity matching. The Incomplete scenario (+63.2) is particularly striking — Captain finds relevant keyboard images far more effectively than CLIP.

The two scenarios where GPT-4o + CLIP RAG scores higher (Temporal, Deformation) are transformative tasks where the question asks about properties not directly visible in the retrieved images. GPT-4o's advantage here comes from its parametric knowledge (it can reason about temporal changes or engine specs), not from better retrieval.

Key Takeaways

  1. Captain's retriever alone (81.3%) outperforms every full RAG system in the paper, including GPT-4o + CLIP retrieval (68.96%). This validates the quality of native multimodal embeddings over CLIP for image corpus retrieval.

  2. The retrieval gap is the real bottleneck. The paper shows that even GPT-4o only achieves a 5.82% improvement with ground-truth images vs. retrieved images, while humans achieve 33.16%. Better retrieval (like Captain's) directly translates to better downstream RAG performance.

  3. Perspective scenarios are largely solved by Captain's approach — 87.3% ContentHit@5 across Angle, Partial, Scope, and Obstruction. These are the scenarios where visual similarity is most important.

  4. Transformative scenarios remain challenging for all systems. These require reasoning about temporal changes, deformation, or processes that go beyond visual similarity matching.

Analysis

Strong performance (>90%) on perspective-based scenarios where visual similarity directly maps to the correct answer: identifying animals, buildings, geographic objects, and reading keyboards. Captain's native multimodal embeddings excel at matching visually similar content across different viewpoints.

Moderate performance (60-85%) on angle/partial views and car identification, where the retriever must match across different viewpoints or identify specific models from partial views.

Challenging scenarios (<60%):

  • Temporal (57%): Matching across significant temporal changes (e.g., buildings across decades, kittens to adult cats).
  • Cats (42%): Breed identification requires matching across age-related visual transformation.
  • Deformation (36%): Identifying car engine specs from deformed/damaged vehicle images requires knowledge not directly visible in the image.

These results align with the MRAG-Bench finding that transformative scenarios are fundamentally harder than perspective scenarios for retrieval systems.

Metrics

Metric Description
ContentHit@K Fraction of questions where at least 1 retrieved image's VLM description contains the correct answer
Precision@K Average fraction of top-K results whose VLM descriptions match the correct answer
MRR Mean Reciprocal Rank of the first content-matching result
FnHit@K Fraction of questions where a ground-truth corpus filename appears in top-K

Output

Results are saved to results/ as JSON with full per-question details and aggregate metrics.

About MRAG-Bench

MRAG-Bench (Hu et al., ICLR 2025) systematically identifies scenarios where visual retrieval provides more benefit than text retrieval. The benchmark spans 9 scenarios across perspective understanding (Scope, Angle, Obstruction, Partial) and transformative understanding (Temporal, Deformation, Biological, Incomplete).

Citation:

@article{hu2024mragbench,
  title={MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
  author={Hu, Wenbo and Gu, Jia-Chen and Dou, Zi-Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai-Wei and Peng, Nanyun},
  journal={arXiv preprint arXiv:2410.08182},
  year={2024}
}

About Captain

Captain is a multimodal RAG platform that indexes and searches across text, images, video, and audio using native multimodal embeddings. This evaluation measures Captain's image retrieval quality on a standardized academic benchmark.

License

This evaluation code is released under the MIT License. The MRAG-Bench dataset is released under CC-BY-4.0 by UCLA NLP.

About

Captain evaluations on MultiModal Search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages