This results doesn't follow the same way of json format as the other results.
- Convert this to the same format as the other results.
- Add more details about the experiments.
We conducted comprehensive ablation studies to optimize the RAG pipeline components and identify the most effective configurations. All experiments were performed using the FactBench dataset with the Gemma2:9B model, measuring both accuracy and computational efficiency.
The ablation study focused on five key components of the RAG methodology:
- Document Selection Methods
- Embedding Models
- Chunking Strategies
- Similarity Cut-off Mechanisms
- Top-K Retrieval Configuration
We evaluated both unsupervised and supervised retrieval approaches:
Unsupervised Methods:
- BM25: Traditional term frequency-based retrieval
- Contriever (MS-MARCO): Dense vector representations through contrastive learning
Supervised Methods:
- Jina.ai Reranker: Cross-encoder architecture with multilingual support (26 languages)
- MS MARCO MiniLM-L-6-v2: BERT-based architecture optimized for ranking
Results: MS MARCO MiniLM-L-6-v2 achieved the highest accuracy (90.14%) with acceptable latency (0.8172s).
Tested diverse embedding models with varying sizes and capabilities:
| Model | Accuracy | Latency | Notes |
|---|---|---|---|
| bge-small-en-v1.5 | 90.14% | 1.70s | ✅ Best overall performance |
| stella_en_1.5B_v5 | 89.61% | 17.69s | Large model, slower inference |
| multilingual-e5-large-instruct | 89.54% | 5.00s | Good multilingual support |
| jina-embeddings-v3* | 88.52% | 4.87s | Memory constraints |
| gte-large-en-v1.5* | 89.71% | 5.86s | Memory constraints |
*Limited evaluation due to memory constraints
Compared three main approaches for text segmentation:
Fixed Chunking:
- Chunk sizes: 256, 512, 1024 tokens
- Best: 1024 tokens (89.46% accuracy, 0.024s latency)
Small2Big Hierarchical:
- Multi-tier: 128, 256, 512 tokens with parent-child relations
- Result: 88.89% accuracy, 0.192s latency
Sliding Window (Recommended):
- Window size 3: 90.14% accuracy, 0.031s latency
- Window size 6: 90.14% accuracy, 0.035s latency
Results: Sliding window with size 3 provides optimal balance of context preservation and performance.
Evaluated filtering mechanisms to retain only relevant chunks:
| Method | Accuracy | Latency Improvement |
|---|---|---|
| No cut-off (baseline) | 89.71% | - |
| Original score cut-off | 90.18% | -0.22s |
| Re-ranked score cut-off | 90.14% | -0.35s |
Threshold: 0.3 (optimal balance between quality and coverage)
Compared different numbers of retrieved chunks:
| Configuration | Accuracy | Latency | Trade-off |
|---|---|---|---|
| Top-K 3 | 90.18% | 5.21s | Faster, slightly lower accuracy |
| Top-K 6 | 90.32% | 7.03s | Higher accuracy, increased latency |
Based on ablation study results, the recommended RAG configuration is:
rag:
# Document selection
document_selector: "ms-marco-MiniLM-L-6-v2"
# Embedding model
embedding_model: "bge-small-en-v1.5"
# Chunking strategy
chunking_strategy: "sliding_window"
window_size: 3
# Similarity filtering
similarity_cutoff: 0.3
cutoff_method: "original_score"
# Retrieval configuration
top_k: 6
selected_documents: 10