Best submission score on Codabench:
NDCG@10 = 0.7586·MAP = 0.6788·R@100 = 0.9464
This repository contains two distinct best models, produced at different stages of the project.
→
models/best_score_after_presentations(0.7586).pyObtained after attending the oral presentations of other groups. Inspired by ideas from Group A (Beurtheret) and Group B (Salma Alkhalily et al.), we extended our existing pipeline with additional signals (sparse BM25/TF-IDF, citation context embeddings, linear fusion). The core of this model remains our original approach — the BGE + GTE + Granite dense fusion with RRF and the ontological domain bonus — which alone accounted for the large majority of the final score. The post-presentation additions brought marginal but consistent improvements on top of that foundation.
→
models/best_score_before_presentations(0.7337).pyandnotebooks/best_score_notebook_(before_presentations).ipynbOur original contribution, developed independently before any inter-group exchange. Already based on the triple dense model fusion (BGE-large + GTE-large + Granite R2) with weighted RRF and the hand-crafted ontological domain bonus — which is the central idea that drives performance in both models.
If you don't want to read everything, goto notebooks/best_score_notebook_(before_presentations).ipynb, it is the most interesting part, holds the best model before presentations with the error analysis and the data exploration. For the final best model, run models/best_score_after_presentations(0.7586).py directly.
The data is not included in this repository. Download it from the challenge platform and place the files as follows:
data/
├── corpus.parquet # 20,000 candidate papers
├── queries.parquet # 100 public train queries
├── held_out_queries.parquet # held-out queries (for final submission)
├── qrels.json # ground-truth: {query_id: [gold_doc_ids]}
└── sample_submission.json # example submission format
Pre-computed embeddings also go under data/embeddings/. Each model has its own subdirectory with embeddings.npy and ids.json (see Project Structure below for the full layout).
To re-encode with a different model:
python scripts/embed.py --model BAAI/bge-small-en-v1.5
# or for body chunks:
python scripts/embed_body.py --model BAAI/bge-large-en-v1.5# Create and activate the virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# For the Matryoshka + LLM pipeline only
pip install ollama sentence-transformers einops
ollama pull llama3.2All scripts are run from starter_kit/ (the project root).
starter_kit/
│
├── data/
│ ├── corpus.parquet
│ ├── queries.parquet
│ ├── held_out_queries.parquet
│ ├── qrels.json
│ ├── cite_cache/ # citation context embeddings (BGE)
│ │ ├── train_cite_embeddings.npy # (2826, 1024) — per-citation BGE embeddings
│ │ ├── train_cite_query_idx.npy
│ │ ├── held_out_cite_embeddings.npy # (2299, 1024)
│ │ └── held_out_cite_query_idx.npy
│ └── embeddings/
│ ├── sentence-transformers_all-MiniLM-L6-v2/ # baseline MiniLM embeddings
│ └── BAAI_bge-large-en-v1.5/ # BGE-large TA + body chunk embeddings
│ ├── corpus_ta_embeddings.npy
│ ├── corpus_body_embeddings.npy
│ ├── corpus_body_doc_ids.json
│ ├── corpus_ids.json
│ ├── query_ta_embeddings.npy
│ ├── query_body_embeddings.npy
│ ├── held_out_query_ta_embeddings.npy
│ ├── held_out_query_body_embeddings.npy
│ └── train/ held_out/ # cite context pooled embeddings
│ ├── gte_large_*/ # GTE-large TA + chunk embeddings
│ ├── gte_modernbert_*/ # GTE-ModernBERT TA embeddings
│ ├── granite_r2_*/ # Granite R2 TA embeddings
│ ├── e5_large_*/ # E5-large TA embeddings
│ └── bgem3_*/ # BGE-M3 TA embeddings
│
├── models/
│ ├── best_score_after_presentations(0.7586).py # ← BEST OVERALL (post-presentations)
│ ├── best_score_before_presentations(0.7337).py # ← Best before presentations
│ └── other_models.py # all other pipeline variants
│
├── notebooks/
│ ├── challenge_baseline.ipynb # starter: TF-IDF, BM25, MiniLM dense
│ └── best_score_notebook_(before_presentations).ipynb # best pre-presentation pipeline
│
├── scripts/
│ ├── embed.py # encode corpus/queries with any HuggingFace model
│ ├── embed_body.py # encode body chunks (section-level)
│ └── embed_cite_contexts.py # encode in-text citation contexts
│
├── submissions/
│ ├── cache/ # intermediate results (auto-created, speeds up reruns)
│ │ ├── bm25_scores_train.npy / bm25_scores_held.npy
│ │ ├── tfidf_scores_train.npy / tfidf_scores_held.npy
│ │ ├── v9_bm25_body_train.npy / v9_bm25_body_held.npy
│ │ ├── matryoshka_top100_*.json
│ │ ├── llm_rerank_scores_*.json
│ │ └── nomic_embeddings/
│ ├── results/ # per-run JSON snapshots
│ ├── results_log.csv # append-only experiment log
│ └── submission.zip # final submission file
│
└── requirements.txt
| File | Role |
|---|---|
models/best_score_after_presentations(0.7586).py |
Final best model. 11-signal linear fusion (dense + sparse + domain + citations) with Optuna weight optimisation and ontological bonus. Run this to reproduce the Codabench score. |
models/best_score_before_presentations(0.7337).py |
Best model before presentations. BGE + GTE + Granite RRF with grid-searched weights and ontological domain bonus. |
models/other_models.py |
Main entry point for all other pipelines: sparse, rrf, matryoshka. |
notebooks/best_score_notebook_(before_presentations).ipynb |
Full pre-presentation pipeline in notebook form with data exploration and error analysis. |
notebooks/challenge_baseline.ipynb |
Step-by-step walkthrough: TF-IDF → BM25 → MiniLM dense retrieval. Good starting point. |
scripts/embed.py |
Re-encode the corpus or queries with any sentence-transformers model. |
scripts/embed_body.py |
Same but for body chunks. Produces chunk_embeddings.npy + paper_idx.npy. |
scripts/embed_cite_contexts.py |
Encodes in-text citation contexts per query paper. |
Run from starter_kit/. All pipelines write a submission zip and log results automatically.
Sparse
tfidf— TF-IDF cosine similarity on title + abstractbm25_ta— BM25 on title + abstract (tokenized)bm25_full— BM25 on title + abstract + all body sections concatenated
Dense (pre-computed embeddings, no GPU needed at run time)
bge— BGE-large-en-v1.5, TA dot-product + MaxSim over body chunksgte— GTE-large, TA dot-product + MaxSim over body chunksgte_modernbert— GTE-ModernBERT-base, TA dot-productgranite— Granite R2, TA dot-producte5— E5-large-instruct, TA dot-productbgem3— BGE-M3, TA dot-product
All dense methods include an ontological domain bonus: documents from the same domain as the query are boosted (+10), related domains less so (+5 / +2). This bonus is the single most impactful component of the pipeline.
Matryoshka + LLM
- Nomic-embed-text-v1.5 (768-dim, 2-stage: 64-dim coarse → 768-dim fine) fused with BGE via RRF, then Llama re-ranks the top-20 candidates pointwise (Yes/No) via Ollama.
# ── Best model (post-presentations) ──────────────────────────────────
python models/best_score_after_presentations\(0.7586\).py
# ── Best model (pre-presentations) ───────────────────────────────────
python models/best_score_before_presentations\(0.7337\).py
# ── Full Matryoshka + LLM (requires Ollama) ──────────────────────────
python models/other_models.py --pipeline matryoshka
# ── Sparse baselines ─────────────────────────────────────────────────
python models/other_models.py --pipeline sparse
python models/other_models.py --pipeline sparse --sparse_methods bm25_full
# ── Dense RRF (our core approach) ────────────────────────────────────
python models/other_models.py --pipeline rrf --models bge gte granite
# ── Dense RRF with grid-searched weights ─────────────────────────────
python models/other_models.py --pipeline rrf --grid
# ── Matryoshka only (no LLM, fast) ───────────────────────────────────
python models/other_models.py --pipeline matryoshka --no_llm
# ── Ablation: disable ontological bonus ──────────────────────────────
python models/other_models.py --pipeline rrf --models bge gte --no_bonusCaching: every expensive computation (BM25 index, Matryoshka embeddings, LLM scores, sparse matrices) is cached in
submissions/cache/. Re-running the same pipeline is near-instant.
| Model | Description | Codabench NDCG@10 |
|---|---|---|
| TF-IDF baseline | Title + abstract, sparse | ~0.48 |
| Dense MiniLM | Precomputed embeddings | ~0.50 |
| BGE-large alone | TA + MaxSim body chunks | ~0.57 |
| BGE + GTE + Granite RRF | Triple dense fusion | ~0.68 |
| BGE + GTE + Granite + bonus | + ontological domain boost | 0.7337 ← before presentations |
| V9 post-presentations | + sparse signals + linear fusion + citations | 0.7586 ← best overall |
┌──────────────────────────────────────────────────────┐
│ Model NDCG@10 MAP R@100 │
│──────────────────────────────────────────────────────│
│ Before presentations 0.7337 0.6561 0.9414|
│ After presentations (V9) 0.7586 0.6788 0.9464│
└──────────────────────────────────────────────────────┘
Every run automatically writes two outputs.
Append-only CSV, one row per run.
| Column | Content |
|---|---|
timestamp |
YYYYMMDD_HHMMSS |
pipeline |
rrf, matryoshka, or sparse |
config_summary |
JSON string of all relevant flags |
final_label |
name of the last evaluated stage |
NDCG@10 |
final score on the train query set |
MAP@100 |
|
R@100 |
|
submission |
path to the .zip file produced |
snapshot |
path to the full JSON snapshot |
Full snapshot for one run:
{
"timestamp": "20260418_214423",
"pipeline": "rrf",
"config": { "models": ["bge", "gte", "granite"], "rrf_k": 60, "grid": false },
"results": [
{ "label": "BGE-large (TA + MaxSim body)", "NDCG@10": 0.71, "MAP@100": 0.58, "R@100": 0.88 },
{ "label": "GTE-large (TA + MaxSim chunks)", "NDCG@10": 0.69 },
{ "label": "RRF final (BGE+GTE+Granite) + Bonus", "NDCG@10": 0.74 }
],
"final_NDCG@10": 0.74
}