Scientific Article Information Retrieval — Challenge

Best submission score on Codabench: NDCG@10 = 0.7586 · MAP = 0.6788 · R@100 = 0.9464

⚠️ Important Notice — Two Generations of Models

This repository contains two distinct best models, produced at different stages of the project.

🥇 Best model overall — after presentations (NDCG@10 = 0.7586 on Codabench)

→ models/best_score_after_presentations(0.7586).py

Obtained after attending the oral presentations of other groups. Inspired by ideas from Group A (Beurtheret) and Group B (Salma Alkhalily et al.), we extended our existing pipeline with additional signals (sparse BM25/TF-IDF, citation context embeddings, linear fusion). The core of this model remains our original approach — the BGE + GTE + Granite dense fusion with RRF and the ontological domain bonus — which alone accounted for the large majority of the final score. The post-presentation additions brought marginal but consistent improvements on top of that foundation.

🥈 Best model before presentations (NDCG@10 = 0.7337 on Codabench)

→ models/best_score_before_presentations(0.7337).py and notebooks/best_score_notebook_(before_presentations).ipynb

Our original contribution, developed independently before any inter-group exchange. Already based on the triple dense model fusion (BGE-large + GTE-large + Granite R2) with weighted RRF and the hand-crafted ontological domain bonus — which is the central idea that drives performance in both models.

Get things done quickly

If you don't want to read everything, goto notebooks/best_score_notebook_(before_presentations).ipynb, it is the most interesting part, holds the best model before presentations with the error analysis and the data exploration. For the final best model, run models/best_score_after_presentations(0.7586).py directly.

Getting the data

The data is not included in this repository. Download it from the challenge platform and place the files as follows:

data/
├── corpus.parquet           # 20,000 candidate papers
├── queries.parquet          # 100 public train queries
├── held_out_queries.parquet # held-out queries (for final submission)
├── qrels.json               # ground-truth: {query_id: [gold_doc_ids]}
└── sample_submission.json   # example submission format

Pre-computed embeddings also go under data/embeddings/. Each model has its own subdirectory with embeddings.npy and ids.json (see Project Structure below for the full layout).

To re-encode with a different model:

python scripts/embed.py --model BAAI/bge-small-en-v1.5
# or for body chunks:
python scripts/embed_body.py --model BAAI/bge-large-en-v1.5

Setup

# Create and activate the virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# For the Matryoshka + LLM pipeline only
pip install ollama sentence-transformers einops
ollama pull llama3.2

All scripts are run from starter_kit/ (the project root).

Project Structure

starter_kit/
│
├── data/
│   ├── corpus.parquet
│   ├── queries.parquet
│   ├── held_out_queries.parquet
│   ├── qrels.json
│   ├── cite_cache/                                    # citation context embeddings (BGE)
│   │   ├── train_cite_embeddings.npy                  # (2826, 1024) — per-citation BGE embeddings
│   │   ├── train_cite_query_idx.npy
│   │   ├── held_out_cite_embeddings.npy               # (2299, 1024)
│   │   └── held_out_cite_query_idx.npy
│   └── embeddings/
│       ├── sentence-transformers_all-MiniLM-L6-v2/   # baseline MiniLM embeddings
│       └── BAAI_bge-large-en-v1.5/                   # BGE-large TA + body chunk embeddings
│           ├── corpus_ta_embeddings.npy
│           ├── corpus_body_embeddings.npy
│           ├── corpus_body_doc_ids.json
│           ├── corpus_ids.json
│           ├── query_ta_embeddings.npy
│           ├── query_body_embeddings.npy
│           ├── held_out_query_ta_embeddings.npy
│           ├── held_out_query_body_embeddings.npy
│           └── train/ held_out/                       # cite context pooled embeddings
│       ├── gte_large_*/                               # GTE-large TA + chunk embeddings
│       ├── gte_modernbert_*/                          # GTE-ModernBERT TA embeddings
│       ├── granite_r2_*/                              # Granite R2 TA embeddings
│       ├── e5_large_*/                                # E5-large TA embeddings
│       └── bgem3_*/                                   # BGE-M3 TA embeddings
│
├── models/
│   ├── best_score_after_presentations(0.7586).py  # ← BEST OVERALL (post-presentations)
│   ├── best_score_before_presentations(0.7337).py # ← Best before presentations
│   └── other_models.py                            # all other pipeline variants
│
├── notebooks/
│   ├── challenge_baseline.ipynb                   # starter: TF-IDF, BM25, MiniLM dense
│   └── best_score_notebook_(before_presentations).ipynb  # best pre-presentation pipeline
│
├── scripts/
│   ├── embed.py               # encode corpus/queries with any HuggingFace model
│   ├── embed_body.py          # encode body chunks (section-level)
│   └── embed_cite_contexts.py # encode in-text citation contexts
│
├── submissions/
│   ├── cache/                 # intermediate results (auto-created, speeds up reruns)
│   │   ├── bm25_scores_train.npy / bm25_scores_held.npy
│   │   ├── tfidf_scores_train.npy / tfidf_scores_held.npy
│   │   ├── v9_bm25_body_train.npy / v9_bm25_body_held.npy
│   │   ├── matryoshka_top100_*.json
│   │   ├── llm_rerank_scores_*.json
│   │   └── nomic_embeddings/
│   ├── results/               # per-run JSON snapshots
│   ├── results_log.csv        # append-only experiment log
│   └── submission.zip         # final submission file
│
└── requirements.txt

Files quick reference

File	Role
`models/best_score_after_presentations(0.7586).py`	Final best model. 11-signal linear fusion (dense + sparse + domain + citations) with Optuna weight optimisation and ontological bonus. Run this to reproduce the Codabench score.
`models/best_score_before_presentations(0.7337).py`	Best model before presentations. BGE + GTE + Granite RRF with grid-searched weights and ontological domain bonus.
`models/other_models.py`	Main entry point for all other pipelines: sparse, rrf, matryoshka.
`notebooks/best_score_notebook_(before_presentations).ipynb`	Full pre-presentation pipeline in notebook form with data exploration and error analysis.
`notebooks/challenge_baseline.ipynb`	Step-by-step walkthrough: TF-IDF → BM25 → MiniLM dense retrieval. Good starting point.
`scripts/embed.py`	Re-encode the corpus or queries with any sentence-transformers model.
`scripts/embed_body.py`	Same but for body chunks. Produces `chunk_embeddings.npy` + `paper_idx.npy`.
`scripts/embed_cite_contexts.py`	Encodes in-text citation contexts per query paper.

other_models.py — all pipelines

Run from starter_kit/. All pipelines write a submission zip and log results automatically.

Methods available

Sparse

tfidf — TF-IDF cosine similarity on title + abstract
bm25_ta — BM25 on title + abstract (tokenized)
bm25_full — BM25 on title + abstract + all body sections concatenated

Dense (pre-computed embeddings, no GPU needed at run time)

bge — BGE-large-en-v1.5, TA dot-product + MaxSim over body chunks
gte — GTE-large, TA dot-product + MaxSim over body chunks
gte_modernbert — GTE-ModernBERT-base, TA dot-product
granite — Granite R2, TA dot-product
e5 — E5-large-instruct, TA dot-product
bgem3 — BGE-M3, TA dot-product

All dense methods include an ontological domain bonus: documents from the same domain as the query are boosted (+10), related domains less so (+5 / +2). This bonus is the single most impactful component of the pipeline.

Matryoshka + LLM

Nomic-embed-text-v1.5 (768-dim, 2-stage: 64-dim coarse → 768-dim fine) fused with BGE via RRF, then Llama re-ranks the top-20 candidates pointwise (Yes/No) via Ollama.

Useful commands

# ── Best model (post-presentations) ──────────────────────────────────
python models/best_score_after_presentations\(0.7586\).py

# ── Best model (pre-presentations) ───────────────────────────────────
python models/best_score_before_presentations\(0.7337\).py

# ── Full Matryoshka + LLM (requires Ollama) ──────────────────────────
python models/other_models.py --pipeline matryoshka

# ── Sparse baselines ─────────────────────────────────────────────────
python models/other_models.py --pipeline sparse
python models/other_models.py --pipeline sparse --sparse_methods bm25_full

# ── Dense RRF (our core approach) ────────────────────────────────────
python models/other_models.py --pipeline rrf --models bge gte granite

# ── Dense RRF with grid-searched weights ─────────────────────────────
python models/other_models.py --pipeline rrf --grid

# ── Matryoshka only (no LLM, fast) ───────────────────────────────────
python models/other_models.py --pipeline matryoshka --no_llm

# ── Ablation: disable ontological bonus ──────────────────────────────
python models/other_models.py --pipeline rrf --models bge gte --no_bonus

Caching: every expensive computation (BM25 index, Matryoshka embeddings, LLM scores, sparse matrices) is cached in submissions/cache/. Re-running the same pipeline is near-instant.

Results

Score progression

Model	Description	Codabench NDCG@10
TF-IDF baseline	Title + abstract, sparse	~0.48
Dense MiniLM	Precomputed embeddings	~0.50
BGE-large alone	TA + MaxSim body chunks	~0.57
BGE + GTE + Granite RRF	Triple dense fusion	~0.68
BGE + GTE + Granite + bonus	+ ontological domain boost	0.7337 ← before presentations
V9 post-presentations	+ sparse signals + linear fusion + citations	0.7586 ← best overall

Final Codabench Score

┌──────────────────────────────────────────────────────┐
│  Model                        NDCG@10   MAP    R@100 │
│──────────────────────────────────────────────────────│
│  Before presentations         0.7337   0.6561  0.9414|
│  After presentations (V9)     0.7586   0.6788  0.9464│
└──────────────────────────────────────────────────────┘

Results storage

Every run automatically writes two outputs.

`submissions/results_log.csv`

Append-only CSV, one row per run.

Column	Content
`timestamp`	`YYYYMMDD_HHMMSS`
`pipeline`	`rrf`, `matryoshka`, or `sparse`
`config_summary`	JSON string of all relevant flags
`final_label`	name of the last evaluated stage
`NDCG@10`	final score on the train query set
`MAP@100`
`R@100`
`submission`	path to the `.zip` file produced
`snapshot`	path to the full JSON snapshot

`submissions/results/<timestamp>_<pipeline>.json`

Full snapshot for one run:

{
  "timestamp": "20260418_214423",
  "pipeline": "rrf",
  "config": { "models": ["bge", "gte", "granite"], "rrf_k": 60, "grid": false },
  "results": [
    { "label": "BGE-large (TA + MaxSim body)", "NDCG@10": 0.71, "MAP@100": 0.58, "R@100": 0.88 },
    { "label": "GTE-large (TA + MaxSim chunks)", "NDCG@10": 0.69 },
    { "label": "RRF final (BGE+GTE+Granite) + Bonus", "NDCG@10": 0.74 }
  ],
  "final_NDCG@10": 0.74
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scientific Article Information Retrieval — Challenge

⚠️ Important Notice — Two Generations of Models

🥇 Best model overall — after presentations (`NDCG@10 = 0.7586` on Codabench)

🥈 Best model before presentations (`NDCG@10 = 0.7337` on Codabench)

Get things done quickly

Getting the data

Setup

Project Structure

Files quick reference

other_models.py — all pipelines

Methods available

Useful commands

Results

Score progression

Final Codabench Score

Results storage

`submissions/results_log.csv`

`submissions/results/<timestamp>_<pipeline>.json`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
models		models
notebooks		notebooks
scripts		scripts
submissions		submissions
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Scientific Article Information Retrieval — Challenge

⚠️ Important Notice — Two Generations of Models

🥇 Best model overall — after presentations (NDCG@10 = 0.7586 on Codabench)

🥈 Best model before presentations (NDCG@10 = 0.7337 on Codabench)

Get things done quickly

Getting the data

Setup

Project Structure

Files quick reference

other_models.py — all pipelines

Methods available

Useful commands

Results

Score progression

Final Codabench Score

Results storage

submissions/results_log.csv

submissions/results/<timestamp>_<pipeline>.json

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🥇 Best model overall — after presentations (`NDCG@10 = 0.7586` on Codabench)

🥈 Best model before presentations (`NDCG@10 = 0.7337` on Codabench)

`submissions/results_log.csv`

`submissions/results/<timestamp>_<pipeline>.json`

Packages