Skip to content

amineouat/Information-Retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scientific Article Information Retrieval — Challenge

Best submission score on Codabench: NDCG@10 = 0.7586 · MAP = 0.6788 · R@100 = 0.9464


⚠️ Important Notice — Two Generations of Models

This repository contains two distinct best models, produced at different stages of the project.

🥇 Best model overall — after presentations (NDCG@10 = 0.7586 on Codabench)

models/best_score_after_presentations(0.7586).py

Obtained after attending the oral presentations of other groups. Inspired by ideas from Group A (Beurtheret) and Group B (Salma Alkhalily et al.), we extended our existing pipeline with additional signals (sparse BM25/TF-IDF, citation context embeddings, linear fusion). The core of this model remains our original approach — the BGE + GTE + Granite dense fusion with RRF and the ontological domain bonus — which alone accounted for the large majority of the final score. The post-presentation additions brought marginal but consistent improvements on top of that foundation.

🥈 Best model before presentations (NDCG@10 = 0.7337 on Codabench)

models/best_score_before_presentations(0.7337).py and notebooks/best_score_notebook_(before_presentations).ipynb

Our original contribution, developed independently before any inter-group exchange. Already based on the triple dense model fusion (BGE-large + GTE-large + Granite R2) with weighted RRF and the hand-crafted ontological domain bonus — which is the central idea that drives performance in both models.


Get things done quickly

If you don't want to read everything, goto notebooks/best_score_notebook_(before_presentations).ipynb, it is the most interesting part, holds the best model before presentations with the error analysis and the data exploration. For the final best model, run models/best_score_after_presentations(0.7586).py directly.

Getting the data

The data is not included in this repository. Download it from the challenge platform and place the files as follows:

data/
├── corpus.parquet           # 20,000 candidate papers
├── queries.parquet          # 100 public train queries
├── held_out_queries.parquet # held-out queries (for final submission)
├── qrels.json               # ground-truth: {query_id: [gold_doc_ids]}
└── sample_submission.json   # example submission format

Pre-computed embeddings also go under data/embeddings/. Each model has its own subdirectory with embeddings.npy and ids.json (see Project Structure below for the full layout).

To re-encode with a different model:

python scripts/embed.py --model BAAI/bge-small-en-v1.5
# or for body chunks:
python scripts/embed_body.py --model BAAI/bge-large-en-v1.5

Setup

# Create and activate the virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# For the Matryoshka + LLM pipeline only
pip install ollama sentence-transformers einops
ollama pull llama3.2

All scripts are run from starter_kit/ (the project root).


Project Structure

starter_kit/
│
├── data/
│   ├── corpus.parquet
│   ├── queries.parquet
│   ├── held_out_queries.parquet
│   ├── qrels.json
│   ├── cite_cache/                                    # citation context embeddings (BGE)
│   │   ├── train_cite_embeddings.npy                  # (2826, 1024) — per-citation BGE embeddings
│   │   ├── train_cite_query_idx.npy
│   │   ├── held_out_cite_embeddings.npy               # (2299, 1024)
│   │   └── held_out_cite_query_idx.npy
│   └── embeddings/
│       ├── sentence-transformers_all-MiniLM-L6-v2/   # baseline MiniLM embeddings
│       └── BAAI_bge-large-en-v1.5/                   # BGE-large TA + body chunk embeddings
│           ├── corpus_ta_embeddings.npy
│           ├── corpus_body_embeddings.npy
│           ├── corpus_body_doc_ids.json
│           ├── corpus_ids.json
│           ├── query_ta_embeddings.npy
│           ├── query_body_embeddings.npy
│           ├── held_out_query_ta_embeddings.npy
│           ├── held_out_query_body_embeddings.npy
│           └── train/ held_out/                       # cite context pooled embeddings
│       ├── gte_large_*/                               # GTE-large TA + chunk embeddings
│       ├── gte_modernbert_*/                          # GTE-ModernBERT TA embeddings
│       ├── granite_r2_*/                              # Granite R2 TA embeddings
│       ├── e5_large_*/                                # E5-large TA embeddings
│       └── bgem3_*/                                   # BGE-M3 TA embeddings
│
├── models/
│   ├── best_score_after_presentations(0.7586).py  # ← BEST OVERALL (post-presentations)
│   ├── best_score_before_presentations(0.7337).py # ← Best before presentations
│   └── other_models.py                            # all other pipeline variants
│
├── notebooks/
│   ├── challenge_baseline.ipynb                   # starter: TF-IDF, BM25, MiniLM dense
│   └── best_score_notebook_(before_presentations).ipynb  # best pre-presentation pipeline
│
├── scripts/
│   ├── embed.py               # encode corpus/queries with any HuggingFace model
│   ├── embed_body.py          # encode body chunks (section-level)
│   └── embed_cite_contexts.py # encode in-text citation contexts
│
├── submissions/
│   ├── cache/                 # intermediate results (auto-created, speeds up reruns)
│   │   ├── bm25_scores_train.npy / bm25_scores_held.npy
│   │   ├── tfidf_scores_train.npy / tfidf_scores_held.npy
│   │   ├── v9_bm25_body_train.npy / v9_bm25_body_held.npy
│   │   ├── matryoshka_top100_*.json
│   │   ├── llm_rerank_scores_*.json
│   │   └── nomic_embeddings/
│   ├── results/               # per-run JSON snapshots
│   ├── results_log.csv        # append-only experiment log
│   └── submission.zip         # final submission file
│
└── requirements.txt

Files quick reference

File Role
models/best_score_after_presentations(0.7586).py Final best model. 11-signal linear fusion (dense + sparse + domain + citations) with Optuna weight optimisation and ontological bonus. Run this to reproduce the Codabench score.
models/best_score_before_presentations(0.7337).py Best model before presentations. BGE + GTE + Granite RRF with grid-searched weights and ontological domain bonus.
models/other_models.py Main entry point for all other pipelines: sparse, rrf, matryoshka.
notebooks/best_score_notebook_(before_presentations).ipynb Full pre-presentation pipeline in notebook form with data exploration and error analysis.
notebooks/challenge_baseline.ipynb Step-by-step walkthrough: TF-IDF → BM25 → MiniLM dense retrieval. Good starting point.
scripts/embed.py Re-encode the corpus or queries with any sentence-transformers model.
scripts/embed_body.py Same but for body chunks. Produces chunk_embeddings.npy + paper_idx.npy.
scripts/embed_cite_contexts.py Encodes in-text citation contexts per query paper.

other_models.py — all pipelines

Run from starter_kit/. All pipelines write a submission zip and log results automatically.

Methods available

Sparse

  • tfidf — TF-IDF cosine similarity on title + abstract
  • bm25_ta — BM25 on title + abstract (tokenized)
  • bm25_full — BM25 on title + abstract + all body sections concatenated

Dense (pre-computed embeddings, no GPU needed at run time)

  • bge — BGE-large-en-v1.5, TA dot-product + MaxSim over body chunks
  • gte — GTE-large, TA dot-product + MaxSim over body chunks
  • gte_modernbert — GTE-ModernBERT-base, TA dot-product
  • granite — Granite R2, TA dot-product
  • e5 — E5-large-instruct, TA dot-product
  • bgem3 — BGE-M3, TA dot-product

All dense methods include an ontological domain bonus: documents from the same domain as the query are boosted (+10), related domains less so (+5 / +2). This bonus is the single most impactful component of the pipeline.

Matryoshka + LLM

  • Nomic-embed-text-v1.5 (768-dim, 2-stage: 64-dim coarse → 768-dim fine) fused with BGE via RRF, then Llama re-ranks the top-20 candidates pointwise (Yes/No) via Ollama.

Useful commands

# ── Best model (post-presentations) ──────────────────────────────────
python models/best_score_after_presentations\(0.7586\).py

# ── Best model (pre-presentations) ───────────────────────────────────
python models/best_score_before_presentations\(0.7337\).py

# ── Full Matryoshka + LLM (requires Ollama) ──────────────────────────
python models/other_models.py --pipeline matryoshka

# ── Sparse baselines ─────────────────────────────────────────────────
python models/other_models.py --pipeline sparse
python models/other_models.py --pipeline sparse --sparse_methods bm25_full

# ── Dense RRF (our core approach) ────────────────────────────────────
python models/other_models.py --pipeline rrf --models bge gte granite

# ── Dense RRF with grid-searched weights ─────────────────────────────
python models/other_models.py --pipeline rrf --grid

# ── Matryoshka only (no LLM, fast) ───────────────────────────────────
python models/other_models.py --pipeline matryoshka --no_llm

# ── Ablation: disable ontological bonus ──────────────────────────────
python models/other_models.py --pipeline rrf --models bge gte --no_bonus

Caching: every expensive computation (BM25 index, Matryoshka embeddings, LLM scores, sparse matrices) is cached in submissions/cache/. Re-running the same pipeline is near-instant.


Results

Score progression

Model Description Codabench NDCG@10
TF-IDF baseline Title + abstract, sparse ~0.48
Dense MiniLM Precomputed embeddings ~0.50
BGE-large alone TA + MaxSim body chunks ~0.57
BGE + GTE + Granite RRF Triple dense fusion ~0.68
BGE + GTE + Granite + bonus + ontological domain boost 0.7337before presentations
V9 post-presentations + sparse signals + linear fusion + citations 0.7586best overall

Final Codabench Score

┌──────────────────────────────────────────────────────┐
│  Model                        NDCG@10   MAP    R@100 │
│──────────────────────────────────────────────────────│
│  Before presentations         0.7337   0.6561  0.9414|
│  After presentations (V9)     0.7586   0.6788  0.9464│
└──────────────────────────────────────────────────────┘

Results storage

Every run automatically writes two outputs.

submissions/results_log.csv

Append-only CSV, one row per run.

Column Content
timestamp YYYYMMDD_HHMMSS
pipeline rrf, matryoshka, or sparse
config_summary JSON string of all relevant flags
final_label name of the last evaluated stage
NDCG@10 final score on the train query set
MAP@100
R@100
submission path to the .zip file produced
snapshot path to the full JSON snapshot

submissions/results/<timestamp>_<pipeline>.json

Full snapshot for one run:

{
  "timestamp": "20260418_214423",
  "pipeline": "rrf",
  "config": { "models": ["bge", "gte", "granite"], "rrf_k": 60, "grid": false },
  "results": [
    { "label": "BGE-large (TA + MaxSim body)", "NDCG@10": 0.71, "MAP@100": 0.58, "R@100": 0.88 },
    { "label": "GTE-large (TA + MaxSim chunks)", "NDCG@10": 0.69 },
    { "label": "RRF final (BGE+GTE+Granite) + Bonus", "NDCG@10": 0.74 }
  ],
  "final_NDCG@10": 0.74
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors