-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Problem
WorkRB includes several neural embedding models for ranking tasks, but lacks lexical baselines. Lexical baselines are commonly used to establish performance bounds (lower-bound reference points for evaluating deep learning models). Additionally, these techniques could serve as the candidate-generation stage in future two-stage retrieval pipelines with re-ranking models.
Proposal
Add lexical baseline models to WorkRB:
BM25Model: BM25 Okapi probabilistic ranking with the Python libraryrank-bm25TfIdfModel: TF-IDF plus cosine similarity (with configurable tokenization: words vs character n-grams)EditDistanceModel: Levenshtein ratio with the Python libraryrapidfuzzRandomRankingModel: Random score generation, equivalent toRndESCOClassificationModelbut for ranking tasks
Each model will inherit from ModelInterface and implement _compute_rankings(). They will accept but ignore ModelInputType parameters (as lexical methods are agnostic to the input type). Also, they will delegate _compute_classification() to _compute_rankings(), as done in BiEncoderModel. All models include Unicode normalization and accept configurable lowercase preprocessing. RandomRankingModel accepts an optional seed parameter for reproducibility.
This would require adding the following new dependencies in the pyproject.toml file:
rank-bm25: BM25 Okapirapidfuzz: Fast Levenshtein
Proposal Characteristics
-
Type:
- New Ontology (data source for multiple tasks)
- New Task(s)
- New Model(s)
- New Metric(s)
- Other
-
Area(s) of code: paths, modules, or APIs you expect to touch
src/workrb/models/lexical_baselines.py.
Additional Context
These baselines are adapted from the MELO Benchmark (src/melo_benchmark/evaluation/lexical_baseline/), where they were used to establish reference performance on multilingual entity linking tasks. The MELO implementations include lemmatization as optional preprocessing. We could also include this in WorkRB as configurable preprocessing logic, but it would require installing spaCy models, or finding an alternative library.
Implementation
- I plan to implement this in a PR
- I am proposing the idea and would like someone else to pick it up