[FEATURE] Add Lexical Baseline Models for Ranking Tasks

## Problem

WorkRB includes several neural embedding models for ranking tasks, but lacks lexical baselines. Lexical baselines are commonly used to establish performance bounds (lower-bound reference points for evaluating deep learning models). Additionally, these techniques could serve as the candidate-generation stage in future two-stage retrieval pipelines with re-ranking models.


## Proposal

Add lexical baseline models to WorkRB:

- `BM25Model`: BM25 Okapi probabilistic ranking with the Python library `rank-bm25`
- `TfIdfModel`: TF-IDF plus cosine similarity (with configurable tokenization: words vs character n-grams)
- `EditDistanceModel`: Levenshtein ratio with the Python library `rapidfuzz`
- `RandomRankingModel`: Random score generation, equivalent to `RndESCOClassificationModel` but for ranking tasks

Each model will inherit from `ModelInterface` and implement `_compute_rankings()`. They will accept but ignore `ModelInputType` parameters (as lexical methods are agnostic to the input type). Also, they will delegate `_compute_classification()` to `_compute_rankings()`, as done in `BiEncoderModel`. All models include Unicode normalization and accept configurable lowercase preprocessing. `RandomRankingModel` accepts an optional `seed` parameter for reproducibility.

This would require adding the following new dependencies in the `pyproject.toml` file:
 - `rank-bm25`: BM25 Okapi
 - `rapidfuzz`: Fast Levenshtein


### Proposal Characteristics

- Type: 
    - [ ] New Ontology (data source for multiple tasks)
    - [ ] New Task(s)
    - [x] New Model(s)
    - [ ]  New Metric(s)
    - [ ] Other

- Area(s) of code: paths, modules, or APIs you expect to touch
    `src/workrb/models/lexical_baselines.py`.

## Additional Context

These baselines are adapted from the [MELO Benchmark](https://github.com/Avature/melo-benchmark) (`src/melo_benchmark/evaluation/lexical_baseline/`), where they were used to establish reference performance on multilingual entity linking tasks. The MELO implementations include lemmatization as optional preprocessing. We could also include this in WorkRB as configurable preprocessing logic, but it would require installing spaCy models, or finding an alternative library.

## Implementation

- [x] I plan to implement this in a PR
- [ ] I am proposing the idea and would like someone else to pick it up


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add Lexical Baseline Models for Ranking Tasks #35

Problem

Proposal

Proposal Characteristics

Additional Context

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Add Lexical Baseline Models for Ranking Tasks #35

Description

Problem

Proposal

Proposal Characteristics

Additional Context

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions