Skip to content

[FEATURE] Add Lexical Baseline Models for Ranking Tasks #35

@federetyk

Description

@federetyk

Problem

WorkRB includes several neural embedding models for ranking tasks, but lacks lexical baselines. Lexical baselines are commonly used to establish performance bounds (lower-bound reference points for evaluating deep learning models). Additionally, these techniques could serve as the candidate-generation stage in future two-stage retrieval pipelines with re-ranking models.

Proposal

Add lexical baseline models to WorkRB:

  • BM25Model: BM25 Okapi probabilistic ranking with the Python library rank-bm25
  • TfIdfModel: TF-IDF plus cosine similarity (with configurable tokenization: words vs character n-grams)
  • EditDistanceModel: Levenshtein ratio with the Python library rapidfuzz
  • RandomRankingModel: Random score generation, equivalent to RndESCOClassificationModel but for ranking tasks

Each model will inherit from ModelInterface and implement _compute_rankings(). They will accept but ignore ModelInputType parameters (as lexical methods are agnostic to the input type). Also, they will delegate _compute_classification() to _compute_rankings(), as done in BiEncoderModel. All models include Unicode normalization and accept configurable lowercase preprocessing. RandomRankingModel accepts an optional seed parameter for reproducibility.

This would require adding the following new dependencies in the pyproject.toml file:

  • rank-bm25: BM25 Okapi
  • rapidfuzz: Fast Levenshtein

Proposal Characteristics

  • Type:

    • New Ontology (data source for multiple tasks)
    • New Task(s)
    • New Model(s)
    • New Metric(s)
    • Other
  • Area(s) of code: paths, modules, or APIs you expect to touch
    src/workrb/models/lexical_baselines.py.

Additional Context

These baselines are adapted from the MELO Benchmark (src/melo_benchmark/evaluation/lexical_baseline/), where they were used to establish reference performance on multilingual entity linking tasks. The MELO implementations include lemmatization as optional preprocessing. We could also include this in WorkRB as configurable preprocessing logic, but it would require installing spaCy models, or finding an alternative library.

Implementation

  • I plan to implement this in a PR
  • I am proposing the idea and would like someone else to pick it up

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions