feat: graded relevance support for ranking metrics by warreveys · Pull Request #51 · techwolf-ai/workrb

warreveys · 2026-05-06T13:45:15Z

Description

RankingDataset now accepts an optional target_relevance field aligned 1-to-1 with target_indices. ndcg@k uses a (2^rel - 1) gain when graded labels are provided. Binary metrics (map, mrr, recall@k, hit@k, rp@k) consume a thresholded view of the grades via RankingTask.binary_relevance_threshold (default 1e-9, any non-zero grade counts). With target_relevance=None every metric is numerically identical to the previous implementation, so existing tasks behave unchanged.

binary_relevance_threshold defines what "relevant" means for the binary metrics on a graded dataset. Override it on the task (e.g. 2.0 on a 1-3 scale) to drop low grades from the binary positive set while still letting them contribute to nDCG. Items below the threshold also leave recall@k's denominator, so a graded dataset's binary numbers are not directly comparable to a fully-binary version.

Adds custom_task_graded_relevance_example.py, which defines a graded task, runs evaluate() so the same dataset feeds nDCG (graded) and MAP/MRR/recall (binary) in one call, and replays a fixed prediction matrix at threshold 1e-9 vs. 2.0 to show the trade-off: nDCG@5 stays at 0.61 while MAP drops 0.79 to 0.44, MRR drops 1.0 to 0.5, recall@5 drops 0.75 to 0.67. README gets a matching "Graded relevance (optional)" section.

Checklist

Added new tests for new functionality
Tested locally with example tasks
Code follows project style guidelines
Documentation updated
No new warnings introduced

RankingDataset now accepts an optional target_relevance field aligned 1-to-1 with target_indices. ndcg@k uses a (2^rel - 1) gain when graded labels are provided; binary metrics (map, mrr, recall@k, hit@k, rp@k) ignore the field. When target_relevance is None the binary nDCG output is numerically identical to the previous implementation, so existing tasks behave unchanged. Indices are sorted on dedup and relevance is permuted in lockstep so (idx, rel) pairs stay aligned through both _postprocess_indices and the RESOLVE strategies for duplicate queries / targets. Adds custom_task_graded_relevance_example.py demonstrating how to define a graded task and how the same dataset can serve nDCG (graded) and MAP/MRR/recall (binary) in one evaluate() call.

Adds RankingTask.binary_relevance_threshold (default 1e-9) so a graded task can choose which grades count as positives for binary metrics (map, mrr, recall@k, hit@k, rp@k). Items with relevance >= threshold are positives; items below are dropped from the binary set but still contribute to graded metrics like ndcg@k. The threshold is plumbed through calculate_ranking_metrics, where the binary positive set is derived on the fly from the (indices, relevance) pair when graded labels are present. Default of 1e-9 keeps every listed item as a positive when the dataset provides target_relevance, so existing graded tasks behave identically. The threshold has no effect when target_relevance is None. The graded example task now sets binary_relevance_threshold=2.0 to demonstrate dropping nice-to-have skills from MAP/MRR while keeping them as gain-1 contributions to nDCG.

warreveys added 2 commits May 6, 2026 15:31

warreveys marked this pull request as ready for review May 6, 2026 14:25

warreveys marked this pull request as draft May 6, 2026 14:25

More clear documentation on the relevancy feature

31e00f1

warreveys marked this pull request as ready for review May 6, 2026 14:49

warreveys requested review from Mattdl and removed request for Mattdl May 7, 2026 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: graded relevance support for ranking metrics#51

feat: graded relevance support for ranking metrics#51
warreveys wants to merge 3 commits into
techwolf-ai:mainfrom
warreveys:feat-target-relevancies

warreveys commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

warreveys commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

warreveys commented May 6, 2026 •

edited

Loading