Skip to content

feat: graded relevance support for ranking metrics#51

Open
warreveys wants to merge 3 commits into
techwolf-ai:mainfrom
warreveys:feat-target-relevancies
Open

feat: graded relevance support for ranking metrics#51
warreveys wants to merge 3 commits into
techwolf-ai:mainfrom
warreveys:feat-target-relevancies

Conversation

@warreveys
Copy link
Copy Markdown
Collaborator

@warreveys warreveys commented May 6, 2026

Description

RankingDataset now accepts an optional target_relevance field aligned 1-to-1 with target_indices. ndcg@k uses a (2^rel - 1) gain when graded labels are provided. Binary metrics (map, mrr, recall@k, hit@k, rp@k) consume a thresholded view of the grades via RankingTask.binary_relevance_threshold (default 1e-9, any non-zero grade counts). With target_relevance=None every metric is numerically identical to the previous implementation, so existing tasks behave unchanged.

binary_relevance_threshold defines what "relevant" means for the binary metrics on a graded dataset. Override it on the task (e.g. 2.0 on a 1-3 scale) to drop low grades from the binary positive set while still letting them contribute to nDCG. Items below the threshold also leave recall@k's denominator, so a graded dataset's binary numbers are not directly comparable to a fully-binary version.

Adds custom_task_graded_relevance_example.py, which defines a graded task, runs evaluate() so the same dataset feeds nDCG (graded) and MAP/MRR/recall (binary) in one call, and replays a fixed prediction matrix at threshold 1e-9 vs. 2.0 to show the trade-off: nDCG@5 stays at 0.61 while MAP drops 0.79 to 0.44, MRR drops 1.0 to 0.5, recall@5 drops 0.75 to 0.67. README gets a matching "Graded relevance (optional)" section.

Checklist

  • Added new tests for new functionality
  • Tested locally with example tasks
  • Code follows project style guidelines
  • Documentation updated
  • No new warnings introduced

warreveys added 2 commits May 6, 2026 15:31
RankingDataset now accepts an optional target_relevance field aligned
1-to-1 with target_indices. ndcg@k uses a (2^rel - 1) gain when graded
labels are provided; binary metrics (map, mrr, recall@k, hit@k, rp@k)
ignore the field. When target_relevance is None the binary nDCG output
is numerically identical to the previous implementation, so existing
tasks behave unchanged.

Indices are sorted on dedup and relevance is permuted in lockstep so
(idx, rel) pairs stay aligned through both _postprocess_indices and
the RESOLVE strategies for duplicate queries / targets.

Adds custom_task_graded_relevance_example.py demonstrating how to
define a graded task and how the same dataset can serve nDCG (graded)
and MAP/MRR/recall (binary) in one evaluate() call.
Adds RankingTask.binary_relevance_threshold (default 1e-9) so a graded
task can choose which grades count as positives for binary metrics
(map, mrr, recall@k, hit@k, rp@k). Items with relevance >= threshold
are positives; items below are dropped from the binary set but still
contribute to graded metrics like ndcg@k. The threshold is plumbed
through calculate_ranking_metrics, where the binary positive set is
derived on the fly from the (indices, relevance) pair when graded
labels are present.

Default of 1e-9 keeps every listed item as a positive when the dataset
provides target_relevance, so existing graded tasks behave identically.
The threshold has no effect when target_relevance is None.

The graded example task now sets binary_relevance_threshold=2.0 to
demonstrate dropping nice-to-have skills from MAP/MRR while keeping
them as gain-1 contributions to nDCG.
@warreveys warreveys marked this pull request as ready for review May 6, 2026 14:25
@warreveys warreveys marked this pull request as draft May 6, 2026 14:25
@warreveys warreveys marked this pull request as ready for review May 6, 2026 14:49
@warreveys warreveys requested review from Mattdl and removed request for Mattdl May 7, 2026 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant