feat: graded relevance support for ranking metrics#51
Open
warreveys wants to merge 3 commits into
Open
Conversation
RankingDataset now accepts an optional target_relevance field aligned 1-to-1 with target_indices. ndcg@k uses a (2^rel - 1) gain when graded labels are provided; binary metrics (map, mrr, recall@k, hit@k, rp@k) ignore the field. When target_relevance is None the binary nDCG output is numerically identical to the previous implementation, so existing tasks behave unchanged. Indices are sorted on dedup and relevance is permuted in lockstep so (idx, rel) pairs stay aligned through both _postprocess_indices and the RESOLVE strategies for duplicate queries / targets. Adds custom_task_graded_relevance_example.py demonstrating how to define a graded task and how the same dataset can serve nDCG (graded) and MAP/MRR/recall (binary) in one evaluate() call.
Adds RankingTask.binary_relevance_threshold (default 1e-9) so a graded task can choose which grades count as positives for binary metrics (map, mrr, recall@k, hit@k, rp@k). Items with relevance >= threshold are positives; items below are dropped from the binary set but still contribute to graded metrics like ndcg@k. The threshold is plumbed through calculate_ranking_metrics, where the binary positive set is derived on the fly from the (indices, relevance) pair when graded labels are present. Default of 1e-9 keeps every listed item as a positive when the dataset provides target_relevance, so existing graded tasks behave identically. The threshold has no effect when target_relevance is None. The graded example task now sets binary_relevance_threshold=2.0 to demonstrate dropping nice-to-have skills from MAP/MRR while keeping them as gain-1 contributions to nDCG.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
RankingDataset now accepts an optional target_relevance field aligned 1-to-1 with target_indices. ndcg@k uses a (2^rel - 1) gain when graded labels are provided. Binary metrics (map, mrr, recall@k, hit@k, rp@k) consume a thresholded view of the grades via RankingTask.binary_relevance_threshold (default 1e-9, any non-zero grade counts). With target_relevance=None every metric is numerically identical to the previous implementation, so existing tasks behave unchanged.
binary_relevance_threshold defines what "relevant" means for the binary metrics on a graded dataset. Override it on the task (e.g. 2.0 on a 1-3 scale) to drop low grades from the binary positive set while still letting them contribute to nDCG. Items below the threshold also leave recall@k's denominator, so a graded dataset's binary numbers are not directly comparable to a fully-binary version.
Adds custom_task_graded_relevance_example.py, which defines a graded task, runs evaluate() so the same dataset feeds nDCG (graded) and MAP/MRR/recall (binary) in one call, and replays a fixed prediction matrix at threshold 1e-9 vs. 2.0 to show the trade-off: nDCG@5 stays at 0.61 while MAP drops 0.79 to 0.44, MRR drops 1.0 to 0.5, recall@5 drops 0.75 to 0.67. README gets a matching "Graded relevance (optional)" section.
Checklist