techwolf-ai · warreveys · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,19 @@
+## Unreleased
+
+### Feat
+
+- graded relevance support for ranking metrics: ``RankingDataset`` accepts an
+  optional ``target_relevance`` field aligned 1-to-1 with ``target_indices``.
+  ``ndcg@k`` uses a (2^rel - 1) gain when graded labels are provided; binary
+  metrics (``map``, ``mrr``, ``recall@k``, ``hit@k``, ``rp@k``) ignore the
+  field. Binary nDCG behavior is preserved when ``target_relevance`` is
+  ``None``. See ``examples/custom_task_graded_relevance_example.py``.
+- ``RankingTask.binary_relevance_threshold`` (default ``1e-9``) lets a graded
+  task choose which grades count as positives for binary metrics. Items with
+  relevance below the threshold are dropped from the binary positive set but
+  still contribute to graded metrics like ``ndcg@k``. Has no effect when
+  ``target_relevance`` is ``None``.
+
 ## v0.5.1 (2026-03-13)
 
 ### Feat

diff --git a/README.md b/README.md
@@ -237,11 +237,23 @@ Each aggregation provides 95% confidence intervals (replace `mean` with `ci_marg
 | --- | --- |
 | `map` | Mean Average Precision |
 | `mrr` | Mean Reciprocal Rank |
-| `ndcg@k` | Normalized Discounted Cumulative Gain with support for top-k cutoff.|
+| `ndcg@k` | Normalized Discounted Cumulative Gain with support for top-k cutoff. Uses a (2^rel - 1) gain when the dataset provides graded `target_relevance`; falls back to binary positives otherwise. See [Graded relevance](#graded-relevance-optional) for the `binary_relevance_threshold` interaction. |
 | `recall@k` | Recall at k (e.g. `recall@5`, `recall@10`) |
 | `hit@k` | Hit rate at k — binary: is any relevant item in the top-k? |
 | `rp@k` | R-Precision at k — precision relative to total relevant items |
 
+##### Graded relevance (optional)
+
+`RankingDataset` accepts an optional `target_relevance` aligned 1-to-1 with `target_indices`. When set, `ndcg@k` consumes the grades via the standard `(2^rel - 1)` gain. The other metrics (`map`, `mrr`, `recall@k`, `hit@k`, `rp@k`) remain binary — they decide what counts as a positive by thresholding the grades.
+
+That threshold is `RankingTask.binary_relevance_threshold`. **It is the knob that decides what "relevant" means for the binary metrics on a graded dataset, so changing it changes their values.**
+
+- Default `1e-9` — every listed grade > 0 is a positive. Numbers match the `target_relevance=None` case exactly, so legacy tasks need no migration.
+- Override on a graded task to express a stricter cutoff (e.g. `2.0` on a 1-3 scale: only secondary/primary count as positives; nice-to-haves still help nDCG but no longer count toward MAP).
+- Always ignored when `target_relevance is None`.
+
+`recall@k` denominator on a graded dataset is the *thresholded* positive count, not the count of all listed items — so a graded dataset's binary numbers are not directly comparable to a fully-binary version of the same data. See [examples/custom_task_graded_relevance_example.py](examples/custom_task_graded_relevance_example.py) for a runnable side-by-side of how the threshold shifts MAP/MRR/recall while leaving nDCG unchanged.
+
 **Classification metrics** (used in `ClassificationTask`):
 
 | Metric | Description |

diff --git a/examples/custom_task_example.py b/examples/custom_task_example.py
@@ -4,6 +4,10 @@
 This example demonstrates how to create a custom ranking task that can be used
 with the WorkRB framework. Custom tasks should inherit from workrb.tasks.RankingTask
 and implement the required abstract methods.
+
+Uses binary relevance: every entry in ``target_indices`` is treated as relevant=1
+by all metrics. For graded relevance (e.g. nDCG@k with a 1-2-3 scale), see
+``custom_task_graded_relevance_example.py``.
 """
 
 import workrb

diff --git a/examples/custom_task_graded_relevance_example.py b/examples/custom_task_graded_relevance_example.py
@@ -0,0 +1,236 @@
+"""
+Custom Graded-Relevance Ranking Task
+
+Extends ``custom_task_example.py`` with graded relevance: each (query, target)
+pair carries an ordinal grade instead of a binary relevant/not-relevant flag.
+
+How a graded task differs from a binary one:
+
+1. Pass an additional ``target_relevance`` list to ``RankingDataset``, aligned
+   1-to-1 with ``target_indices``. Items NOT listed in ``target_indices`` are
+   implicitly grade 0 (irrelevant).
+2. Add ``ndcg@k`` to ``default_metrics``. The (2^rel - 1) gain in nDCG uses the
+   grades.
+3. Choose ``binary_relevance_threshold`` on the task to control which graded
+   items count as positives for binary metrics (``map``, ``mrr``, ``recall@k``,
+   ``hit@k``, ``rp@k``). Items with relevance >= threshold are positives;
+   items below are dropped from the binary positive set but still contribute
+   to nDCG.
+
+Why the threshold matters
+-------------------------
+The threshold *defines* what "relevant" means for the binary metrics on a
+graded dataset. Changing it changes the values of MAP/MRR/recall/hit/rp.
+nDCG is unaffected — it always sees the full graded list.
+
+- Default ``1e-9``: every listed grade > 0 counts. Numbers match the
+  ``target_relevance=None`` (binary-only) case exactly. Safe default if you
+  just want graded nDCG without disturbing legacy binary metrics.
+- Stricter (e.g. ``2.0`` on a 1-3 scale): only secondary/primary count.
+  Nice-to-haves stop counting toward MAP but still help nDCG.
+
+The bottom of this file runs the same dataset under two thresholds so you
+can see nDCG stay constant while MAP/MRR/recall shift.
+
+The grading scale is up to the task. Common choices: {1, 2, 3} (this example),
+{1, 2, 3, 4} (TREC-style), or fractional in [0, 1]. Values must be non-negative.
+"""
+
+import numpy as np
+
+import workrb
+from workrb.metrics.ranking import calculate_ranking_metrics
+from workrb.registry import register_task
+from workrb.tasks.abstract.base import DatasetSplit, LabelType, Language
+from workrb.tasks.abstract.ranking_base import RankingDataset, RankingTaskGroup
+from workrb.types import ModelInputType
+
+
+@register_task()
+class GradedJob2SkillTask(workrb.tasks.RankingTask):
+    """Job-to-skill ranking with three relevance grades.
+
+    Grade 3 — primary skill (the role is defined by it).
+    Grade 2 — secondary skill (clearly expected).
+    Grade 1 — nice-to-have (mentioned but not required).
+    Grade 0 — irrelevant; never appears in ``target_indices`` (implicit).
+    """
+
+    @property
+    def name(self) -> str:
+        return "GradedJob2SkillTask"
+
+    @property
+    def description(self) -> str:
+        return "Job-to-skill ranking with graded relevance for nDCG evaluation."
+
+    @property
+    def query_input_type(self) -> ModelInputType:
+        return ModelInputType.JOB_TITLE
+
+    @property
+    def target_input_type(self) -> ModelInputType:
+        return ModelInputType.SKILL_NAME
+
+    @property
+    def default_metrics(self) -> list[str]:
+        # nDCG@k is the headline metric here. MAP/MRR/recall are kept as a
+        # binary sanity check on the high-grade subset (see the threshold
+        # below): they treat every passing graded positive as relevant=1.
+        return ["ndcg@5", "ndcg@10", "map", "mrr", "recall@5"]
+
+    @property
+    def binary_relevance_threshold(self) -> float:
+        # Only "secondary" (grade 2) and "primary" (grade 3) skills count as
+        # positives for binary metrics; "nice-to-have" (grade 1) items drop
+        # out of MAP/MRR/recall but still contribute to nDCG.
+        return 2.0
+
+    @property
+    def task_group(self) -> RankingTaskGroup:
+        return RankingTaskGroup.JOB2SKILL
+
+    @property
+    def label_type(self) -> LabelType:
+        return LabelType.MULTI_LABEL
+
+    @property
+    def supported_query_languages(self) -> list[Language]:
+        return [Language.EN]
+
+    @property
+    def supported_target_languages(self) -> list[Language]:
+        return [Language.EN]
+
+    def load_dataset(self, dataset_id: str, split: DatasetSplit) -> RankingDataset:
+        """Load a tiny in-memory graded dataset.
+
+        target_indices and target_relevance are aligned 1-to-1: position i in
+        ``target_relevance[q]`` is the grade for ``target_indices[q][i]``.
+        """
+        queries = [
+            "Machine learning engineer",
+            "Data scientist",
+            "Backend software developer",
+        ]
+
+        targets = [
+            "Python programming",  # 0
+            "Machine learning",  # 1
+            "Data analysis",  # 2
+            "Software engineering",  # 3
+            "Statistics",  # 4
+            "Deep learning",  # 5
+            "Web development",  # 6
+            "Database management",  # 7
+        ]
+
+        # Each list pairs (skill_idx -> grade). Skills not listed for a query
+        # are implicitly grade 0.
+        target_indices = [
+            [1, 5, 0, 3],  # ML engineer: ML(3), DL(3), Python(2), SWE(1)
+            [1, 2, 4, 5],  # Data scientist: ML(3), DA(3), Stats(2), DL(2)
+            [0, 3, 7, 6],  # Backend dev: Python(3), SWE(3), DB(2), Web(1)
+        ]
+        target_relevance = [
+            [3.0, 3.0, 2.0, 1.0],
+            [3.0, 3.0, 2.0, 2.0],
+            [3.0, 3.0, 2.0, 1.0],
+        ]
+
+        return RankingDataset(
+            query_texts=queries,
+            target_indices=target_indices,
+            target_space=targets,
+            dataset_id=dataset_id,
+            target_relevance=target_relevance,
+        )
+
+
+def _demo_threshold_effect() -> None:
+    """Show how ``binary_relevance_threshold`` changes binary metrics on the same data.
+
+    Construction is deliberate: a single query with positives at grades {3, 3, 2, 1},
+    and a hand-picked ranking that places the grade-1 "nice-to-have" at rank 1.
+    That makes the trade-off legible:
+
+    - At threshold ``1e-9`` (default), the nice-to-have counts as a positive, so
+      MRR is a perfect 1.0 — even though the model put a low-value item at the top.
+    - At threshold ``2.0``, the nice-to-have is dropped from the positive set, so
+      MRR collapses to the rank of the first surviving positive.
+    - nDCG is invariant: it always sees the full graded list, and the (2^rel - 1)
+      gain already discounts the nice-to-have appropriately.
+
+    Same ``prediction_matrix``, same ``target_relevance`` — only the threshold differs.
+    """
+    target_indices = [[0, 1, 2, 3]]
+    target_relevance = [[3.0, 3.0, 2.0, 1.0]]  # primary, primary, secondary, nice-to-have
+    n_targets = 8
+
+    # Forced ranking (best -> worst):
+    #   rank 1: idx 3 (grade 1, nice-to-have ranked first)
+    #   rank 2: idx 0 (grade 3)
+    #   rank 5: idx 1 (grade 3, just inside top-5)
+    #   rank 7: idx 2 (grade 2, outside top-5)
+    order = [3, 0, 5, 6, 1, 7, 2, 4]
+    prediction_matrix = np.zeros((1, n_targets), dtype=float)
+    for rank, idx in enumerate(order):
+        prediction_matrix[0, idx] = float(n_targets - rank)
+
+    metrics = ["ndcg@5", "map", "mrr", "recall@5"]
+    thresholds = {
+        "1e-9 (every listed grade counts)": 1e-9,
+        "2.0  (only grade >= 2 counts)": 2.0,
+    }
+
+    print("\n--- Effect of binary_relevance_threshold on the same dataset ---")
+    print("Same prediction_matrix, same target_relevance — only the threshold differs.\n")
+    header = f"{'threshold':<36}" + "".join(f"{m:>10}" for m in metrics)
+    print(header)
+    print("-" * len(header))
+    for label, threshold in thresholds.items():
+        results = calculate_ranking_metrics(
+            prediction_matrix=prediction_matrix,
+            pos_label_idxs=target_indices,
+            metrics=metrics,
+            pos_label_relevance=target_relevance,
+            binary_relevance_threshold=threshold,
+        )
+        row = f"{label:<36}" + "".join(f"{results[m]:>10.4f}" for m in metrics)
+        print(row)
+    print(
+        "\nnDCG@5 is identical across thresholds — it always consumes the full graded list."
+        "\nMAP / MRR / recall@5 shift because the threshold redefines the binary positive set:"
+        "\n  - 1e-9: the nice-to-have (grade 1) at rank 1 counts as a positive, so MRR=1.0"
+        "\n          and the model 'looks great' on binary metrics despite ranking a"
+        "\n          low-value item first."
+        "\n  - 2.0:  the nice-to-have is dropped from the positive set; MRR drops to the"
+        "\n          rank of the first surviving positive, MAP drops, and recall@5 is"
+        "\n          computed against a smaller denominator (3 instead of 4)."
+        "\nLesson: on a graded dataset, binary metrics answer 'how does the model rank"
+        "\nthe items I declared relevant', and *what counts as relevant* is exactly the"
+        "\nthreshold. Pick it deliberately."
+    )
+
+
+if __name__ == "__main__":
+    print("🚀 Custom Graded-Relevance Task Example")
+    print("=" * 50)
+
+    model = workrb.models.BiEncoderModel("all-MiniLM-L6-v2")
+    tasks = [GradedJob2SkillTask(languages=["en"], split="test")]
+
+    # Run with the task's default_metrics, which mixes nDCG (graded) with
+    # MAP/MRR/recall (binary). The same dataset feeds both — graded metrics
+    # consult target_relevance, binary metrics see only the thresholded set.
+    results = workrb.evaluate(
+        model,
+        tasks,
+        output_folder="results/graded_task_demo",
+        description="Custom ranking task with graded relevance.",
+        force_restart=True,
+    )
+    print(results)
+
+    # Show why binary_relevance_threshold matters: same data, two thresholds.
+    _demo_threshold_effect()