refactor: generalize dataset indexing from language-based to dataset_id-based by federetyk · Pull Request #34 · techwolf-ai/workrb

federetyk · 2026-01-16T10:42:41Z

Addresses #33

Description

This PR generalizes dataset indexing within tasks from Language enum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.

The refactor introduces a languages_to_dataset_ids() method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A new get_dataset_language() method maps datasets back to their language for proper per-language result aggregation, returning None for cross-lingual or multilingual datasets.

Changes:

Rename lang_datasets: dict[Language, Dataset] to datasets: dict[str, Dataset] in Task base class
Add languages_to_dataset_ids(languages) -> list[str] method with default backward-compatible mapping
Rename load_monolingual_data(language, split) to load_dataset(dataset_id, split) across all tasks
Add get_dataset_language(dataset_id) -> Language | None method for per-language aggregation
Add language field to MetricsResult to track dataset language
Update _aggregate_per_language() to group by the language field, skipping datasets marked as cross-lingual or multilingual
Update all task implementations to use the new method signature
Add unit test for multi-dataset task scenarios
Fix minor issues in some files in examples/

All tests pass, and the output of examples/run_multiple_models.py produces results consistent with the main branch.

Checklist

Added new tests for new functionality
Tested locally with example tasks
Code follows project style guidelines
Documentation updated
No new warnings introduced

…id-based

…egation

Mattdl · 2026-02-19T13:37:27Z

src/workrb/run.py

@@ -327,11 +327,9 @@ def _run_pending_work(
                language_results={},


Best that we consistently reformat our 'language_results' to 'datasetid_results' as well for consistency

Mattdl · 2026-02-19T13:45:52Z

src/workrb/results.py

    def _aggregate_per_language(
        self,
        tag_name: str = "mean_per_language",
        aggregations: tuple = ("mean", "stderr", "ci_margin"),


I would suggestion we make the monolingual aspect explicit here. By providing an argument aggregation_mode="monolingual_only". If any other mode is provided we raise an exception.

In the future we would also want to allow cross_lingual to be included for example, otherwise many of the MELO dataset runs would just be ignored.

Additionally, if we ignore the cross-lingual results, we should actually prevent the model from running them and filtering out these tasks when they are not in the final evaluation metrics. I'd suggest we do a preprocessing step that raises an exception if the metric calculation ignores any of the included tasks.

Mattdl · 2026-02-19T13:49:56Z

tests/test_multi_dataset_task.py

+        assert task.datasets["en_region_a"].dataset_id == "en_region_a"
+        assert task.datasets["en_region_b"].dataset_id == "en_region_b"
+
+


We should add a test that checks for the scenario where the metric calculation (e.g. monollingual_only, or grouped_target_language, grouped_query_language) is not matching with the tasks in the dataset. For example, the MELO with cross-lingual shouldn't run any cross-lingual tasks if monolingual_only is turned on. (Because the cross-lingual results would be ignored)

Mattdl · 2026-02-19T13:57:28Z

src/workrb/tasks/abstract/base.py

        return ""

+    def get_dataset_language(self, dataset_id: str) -> Language | None:
+        """Return the language of a dataset if it's monolingual, None otherwise.


Would clarify the first-line description as to the purpose of this method, and when it needs to be overwritten.

Mattdl

I left some comments. In general, I think this is a great PR because:

the dataset_id allows to decouple the execution of the evaluation loop, with the evaluation strategy. Additionally it allows for backwards compatibility.
The aggregation logic for the metrics can be tied to the dataset_id. Each dataset_id has a required mapping to the language.

One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.

My proposal is to by default stick to a metric aggregation mode that can be defined by the user to the benchmark (in workrb.evaluate). Then there's 2 steps involved:

Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).
Then, in the aggregation step, we need to make the distinction between the input_language and output_language. This would require a refactor of all tasks to define both (they are the same for monolingual). The metric aggregation logic then looks at:
1. monolingual_only (default): checks if all tasks input/output languages are the same, and only aggregates on those.
2. crosslingual_group_input_languages: would aggregate all results based on input_language
3. crosslingual_group_output_languages: would aggregate all results based on output_language

federetyk added 5 commits January 15, 2026 11:26

refactor: generalize dataset indexing from language-based to dataset_…

b00e4c5

…id-based

fix: solve issues in example files

17b1897

fix: add language field to MetricsResult for proper per-language aggr…

e16f8dd

…egation

style: update docstrings to comply with NumPy style

e254bc2

chore: merge upstream changes (v0.3.0, task renames, test refactor)

40810c2

This was referenced Jan 16, 2026

[FEATURE] Add MELO Benchmark datasets as a ranking task for job title normalization #30

Open

feat: add new ranking tasks for melo #37

Open

Mattdl reviewed Feb 19, 2026

View reviewed changes

Mattdl requested changes Feb 19, 2026

View reviewed changes

Mattdl mentioned this pull request Feb 20, 2026

feat: project freelancer ranking #40

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: generalize dataset indexing from language-based to dataset_id-based#34

refactor: generalize dataset indexing from language-based to dataset_id-based#34
federetyk wants to merge 5 commits intotechwolf-ai:mainfrom
federetyk:refactor/generalize-dataset-indexing

federetyk commented Jan 16, 2026

Uh oh!

Mattdl Feb 19, 2026

Uh oh!

Mattdl Feb 19, 2026

Uh oh!

Mattdl Feb 19, 2026

Uh oh!

Mattdl Feb 19, 2026

Uh oh!

Mattdl left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

		@@ -327,11 +327,9 @@ def _run_pending_work(
		language_results={},

		assert task.datasets["en_region_a"].dataset_id == "en_region_a"
		assert task.datasets["en_region_b"].dataset_id == "en_region_b"

Conversation

federetyk commented Jan 16, 2026

Description

Checklist

Uh oh!

Mattdl Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Mattdl Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Mattdl Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Mattdl Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Mattdl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments