Skip to content

refactor: generalize dataset indexing from language-based to dataset_id-based#34

Open
federetyk wants to merge 5 commits intotechwolf-ai:mainfrom
federetyk:refactor/generalize-dataset-indexing
Open

refactor: generalize dataset indexing from language-based to dataset_id-based#34
federetyk wants to merge 5 commits intotechwolf-ai:mainfrom
federetyk:refactor/generalize-dataset-indexing

Conversation

@federetyk
Copy link
Contributor

Addresses #33

Description

This PR generalizes dataset indexing within tasks from Language enum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.

The refactor introduces a languages_to_dataset_ids() method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A new get_dataset_language() method maps datasets back to their language for proper per-language result aggregation, returning None for cross-lingual or multilingual datasets.

Changes:

  • Rename lang_datasets: dict[Language, Dataset] to datasets: dict[str, Dataset] in Task base class
  • Add languages_to_dataset_ids(languages) -> list[str] method with default backward-compatible mapping
  • Rename load_monolingual_data(language, split) to load_dataset(dataset_id, split) across all tasks
  • Add get_dataset_language(dataset_id) -> Language | None method for per-language aggregation
  • Add language field to MetricsResult to track dataset language
  • Update _aggregate_per_language() to group by the language field, skipping datasets marked as cross-lingual or multilingual
  • Update all task implementations to use the new method signature
  • Add unit test for multi-dataset task scenarios
  • Fix minor issues in some files in examples/

All tests pass, and the output of examples/run_multiple_models.py produces results consistent with the main branch.

Checklist

  • Added new tests for new functionality
  • Tested locally with example tasks
  • Code follows project style guidelines
  • Documentation updated
  • No new warnings introduced

@@ -327,11 +327,9 @@ def _run_pending_work(
language_results={},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best that we consistently reformat our 'language_results' to 'datasetid_results' as well for consistency

def _aggregate_per_language(
self,
tag_name: str = "mean_per_language",
aggregations: tuple = ("mean", "stderr", "ci_margin"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggestion we make the monolingual aspect explicit here. By providing an argument aggregation_mode="monolingual_only". If any other mode is provided we raise an exception.

In the future we would also want to allow cross_lingual to be included for example, otherwise many of the MELO dataset runs would just be ignored.

Additionally, if we ignore the cross-lingual results, we should actually prevent the model from running them and filtering out these tasks when they are not in the final evaluation metrics. I'd suggest we do a preprocessing step that raises an exception if the metric calculation ignores any of the included tasks.

assert task.datasets["en_region_a"].dataset_id == "en_region_a"
assert task.datasets["en_region_b"].dataset_id == "en_region_b"


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test that checks for the scenario where the metric calculation (e.g. monollingual_only, or grouped_target_language, grouped_query_language) is not matching with the tasks in the dataset. For example, the MELO with cross-lingual shouldn't run any cross-lingual tasks if monolingual_only is turned on. (Because the cross-lingual results would be ignored)

return ""

def get_dataset_language(self, dataset_id: str) -> Language | None:
"""Return the language of a dataset if it's monolingual, None otherwise.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would clarify the first-line description as to the purpose of this method, and when it needs to be overwritten.

Copy link
Collaborator

@Mattdl Mattdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. In general, I think this is a great PR because:

  1. the dataset_id allows to decouple the execution of the evaluation loop, with the evaluation strategy. Additionally it allows for backwards compatibility.
  2. The aggregation logic for the metrics can be tied to the dataset_id. Each dataset_id has a required mapping to the language.

One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.

My proposal is to by default stick to a metric aggregation mode that can be defined by the user to the benchmark (in workrb.evaluate). Then there's 2 steps involved:

  1. Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).
  2. Then, in the aggregation step, we need to make the distinction between the input_language and output_language. This would require a refactor of all tasks to define both (they are the same for monolingual). The metric aggregation logic then looks at:
    1. monolingual_only (default): checks if all tasks input/output languages are the same, and only aggregates on those.
    2. crosslingual_group_input_languages: would aggregate all results based on input_language
    3. crosslingual_group_output_languages: would aggregate all results based on output_language

@Mattdl Mattdl mentioned this pull request Feb 20, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments