refactor: generalize dataset indexing from language-based to dataset_id-based#34
refactor: generalize dataset indexing from language-based to dataset_id-based#34federetyk wants to merge 5 commits intotechwolf-ai:mainfrom
Conversation
| @@ -327,11 +327,9 @@ def _run_pending_work( | |||
| language_results={}, | |||
There was a problem hiding this comment.
Best that we consistently reformat our 'language_results' to 'datasetid_results' as well for consistency
| def _aggregate_per_language( | ||
| self, | ||
| tag_name: str = "mean_per_language", | ||
| aggregations: tuple = ("mean", "stderr", "ci_margin"), |
There was a problem hiding this comment.
I would suggestion we make the monolingual aspect explicit here. By providing an argument aggregation_mode="monolingual_only". If any other mode is provided we raise an exception.
In the future we would also want to allow cross_lingual to be included for example, otherwise many of the MELO dataset runs would just be ignored.
Additionally, if we ignore the cross-lingual results, we should actually prevent the model from running them and filtering out these tasks when they are not in the final evaluation metrics. I'd suggest we do a preprocessing step that raises an exception if the metric calculation ignores any of the included tasks.
| assert task.datasets["en_region_a"].dataset_id == "en_region_a" | ||
| assert task.datasets["en_region_b"].dataset_id == "en_region_b" | ||
|
|
||
|
|
There was a problem hiding this comment.
We should add a test that checks for the scenario where the metric calculation (e.g. monollingual_only, or grouped_target_language, grouped_query_language) is not matching with the tasks in the dataset. For example, the MELO with cross-lingual shouldn't run any cross-lingual tasks if monolingual_only is turned on. (Because the cross-lingual results would be ignored)
| return "" | ||
|
|
||
| def get_dataset_language(self, dataset_id: str) -> Language | None: | ||
| """Return the language of a dataset if it's monolingual, None otherwise. |
There was a problem hiding this comment.
Would clarify the first-line description as to the purpose of this method, and when it needs to be overwritten.
Mattdl
left a comment
There was a problem hiding this comment.
I left some comments. In general, I think this is a great PR because:
- the
dataset_idallows to decouple the execution of the evaluation loop, with the evaluation strategy. Additionally it allows for backwards compatibility. - The aggregation logic for the metrics can be tied to the
dataset_id. Each dataset_id has a required mapping to the language.
One issue with current implementation is that the metric aggregation just ignores cross-lingual tasks. Hence, we would not want to execute these as they are ignored in the final result calculation. So we need to find a way to include the cross-lingual results in the final metric calculation.
My proposal is to by default stick to a metric aggregation mode that can be defined by the user to the benchmark (in workrb.evaluate). Then there's 2 steps involved:
- Check if all defined tasks match with the defined mode (otherwise we raise, or we need to filter underlying tasks, which could also be defined with a user-defined mode).
- Then, in the aggregation step, we need to make the distinction between the
input_languageandoutput_language. This would require a refactor of all tasks to define both (they are the same for monolingual). The metric aggregation logic then looks at:monolingual_only(default): checks if all tasks input/output languages are the same, and only aggregates on those.crosslingual_group_input_languages: would aggregate all results based on input_languagecrosslingual_group_output_languages: would aggregate all results based on output_language
Addresses #33
Description
This PR generalizes dataset indexing within tasks from
Languageenum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.The refactor introduces a
languages_to_dataset_ids()method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A newget_dataset_language()method maps datasets back to their language for proper per-language result aggregation, returningNonefor cross-lingual or multilingual datasets.Changes:
lang_datasets: dict[Language, Dataset]todatasets: dict[str, Dataset]in Task base classlanguages_to_dataset_ids(languages) -> list[str]method with default backward-compatible mappingload_monolingual_data(language, split)toload_dataset(dataset_id, split)across all tasksget_dataset_language(dataset_id) -> Language | Nonemethod for per-language aggregationlanguagefield toMetricsResultto track dataset language_aggregate_per_language()to group by thelanguagefield, skipping datasets marked as cross-lingual or multilingualexamples/All tests pass, and the output of
examples/run_multiple_models.pyproduces results consistent with the main branch.Checklist