Make parquet the default data format (#637) by kartikmandar · Pull Request #715 · openml/automlbenchmark

kartikmandar · 2025-11-28T16:08:19Z

Summary

Add configurable default_data_format option under openml namespace in config
Change default format from ARFF to Parquet for OpenML datasets
Extend reorder_dataset() to support parquet and csv formats

Changes

resources/config.yaml: Add openml.default_data_format: parquet
amlb/datasets/openml.py: Read format from config with parquet fallback
amlb/datautils.py: Refactor to handle parquet, csv, and arff files
tests/: Update path assertions to use configurable format

Backward Compatibility

Users can restore previous behavior by setting:
openml:
default_data_format: arff

Summary by CodeRabbit

Release Notes

New Features
- OpenML datasets now use a configurable default data format (Parquet by default) instead of fixed ARFF, improving performance
- Dataset reordering functionality now supports multiple formats: ARFF, Parquet, and CSV
Chores
- Added openml.default_data_format configuration option with fallback handling

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-28T16:08:30Z

Walkthrough

OpenML dataset initialization becomes format-configurable by reading openml.default_data_format from configuration (defaulting to "parquet") instead of hard-coding "arff". Complementary changes extend dataset reordering to support ARFF, Parquet, and CSV formats with format-specific handlers and updated tests.

Changes

Cohort / File(s)	Summary
Configuration `resources/config.yaml`	Added `openml.default_data_format` configuration option with default value "parquet" and documentation for supported formats (parquet, arff, csv).
OpenML Initialization `amlb/datasets/openml.py`	Modified OpenmlDatasplit to read configurable default format from `rconfig().openml.default_data_format` instead of hard-coding "arff". Includes fallback to "parquet" and warning logging for unsupported formats.
Dataset Reordering Utilities `amlb/datautils.py`	Enhanced `reorder_dataset` to support ARFF, Parquet, and CSV formats via new internal helpers (`_reorder_columns`, `_reorder_arff`, `_reorder_parquet`, `_reorder_csv`). Added format-specific logic and error handling for unsupported formats.
Test Updates `tests/unit/amlb/datasets/openml/test_openml_dataloader.py`	Updated `_assert_data_paths` to derive expected format from dataset configuration instead of hard-coding "arff". Added explicit verification for requesting "arff" format via `data_path("arff")`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

amlb/datautils.py: Multiple format-specific handlers with varying logic patterns (ARFF attribute manipulation vs. dataframe operations) warrant careful verification of correctness across formats.
amlb/datasets/openml.py: Configuration fallback chain and warning logic should be validated for edge cases.
tests/unit/amlb/datasets/openml/test_openml_dataloader.py: Ensure format derivation logic correctly reflects the new configurable behavior.

Poem

🐰 From ARFF chains, now format flows free,
Parquet, CSV—pick what you'll be!
Config whispers the data format's name,
Reorder them all in the same graceful game. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Make parquet the default data format' directly and concisely summarizes the primary change: switching the default OpenML dataset format from ARFF to Parquet.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/unit/amlb/datasets/openml/test_openml_dataloader.py (1)
14-24: Consider adding a test for custom format configuration.

The current fixture tests the default behavior (parquet fallback). For more comprehensive coverage, consider adding a test that explicitly sets default_data_format to verify the configuration is respected.

Example test fixture for custom format:
@pytest.fixture
def oml_config_arff():
    return from_configs(
        ns(
            input_dir="my_input",
            output_dir="my_output",
            user_dir="my_user_dir",
            root_dir="my_root_dir",
            openml=ns(
                apikey="c1994bdb7ecb3c6f3c8f3b35f4b47f1f",
                infer_dtypes=False,
                default_data_format="arff",
            ),
        )
    ).config

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfe8d21 and 6be0b6f.

📒 Files selected for processing (4)

amlb/datasets/openml.py (1 hunks)
amlb/datautils.py (3 hunks)
resources/config.yaml (1 hunks)
tests/unit/amlb/datasets/openml/test_openml_dataloader.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

amlb/datautils.py (1)

amlb/utils/os.py (1)

split_path (30-33)

🪛 Ruff (0.14.6)

amlb/datautils.py

159-162: Avoid specifying long messages outside the exception class

(TRY003)

178-178: Consider [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]] instead of concatenation

Replace with [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]]

(RUF005)

🔇 Additional comments (6)

resources/config.yaml (1)

105-107: Well-documented configuration option.

The new default_data_format option is clearly documented with allowed values and rationale for the default choice. This provides good backward compatibility through configuration.

amlb/datautils.py (3)

122-162: Clean dispatch logic for multi-format support.

The refactoring follows a clean pattern: extract extension, dispatch to format-specific handler. The case-insensitive extension check and informative error message for unsupported formats are good practices.

181-211: ARFF reordering implementation looks correct.

The logic properly loads the ARFF file, computes new column order, applies it to both attributes and data, and handles the save vs. return-data paths correctly.

214-253: Consistent implementation for Parquet and CSV handlers.

Both functions follow the same pattern as the ARFF handler, maintaining consistency. The CSV writer correctly uses index=False to avoid an extra index column.

amlb/datasets/openml.py (1)

339-349: Robust configuration handling with graceful fallback.

The implementation properly reads the configurable format with a fallback to "parquet", validates against supported formats, and logs a helpful warning if the configuration is invalid. This ensures backward compatibility while enabling user customization.

tests/unit/amlb/datasets/openml/test_openml_dataloader.py (1)

138-166: Good test adaptation for configurable default format.

The test correctly derives the expected format from dataset.train.format rather than assuming a fixed format. The explicit verification of all supported formats (arff, csv, parquet) via data_path() ensures comprehensive coverage of the format handling logic.

coderabbitai · 2025-11-28T16:18:19Z

amlb/datautils.py

+def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None:
+    """Calculate the new column order. Returns None if no reordering needed."""
+    n_cols = len(columns)
+    src = n_cols + 1 + target_src if target_src < 0 else target_src
+    dest = n_cols + 1 + target_dest if target_dest < 0 else target_dest
+
+    if src == dest:
+        return None
+
+    ori = list(range(n_cols))
+    if src < dest:
+        return ori[:src] + ori[src + 1 : dest] + [src] + ori[dest:]
+    else:
+        return ori[:dest] + [src] + ori[dest:src] + ori[src + 1 :]


⚠️ Potential issue | 🔴 Critical

Off-by-one error in negative index handling.

The formula for converting negative indices is incorrect. For a list with n_cols=5, index -1 should map to 4 (the last element), but the current formula n_cols + 1 + target_src yields 5 + 1 + (-1) = 5, which is out of bounds.

Apply this diff to fix the off-by-one error:

def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None: """Calculate the new column order. Returns None if no reordering needed.""" n_cols = len(columns) - src = n_cols + 1 + target_src if target_src < 0 else target_src - dest = n_cols + 1 + target_dest if target_dest < 0 else target_dest + src = n_cols + target_src if target_src < 0 else target_src + dest = n_cols + target_dest if target_dest < 0 else target_dest if src == dest: return None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None:

"""Calculate the new column order. Returns None if no reordering needed."""

n_cols = len(columns)

src = n_cols + 1 + target_src if target_src < 0 else target_src

dest = n_cols + 1 + target_dest if target_dest < 0 else target_dest

if src == dest:

return None

ori = list(range(n_cols))

if src < dest:

return ori[:src] + ori[src + 1 : dest] + [src] + ori[dest:]

else:

return ori[:dest] + [src] + ori[dest:src] + ori[src + 1 :]

def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None:

"""Calculate the new column order. Returns None if no reordering needed."""

n_cols = len(columns)

src = n_cols + target_src if target_src < 0 else target_src

dest = n_cols + target_dest if target_dest < 0 else target_dest

if src == dest:

return None

ori = list(range(n_cols))

if src < dest:

return ori[:src] + ori[src + 1 : dest] + [src] + ori[dest:]

else:

return ori[:dest] + [src] + ori[dest:src] + ori[src + 1 :]

🧰 Tools

🪛 Ruff (0.14.6)

178-178: Consider [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]] instead of concatenation

Replace with [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]]

(RUF005)

Make parquet the default data format (openml#637)

26449c3

[pre-commit.ci] auto fixes from pre-commit.com hooks

6be0b6f

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make parquet the default data format (#637)#715

Make parquet the default data format (#637)#715
kartikmandar wants to merge 2 commits intoopenml:masterfrom
kartikmandar:feature/parquet-default-format

kartikmandar commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kartikmandar commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kartikmandar commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2025 •

edited

Loading