Make parquet the default data format (#637)#715
Make parquet the default data format (#637)#715kartikmandar wants to merge 2 commits intoopenml:masterfrom
Conversation
WalkthroughOpenML dataset initialization becomes format-configurable by reading Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/amlb/datasets/openml/test_openml_dataloader.py (1)
14-24: Consider adding a test for custom format configuration.The current fixture tests the default behavior (parquet fallback). For more comprehensive coverage, consider adding a test that explicitly sets
default_data_formatto verify the configuration is respected.Example test fixture for custom format:
@pytest.fixture def oml_config_arff(): return from_configs( ns( input_dir="my_input", output_dir="my_output", user_dir="my_user_dir", root_dir="my_root_dir", openml=ns( apikey="c1994bdb7ecb3c6f3c8f3b35f4b47f1f", infer_dtypes=False, default_data_format="arff", ), ) ).config
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
amlb/datasets/openml.py(1 hunks)amlb/datautils.py(3 hunks)resources/config.yaml(1 hunks)tests/unit/amlb/datasets/openml/test_openml_dataloader.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
amlb/datautils.py (1)
amlb/utils/os.py (1)
split_path(30-33)
🪛 Ruff (0.14.6)
amlb/datautils.py
159-162: Avoid specifying long messages outside the exception class
(TRY003)
178-178: Consider [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]] instead of concatenation
Replace with [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]]
(RUF005)
🔇 Additional comments (6)
resources/config.yaml (1)
105-107: Well-documented configuration option.The new
default_data_formatoption is clearly documented with allowed values and rationale for the default choice. This provides good backward compatibility through configuration.amlb/datautils.py (3)
122-162: Clean dispatch logic for multi-format support.The refactoring follows a clean pattern: extract extension, dispatch to format-specific handler. The case-insensitive extension check and informative error message for unsupported formats are good practices.
181-211: ARFF reordering implementation looks correct.The logic properly loads the ARFF file, computes new column order, applies it to both attributes and data, and handles the save vs. return-data paths correctly.
214-253: Consistent implementation for Parquet and CSV handlers.Both functions follow the same pattern as the ARFF handler, maintaining consistency. The CSV writer correctly uses
index=Falseto avoid an extra index column.amlb/datasets/openml.py (1)
339-349: Robust configuration handling with graceful fallback.The implementation properly reads the configurable format with a fallback to "parquet", validates against supported formats, and logs a helpful warning if the configuration is invalid. This ensures backward compatibility while enabling user customization.
tests/unit/amlb/datasets/openml/test_openml_dataloader.py (1)
138-166: Good test adaptation for configurable default format.The test correctly derives the expected format from
dataset.train.formatrather than assuming a fixed format. The explicit verification of all supported formats (arff, csv, parquet) viadata_path()ensures comprehensive coverage of the format handling logic.
| def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None: | ||
| """Calculate the new column order. Returns None if no reordering needed.""" | ||
| n_cols = len(columns) | ||
| src = n_cols + 1 + target_src if target_src < 0 else target_src | ||
| dest = n_cols + 1 + target_dest if target_dest < 0 else target_dest | ||
|
|
||
| if src == dest: | ||
| return None | ||
|
|
||
| ori = list(range(n_cols)) | ||
| if src < dest: | ||
| return ori[:src] + ori[src + 1 : dest] + [src] + ori[dest:] | ||
| else: | ||
| return ori[:dest] + [src] + ori[dest:src] + ori[src + 1 :] |
There was a problem hiding this comment.
Off-by-one error in negative index handling.
The formula for converting negative indices is incorrect. For a list with n_cols=5, index -1 should map to 4 (the last element), but the current formula n_cols + 1 + target_src yields 5 + 1 + (-1) = 5, which is out of bounds.
Apply this diff to fix the off-by-one error:
def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None:
"""Calculate the new column order. Returns None if no reordering needed."""
n_cols = len(columns)
- src = n_cols + 1 + target_src if target_src < 0 else target_src
- dest = n_cols + 1 + target_dest if target_dest < 0 else target_dest
+ src = n_cols + target_src if target_src < 0 else target_src
+ dest = n_cols + target_dest if target_dest < 0 else target_dest
if src == dest:
return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None: | |
| """Calculate the new column order. Returns None if no reordering needed.""" | |
| n_cols = len(columns) | |
| src = n_cols + 1 + target_src if target_src < 0 else target_src | |
| dest = n_cols + 1 + target_dest if target_dest < 0 else target_dest | |
| if src == dest: | |
| return None | |
| ori = list(range(n_cols)) | |
| if src < dest: | |
| return ori[:src] + ori[src + 1 : dest] + [src] + ori[dest:] | |
| else: | |
| return ori[:dest] + [src] + ori[dest:src] + ori[src + 1 :] | |
| def _reorder_columns(columns: list, target_src: int, target_dest: int) -> list | None: | |
| """Calculate the new column order. Returns None if no reordering needed.""" | |
| n_cols = len(columns) | |
| src = n_cols + target_src if target_src < 0 else target_src | |
| dest = n_cols + target_dest if target_dest < 0 else target_dest | |
| if src == dest: | |
| return None | |
| ori = list(range(n_cols)) | |
| if src < dest: | |
| return ori[:src] + ori[src + 1 : dest] + [src] + ori[dest:] | |
| else: | |
| return ori[:dest] + [src] + ori[dest:src] + ori[src + 1 :] |
🧰 Tools
🪛 Ruff (0.14.6)
178-178: Consider [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]] instead of concatenation
Replace with [*ori[:dest], src, *ori[dest:src], *ori[src + 1:]]
(RUF005)
Summary
Changes
Backward Compatibility
Users can restore previous behavior by setting:
openml:
default_data_format: arff
Summary by CodeRabbit
Release Notes
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.