feat(package): Add `dataset-manager` scripts to support listing datasets, and deleting them entirely. by haiqi96 · Pull Request #1144 · y-scope/clp

haiqi96 · 2025-07-30T21:15:31Z

Description

This PR adds a managment script for dataset that support listing and deleting existing dataset. The script is based on the following assumption:

The script will be used as admin tool which will not handle race condition. The User is expected to ensure that dataset to be removed are not being searched or compressed to.
The script first 1. removes archives and then 2. deletes the tables in the database. between 1 and 2, the database and archive will have a temporary inconsistency. (such that archives are removed, but the metadata exist).
If the script fails at stage 1 (removing archive), it will not proceed to delete archive metadata.
The scripot doesn't return any error if the archive to be deleted doesn' exist. This allows user to rerun the script if the script fails between removing archives and tables.

The script supports the following operation

list (list all existing datasets)
del (delete a list of datasets)
del -a/--all (delete all existing datasets).

Some behavior to be decided:

If user requests to delete multiple datasets, the script will skip any invalid dataset but delete the others. Alternatively, we can let the script first validate all datasets and don't proceed to deletion if any dataset is invalid.
If the script fails to delete a dataset, the script will abort and will not continue on the rest of dataset.

Note: this PR also updates the dataset logic in compression scheduler, because the current implemetation assumes that dataset are never removed.
The current implemetation let compression scheduler poll the dataset for every new compression job, and assumes that no dataset will be deleted when a job is being scheduled.

Some example command and output

$ ./sbin/admin-tools/dataset-manager.sh list
2025-07-31T15:50:18.781 INFO [dataset_manager] Found 2 datasets.
2025-07-31T15:50:18.781 INFO [dataset_manager] my_best_dataset
2025-07-31T15:50:18.781 INFO [dataset_manager] my_favorite_dataset
$ ./sbin/admin-tools/dataset-manager.sh del my_best_dataset
2025-07-31T15:50:28.233 INFO [dataset_manager] Successfully deleted archives of dataset `my_best_dataset`.
2025-07-31T15:50:28.265 INFO [dataset_manager] Successfully deleted dataset `my_best_dataset` from database.
$ ./sbin/admin-tools/dataset-manager.sh del --all
2025-07-31T15:50:34.101 INFO [dataset_manager] Successfully deleted archives of dataset `my_favorite_dataset`.
2025-07-31T15:50:34.130 INFO [dataset_manager] Successfully deleted dataset `my_favorite_dataset` from database.

$ ./sbin/admin-tools/dataset-manager.sh del --all
2025-07-31T15:51:28.577 WARNING [dataset_manager] No dataset will be deleted...
$ ./sbin/admin-tools/dataset-manager.sh del no_such_dataset
2025-07-31T15:51:37.575 ERROR [dataset_manager] Dataset `no_such_dataset` doesn't exist.
2025-07-31T15:51:37.575 WARNING [dataset_manager] No dataset will be deleted...

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Tested removing datasets from both filestorage and s3.
Tested removing datasets compressed with tag to ensure foreign key constraint in the tables don't cause failures.
Confirmed that script prints error message properly when input dataset doesn't exist.
Confirmed that on s3, deleteing a dataset doesn't interfere with other dataset with shared prefix.

Summary by CodeRabbit

New Features
- New CLI to list and delete datasets (selective or all) with a shell helper to run it; containerized native runner included.
Improvements
- S3: batched deletions by prefix, expanded auth handling and deletion-limit constant.
- Added archive-manager action name and compression-tasks table constant.
- New metadata-only dataset removal utility to safely drop per-dataset metadata.
Bug Fixes
- Safer archive deletion checks, stronger validation and error handling.
Documentation
- Object-storage guide adds IAM ListBucket for prefix-limited access.
Refactor
- Simplified existing-dataset handling in task scheduling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(package): Add `dataset-manager` scripts to support listing datasets, and deleting them entirely.#1144

feat(package): Add `dataset-manager` scripts to support listing datasets, and deleting them entirely.#1144
haiqi96 merged 22 commits into
y-scope:mainfrom
haiqi96:dataset_utils

haiqi96 commented Jul 30, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haiqi96 commented Jul 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haiqi96 commented Jul 30, 2025 •

edited by coderabbitai Bot

Loading