Skip to content

feat(package): Add dataset-manager scripts to support listing datasets, and deleting them entirely.#1144

Merged
haiqi96 merged 22 commits into
y-scope:mainfrom
haiqi96:dataset_utils
Aug 17, 2025
Merged

feat(package): Add dataset-manager scripts to support listing datasets, and deleting them entirely.#1144
haiqi96 merged 22 commits into
y-scope:mainfrom
haiqi96:dataset_utils

Conversation

@haiqi96

@haiqi96 haiqi96 commented Jul 30, 2025

Copy link
Copy Markdown
Contributor

Description

This PR adds a managment script for dataset that support listing and deleting existing dataset. The script is based on the following assumption:

  1. The script will be used as admin tool which will not handle race condition. The User is expected to ensure that dataset to be removed are not being searched or compressed to.
  2. The script first 1. removes archives and then 2. deletes the tables in the database. between 1 and 2, the database and archive will have a temporary inconsistency. (such that archives are removed, but the metadata exist).
  3. If the script fails at stage 1 (removing archive), it will not proceed to delete archive metadata.
  4. The scripot doesn't return any error if the archive to be deleted doesn' exist. This allows user to rerun the script if the script fails between removing archives and tables.

The script supports the following operation

  • list (list all existing datasets)
  • del (delete a list of datasets)
  • del -a/--all (delete all existing datasets).

Some behavior to be decided:

  1. If user requests to delete multiple datasets, the script will skip any invalid dataset but delete the others. Alternatively, we can let the script first validate all datasets and don't proceed to deletion if any dataset is invalid.
  2. If the script fails to delete a dataset, the script will abort and will not continue on the rest of dataset.

Note: this PR also updates the dataset logic in compression scheduler, because the current implemetation assumes that dataset are never removed.
The current implemetation let compression scheduler poll the dataset for every new compression job, and assumes that no dataset will be deleted when a job is being scheduled.

Some example command and output

$ ./sbin/admin-tools/dataset-manager.sh list
2025-07-31T15:50:18.781 INFO [dataset_manager] Found 2 datasets.
2025-07-31T15:50:18.781 INFO [dataset_manager] my_best_dataset
2025-07-31T15:50:18.781 INFO [dataset_manager] my_favorite_dataset
$ ./sbin/admin-tools/dataset-manager.sh del my_best_dataset
2025-07-31T15:50:28.233 INFO [dataset_manager] Successfully deleted archives of dataset `my_best_dataset`.
2025-07-31T15:50:28.265 INFO [dataset_manager] Successfully deleted dataset `my_best_dataset` from database.
$ ./sbin/admin-tools/dataset-manager.sh del --all
2025-07-31T15:50:34.101 INFO [dataset_manager] Successfully deleted archives of dataset `my_favorite_dataset`.
2025-07-31T15:50:34.130 INFO [dataset_manager] Successfully deleted dataset `my_favorite_dataset` from database.

$ ./sbin/admin-tools/dataset-manager.sh del --all
2025-07-31T15:51:28.577 WARNING [dataset_manager] No dataset will be deleted...
$ ./sbin/admin-tools/dataset-manager.sh del no_such_dataset
2025-07-31T15:51:37.575 ERROR [dataset_manager] Dataset `no_such_dataset` doesn't exist.
2025-07-31T15:51:37.575 WARNING [dataset_manager] No dataset will be deleted...

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Tested removing datasets from both filestorage and s3.
  • Tested removing datasets compressed with tag to ensure foreign key constraint in the tables don't cause failures.
  • Confirmed that script prints error message properly when input dataset doesn't exist.
  • Confirmed that on s3, deleteing a dataset doesn't interfere with other dataset with shared prefix.

Summary by CodeRabbit

  • New Features

    • New CLI to list and delete datasets (selective or all) with a shell helper to run it; containerized native runner included.
  • Improvements

    • S3: batched deletions by prefix, expanded auth handling and deletion-limit constant.
    • Added archive-manager action name and compression-tasks table constant.
    • New metadata-only dataset removal utility to safely drop per-dataset metadata.
  • Bug Fixes

    • Safer archive deletion checks, stronger validation and error handling.
  • Documentation

    • Object-storage guide adds IAM ListBucket for prefix-limited access.
  • Refactor

    • Simplified existing-dataset handling in task scheduling.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants