Skip to content

Support bulk deletes for datasets whenever account is deleted #1457

@zaychenko-sergei

Description

@zaychenko-sergei

Currently, when account is deleted, we sequentially delete each of the datasets this account owns:


    async fn handle_account_lifecycle_deleted_message(
        &self,
        message: &AccountLifecycleMessageDeleted,
    ) -> Result<(), InternalError> {
        let mut owned_dataset_stream = self
            .dataset_registry
            .all_dataset_handles_by_owner_id(&message.account_id);

        use tokio_stream::StreamExt;

        // TODO: PERF: Batch/concurrent processing
        while let Some(dataset_handle) = owned_dataset_stream.try_next().await? {
            match self
                .delete_dataset_use_case
                .execute_via_handle_preauthorized(&dataset_handle)
                .await
            {
                Ok(_) | Err(DeleteDatasetError::NotFound(_)) => { /* idempotent deletion */ }
                e @ Err(_) => e.int_err()?,
            }
        }

        Ok(())
    }

Let's say, user owned N=100 datasets by the deletion moment.
This results in:

  • N=100 individual sequential SQL queries to delete each dataset's entry
  • a lot of activity in S3 to delete dataset files of each dataset (sequentially executing a scenario)
  • generating N=100 outbox DatasetLifecycleMessage::deleted events
  • N=100 hander invocations, doing:
    - DID secrets cleanup
    - ReBAC cleanup
    - Associated flow scope removal actions
    - some updates within dependency graph
    - full-text/semantic search document deletions
    - ... look carefully for other handlers that might exist ...

For some of these, it's possible to replace N individual actions with 1 vectorized action (should be doable with all of SQL queries). For others, like S3 cleanup, we could at least launch N concurrently running tasks to minimize overall waiting.

This would require introducing a vectorized version of DatasetLifecycleMessage::deleted event. It could be a breaking change in the format to accept a vector of identifiers, or it could be a new event, depending on implementation considerations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancerustPull requests that update Rust code

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions