Currently, when account is deleted, we sequentially delete each of the datasets this account owns:
async fn handle_account_lifecycle_deleted_message(
&self,
message: &AccountLifecycleMessageDeleted,
) -> Result<(), InternalError> {
let mut owned_dataset_stream = self
.dataset_registry
.all_dataset_handles_by_owner_id(&message.account_id);
use tokio_stream::StreamExt;
// TODO: PERF: Batch/concurrent processing
while let Some(dataset_handle) = owned_dataset_stream.try_next().await? {
match self
.delete_dataset_use_case
.execute_via_handle_preauthorized(&dataset_handle)
.await
{
Ok(_) | Err(DeleteDatasetError::NotFound(_)) => { /* idempotent deletion */ }
e @ Err(_) => e.int_err()?,
}
}
Ok(())
}
Let's say, user owned N=100 datasets by the deletion moment.
This results in:
- N=100 individual sequential SQL queries to delete each dataset's entry
- a lot of activity in S3 to delete dataset files of each dataset (sequentially executing a scenario)
- generating N=100 outbox
DatasetLifecycleMessage::deleted events
- N=100 hander invocations, doing:
- DID secrets cleanup
- ReBAC cleanup
- Associated flow scope removal actions
- some updates within dependency graph
- full-text/semantic search document deletions
- ... look carefully for other handlers that might exist ...
For some of these, it's possible to replace N individual actions with 1 vectorized action (should be doable with all of SQL queries). For others, like S3 cleanup, we could at least launch N concurrently running tasks to minimize overall waiting.
This would require introducing a vectorized version of DatasetLifecycleMessage::deleted event. It could be a breaking change in the format to accept a vector of identifiers, or it could be a new event, depending on implementation considerations.
Currently, when account is deleted, we sequentially delete each of the datasets this account owns:
Let's say, user owned N=100 datasets by the deletion moment.
This results in:
DatasetLifecycleMessage::deletedevents- DID secrets cleanup
- ReBAC cleanup
- Associated flow scope removal actions
- some updates within dependency graph
- full-text/semantic search document deletions
- ... look carefully for other handlers that might exist ...
For some of these, it's possible to replace N individual actions with 1 vectorized action (should be doable with all of SQL queries). For others, like S3 cleanup, we could at least launch N concurrently running tasks to minimize overall waiting.
This would require introducing a vectorized version of
DatasetLifecycleMessage::deletedevent. It could be a breaking change in the format to accept a vector of identifiers, or it could be a new event, depending on implementation considerations.