While working on branch cleanup support, I found a format limitation that prevents deleting a branch if it is still referenced by other branches. IMO, this is a limitation in the current format, and it is worth opening an issue for further discussion. @jackye1995 @brendanclement
Problem
The current branch format binds the branch name to its physical layout path.
For example, if we have:
main -> featureA -> experimentA
then:
featureA is stored at tree/featureA
experimentA is stored at tree/featureA/experimentA
Now suppose we want to delete featureA but keep experimentA.
In this case, we cannot really remove the featureA directory, because the root version of experimentA lives inside featureA. We also cannot really remove all data of featureA: versions and data still referenced by downstream branches must be retained, while anything no longer referenced can be released.
This is reasonable by itself.
The issue is that if tree/featureA cannot be removed physically, then the branch name is effectively still occupied, and we cannot create a new branch with the same name later.
Because of this, the current implementation does not allow deleting a branch that is still referenced by descendant branches. This restriction comes from the current format spec.
Current state
We have already introduced branch_identifier, and persistent branch lineage is now available in metadata. That means we no longer need to use the branch name itself to represent lineage.
Proposal
Use UUID-based physical branch directories instead of branch-name-based paths.
For example:
tree/UUID1 -> featureA
tree/UUID2 -> experimentA
The mapping from branch_name -> uuid has been stored in branch metadata.
When loading a branch, we can resolve its physical path from branch metadata, instead of deriving it from the branch name.
Deletion semantics
With this model, deleting featureA should mean:
- release unreferenced versions and data in this branch
- retain versions and data still referenced by descendant branches
- delete the branch metadata for
featureA
This also means branch deletion should examine not only what can be removed from the target branch itself, but also whether versions and data in upstream branches can now be released.
Compatibility
We should also consider compatibility carefully:
- it should still be possible to load a branch dataset from
dataset/tree/branch_name
- we need to consider how to remain compatible with existing branch URLs
Cleanup
Dangling directories left by branch creation should also be considered and removed by cleanup.
While working on branch cleanup support, I found a format limitation that prevents deleting a branch if it is still referenced by other branches. IMO, this is a limitation in the current format, and it is worth opening an issue for further discussion. @jackye1995 @brendanclement
Problem
The current branch format binds the branch name to its physical layout path.
For example, if we have:
main -> featureA -> experimentAthen:
featureAis stored attree/featureAexperimentAis stored attree/featureA/experimentANow suppose we want to delete
featureAbut keepexperimentA.In this case, we cannot really remove the
featureAdirectory, because the root version ofexperimentAlives insidefeatureA. We also cannot really remove all data offeatureA: versions and data still referenced by downstream branches must be retained, while anything no longer referenced can be released.This is reasonable by itself.
The issue is that if
tree/featureAcannot be removed physically, then the branch name is effectively still occupied, and we cannot create a new branch with the same name later.Because of this, the current implementation does not allow deleting a branch that is still referenced by descendant branches. This restriction comes from the current format spec.
Current state
We have already introduced
branch_identifier, and persistent branch lineage is now available in metadata. That means we no longer need to use the branch name itself to represent lineage.Proposal
Use UUID-based physical branch directories instead of branch-name-based paths.
For example:
tree/UUID1->featureAtree/UUID2->experimentAThe mapping from
branch_name -> uuidhas been stored in branch metadata.When loading a branch, we can resolve its physical path from branch metadata, instead of deriving it from the branch name.
Deletion semantics
With this model, deleting
featureAshould mean:featureAThis also means branch deletion should examine not only what can be removed from the target branch itself, but also whether versions and data in upstream branches can now be released.
Compatibility
We should also consider compatibility carefully:
dataset/tree/branch_nameCleanup
Dangling directories left by branch creation should also be considered and removed by cleanup.