Goal
Promote commit / abort from a per-sandbox property to a first-class property of a Pipeline (or any group of sandboxes that should succeed or fail together). Stages should be able to see each other's in-flight writes under a defined isolation model, and the runtime should commit all of them together at the end or none of them at all.
This is an RFC for the full feature, not a single PR. It describes the end state, the design choices, the recommended defaults, and a delivery phasing.
What exists today
Sandbox owns cow_branch: Option<Box<dyn CowBranch>> with commit() / abort() / cleanup() / changes() (crates/sandlock-core/src/cow/mod.rs).
- Three backends:
- seccomp: pure userspace, no mount, no privilege. Supervisor intercepts opens, plans copy-on-write into an upper directory, tracks deletions in memory. Auto-enabled when
fs_isolation = None and a workdir is set. Works on any Linux 5.9+.
- overlayfs: kernel overlayfs. Requires a mount in the child, which needs mount privilege.
- branchfs: ioctls against a
.branchfs_ctl node. Requires a custom kernel module that is not generally available.
- Per-sandbox
on_exit / on_error policy (BranchAction::{Commit, Abort, Keep}) auto-applied in Sandbox::Drop (crates/sandlock-core/src/sandbox.rs:1615).
Pipeline forks N stages with inter-stage pipes (crates/sandlock-core/src/pipeline.rs), zero awareness of COW. Each stage's Drop independently decides commit/abort based on its own exit code.
- Dry-run for a single sandbox (
Sandbox::do_dry_run) captures changes() then aborts.
Gather (fan-in) exists for data flow over pipes; it has no COW story.
The end state (what "done" looks like)
A user can declare a group of sandboxes (initially a Pipeline or Gather, eventually any user-defined set) and run them under a transaction. The runtime guarantees:
- Every stage observes prior stages' writes according to the chosen isolation model (default: read-committed for sequential, snapshot for parallel siblings).
- At end of transaction, either every stage's writes commit to the underlying workdir, or none of them do.
- Parallel siblings that touch the same path produce a structured conflict, surfaced to the caller with a configurable resolution policy.
- The pipeline can be dry-run: the union of intended changes is reported and nothing is committed.
- The transaction works on any sandlock-supported kernel (5.9+) without elevated privilege.
- Crash mid-commit is recoverable: a subsequent sandlock invocation can either complete or roll back the interrupted transaction from durable transaction metadata.
- Concurrent transactions over the same workdir are serializable (or at minimum: fail fast with a clear error rather than interleaving undefined-ly).
The user surface:
- Rust:
Pipeline::commit() / abort() / dry_run(), Transaction::new(sandboxes).run(), MergeConflict type with policy enum.
- CLI:
sandlock pipeline ... --dry-run, sandlock txn list / show / abort, exit codes that distinguish stage failure from commit failure from conflict.
- Python SDK: mirrored API plus
with Transaction(...) as t: context manager.
Design axes (the choices the RFC commits to)
Isolation model
| Model |
What a stage sees |
Cost |
| Read-committed (sequential) |
All writes committed by prior stages in this transaction, plus the pre-transaction workdir |
Low. Lookup chain extends through prior uppers. |
| Snapshot (parallel siblings) |
The pre-transaction workdir only; siblings' writes are invisible until commit |
Low for seccomp backend (each sibling has its own upper). |
| Serializable (across transactions) |
No other transaction's mid-flight writes ever visible |
Needs a workdir-level lock manager. |
Recommendation: read-committed for sequential pipelines, snapshot for parallel siblings, opt-in workdir lock for cross-transaction serializability. All three are expressible on top of the seccomp backend without new kernel features.
Atomicity of finalization
| Strategy |
Property |
Cost |
| Serial per-branch commit |
Simple. Mid-commit failure leaves partial state. |
Trivial. |
| Two-phase commit |
Prepare all branches (validate writable, no conflicts), then flip. Mid-flip failure still partial unless combined with a journal. |
Medium. |
| Write-ahead log + recovery |
Durable transaction metadata; recovery step on next startup completes or rolls back. |
Significant, but unblocks crash safety. |
| Single-rename finalize |
Stage entire result tree aside, rename(2) into place. Atomic but only for whole-subtree replacement. |
Limited applicability. |
Recommendation: serial per-branch commit in Phase 1 with an explicit "crash mid-commit may leave partial state" caveat in the API doc; WAL + recovery as a later phase.
Conflict resolution for parallel siblings
| Policy |
Behavior |
Fail (default) |
Any path written by two siblings aborts the transaction with a MergeConflict listing the conflicting paths. |
FirstWriterWins |
Earlier-declared sibling wins per-path; later writes discarded. |
LastWriterWins |
Later-declared sibling wins per-path. |
Markers |
Both versions retained with a suffix; commits succeed, user resolves later. |
Recommendation: Fail is the only safe default. The others should require explicit opt-in.
Backend strategy
| Backend |
Suitability for transactions |
| seccomp |
Best fit. Pure userspace, the supervisor already arbitrates every open. Extending the upper-lookup chain across multiple uppers (prior stages' or sibling siblings') is a natural extension of the existing model. Works without privilege on any Linux 5.9+. This is sandlock's identity. |
| overlayfs |
Off the rootless happy path. Requires mount privilege via user namespaces or CAP_SYS_ADMIN. Stacking via lowerdir=A:B:C is convenient syntactically but the privilege requirement contradicts sandlock's positioning. Future optimization target for heavy-I/O workloads where in-kernel overlay is meaningfully faster than userspace COW, not a starting point. |
| branchfs |
Niche. Requires a kernel module that is not generally available; backend's changes() returns empty (kernel-managed, opaque to userspace), so dry-run does not work. Out of scope for transactions. |
Recommendation: Phase 1 implements transactions on the seccomp backend only. Overlayfs support is a later phase if and only if performance data justifies it. Branchfs is out of scope.
Coordinator placement
Pipeline already owns the stage handles and lifetime. The coordinator lives next to Pipeline. A general Transaction API that accepts an arbitrary set of Sandboxes can be exposed later; the pipeline coordinator is the first user of the underlying primitive.
Phased delivery
Phase 1: Transactional sequential pipelines on the seccomp COW backend
Smallest useful slice that delivers a real, demoable feature without compromising sandlock's unprivileged identity.
Pipeline::commit() / abort() / dry_run() API and matching Python surface.
- Coordinator forces every stage's
on_exit / on_error to BranchAction::Keep, waits for all stages, then commits all in declaration order on aggregate success, or aborts all on any failure.
- Sequential stacking in the seccomp backend: stage N+1's upper-lookup chain is
[own upper, stage_N upper, stage_{N-1} upper, ..., original workdir]. Reads walk the chain; first hit wins; deletions tracked per-upper still hide lower entries.
- Isolation: read-committed only.
- Conflict policy: not yet meaningful (sequential, single writer per timestamp).
- Atomicity: serial per-branch commit; document the crash window.
- Backends other than seccomp: fail fast with a clear error pointing at this RFC.
Acceptance: 3-stage pipeline where stage 1 writes a.txt, stage 2 reads a.txt and writes b.txt, stage 3 reads both. On success both files appear in the workdir; on any non-zero stage exit, neither appears. Existing per-sandbox COW behavior unchanged. Python SDK covered by at least one end-to-end test.
Phase 2: Parallel siblings and Gather transactions
- Snapshot isolation for siblings: each sibling has its own upper, lower is the pre-transaction state.
MergeConflict type and ConflictPolicy::{Fail, FirstWriterWins, LastWriterWins, Markers}.
- Commit-time conflict detection across sibling uppers.
Gather::commit() / abort() / dry_run() mirroring Pipeline.
Phase 3: Crash safety
- Durable transaction metadata under
$XDG_RUNTIME_DIR/sandlock/txn/<id>/ describing every branch and its committed state.
- Recovery sweep on next sandlock invocation: complete or roll back interrupted transactions.
- Two-phase commit option for callers that need stronger guarantees than serial.
Phase 4: Cross-transaction serializability
- Workdir-level advisory lock (file-based,
flock) acquired at txn open, released at commit/abort.
- Configurable:
IsolationPolicy::{None, FailIfBusy, WaitForLock(Duration)}.
- Opt-in; default remains "undefined behavior on concurrent transactions, same as today."
Phase 5 (optional): Overlayfs backend support
Only if profiling shows the userspace COW path is a real bottleneck for a class of workloads where users are willing to accept the mount-privilege requirement. Most likely route: --fs-isolation=overlayfs becomes a performance opt-in for transactions, with the same coordinator and lookup-chain semantics as the seccomp path.
Open questions
- Partial / selective commit by path ("commit
./out/, discard ./scratch/"). Useful for build pipelines. Phase 2 or 3? Affects the commit() signature; better to settle before Phase 1 ships.
- Stage retries inside a transaction. A retried stage's prior upper should be discarded before re-running; needs a hook in the coordinator.
- Memory bound on the seccomp backend's deletion-tracking set when chained across many stages. Worth measuring before Phase 1 ships.
- Interaction with
--no-supervisor mode. Transactions require the supervisor (the seccomp backend needs it). --no-supervisor should fail fast with a clear error.
- Pipelines that mix sandboxed and unsandboxed stages. Out of scope for now; document.
Out of scope (this RFC)
- Transactions across hosts.
- Transactions over the chroot-mode workdir.
- Branchfs backend support.
- A generic
Transaction API decoupled from Pipeline / Gather (Phase 1+2 deliver enough to inform the right shape later).
Proposed Python API and examples
These examples make the design concrete. They use the Python SDK's existing Sandbox(...).cmd([...]) shape plus the | pipeline operator and the Gather builder, and add run_transactional(), prepare(), dry_run(), and a transaction() context manager.
API additions
# python/src/sandlock/_sdk.py
class Pipeline:
def run_transactional(self) -> TxnOutcome: ...
def dry_run(self) -> DryRunOutcome: ...
def prepare(self) -> PreparedTxn: ...
def transaction(self) -> ContextManager[PreparedTxn]: ...
class Gather:
def run_transactional(self) -> TxnOutcome: ...
def dry_run(self) -> DryRunOutcome: ...
def prepare(self) -> PreparedTxn: ...
def with_conflict_policy(self, policy: ConflictPolicy) -> "Gather": ... # Phase 2
@dataclass
class TxnOutcome:
committed: bool
stages: list[StageResult]
changes: list[Change]
conflicts: list[MergeConflict] # Phase 2
abort_reason: AbortReason | None
class PreparedTxn:
stages: list[StageResult]
changes: list[Change]
conflicts: list[MergeConflict]
def all_succeeded(self) -> bool: ...
def has_conflicts(self) -> bool: ...
def commit(self) -> None: ...
def abort(self) -> None: ...
The existing Pipeline.run() keeps today's per-stage-Drop behavior. run_transactional() is the new entry point; existing callers are unaffected.
Example 1: agent loop, all-or-nothing (Phase 1)
from sandlock import Sandbox
workspace = "/home/me/repo"
BASE_READ = ["/usr", "/lib", "/lib64", "/bin", "/etc"]
planner = Sandbox(
workdir=workspace,
fs_readable=BASE_READ + [workspace],
net_allow=["api.openai.com:443"],
)
editor = Sandbox(
workdir=workspace,
fs_readable=BASE_READ,
fs_writable=[workspace],
)
tester = Sandbox(
workdir=workspace,
fs_readable=BASE_READ + [workspace],
max_processes=20,
max_memory="1G",
)
result = (
planner.cmd(["python3", "plan.py"])
| editor.cmd(["python3", "apply_patch.py"])
| tester.cmd(["pytest", "-x"])
).run_transactional()
if result.committed:
print("agent step merged")
else:
print(f"rolled back: {result.abort_reason}")
Example 2: preview before commit, context-managed (Phase 1)
pipeline = (
planner.cmd(["python3", "plan.py"])
| editor.cmd(["python3", "apply_patch.py"])
| tester.cmd(["pytest"])
)
with pipeline.transaction() as txn:
txn.run()
if not txn.all_succeeded():
raise RuntimeError("stage failed; context exit will abort")
print(f"would change {len(txn.changes)} paths:")
for c in txn.changes:
print(f" {c.kind} {c.path}")
if input("apply? [y/N] ").lower() == "y":
txn.commit()
# leaving the `with` block without committing aborts automatically;
# exceptions also auto-abort.
Example 3: test-and-merge for an LLM-proposed patch (Phase 1)
patch_text = call_llm(prompt)
apply = Sandbox(
workdir=workspace,
fs_readable=BASE_READ,
fs_writable=[workspace],
)
test = Sandbox(
workdir=workspace,
fs_readable=BASE_READ + [workspace],
)
outcome = (
apply.cmd(["git", "apply", "-"], stdin=patch_text.encode())
| test.cmd(["pytest", "-x"])
).run_transactional()
# Workspace is byte-identical to before the call if anything failed.
return outcome.committed
Example 4: parallel agents on disjoint subsystems (Phase 2)
from sandlock import Gather, ConflictPolicy
frontend = Sandbox(workdir=workspace, fs_writable=[f"{workspace}/web"])
backend = Sandbox(workdir=workspace, fs_writable=[f"{workspace}/api"])
test = Sandbox(workdir=workspace, fs_readable=BASE_READ + [workspace])
result = (
Gather()
.source("frontend_agent", frontend.cmd(["agent", "--scope", "web"]))
.source("backend_agent", backend.cmd(["agent", "--scope", "api"]))
.consumer(test.cmd(["pytest"]))
.with_conflict_policy(ConflictPolicy.FAIL)
.run_transactional()
)
if not result.committed:
if result.conflicts:
for c in result.conflicts:
print(f"CONFLICT {c.path} written by {c.stages}")
else:
print(f"failed: {result.abort_reason}")
Design notes worth flagging
run() stays as today. Adding run_transactional() rather than mutating run()'s behavior avoids silently changing existing pipelines.
prepare() returns a value the caller must resolve. PreparedTxn aborts on garbage-collection if never committed, so a forgotten .commit() / .abort() does not silently commit.
transaction() context manager is the ergonomic surface: clean exit commits if .commit() was called, otherwise aborts; any exception aborts.
TxnOutcome always carries stages even when aborted, so the caller can inspect which stage failed without a second call.
Change is the existing dry_run.Change, so dry_run() for a pipeline is the same shape as dry_run() for a single sandbox, just merged across stages.
- No CLI surface in Phase 1. Pipelines are constructed in code today; CLI declaration of multi-stage pipelines is a separate question and should not block this work.
Goal
Promote
commit/abortfrom a per-sandbox property to a first-class property of aPipeline(or any group of sandboxes that should succeed or fail together). Stages should be able to see each other's in-flight writes under a defined isolation model, and the runtime should commit all of them together at the end or none of them at all.This is an RFC for the full feature, not a single PR. It describes the end state, the design choices, the recommended defaults, and a delivery phasing.
What exists today
Sandboxownscow_branch: Option<Box<dyn CowBranch>>withcommit() / abort() / cleanup() / changes()(crates/sandlock-core/src/cow/mod.rs).fs_isolation = Noneand a workdir is set. Works on any Linux 5.9+..branchfs_ctlnode. Requires a custom kernel module that is not generally available.on_exit/on_errorpolicy (BranchAction::{Commit, Abort, Keep}) auto-applied inSandbox::Drop(crates/sandlock-core/src/sandbox.rs:1615).Pipelineforks N stages with inter-stage pipes (crates/sandlock-core/src/pipeline.rs), zero awareness of COW. Each stage'sDropindependently decides commit/abort based on its own exit code.Sandbox::do_dry_run) captureschanges()then aborts.Gather(fan-in) exists for data flow over pipes; it has no COW story.The end state (what "done" looks like)
A user can declare a group of sandboxes (initially a
PipelineorGather, eventually any user-defined set) and run them under a transaction. The runtime guarantees:The user surface:
Pipeline::commit() / abort() / dry_run(),Transaction::new(sandboxes).run(),MergeConflicttype with policy enum.sandlock pipeline ... --dry-run,sandlock txn list / show / abort, exit codes that distinguish stage failure from commit failure from conflict.with Transaction(...) as t:context manager.Design axes (the choices the RFC commits to)
Isolation model
Recommendation: read-committed for sequential pipelines, snapshot for parallel siblings, opt-in workdir lock for cross-transaction serializability. All three are expressible on top of the seccomp backend without new kernel features.
Atomicity of finalization
rename(2)into place. Atomic but only for whole-subtree replacement.Recommendation: serial per-branch commit in Phase 1 with an explicit "crash mid-commit may leave partial state" caveat in the API doc; WAL + recovery as a later phase.
Conflict resolution for parallel siblings
Fail(default)MergeConflictlisting the conflicting paths.FirstWriterWinsLastWriterWinsMarkersRecommendation:
Failis the only safe default. The others should require explicit opt-in.Backend strategy
CAP_SYS_ADMIN. Stacking vialowerdir=A:B:Cis convenient syntactically but the privilege requirement contradicts sandlock's positioning. Future optimization target for heavy-I/O workloads where in-kernel overlay is meaningfully faster than userspace COW, not a starting point.changes()returns empty (kernel-managed, opaque to userspace), so dry-run does not work. Out of scope for transactions.Recommendation: Phase 1 implements transactions on the seccomp backend only. Overlayfs support is a later phase if and only if performance data justifies it. Branchfs is out of scope.
Coordinator placement
Pipelinealready owns the stage handles and lifetime. The coordinator lives next toPipeline. A generalTransactionAPI that accepts an arbitrary set ofSandboxes can be exposed later; the pipeline coordinator is the first user of the underlying primitive.Phased delivery
Phase 1: Transactional sequential pipelines on the seccomp COW backend
Smallest useful slice that delivers a real, demoable feature without compromising sandlock's unprivileged identity.
Pipeline::commit() / abort() / dry_run()API and matching Python surface.on_exit/on_errortoBranchAction::Keep, waits for all stages, then commits all in declaration order on aggregate success, or aborts all on any failure.[own upper, stage_N upper, stage_{N-1} upper, ..., original workdir]. Reads walk the chain; first hit wins; deletions tracked per-upper still hide lower entries.Acceptance: 3-stage pipeline where stage 1 writes
a.txt, stage 2 readsa.txtand writesb.txt, stage 3 reads both. On success both files appear in the workdir; on any non-zero stage exit, neither appears. Existing per-sandbox COW behavior unchanged. Python SDK covered by at least one end-to-end test.Phase 2: Parallel siblings and
GathertransactionsMergeConflicttype andConflictPolicy::{Fail, FirstWriterWins, LastWriterWins, Markers}.Gather::commit() / abort() / dry_run()mirroringPipeline.Phase 3: Crash safety
$XDG_RUNTIME_DIR/sandlock/txn/<id>/describing every branch and its committed state.Phase 4: Cross-transaction serializability
flock) acquired at txn open, released at commit/abort.IsolationPolicy::{None, FailIfBusy, WaitForLock(Duration)}.Phase 5 (optional): Overlayfs backend support
Only if profiling shows the userspace COW path is a real bottleneck for a class of workloads where users are willing to accept the mount-privilege requirement. Most likely route:
--fs-isolation=overlayfsbecomes a performance opt-in for transactions, with the same coordinator and lookup-chain semantics as the seccomp path.Open questions
./out/, discard./scratch/"). Useful for build pipelines. Phase 2 or 3? Affects thecommit()signature; better to settle before Phase 1 ships.--no-supervisormode. Transactions require the supervisor (the seccomp backend needs it).--no-supervisorshould fail fast with a clear error.Out of scope (this RFC)
TransactionAPI decoupled fromPipeline/Gather(Phase 1+2 deliver enough to inform the right shape later).Proposed Python API and examples
These examples make the design concrete. They use the Python SDK's existing
Sandbox(...).cmd([...])shape plus the|pipeline operator and theGatherbuilder, and addrun_transactional(),prepare(),dry_run(), and atransaction()context manager.API additions
The existing
Pipeline.run()keeps today's per-stage-Drop behavior.run_transactional()is the new entry point; existing callers are unaffected.Example 1: agent loop, all-or-nothing (Phase 1)
Example 2: preview before commit, context-managed (Phase 1)
Example 3: test-and-merge for an LLM-proposed patch (Phase 1)
Example 4: parallel agents on disjoint subsystems (Phase 2)
Design notes worth flagging
run()stays as today. Addingrun_transactional()rather than mutatingrun()'s behavior avoids silently changing existing pipelines.prepare()returns a value the caller must resolve.PreparedTxnaborts on garbage-collection if never committed, so a forgotten.commit()/.abort()does not silently commit.transaction()context manager is the ergonomic surface: clean exit commits if.commit()was called, otherwise aborts; any exception aborts.TxnOutcomealways carriesstageseven when aborted, so the caller can inspect which stage failed without a second call.Changeis the existingdry_run.Change, sodry_run()for a pipeline is the same shape asdry_run()for a single sandbox, just merged across stages.