RFC: Multi-sandbox transactions for pipelines

## Goal

Promote `commit` / `abort` from a per-sandbox property to a first-class property of a `Pipeline` (or any group of sandboxes that should succeed or fail together). Stages should be able to see each other's in-flight writes under a defined isolation model, and the runtime should commit all of them together at the end or none of them at all.

This is an RFC for the full feature, not a single PR. It describes the end state, the design choices, the recommended defaults, and a delivery phasing.

## What exists today

- `Sandbox` owns `cow_branch: Option<Box<dyn CowBranch>>` with `commit() / abort() / cleanup() / changes()` (`crates/sandlock-core/src/cow/mod.rs`).
- Three backends:
  - **seccomp**: pure userspace, no mount, no privilege. Supervisor intercepts opens, plans copy-on-write into an upper directory, tracks deletions in memory. Auto-enabled when `fs_isolation = None` and a workdir is set. Works on any Linux 5.9+.
  - **overlayfs**: kernel overlayfs. Requires a mount in the child, which needs mount privilege.
  - **branchfs**: ioctls against a `.branchfs_ctl` node. Requires a custom kernel module that is not generally available.
- Per-sandbox `on_exit` / `on_error` policy (`BranchAction::{Commit, Abort, Keep}`) auto-applied in `Sandbox::Drop` (`crates/sandlock-core/src/sandbox.rs:1615`).
- `Pipeline` forks N stages with inter-stage pipes (`crates/sandlock-core/src/pipeline.rs`), zero awareness of COW. Each stage's `Drop` independently decides commit/abort based on its own exit code.
- Dry-run for a single sandbox (`Sandbox::do_dry_run`) captures `changes()` then aborts.
- `Gather` (fan-in) exists for data flow over pipes; it has no COW story.

## The end state (what \"done\" looks like)

A user can declare a group of sandboxes (initially a `Pipeline` or `Gather`, eventually any user-defined set) and run them under a transaction. The runtime guarantees:

1. Every stage observes prior stages' writes according to the chosen isolation model (default: read-committed for sequential, snapshot for parallel siblings).
2. At end of transaction, either every stage's writes commit to the underlying workdir, or none of them do.
3. Parallel siblings that touch the same path produce a structured conflict, surfaced to the caller with a configurable resolution policy.
4. The pipeline can be dry-run: the union of intended changes is reported and nothing is committed.
5. The transaction works on any sandlock-supported kernel (5.9+) without elevated privilege.
6. Crash mid-commit is recoverable: a subsequent sandlock invocation can either complete or roll back the interrupted transaction from durable transaction metadata.
7. Concurrent transactions over the same workdir are serializable (or at minimum: fail fast with a clear error rather than interleaving undefined-ly).

The user surface:

- Rust: `Pipeline::commit() / abort() / dry_run()`, `Transaction::new(sandboxes).run()`, `MergeConflict` type with policy enum.
- CLI: `sandlock pipeline ... --dry-run`, `sandlock txn list / show / abort`, exit codes that distinguish stage failure from commit failure from conflict.
- Python SDK: mirrored API plus `with Transaction(...) as t:` context manager.

## Design axes (the choices the RFC commits to)

### Isolation model

| Model | What a stage sees | Cost |
|---|---|---|
| Read-committed (sequential) | All writes committed by *prior* stages in this transaction, plus the pre-transaction workdir | Low. Lookup chain extends through prior uppers. |
| Snapshot (parallel siblings) | The pre-transaction workdir only; siblings' writes are invisible until commit | Low for seccomp backend (each sibling has its own upper). |
| Serializable (across transactions) | No other transaction's mid-flight writes ever visible | Needs a workdir-level lock manager. |

**Recommendation**: read-committed for sequential pipelines, snapshot for parallel siblings, opt-in workdir lock for cross-transaction serializability. All three are expressible on top of the seccomp backend without new kernel features.

### Atomicity of finalization

| Strategy | Property | Cost |
|---|---|---|
| Serial per-branch commit | Simple. Mid-commit failure leaves partial state. | Trivial. |
| Two-phase commit | Prepare all branches (validate writable, no conflicts), then flip. Mid-flip failure still partial unless combined with a journal. | Medium. |
| Write-ahead log + recovery | Durable transaction metadata; recovery step on next startup completes or rolls back. | Significant, but unblocks crash safety. |
| Single-rename finalize | Stage entire result tree aside, `rename(2)` into place. Atomic but only for whole-subtree replacement. | Limited applicability. |

**Recommendation**: serial per-branch commit in Phase 1 with an explicit \"crash mid-commit may leave partial state\" caveat in the API doc; WAL + recovery as a later phase.

### Conflict resolution for parallel siblings

| Policy | Behavior |
|---|---|
| `Fail` (default) | Any path written by two siblings aborts the transaction with a `MergeConflict` listing the conflicting paths. |
| `FirstWriterWins` | Earlier-declared sibling wins per-path; later writes discarded. |
| `LastWriterWins` | Later-declared sibling wins per-path. |
| `Markers` | Both versions retained with a suffix; commits succeed, user resolves later. |

**Recommendation**: `Fail` is the only safe default. The others should require explicit opt-in.

### Backend strategy

| Backend | Suitability for transactions |
|---|---|
| **seccomp** | Best fit. Pure userspace, the supervisor already arbitrates every open. Extending the upper-lookup chain across multiple uppers (prior stages' or sibling siblings') is a natural extension of the existing model. Works without privilege on any Linux 5.9+. This is sandlock's identity. |
| **overlayfs** | Off the rootless happy path. Requires mount privilege via user namespaces or `CAP_SYS_ADMIN`. Stacking via `lowerdir=A:B:C` is convenient *syntactically* but the privilege requirement contradicts sandlock's positioning. Future optimization target for heavy-I/O workloads where in-kernel overlay is meaningfully faster than userspace COW, not a starting point. |
| **branchfs** | Niche. Requires a kernel module that is not generally available; backend's `changes()` returns empty (kernel-managed, opaque to userspace), so dry-run does not work. Out of scope for transactions. |

**Recommendation**: Phase 1 implements transactions on the seccomp backend only. Overlayfs support is a later phase if and only if performance data justifies it. Branchfs is out of scope.

### Coordinator placement

`Pipeline` already owns the stage handles and lifetime. The coordinator lives next to `Pipeline`. A general `Transaction` API that accepts an arbitrary set of `Sandbox`es can be exposed later; the pipeline coordinator is the first user of the underlying primitive.

## Phased delivery

### Phase 1: Transactional sequential pipelines on the seccomp COW backend

Smallest useful slice that delivers a real, demoable feature without compromising sandlock's unprivileged identity.

- `Pipeline::commit() / abort() / dry_run()` API and matching Python surface.
- Coordinator forces every stage's `on_exit` / `on_error` to `BranchAction::Keep`, waits for all stages, then commits all in declaration order on aggregate success, or aborts all on any failure.
- **Sequential stacking in the seccomp backend**: stage N+1's upper-lookup chain is `[own upper, stage_N upper, stage_{N-1} upper, ..., original workdir]`. Reads walk the chain; first hit wins; deletions tracked per-upper still hide lower entries.
- Isolation: read-committed only.
- Conflict policy: not yet meaningful (sequential, single writer per timestamp).
- Atomicity: serial per-branch commit; document the crash window.
- Backends other than seccomp: fail fast with a clear error pointing at this RFC.

Acceptance: 3-stage pipeline where stage 1 writes `a.txt`, stage 2 reads `a.txt` and writes `b.txt`, stage 3 reads both. On success both files appear in the workdir; on any non-zero stage exit, neither appears. Existing per-sandbox COW behavior unchanged. Python SDK covered by at least one end-to-end test.

### Phase 2: Parallel siblings and `Gather` transactions

- Snapshot isolation for siblings: each sibling has its own upper, lower is the pre-transaction state.
- `MergeConflict` type and `ConflictPolicy::{Fail, FirstWriterWins, LastWriterWins, Markers}`.
- Commit-time conflict detection across sibling uppers.
- `Gather::commit() / abort() / dry_run()` mirroring `Pipeline`.

### Phase 3: Crash safety

- Durable transaction metadata under `$XDG_RUNTIME_DIR/sandlock/txn/<id>/` describing every branch and its committed state.
- Recovery sweep on next sandlock invocation: complete or roll back interrupted transactions.
- Two-phase commit option for callers that need stronger guarantees than serial.

### Phase 4: Cross-transaction serializability

- Workdir-level advisory lock (file-based, `flock`) acquired at txn open, released at commit/abort.
- Configurable: `IsolationPolicy::{None, FailIfBusy, WaitForLock(Duration)}`.
- Opt-in; default remains \"undefined behavior on concurrent transactions, same as today.\"

### Phase 5 (optional): Overlayfs backend support

Only if profiling shows the userspace COW path is a real bottleneck for a class of workloads where users are willing to accept the mount-privilege requirement. Most likely route: `--fs-isolation=overlayfs` becomes a performance opt-in for transactions, with the same coordinator and lookup-chain semantics as the seccomp path.

## Open questions

- **Partial / selective commit by path** (\"commit `./out/`, discard `./scratch/`\"). Useful for build pipelines. Phase 2 or 3? Affects the `commit()` signature; better to settle before Phase 1 ships.
- **Stage retries inside a transaction.** A retried stage's prior upper should be discarded before re-running; needs a hook in the coordinator.
- **Memory bound on the seccomp backend's deletion-tracking set** when chained across many stages. Worth measuring before Phase 1 ships.
- **Interaction with `--no-supervisor` mode.** Transactions require the supervisor (the seccomp backend needs it). `--no-supervisor` should fail fast with a clear error.
- **Pipelines that mix sandboxed and unsandboxed stages.** Out of scope for now; document.

## Out of scope (this RFC)

- Transactions across hosts.
- Transactions over the chroot-mode workdir.
- Branchfs backend support.
- A generic `Transaction` API decoupled from `Pipeline` / `Gather` (Phase 1+2 deliver enough to inform the right shape later).

## Proposed Python API and examples

These examples make the design concrete. They use the Python SDK's existing `Sandbox(...).cmd([...])` shape plus the `|` pipeline operator and the `Gather` builder, and add `run_transactional()`, `prepare()`, `dry_run()`, and a `transaction()` context manager.

### API additions

```python
# python/src/sandlock/_sdk.py

class Pipeline:
    def run_transactional(self) -> TxnOutcome: ...
    def dry_run(self) -> DryRunOutcome: ...
    def prepare(self) -> PreparedTxn: ...
    def transaction(self) -> ContextManager[PreparedTxn]: ...

class Gather:
    def run_transactional(self) -> TxnOutcome: ...
    def dry_run(self) -> DryRunOutcome: ...
    def prepare(self) -> PreparedTxn: ...
    def with_conflict_policy(self, policy: ConflictPolicy) -> "Gather": ...   # Phase 2

@dataclass
class TxnOutcome:
    committed: bool
    stages: list[StageResult]
    changes: list[Change]
    conflicts: list[MergeConflict]       # Phase 2
    abort_reason: AbortReason | None

class PreparedTxn:
    stages: list[StageResult]
    changes: list[Change]
    conflicts: list[MergeConflict]
    def all_succeeded(self) -> bool: ...
    def has_conflicts(self) -> bool: ...
    def commit(self) -> None: ...
    def abort(self) -> None: ...
```

The existing `Pipeline.run()` keeps today's per-stage-Drop behavior. `run_transactional()` is the new entry point; existing callers are unaffected.

### Example 1: agent loop, all-or-nothing (Phase 1)

```python
from sandlock import Sandbox

workspace = "/home/me/repo"
BASE_READ = ["/usr", "/lib", "/lib64", "/bin", "/etc"]

planner = Sandbox(
    workdir=workspace,
    fs_readable=BASE_READ + [workspace],
    net_allow=["api.openai.com:443"],
)
editor = Sandbox(
    workdir=workspace,
    fs_readable=BASE_READ,
    fs_writable=[workspace],
)
tester = Sandbox(
    workdir=workspace,
    fs_readable=BASE_READ + [workspace],
    max_processes=20,
    max_memory="1G",
)

result = (
    planner.cmd(["python3", "plan.py"])
    | editor.cmd(["python3", "apply_patch.py"])
    | tester.cmd(["pytest", "-x"])
).run_transactional()

if result.committed:
    print("agent step merged")
else:
    print(f"rolled back: {result.abort_reason}")
```

### Example 2: preview before commit, context-managed (Phase 1)

```python
pipeline = (
    planner.cmd(["python3", "plan.py"])
    | editor.cmd(["python3", "apply_patch.py"])
    | tester.cmd(["pytest"])
)

with pipeline.transaction() as txn:
    txn.run()

    if not txn.all_succeeded():
        raise RuntimeError("stage failed; context exit will abort")

    print(f"would change {len(txn.changes)} paths:")
    for c in txn.changes:
        print(f"  {c.kind} {c.path}")

    if input("apply? [y/N] ").lower() == "y":
        txn.commit()
    # leaving the `with` block without committing aborts automatically;
    # exceptions also auto-abort.
```

### Example 3: test-and-merge for an LLM-proposed patch (Phase 1)

```python
patch_text = call_llm(prompt)

apply = Sandbox(
    workdir=workspace,
    fs_readable=BASE_READ,
    fs_writable=[workspace],
)
test = Sandbox(
    workdir=workspace,
    fs_readable=BASE_READ + [workspace],
)

outcome = (
    apply.cmd(["git", "apply", "-"], stdin=patch_text.encode())
    | test.cmd(["pytest", "-x"])
).run_transactional()

# Workspace is byte-identical to before the call if anything failed.
return outcome.committed
```

### Example 4: parallel agents on disjoint subsystems (Phase 2)

```python
from sandlock import Gather, ConflictPolicy

frontend = Sandbox(workdir=workspace, fs_writable=[f"{workspace}/web"])
backend  = Sandbox(workdir=workspace, fs_writable=[f"{workspace}/api"])
test     = Sandbox(workdir=workspace, fs_readable=BASE_READ + [workspace])

result = (
    Gather()
    .source("frontend_agent", frontend.cmd(["agent", "--scope", "web"]))
    .source("backend_agent",  backend.cmd(["agent", "--scope", "api"]))
    .consumer(test.cmd(["pytest"]))
    .with_conflict_policy(ConflictPolicy.FAIL)
    .run_transactional()
)

if not result.committed:
    if result.conflicts:
        for c in result.conflicts:
            print(f"CONFLICT {c.path} written by {c.stages}")
    else:
        print(f"failed: {result.abort_reason}")
```

### Design notes worth flagging

1. **`run()` stays as today.** Adding `run_transactional()` rather than mutating `run()`'s behavior avoids silently changing existing pipelines.
2. **`prepare()` returns a value the caller must resolve.** `PreparedTxn` aborts on garbage-collection if never committed, so a forgotten `.commit()` / `.abort()` does not silently commit.
3. **`transaction()` context manager** is the ergonomic surface: clean exit commits if `.commit()` was called, otherwise aborts; any exception aborts.
4. **`TxnOutcome` always carries `stages`** even when aborted, so the caller can inspect which stage failed without a second call.
5. **`Change` is the existing `dry_run.Change`**, so `dry_run()` for a pipeline is the same shape as `dry_run()` for a single sandbox, just merged across stages.
6. **No CLI surface in Phase 1.** Pipelines are constructed in code today; CLI declaration of multi-stage pipelines is a separate question and should not block this work.


Backend	Suitability for transactions
seccomp	Best fit. Pure userspace, the supervisor already arbitrates every open. Extending the upper-lookup chain across multiple uppers (prior stages' or sibling siblings') is a natural extension of the existing model. Works without privilege on any Linux 5.9+. This is sandlock's identity.
overlayfs	Off the rootless happy path. Requires mount privilege via user namespaces or `CAP_SYS_ADMIN`. Stacking via `lowerdir=A:B:C` is convenient syntactically but the privilege requirement contradicts sandlock's positioning. Future optimization target for heavy-I/O workloads where in-kernel overlay is meaningfully faster than userspace COW, not a starting point.
branchfs	Niche. Requires a kernel module that is not generally available; backend's `changes()` returns empty (kernel-managed, opaque to userspace), so dry-run does not work. Out of scope for transactions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Multi-sandbox transactions for pipelines #65

Goal

What exists today

The end state (what "done" looks like)

Design axes (the choices the RFC commits to)

Isolation model

Atomicity of finalization

Conflict resolution for parallel siblings

Backend strategy

Coordinator placement

Phased delivery

Phase 1: Transactional sequential pipelines on the seccomp COW backend

Phase 2: Parallel siblings and `Gather` transactions

Phase 3: Crash safety

Phase 4: Cross-transaction serializability

Phase 5 (optional): Overlayfs backend support

Open questions

Out of scope (this RFC)

Proposed Python API and examples

API additions

Example 1: agent loop, all-or-nothing (Phase 1)

Example 2: preview before commit, context-managed (Phase 1)

Example 3: test-and-merge for an LLM-proposed patch (Phase 1)

Example 4: parallel agents on disjoint subsystems (Phase 2)

Design notes worth flagging

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	What a stage sees	Cost
Read-committed (sequential)	All writes committed by prior stages in this transaction, plus the pre-transaction workdir	Low. Lookup chain extends through prior uppers.
Snapshot (parallel siblings)	The pre-transaction workdir only; siblings' writes are invisible until commit	Low for seccomp backend (each sibling has its own upper).
Serializable (across transactions)	No other transaction's mid-flight writes ever visible	Needs a workdir-level lock manager.

Strategy	Property	Cost
Serial per-branch commit	Simple. Mid-commit failure leaves partial state.	Trivial.
Two-phase commit	Prepare all branches (validate writable, no conflicts), then flip. Mid-flip failure still partial unless combined with a journal.	Medium.
Write-ahead log + recovery	Durable transaction metadata; recovery step on next startup completes or rolls back.	Significant, but unblocks crash safety.
Single-rename finalize	Stage entire result tree aside, `rename(2)` into place. Atomic but only for whole-subtree replacement.	Limited applicability.

Policy	Behavior
`Fail` (default)	Any path written by two siblings aborts the transaction with a `MergeConflict` listing the conflicting paths.
`FirstWriterWins`	Earlier-declared sibling wins per-path; later writes discarded.
`LastWriterWins`	Later-declared sibling wins per-path.
`Markers`	Both versions retained with a suffix; commits succeed, user resolves later.

RFC: Multi-sandbox transactions for pipelines #65

Description

Goal

What exists today

The end state (what "done" looks like)

Design axes (the choices the RFC commits to)

Isolation model

Atomicity of finalization

Conflict resolution for parallel siblings

Backend strategy

Coordinator placement

Phased delivery

Phase 1: Transactional sequential pipelines on the seccomp COW backend

Phase 2: Parallel siblings and Gather transactions

Phase 3: Crash safety

Phase 4: Cross-transaction serializability

Phase 5 (optional): Overlayfs backend support

Open questions

Out of scope (this RFC)

Proposed Python API and examples

API additions

Example 1: agent loop, all-or-nothing (Phase 1)

Example 2: preview before commit, context-managed (Phase 1)

Example 3: test-and-merge for an LLM-proposed patch (Phase 1)

Example 4: parallel agents on disjoint subsystems (Phase 2)

Design notes worth flagging

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase 2: Parallel siblings and `Gather` transactions