Sandboxed Coding Agent — Testing Plan

Preamble: Review of Proposed Strategy

This testing plan is derived from a review of the externally proposed testing strategy. That strategy is strong in several areas: it correctly identifies risk-driven quality goals, proposes a sound test pyramid with CI gating, calls for fault injection and model-based testing (both high-ROI), recommends reusing upstream test suites rather than reinventing them, and sequences TDD development to avoid "debugging QEMU" as the main loop.

The review identified the following issues and gaps that this plan addresses:

Spec ambiguities the strategy correctly flagged but left open. This plan locks them down (§1) so that tests have precise oracles.
Missing coverage for ambient-step behavior. Writes arriving outside any active command step must be tested explicitly — the strategy mentions the concept but omits test cases.
Incomplete POSIX metadata overlay testing (Windows). The strategy's Windows normalization tests don't cover the SQLite-backed overlay store added in the updated project plan for persistent chmod tracking.
Reflink/clone detection path untested. The strategy mentions reflinks as a storage optimization but has no tests for the detection and fallback logic.
Quiescence window edge cases underspecified. The strategy mentions quiescence tests but doesn't define the timeout, hang-prevention, or interaction with ambient steps.
MCP write_file synthetic step lifecycle. The strategy lists write_file producing an API step but doesn't specify the full lifecycle (step open → preimage → write → step close) or error paths.
No explicit test for undo-barrier visibility in undo.history. The strategy tests barrier blocking but not the API shape of barrier entries in history responses.
Event ordering guarantee left unresolved. This plan chooses a concrete semantics so tests can assert against it.
CI infrastructure for QEMU E2E. The strategy recommends E2E tests but doesn't address the practical question of KVM availability in CI runners.

All of these are resolved in the sections below.

1. Spec Decisions Required by Tests

Tests need precise oracles. The following decisions are locked down here and must be reflected in implementation.

1.1 `post_create` vs `pre_open_trunc` disambiguation

Decision: The filesystem backend is responsible for distinguishing "create new" from "open-and-truncate existing." If the backend's create operation opens an existing file with truncation, the backend must call pre_open_trunc (not post_create). post_create is only called when a genuinely new inode is created. The interceptor does not attempt to detect this itself — the backend has the information (e.g., O_CREAT|O_TRUNC vs O_CREAT|O_EXCL).

Test implication: Tests must verify that overwriting an existing file via creat() captures the preimage, while creating a truly new file records existed_before=false.

1.2 Directory restore ordering during rollback

Decision: Rollback processes paths in two passes:

Create pass (depth-first): Recreate directories (shallowest first), then restore file contents and metadata.
Metadata pass (depth-first, leaves first): Restore directory metadata (mode, mtime) after all children are restored, proceeding from deepest to shallowest. This prevents child restoration from updating parent directory mtime after it was already restored.

Test implication: Tests that delete a directory tree and roll back must assert that both file contents and directory metadata (including mtime) are restored correctly.

1.3 Metadata equality semantics

Decision: Rollback restores the following, and TreeSnapshot comparison asserts them:

Attribute	Assertion	Notes
File contents	Byte-exact	Always
File type	Exact (reg/dir/symlink)	Always
Mode bits	Exact (all 12 bits: suid/sgid/sticky + rwx)	Always on Linux; via overlay on Windows
mtime	Within filesystem granularity tolerance (configurable, default 1ms)	FAT32 has 2-second granularity; ext4/APFS are sub-second
xattrs	Exact key-value set if the filesystem supports xattrs	Tests skip with explicit reason if unsupported
Symlink target	Exact string	Always
atime	Not asserted	Deliberately excluded — too volatile

1.4 Undo history after rollback

Decision: Rollback is a pop operation. Rolled-back steps are removed from the history and cannot be re-applied (no "redo"). undo.history after undo.rollback(2) returns a list that no longer contains the two most recent steps.

Rationale: Pop is simpler and avoids the question of whether redo is valid after external modifications or new steps. Redo can be added post-MVP as a separate feature.

1.5 STDIO event/response ordering

Decision: Events and responses may interleave on stdout. The only ordering guarantee is: the response for a given request_id is sent after the corresponding operation completes (or fails). Events (event.step_completed, event.terminal_output, etc.) may arrive before or after the response they are associated with. Clients must correlate by request_id and step_id, not by position in the stream.

Test implication: JsonlClient must buffer and correlate, not assume positional ordering.

1.6 Ambient step behavior

Decision: Filesystem writes that arrive outside any active command step are attributed to a synthetic "ambient" step. Ambient steps:

Have a system-generated step ID (negative IDs, e.g., -1, -2, to distinguish from command steps).
Capture preimages and participate in undo like normal steps.
Are auto-closed after a configurable inactivity timeout (default 5 seconds of no new writes).
Appear in undo.history with type: "ambient" and no associated command.

2. Test Pyramid and CI Gating

2.1 Layers

Layer	Speed	Scope	Requires
L1: Unit	< 1s each	Pure logic: parsers, state machines, manifests, error taxonomy, step tracker, path normalization	Nothing beyond `cargo test`
L2: Component integration	< 5s each	Real host filesystem + `UndoInterceptor`, WAL, pruning, barrier logic, `TreeSnapshot` comparison	`tempfile` crate, host filesystem
L3: Protocol integration	< 5s each	9P server and control channel with in-process clients (no kernel mount, no QEMU)	Tokio test runtime
L4: System / E2E	10–60s each	Full agent binary, QEMU guest, STDIO/MCP clients, end-to-end undo/safeguard/barrier validation	QEMU, KVM (or TCG fallback), test guest image
L5: Security fuzzing	Continuous	`cargo-fuzz` targets for protocol parsers, path normalization, manifest parsing	`libFuzzer`, fuzz corpora
L6: Stress / performance	Minutes	fsx/fio workloads, large repo operations, sustained write pressure, watcher overflow	QEMU + KVM, `criterion` benchmarks

2.2 CI Gating

Per-PR (required, must pass before merge):

L1 + L2 + L3 (all unit, component, protocol tests)
L5 fuzz smoke: each fuzz target runs for 30 seconds with existing corpus
cargo clippy --all-targets, cargo fmt --check, cargo deny check, cargo audit
Total budget: < 10 minutes

Nightly (blocks release if failing):

L4 full QEMU E2E suite (Linux host with KVM; if KVM unavailable, run a reduced TCG subset)
L5 extended fuzz runs: 10 minutes per target, corpus regression
L6 performance baselines with 30% regression alert threshold
Total budget: < 45 minutes

Pre-release gate:

All of the above plus manual review of fuzz coverage report
Phase 2/3: macOS and Windows E2E suites on dedicated runners

2.3 CI Infrastructure for QEMU E2E

QEMU E2E tests require KVM access. Options by CI provider:

Self-hosted runner (recommended for nightly): A Linux VM with nested virtualization enabled (/dev/kvm available). Most cloud providers support this (GCP N2, AWS metal/nested, etc.).
GitHub Actions: Use runs-on: ubuntu-latest with KVM enabled (available on larger runners). Alternatively, use TCG (software emulation) for a slow but functional subset.
Fallback for PRs: Skip L4 tests on PR CI if KVM is unavailable; gate only on nightly. Mark L4 tests with #[ignore] and enable via --ignored flag on nightly runs.

3. Test Harness Architecture

3.1 Crate and directory layout

sandbox-agent/
  crates/
    test-support/              # Shared test utilities (library crate)
      src/
        lib.rs                 # Re-exports
        workspace.rs           # TempWorkspace: fixture trees + undo dir
        snapshot.rs            # TreeSnapshot + assert_tree_eq
        jsonl_client.rs        # STDIO API test client
        mcp_client.rs          # MCP socket test client
        fake_shim.rs           # In-process fake VM shim
        fault.rs               # Fault injection registry
        clock.rs               # Deterministic clock for tests
        fixtures.rs            # Reusable fixture tree builders
      Cargo.toml               # dev-dependency only

  tests/
    integration/               # L2 component integration tests
      undo_interceptor.rs
      wal_crash_recovery.rs
      undo_barriers.rs
      undo_pruning.rs
      undo_resource_limits.rs
      ambient_steps.rs
      multi_directory.rs
    protocol/                  # L3 protocol integration tests
      control_channel.rs
      stdio_api.rs
      mcp_server.rs
      p9_wire.rs               # Phase 3
    e2e/                       # L4 system tests (require QEMU)
      session_lifecycle.rs
      undo_roundtrip.rs
      safeguard_flow.rs
      external_modification.rs
      mcp_integration.rs
      pjdfstest_subset.rs

  fuzz/                        # L5 fuzz targets
    Cargo.toml
    corpus/
      p9_wire/
      control_jsonl/
      stdio_json/
      mcp_jsonrpc/
      undo_manifest/
      path_normalize/
    fuzz_targets/
      p9_wire.rs
      control_jsonl.rs
      stdio_json.rs
      mcp_jsonrpc.rs
      undo_manifest.rs
      path_normalize.rs

  benches/                     # L6 microbenchmarks
    preimage_capture.rs
    zstd_compression.rs
    rollback_restore.rs
    manifest_io.rs

3.2 `TempWorkspace`

/// Creates an isolated working directory + undo directory for a single test.
pub struct TempWorkspace {
    pub working_dir: PathBuf,   // The "shared folder" equivalent
    pub undo_dir: PathBuf,      // Adjacent, outside share root
    _temp: TempDir,             // Dropped on test exit
}

impl TempWorkspace {
    /// Create empty workspace.
    pub fn new() -> Self { ... }

    /// Create workspace from a fixture builder.
    pub fn with_fixture(f: impl FnOnce(&Path)) -> Self { ... }

    /// Snapshot the current state of the working directory.
    pub fn snapshot(&self) -> TreeSnapshot { ... }
}

3.3 `TreeSnapshot` and `assert_tree_eq`

pub struct TreeSnapshot {
    pub entries: BTreeMap<PathBuf, EntrySnapshot>,
}

pub struct EntrySnapshot {
    pub file_type: FileType,       // Reg, Dir, Symlink
    pub content_hash: Option<[u8; 32]>,  // blake3 for regular files
    pub size: u64,
    pub mode: u32,
    pub mtime_ns: i128,
    pub symlink_target: Option<String>,
    pub xattrs: BTreeMap<String, Vec<u8>>,
}

pub struct SnapshotCompareOptions {
    pub mtime_tolerance_ns: i128,  // Default: 1_000_000 (1ms)
    pub check_xattrs: bool,        // Default: true on Linux, false on Windows
    pub exclude_patterns: Vec<String>,
}

/// Panics with a human-readable diff on mismatch.
pub fn assert_tree_eq(
    before: &TreeSnapshot,
    after: &TreeSnapshot,
    opts: &SnapshotCompareOptions,
) { ... }

3.4 `JsonlClient` and `McpClient`

/// Spawns the agent as a child process, speaks STDIO API.
pub struct JsonlClient {
    child: Child,
    // Reads stdout in a background task, demuxes events and responses
    // into separate channels keyed by request_id / event type.
}

impl JsonlClient {
    pub async fn send(&mut self, msg: Value) -> Result<()>;
    pub async fn recv_response(&mut self, request_id: &str, timeout: Duration) -> Result<Value>;
    pub async fn recv_event(&mut self, event_type: &str, timeout: Duration) -> Result<Value>;
    pub fn stderr_lines(&self) -> Vec<String>;  // For log validation
}

/// Connects to MCP socket, speaks JSON-RPC.
pub struct McpClient { ... }

impl McpClient {
    pub async fn connect(socket_path: &Path) -> Result<Self>;
    pub async fn call(&mut self, method: &str, params: Value) -> Result<Value>;
}

3.5 Deterministic `Clock` and step IDs

The UndoInterceptor and step finalization logic accept a Clock trait:

pub trait Clock: Send + Sync {
    fn now(&self) -> SystemTime;
}

pub struct RealClock;
pub struct FakeClock { inner: Mutex<SystemTime> }

impl FakeClock {
    pub fn advance(&self, duration: Duration);
    pub fn set(&self, time: SystemTime);
}

Step IDs are host-generated and deterministic in tests (sequential integers starting from a test-provided seed).

3.6 Fault injection

Compile-time gated (cfg(feature = "fault_injection")), never in release builds.

pub struct FaultInjector {
    faults: Mutex<VecDeque<Fault>>,
}

pub enum Fault {
    FailPreimageWrite { errno: i32 },     // ENOSPC, EIO, etc.
    FailStepPromotion,                     // Rename WAL→steps fails
    TruncateManifest { after_bytes: u64 }, // Partial manifest write
    ForceWatcherOverflow,                  // Simulate inotify overflow
    ForceLateWrite { delay_ms: u64 },      // Write arrives after step_completed
}

impl FaultInjector {
    pub fn enqueue(&self, fault: Fault);
    /// Returns Some(fault) and removes it, or None.
    pub fn check(&self, point: &str) -> Option<Fault>;
}

Injection points are placed at the start of each fallible operation in UndoInterceptor:

if let Some(fault) = self.fault_injector.check("preimage_write") {
    return Err(io::Error::from_raw_os_error(fault.errno()));
}

4. Core Invariants (Driving All Tests)

These are the properties that, if violated, constitute a bug. Every test traces back to one or more of these.

INV-1: Undo correctness

For any completed, protected step S in working directory D: undo.rollback(1) restores TreeSnapshot(D) to exactly the snapshot captured before step S began (within the metadata semantics of §1.3).

INV-2: First-touch fidelity

For any path P mutated multiple times within step S, the stored preimage corresponds to the state of P before the first mutation within S. Subsequent mutations within S do not overwrite the preimage.

INV-3: Crash atomicity

If the agent crashes at any point during a step, restart always produces a working directory state equal to the pre-step snapshot. The WAL in_progress directory is removed. An event.recovery is emitted.

INV-4: Barrier integrity

Rollback cannot cross an undo barrier unless force: true. Barriers are visible in undo.history. Internal sandbox writes never create barriers.

INV-5: Safeguard pre-operation trigger

The safeguard triggers before the operation that would cross the threshold executes on the host filesystem. On deny, zero host mutations from the paused-and-denied portion persist.

INV-6: Path containment

No guest-originated filesystem operation, undo preimage capture, or rollback restore can read or write any host path outside the share root directory.

INV-7: Parser robustness

No input to any protocol parser (9P wire, control channel JSONL, STDIO JSON, MCP JSON-RPC, undo manifest) causes a panic, unbounded allocation, or undefined behavior. Malformed input produces a structured error.

INV-8: Transport isolation

STDIO API stdout never contains log output. Stderr never contains protocol messages. MCP socket traffic never appears on stdout/stderr. No cross-contamination.

5. Component Test Specifications

5.1 WriteInterceptor / UndoInterceptor

Priority: Highest. This is the core correctness path — implement and test first.

Test scaffolding required:

TempWorkspace + TreeSnapshot (§3.2, §3.3)
StepTracker test double: allows manual open_step(id) / close_step(id) calls
OperationApplier helper: calls the interceptor hook, then performs the real std::fs operation, mirroring backend behavior

Test matrix

ID	Category	Scenario	Assert
UI-01	First-touch	Write same file 3× in one step	One preimage stored; rollback restores original
UI-02	Create	Create new file + write content	Rollback deletes file; parent dir mtime restored
UI-03	Create-dir	Create nested directory structure	Rollback removes all created dirs (deepest first)
UI-04	Delete	Delete file	Rollback restores bytes + mode + mtime + xattrs
UI-05	Delete-tree	`rm -rf` simulation (deep nested tree)	Rollback restores full tree with correct structure
UI-06	Rename-new	Rename A→B where B doesn't exist	Rollback restores A, removes B
UI-07	Rename-over	Rename A→B where B exists	Rollback restores both A and B to pre-step state
UI-08	Rename-dir	Rename directory with nested files	Rollback restores all paths under old name
UI-09	Truncate-open	Open existing file with O_TRUNC	`pre_open_trunc` captures preimage before truncation
UI-10	Truncate-setattr	`setattr` truncate to shorter length	Preimage contains original full contents
UI-11	Chmod	Flip executable bit	Rollback restores original mode
UI-12	Xattr-set	Set user xattr on file	Rollback removes xattr (or restores previous value)
UI-13	Xattr-remove	Remove existing xattr	Rollback restores xattr
UI-14	Fallocate	Extend file via fallocate	Rollback restores original size
UI-15	Copy-file-range	Copy range into existing file	Destination preimage captured; rollback restores
UI-16	Multi-step	Steps 1 and 2 modify same file differently	Rollback(1) restores to post-step-1 state, not original
UI-17	Unprotected	Step exceeds `max_single_step_size`	Step marked unprotected; rollback returns error
UI-18	FIFO-eviction	Exceed `max_step_count`	Oldest step evicted; `event.warning` emitted
UI-19	Log-size-eviction	Exceed `max_log_size_bytes`	Oldest steps evicted until within budget
UI-20	Ambient-step	Write arrives outside any command step	Attributed to ambient step; undo works
UI-21	Ambient-timeout	Ambient step auto-closes after inactivity	New write after timeout opens new ambient step
UI-22	Multi-dir	Rollback in dir A	Dir B unmodified
UI-23	Hardlink	Hardlink to file within share root	No panic; behavior documented (path-based capture)
UI-24	Symlink-internal	Symlink within share root	Symlink target string captured; rollback restores

Model-based test (proptest / quickcheck)

Generate random sequences of operations (CreateFile, Write, Truncate, Chmod, Rename, Delete, Mkdir, Rmdir, SetXattr, RemoveXattr) grouped into steps. After each step, optionally roll back and compare to the stored snapshot. This catches ordering bugs, rename collisions, and multi-touch edge cases that enumerated tests miss.

#[proptest]
fn undo_model(ops: Vec<StepOps>) {
    let ws = TempWorkspace::with_fixture(random_small_tree);
    let interceptor = UndoInterceptor::new(ws.undo_dir.clone(), ...);
    for step in &ops {
        let snapshot_before = ws.snapshot();
        interceptor.open_step(step.id);
        for op in &step.ops { op.apply(&ws, &interceptor); }
        interceptor.close_step(step.id);
        if step.should_rollback {
            interceptor.rollback(1);
            assert_tree_eq(&snapshot_before, &ws.snapshot(), &default_opts());
        }
    }
}

5.2 WAL and Crash Recovery

Test scaffolding: Fault injection (§3.6), TempWorkspace.

ID	Scenario	Fault injected	Assert
CR-01	Crash mid-step (after some preimages written)	Kill process (or return from test without closing step)	Restart rolls back; working dir equals pre-step snapshot
CR-02	Crash during preimage write	`FailPreimageWrite { errno: EIO }`	Operation fails; step becomes unprotected OR write is rejected (test whichever policy is chosen)
CR-03	Crash during step promotion	`FailStepPromotion`	WAL `in_progress` remains; restart rolls back
CR-04	Truncated manifest	`TruncateManifest { after_bytes: 50 }`	Restart detects corruption, rolls back, emits `event.recovery`
CR-05	Clean shutdown (no crash)	None	WAL empty; steps directory contains committed steps
CR-06	Double recovery (restart twice without new writes)	None	Second restart is a no-op; no duplicate events
CR-07	Crash with empty step (step opened but no writes)	Kill before any preimage	Restart discards empty WAL entry; no-op rollback

5.3 External Modifications and Undo Barriers

ID	Scenario	Assert
EB-01	External write during active session	`event.external_modification` emitted with affected paths
EB-02	Rollback across barrier (no force)	Rollback rejected with error listing barrier details
EB-03	Rollback across barrier (force=true)	Rollback proceeds; warning included in response
EB-04	Barrier visible in `undo.history`	History entry has `type: "barrier"` with timestamp and paths
EB-05	Internal sandbox write does NOT trigger barrier	Correlation logic filters backend-originated watcher events
EB-06	Multiple barriers between steps	Each barrier listed; rollback blocked at nearest
EB-07	Watcher overflow	Agent emits warning; degrades to conservative barrier behavior
EB-08	`policy=warn`	External write emits warning but no barrier; rollback proceeds
EB-09	`policy=lock` (if implemented)	External write attempt fails (CI-optional, requires permission control)

5.4 Safeguards

ID	Scenario	Assert
SG-01	Delete count reaches threshold	`event.safeguard_triggered` emitted; no further host mutations while paused
SG-02	Confirm allow	Command completes; step commits; undo works
SG-03	Confirm deny	Entire step rolled back; tree matches pre-step snapshot
SG-04	Timeout (no confirm sent)	Auto-deny; step rolled back
SG-05	Overwrite-large-file threshold	Triggered when existing file > configured size is overwritten
SG-06	Rename-over-existing threshold	Triggered when rename destination exists
SG-07	Queue overflow (request-holding mode)	Queue cap reached; further ops get `ENOSPC`; no OOM
SG-08	QMP pause mode (if available)	QMP `stop` issued on trigger; `cont` on allow/deny; VM verifiably paused
SG-09	Pre-operation trigger ordering	Safeguard fires before the Nth deletion executes (not after)

5.5 Control Channel and Step Tracking

Unit tests (JSONL parsing):

ID	Scenario	Assert
CC-01	Valid `step_started` / `step_completed` sequence	Step opens and closes; filesystem writes attributed correctly
CC-02	Malformed JSON	Structured error logged; channel not broken
CC-03	Unknown message type	Ignored or logged; channel not broken
CC-04	Oversized message (>1MB)	Rejected before full allocation
CC-05	`step_completed` without `step_started`	Error logged; no crash
CC-06	Duplicate `step_started` for same ID	Error logged; existing step unaffected
CC-07	Cancellation mid-step	Step finalized appropriately

Integration tests (fake shim, no QEMU):

ID	Scenario	Assert
CC-08	Normal exec cycle	Host sends `exec`; fake shim returns started/output/completed; events forwarded
CC-09	Quiescence window: no late writes	Step closes immediately after `step_completed` + quiescence timeout
CC-10	Quiescence window: late write arrives	Step closure waits for in-flight ops to drain; late write included in step
CC-11	Quiescence timeout: prevent hang	If in-flight ops never drain, step closes after max quiescence timeout (e.g., 2s)
CC-12	Ambient writes after step close	Writes after quiescence window go to ambient step, not the closed step

5.6 STDIO API

Schema tests (unit):

ID	Scenario	Assert
SA-01	Each request type parses correctly	Valid request → accepted
SA-02	Unknown request type	Structured error: `{code: "unknown_operation", message: "..."}`
SA-03	Missing required field	Structured error with field name
SA-04	Version negotiation (once defined)	Mismatched version → graceful rejection

Stream behavior tests (integration with JsonlClient):

ID	Scenario	Assert
SA-05	Response correlates to `request_id`	Response `request_id` matches request
SA-06	Events interleave with responses	Client correctly demuxes both
SA-07	Stderr is valid JSONL logs	Every stderr line parses as JSON with `timestamp`, `level`, `component`
SA-08	Stdout contains no log lines	No line on stdout has `level` or `component` fields
SA-09	Backpressure: client stops reading	Agent does not deadlock (bounded buffers or timeout)

Security tests:

ID	Scenario	Assert
SA-10	`fs.read` with `../../etc/passwd` path	Rejected; resolved relative to working dir root
SA-11	`fs.list` with absolute path outside root	Rejected
SA-12	Oversized `write_file` payload	Size limit enforced; structured error

5.7 MCP Server

ID	Scenario	Assert
MC-01	JSON-RPC compliance (id, errors, unknown method)	Correct JSON-RPC responses
MC-02	`execute_command` returns exit_code/stdout/stderr	Values match what fake shim sent
MC-03	`write_file` creates synthetic API step	Step appears in `undo.history` with `type: "api"`
MC-04	`write_file` → rollback	Written file removed; preimage restored
MC-05	`write_file` error (path outside root)	JSON-RPC error; no step created
MC-06	MCP triggers safeguard → STDIO event emitted	Cross-interface consistency
MC-07	Connection without auth token (if implemented)	Rejected
MC-08	Concurrent MCP + STDIO operations	Shared undo/safeguard state consistent; no races

5.8 Undo Log Storage and Versioning

ID	Scenario	Assert
UL-01	Manifest correctness	Affected paths, `existed_before`, file type, metadata encoding all round-trip
UL-02	Preimage atomicity	Preimage writes use temp file + atomic rename
UL-03	Step promotion atomicity	`wal/in_progress/` renamed to `steps/{id}/` atomically
UL-04	Version mismatch on startup	`event.undo_version_mismatch` emitted; undo disabled
UL-05	`undo.discard` after mismatch	Old log wiped; new version file written; undo re-enabled
UL-06	Corrupt manifest (truncated)	Graceful error; agent doesn't crash
UL-07	Missing preimage file	Rollback returns error for that step; other steps unaffected
UL-08	Corrupt preimage (flipped bytes)	Detected (if checksums used) or rollback produces incorrect state (documented)
UL-09	Reflink detection + fallback	If `FICLONE` succeeds, preimage is a reflink; if it fails, falls back to copy+zstd

5.9 Filesystem Backends

5.9.1 virtiofsd fork (Linux/macOS)

Unit/integration (no VM):

ID	Scenario	Assert
VF-01	`InodePathMap`: insert/update/remove/rename	Lookup returns correct path
VF-02	`InodePathMap`: negative lookup	Returns defined error (not panic)
VF-03	`InodePathMap`: path always within root	No path returned outside share root

E2E (with QEMU):

ID	Scenario	Assert
VF-04	Guest performs each primitive op	Undo step lists correct paths; rollback restores snapshot
VF-05	`pjdfstest` curated subset	POSIX semantics match for create/unlink/rename/chmod/symlink

Reuse: Run upstream virtiofsd unit tests in fork CI. Keep them passing. Add wrapper-specific tests on top.

5.9.2 9P server (Phase 3)

ID	Scenario	Assert
P9-01	Wire round-trip for each message type	Serialize → deserialize = identity
P9-02	Known-byte fixtures	Match crosvm test vectors (adapt licensing)
P9-03	Invalid sizes/offsets/flags	Correct `Rlerror` errno
P9-04	Out-of-order responses by tag	Pipelined requests handled correctly
P9-05	Oversized message	Rejected before full allocation

5.9.3 Windows normalization (Phase 3)

ID	Scenario	Assert
WN-01	Case-collision detection	Create `Foo` then `foo` → error
WN-02	Reserved names	Create `CON`, `NUL`, etc. → rejected
WN-03	POSIX metadata overlay: chmod persistence	`chmod 755` → `getattr` returns 755 across sessions
WN-04	Overlay: new file defaults	File without overlay entry gets heuristic mode
WN-05	Overlay: rollback restores mode	Mode changed by step → rollback restores previous overlay entry
WN-06	Reparse point escape	Create junction inside root → outside; write through it → rejected
WN-07	Reparse point: read through junction	Read via junction pointing outside root → rejected

5.10 Session Lifecycle

ID	Scenario	Assert
SL-01	`session.start` with invalid working dir	Structured error
SL-02	`session.start` with multiple dirs	Each dir gets mount tag; backend instances created
SL-03	`session.stop` (persistent)	VM shuts down; disk image preserved
SL-04	`session.stop` (ephemeral)	VM destroyed; disk image deleted
SL-05	`session.reset`	Persistent VM wiped and recreated
SL-06	QEMU launch failure	Structured error event; agent doesn't hang
SL-07	Control channel disconnect	Agent transitions to error state; emits event
SL-08	Resource cleanup on stop	Sockets removed; child processes terminated

5.11 Observability

ID	Scenario	Assert
OB-01	Stderr logs parse as JSONL	Every line has `timestamp`, `level`, `component`
OB-02	`request_id` correlation	Log entries for a request include matching `request_id`
OB-03	`step_id` correlation	Log entries during a step include matching `step_id`
OB-04	No protocol frames in logs	Logs never contain raw 9P bytes or control channel messages
OB-05	Log level filtering	`--log-level=warn` suppresses info/debug/trace

6. End-to-End Test Design (QEMU, MVP Linux)

6.1 Test guest image

Build a minimal guest image containing:

Busybox or Alpine base (< 50MB)
VM-side shim (baked in)
Core utilities: sh, dd, truncate, chmod, ln, mv, rm, mkdir, touch, stat
Optional: setfattr/getfattr (for xattr E2E tests)
No node, cargo, etc. — those are nightly workload tests

Build recipe: Alpine-based initramfs created via a Dockerfile or Buildroot config. The xtask command cargo xtask build-guest produces vmlinuz + initrd.img for both x86_64 and aarch64.

6.2 E2E test pattern

Every E2E test follows this sequence:

#[tokio::test]
#[ignore] // Only run in nightly CI with KVM
async fn test_undo_single_file_write() {
    let ws = TempWorkspace::with_fixture(small_tree);
    let initial_snapshot = ws.snapshot();

    let mut client = JsonlClient::spawn_agent(&[
        "--working-dir", ws.working_dir.to_str().unwrap(),
        "--undo-dir", ws.undo_dir.to_str().unwrap(),
        "--vm-mode", "ephemeral",
        "--backend", "virtiofs", // or "9p" for Phase 3 tests
    ]).await;

    client.send(session_start()).await;
    client.recv_response("start", Duration::from_secs(30)).await; // VM boot

    // Execute mutation
    client.send(agent_execute("echo 'hello' > /mnt/working/test.txt")).await;
    client.recv_event("event.step_completed", Duration::from_secs(10)).await;

    // Verify mutation happened
    let post_snapshot = ws.snapshot();
    assert_ne!(&initial_snapshot, &post_snapshot);

    // Rollback
    client.send(undo_rollback(1)).await;
    client.recv_response("rollback", Duration::from_secs(5)).await;

    // Verify restoration
    assert_tree_eq(&initial_snapshot, &ws.snapshot(), &default_opts());

    client.send(session_stop()).await;
}

6.3 Fixture trees

pub fn small_tree(root: &Path) {
    // Files of various sizes
    fs::write(root.join("empty.txt"), "");
    fs::write(root.join("small.txt"), "hello world");
    fs::write(root.join("medium.txt"), "x".repeat(4096));
    fs::write(root.join("large.bin"), &vec![0xABu8; 1_000_000]);
    // Nested directories
    fs::create_dir_all(root.join("src/components"));
    fs::write(root.join("src/main.rs"), "fn main() {}");
    fs::write(root.join("src/components/app.rs"), "pub struct App;");
    // Executable file
    let exec_path = root.join("run.sh");
    fs::write(&exec_path, "#!/bin/sh\necho ok");
    fs::set_permissions(&exec_path, Permissions::from_mode(0o755));
}

pub fn rename_tree(root: &Path) { /* a.txt + b.txt with distinct contents */ }
pub fn xattr_tree(root: &Path) { /* file with user.test xattr */ }
pub fn symlink_tree(root: &Path) { /* symlinks to file and dir */ }
pub fn deep_tree(root: &Path) { /* 5 levels deep, 100+ files for safeguard tests */ }

7. Security Testing

7.1 Fuzz targets

Each target is a fuzz_target! in fuzz/fuzz_targets/. Corpora are committed and grow over time.

Target	Input	Assertions
`p9_wire`	Raw bytes	No panic; no alloc > 16MB; error on invalid input
`control_jsonl`	UTF-8 string (one line)	No panic; no alloc > 1MB; valid parse or structured error
`stdio_json`	UTF-8 string (one line)	No panic; no alloc > 1MB; valid parse or structured error
`mcp_jsonrpc`	UTF-8 string	No panic; no alloc > 1MB
`undo_manifest`	Raw bytes (simulated manifest file)	No panic; valid parse or error
`path_normalize`	`Vec<Vec<u8>>` (path components)	No panic; result is within root or error; no `..` traversal

7.2 Containment tests

ID	Scenario	Phase	Assert
SC-01	Symlink in working dir → `/etc/passwd` (Linux virtiofsd+chroot)	MVP	Preimage capture does not access `/etc/passwd`
SC-02	Rollback with symlink to outside root	MVP	Restore does not write outside root
SC-03	Symlink chain (A→B→C→outside)	MVP	Entire chain resolved; access denied
SC-04	macOS: symlink escape without chroot	Phase 2	`openat`-relative containment rejects
SC-05	macOS: TOCTOU during `F_GETPATH` re-open	Phase 2	No re-open by path; use fd directly
SC-06	Windows: junction to `C:\Windows\System32`	Phase 3	9P server rejects; no host access
SC-07	Windows: reparse point during rename	Phase 3	Rename target validated within root

7.3 DoS / resource exhaustion

ID	Scenario	Assert
DOS-01	Create many unique files in one step until `max_single_step_size` hit	Step becomes unprotected; agent responsive
DOS-02	Safeguard pause: flood with filesystem ops	Queue capped; `ENOSPC` beyond cap; no OOM
DOS-03	Many concurrent 9P requests (pipelined)	Agent handles within bounded memory
DOS-04	Giant 9P message (size field claims 2GB)	Rejected at wire parse; no allocation

7.4 CI hardening checks

Run as part of per-PR CI:

cargo audit — known vulnerability check
cargo deny check — license and advisory policy
cargo clippy --all-targets — lint
Sanitizer jobs (nightly CI): ASan + UBSan on fuzz targets, TSAN on concurrency-heavy unit tests if feasible

8. Performance Testing

8.1 Microbenchmarks (`criterion`, per-PR acceptable)

Benchmark	Sizes	Regression threshold
Preimage capture throughput	4KB, 1MB, 100MB	30% regression alerts
zstd compression (level 3)	4KB, 1MB, 100MB	30%
Rollback restore throughput	4KB, 1MB, 100MB	30%
Manifest write + promotion	10 paths, 100 paths, 1000 paths	30%
`TreeSnapshot` capture	100 files, 1000 files, 10000 files	30%

8.2 Macrobenchmarks (nightly / manual, in QEMU)

Workload	Metrics
`git status` on large repo (Linux kernel tree)	Wall time, agent RSS
`rm -rf node_modules` (10,000 files) → undo	Wall time, undo log size, restore time
`fsx` (random filesystem exerciser) for 60s	No errors; agent RSS stable
`fio` sequential 1MB writes × 1000	Throughput vs baseline (no interception)

Record results in CI artifacts for trend analysis. Alert on >30% regression from rolling 7-day baseline.

9. Reusing Upstream Tests

Suite	Source	How to use	Phase
virtiofsd unit tests	Upstream fork	Run in fork CI; keep passing; add wrapper tests on top	MVP
`pjdfstest`	`github.com/pjd/pjdfstest`	Run curated subset inside guest against `/mnt/working`	MVP
crosvm `p9` crate fixtures	`chromium.googlesource.com/crosvm`	Port known-byte test vectors for wire format; adapt attribution	Phase 3
Mutagen test vectors	`github.com/mutagen-io/mutagen`	Port reserved-name tables, case-collision scenarios, chmod persistence behaviors as Rust table-driven tests	Phase 3
`xfstests` (optional)	`github.com/kdave/xfstests`	Run small subset externally (not vendored) for extended POSIX validation	Nightly

10. Cross-Platform Test Matrix

Phase	Host OS	Backend	CI Runner	Must pass before moving on
MVP	Linux x86_64	virtiofsd fork	GitHub Actions + self-hosted KVM runner	L1–L3 full; L4 E2E subset; L5 fuzz smoke; `pjdfstest` subset
Phase 2	macOS Apple Silicon	virtiofsd fork (ported)	macOS self-hosted runner (M-series)	macOS containment tests; portability layer tests; FSEvents barrier tests; E2E mount+undo
Phase 3	Windows x86_64	9P server	Windows self-hosted runner with WHPX	9P wire+dispatch tests; junction/reparse containment; case/reserved-name tests; metadata overlay tests; WHPX E2E

11. TDD Development Sequence

This ordering keeps the tight TDD loop fast and avoids "debugging QEMU" as the primary development activity.

Step	What to build	What to test	Layer
1	`TreeSnapshot` + `assert_tree_eq`	Snapshot round-trip; equality and diff output	L1
2	`UndoInterceptor` core (first-touch, preimage write, rollback)	UI-01 through UI-08 (create/write/rename/delete → rollback)	L2
3	WAL + crash recovery	CR-01 through CR-07 (fault injection, no VM)	L2
4	Undo barriers	EB-01 through EB-06 (external mod simulation)	L2
5	Safeguards (interceptor level)	SG-01 through SG-06 (simulate delete counts)	L2
6	Metadata capture (mode, mtime, xattrs)	UI-09 through UI-15	L2
7	Resource limits + pruning	UI-17 through UI-19, UL-01 through UL-09	L2
8	Control channel parsing + state machine	CC-01 through CC-07	L1
9	Control channel integration (fake shim)	CC-08 through CC-12 (quiescence, ambient)	L3
10	STDIO API contract tests	SA-01 through SA-12	L3
11	MCP server contract tests	MC-01 through MC-08	L3
12	Fuzz targets (initial)	All 6 fuzz targets with seed corpus	L5
13	QEMU E2E: session lifecycle	SL-01 through SL-08	L4
14	QEMU E2E: undo round-trip	`echo` → step → rollback → snapshot compare	L4
15	QEMU E2E: `pjdfstest` subset	POSIX semantics validation	L4
16	QEMU E2E: safeguard flow	`rm -rf` in VM → trigger → deny → verify rollback	L4
17	Model-based / property tests	Random op sequences → rollback → snapshot	L2
18	Performance baselines	Microbenchmarks (`criterion`)	L6

Steps 1–7 require no QEMU, no networking, no async runtime — pure Rust + filesystem. This is where the majority of correctness bugs will be found and fixed.

12. `cargo xtask` Commands

cargo xtask test-fast       # L1 + L2 + L3 (per-PR)
cargo xtask test-fuzz-smoke  # L5 short runs (per-PR)
cargo xtask test-e2e         # L4 (requires KVM; nightly)
cargo xtask test-all         # Everything
cargo xtask fuzz <target>    # Run a specific fuzz target continuously
cargo xtask bench            # L6 microbenchmarks
cargo xtask build-guest      # Build test guest image (vmlinuz + initrd)
cargo xtask ci-check         # clippy + fmt + deny + audit

13. Cargo Features

[features]
default = []
fault_injection = []  # Enables FaultInjector compile paths; never in release
e2e_tests = []        # Enables QEMU E2E test compilation

Tests use:

cargo test                                    # L1 + L2 + L3
cargo test --features fault_injection         # L2 with fault injection
cargo test --features e2e_tests --ignored     # L4 QEMU E2E

FilesExpand file tree

testing-plan.md

Latest commit

History