This testing plan is derived from a review of the externally proposed testing strategy. That strategy is strong in several areas: it correctly identifies risk-driven quality goals, proposes a sound test pyramid with CI gating, calls for fault injection and model-based testing (both high-ROI), recommends reusing upstream test suites rather than reinventing them, and sequences TDD development to avoid "debugging QEMU" as the main loop.
The review identified the following issues and gaps that this plan addresses:
- Spec ambiguities the strategy correctly flagged but left open. This plan locks them down (§1) so that tests have precise oracles.
- Missing coverage for ambient-step behavior. Writes arriving outside any active command step must be tested explicitly — the strategy mentions the concept but omits test cases.
- Incomplete POSIX metadata overlay testing (Windows). The strategy's Windows normalization tests don't cover the SQLite-backed overlay store added in the updated project plan for persistent
chmodtracking. - Reflink/clone detection path untested. The strategy mentions reflinks as a storage optimization but has no tests for the detection and fallback logic.
- Quiescence window edge cases underspecified. The strategy mentions quiescence tests but doesn't define the timeout, hang-prevention, or interaction with ambient steps.
- MCP
write_filesynthetic step lifecycle. The strategy listswrite_fileproducing an API step but doesn't specify the full lifecycle (step open → preimage → write → step close) or error paths. - No explicit test for undo-barrier visibility in
undo.history. The strategy tests barrier blocking but not the API shape of barrier entries in history responses. - Event ordering guarantee left unresolved. This plan chooses a concrete semantics so tests can assert against it.
- CI infrastructure for QEMU E2E. The strategy recommends E2E tests but doesn't address the practical question of KVM availability in CI runners.
All of these are resolved in the sections below.
Tests need precise oracles. The following decisions are locked down here and must be reflected in implementation.
Decision: The filesystem backend is responsible for distinguishing "create new" from "open-and-truncate existing." If the backend's create operation opens an existing file with truncation, the backend must call pre_open_trunc (not post_create). post_create is only called when a genuinely new inode is created. The interceptor does not attempt to detect this itself — the backend has the information (e.g., O_CREAT|O_TRUNC vs O_CREAT|O_EXCL).
Test implication: Tests must verify that overwriting an existing file via creat() captures the preimage, while creating a truly new file records existed_before=false.
Decision: Rollback processes paths in two passes:
- Create pass (depth-first): Recreate directories (shallowest first), then restore file contents and metadata.
- Metadata pass (depth-first, leaves first): Restore directory metadata (mode, mtime) after all children are restored, proceeding from deepest to shallowest. This prevents child restoration from updating parent directory mtime after it was already restored.
Test implication: Tests that delete a directory tree and roll back must assert that both file contents and directory metadata (including mtime) are restored correctly.
Decision: Rollback restores the following, and TreeSnapshot comparison asserts them:
| Attribute | Assertion | Notes |
|---|---|---|
| File contents | Byte-exact | Always |
| File type | Exact (reg/dir/symlink) | Always |
| Mode bits | Exact (all 12 bits: suid/sgid/sticky + rwx) | Always on Linux; via overlay on Windows |
| mtime | Within filesystem granularity tolerance (configurable, default 1ms) | FAT32 has 2-second granularity; ext4/APFS are sub-second |
| xattrs | Exact key-value set if the filesystem supports xattrs | Tests skip with explicit reason if unsupported |
| Symlink target | Exact string | Always |
| atime | Not asserted | Deliberately excluded — too volatile |
Decision: Rollback is a pop operation. Rolled-back steps are removed from the history and cannot be re-applied (no "redo"). undo.history after undo.rollback(2) returns a list that no longer contains the two most recent steps.
Rationale: Pop is simpler and avoids the question of whether redo is valid after external modifications or new steps. Redo can be added post-MVP as a separate feature.
Decision: Events and responses may interleave on stdout. The only ordering guarantee is: the response for a given request_id is sent after the corresponding operation completes (or fails). Events (event.step_completed, event.terminal_output, etc.) may arrive before or after the response they are associated with. Clients must correlate by request_id and step_id, not by position in the stream.
Test implication: JsonlClient must buffer and correlate, not assume positional ordering.
Decision: Filesystem writes that arrive outside any active command step are attributed to a synthetic "ambient" step. Ambient steps:
- Have a system-generated step ID (negative IDs, e.g.,
-1,-2, to distinguish from command steps). - Capture preimages and participate in undo like normal steps.
- Are auto-closed after a configurable inactivity timeout (default 5 seconds of no new writes).
- Appear in
undo.historywithtype: "ambient"and no associated command.
| Layer | Speed | Scope | Requires |
|---|---|---|---|
| L1: Unit | < 1s each | Pure logic: parsers, state machines, manifests, error taxonomy, step tracker, path normalization | Nothing beyond cargo test |
| L2: Component integration | < 5s each | Real host filesystem + UndoInterceptor, WAL, pruning, barrier logic, TreeSnapshot comparison |
tempfile crate, host filesystem |
| L3: Protocol integration | < 5s each | 9P server and control channel with in-process clients (no kernel mount, no QEMU) | Tokio test runtime |
| L4: System / E2E | 10–60s each | Full agent binary, QEMU guest, STDIO/MCP clients, end-to-end undo/safeguard/barrier validation | QEMU, KVM (or TCG fallback), test guest image |
| L5: Security fuzzing | Continuous | cargo-fuzz targets for protocol parsers, path normalization, manifest parsing |
libFuzzer, fuzz corpora |
| L6: Stress / performance | Minutes | fsx/fio workloads, large repo operations, sustained write pressure, watcher overflow | QEMU + KVM, criterion benchmarks |
Per-PR (required, must pass before merge):
- L1 + L2 + L3 (all unit, component, protocol tests)
- L5 fuzz smoke: each fuzz target runs for 30 seconds with existing corpus
cargo clippy --all-targets,cargo fmt --check,cargo deny check,cargo audit- Total budget: < 10 minutes
Nightly (blocks release if failing):
- L4 full QEMU E2E suite (Linux host with KVM; if KVM unavailable, run a reduced TCG subset)
- L5 extended fuzz runs: 10 minutes per target, corpus regression
- L6 performance baselines with 30% regression alert threshold
- Total budget: < 45 minutes
Pre-release gate:
- All of the above plus manual review of fuzz coverage report
- Phase 2/3: macOS and Windows E2E suites on dedicated runners
QEMU E2E tests require KVM access. Options by CI provider:
- Self-hosted runner (recommended for nightly): A Linux VM with nested virtualization enabled (
/dev/kvmavailable). Most cloud providers support this (GCP N2, AWS metal/nested, etc.). - GitHub Actions: Use
runs-on: ubuntu-latestwith KVM enabled (available on larger runners). Alternatively, use TCG (software emulation) for a slow but functional subset. - Fallback for PRs: Skip L4 tests on PR CI if KVM is unavailable; gate only on nightly. Mark L4 tests with
#[ignore]and enable via--ignoredflag on nightly runs.
sandbox-agent/
crates/
test-support/ # Shared test utilities (library crate)
src/
lib.rs # Re-exports
workspace.rs # TempWorkspace: fixture trees + undo dir
snapshot.rs # TreeSnapshot + assert_tree_eq
jsonl_client.rs # STDIO API test client
mcp_client.rs # MCP socket test client
fake_shim.rs # In-process fake VM shim
fault.rs # Fault injection registry
clock.rs # Deterministic clock for tests
fixtures.rs # Reusable fixture tree builders
Cargo.toml # dev-dependency only
tests/
integration/ # L2 component integration tests
undo_interceptor.rs
wal_crash_recovery.rs
undo_barriers.rs
undo_pruning.rs
undo_resource_limits.rs
ambient_steps.rs
multi_directory.rs
protocol/ # L3 protocol integration tests
control_channel.rs
stdio_api.rs
mcp_server.rs
p9_wire.rs # Phase 3
e2e/ # L4 system tests (require QEMU)
session_lifecycle.rs
undo_roundtrip.rs
safeguard_flow.rs
external_modification.rs
mcp_integration.rs
pjdfstest_subset.rs
fuzz/ # L5 fuzz targets
Cargo.toml
corpus/
p9_wire/
control_jsonl/
stdio_json/
mcp_jsonrpc/
undo_manifest/
path_normalize/
fuzz_targets/
p9_wire.rs
control_jsonl.rs
stdio_json.rs
mcp_jsonrpc.rs
undo_manifest.rs
path_normalize.rs
benches/ # L6 microbenchmarks
preimage_capture.rs
zstd_compression.rs
rollback_restore.rs
manifest_io.rs
/// Creates an isolated working directory + undo directory for a single test.
pub struct TempWorkspace {
pub working_dir: PathBuf, // The "shared folder" equivalent
pub undo_dir: PathBuf, // Adjacent, outside share root
_temp: TempDir, // Dropped on test exit
}
impl TempWorkspace {
/// Create empty workspace.
pub fn new() -> Self { ... }
/// Create workspace from a fixture builder.
pub fn with_fixture(f: impl FnOnce(&Path)) -> Self { ... }
/// Snapshot the current state of the working directory.
pub fn snapshot(&self) -> TreeSnapshot { ... }
}pub struct TreeSnapshot {
pub entries: BTreeMap<PathBuf, EntrySnapshot>,
}
pub struct EntrySnapshot {
pub file_type: FileType, // Reg, Dir, Symlink
pub content_hash: Option<[u8; 32]>, // blake3 for regular files
pub size: u64,
pub mode: u32,
pub mtime_ns: i128,
pub symlink_target: Option<String>,
pub xattrs: BTreeMap<String, Vec<u8>>,
}
pub struct SnapshotCompareOptions {
pub mtime_tolerance_ns: i128, // Default: 1_000_000 (1ms)
pub check_xattrs: bool, // Default: true on Linux, false on Windows
pub exclude_patterns: Vec<String>,
}
/// Panics with a human-readable diff on mismatch.
pub fn assert_tree_eq(
before: &TreeSnapshot,
after: &TreeSnapshot,
opts: &SnapshotCompareOptions,
) { ... }/// Spawns the agent as a child process, speaks STDIO API.
pub struct JsonlClient {
child: Child,
// Reads stdout in a background task, demuxes events and responses
// into separate channels keyed by request_id / event type.
}
impl JsonlClient {
pub async fn send(&mut self, msg: Value) -> Result<()>;
pub async fn recv_response(&mut self, request_id: &str, timeout: Duration) -> Result<Value>;
pub async fn recv_event(&mut self, event_type: &str, timeout: Duration) -> Result<Value>;
pub fn stderr_lines(&self) -> Vec<String>; // For log validation
}
/// Connects to MCP socket, speaks JSON-RPC.
pub struct McpClient { ... }
impl McpClient {
pub async fn connect(socket_path: &Path) -> Result<Self>;
pub async fn call(&mut self, method: &str, params: Value) -> Result<Value>;
}The UndoInterceptor and step finalization logic accept a Clock trait:
pub trait Clock: Send + Sync {
fn now(&self) -> SystemTime;
}
pub struct RealClock;
pub struct FakeClock { inner: Mutex<SystemTime> }
impl FakeClock {
pub fn advance(&self, duration: Duration);
pub fn set(&self, time: SystemTime);
}Step IDs are host-generated and deterministic in tests (sequential integers starting from a test-provided seed).
Compile-time gated (cfg(feature = "fault_injection")), never in release builds.
pub struct FaultInjector {
faults: Mutex<VecDeque<Fault>>,
}
pub enum Fault {
FailPreimageWrite { errno: i32 }, // ENOSPC, EIO, etc.
FailStepPromotion, // Rename WAL→steps fails
TruncateManifest { after_bytes: u64 }, // Partial manifest write
ForceWatcherOverflow, // Simulate inotify overflow
ForceLateWrite { delay_ms: u64 }, // Write arrives after step_completed
}
impl FaultInjector {
pub fn enqueue(&self, fault: Fault);
/// Returns Some(fault) and removes it, or None.
pub fn check(&self, point: &str) -> Option<Fault>;
}Injection points are placed at the start of each fallible operation in UndoInterceptor:
if let Some(fault) = self.fault_injector.check("preimage_write") {
return Err(io::Error::from_raw_os_error(fault.errno()));
}These are the properties that, if violated, constitute a bug. Every test traces back to one or more of these.
For any completed, protected step S in working directory D: undo.rollback(1) restores TreeSnapshot(D) to exactly the snapshot captured before step S began (within the metadata semantics of §1.3).
For any path P mutated multiple times within step S, the stored preimage corresponds to the state of P before the first mutation within S. Subsequent mutations within S do not overwrite the preimage.
If the agent crashes at any point during a step, restart always produces a working directory state equal to the pre-step snapshot. The WAL in_progress directory is removed. An event.recovery is emitted.
Rollback cannot cross an undo barrier unless force: true. Barriers are visible in undo.history. Internal sandbox writes never create barriers.
The safeguard triggers before the operation that would cross the threshold executes on the host filesystem. On deny, zero host mutations from the paused-and-denied portion persist.
No guest-originated filesystem operation, undo preimage capture, or rollback restore can read or write any host path outside the share root directory.
No input to any protocol parser (9P wire, control channel JSONL, STDIO JSON, MCP JSON-RPC, undo manifest) causes a panic, unbounded allocation, or undefined behavior. Malformed input produces a structured error.
STDIO API stdout never contains log output. Stderr never contains protocol messages. MCP socket traffic never appears on stdout/stderr. No cross-contamination.
Priority: Highest. This is the core correctness path — implement and test first.
Test scaffolding required:
TempWorkspace+TreeSnapshot(§3.2, §3.3)StepTrackertest double: allows manualopen_step(id)/close_step(id)callsOperationApplierhelper: calls the interceptor hook, then performs the realstd::fsoperation, mirroring backend behavior
| ID | Category | Scenario | Assert |
|---|---|---|---|
| UI-01 | First-touch | Write same file 3× in one step | One preimage stored; rollback restores original |
| UI-02 | Create | Create new file + write content | Rollback deletes file; parent dir mtime restored |
| UI-03 | Create-dir | Create nested directory structure | Rollback removes all created dirs (deepest first) |
| UI-04 | Delete | Delete file | Rollback restores bytes + mode + mtime + xattrs |
| UI-05 | Delete-tree | rm -rf simulation (deep nested tree) |
Rollback restores full tree with correct structure |
| UI-06 | Rename-new | Rename A→B where B doesn't exist | Rollback restores A, removes B |
| UI-07 | Rename-over | Rename A→B where B exists | Rollback restores both A and B to pre-step state |
| UI-08 | Rename-dir | Rename directory with nested files | Rollback restores all paths under old name |
| UI-09 | Truncate-open | Open existing file with O_TRUNC | pre_open_trunc captures preimage before truncation |
| UI-10 | Truncate-setattr | setattr truncate to shorter length |
Preimage contains original full contents |
| UI-11 | Chmod | Flip executable bit | Rollback restores original mode |
| UI-12 | Xattr-set | Set user xattr on file | Rollback removes xattr (or restores previous value) |
| UI-13 | Xattr-remove | Remove existing xattr | Rollback restores xattr |
| UI-14 | Fallocate | Extend file via fallocate | Rollback restores original size |
| UI-15 | Copy-file-range | Copy range into existing file | Destination preimage captured; rollback restores |
| UI-16 | Multi-step | Steps 1 and 2 modify same file differently | Rollback(1) restores to post-step-1 state, not original |
| UI-17 | Unprotected | Step exceeds max_single_step_size |
Step marked unprotected; rollback returns error |
| UI-18 | FIFO-eviction | Exceed max_step_count |
Oldest step evicted; event.warning emitted |
| UI-19 | Log-size-eviction | Exceed max_log_size_bytes |
Oldest steps evicted until within budget |
| UI-20 | Ambient-step | Write arrives outside any command step | Attributed to ambient step; undo works |
| UI-21 | Ambient-timeout | Ambient step auto-closes after inactivity | New write after timeout opens new ambient step |
| UI-22 | Multi-dir | Rollback in dir A | Dir B unmodified |
| UI-23 | Hardlink | Hardlink to file within share root | No panic; behavior documented (path-based capture) |
| UI-24 | Symlink-internal | Symlink within share root | Symlink target string captured; rollback restores |
Generate random sequences of operations (CreateFile, Write, Truncate, Chmod, Rename, Delete, Mkdir, Rmdir, SetXattr, RemoveXattr) grouped into steps. After each step, optionally roll back and compare to the stored snapshot. This catches ordering bugs, rename collisions, and multi-touch edge cases that enumerated tests miss.
#[proptest]
fn undo_model(ops: Vec<StepOps>) {
let ws = TempWorkspace::with_fixture(random_small_tree);
let interceptor = UndoInterceptor::new(ws.undo_dir.clone(), ...);
for step in &ops {
let snapshot_before = ws.snapshot();
interceptor.open_step(step.id);
for op in &step.ops { op.apply(&ws, &interceptor); }
interceptor.close_step(step.id);
if step.should_rollback {
interceptor.rollback(1);
assert_tree_eq(&snapshot_before, &ws.snapshot(), &default_opts());
}
}
}Test scaffolding: Fault injection (§3.6), TempWorkspace.
| ID | Scenario | Fault injected | Assert |
|---|---|---|---|
| CR-01 | Crash mid-step (after some preimages written) | Kill process (or return from test without closing step) | Restart rolls back; working dir equals pre-step snapshot |
| CR-02 | Crash during preimage write | FailPreimageWrite { errno: EIO } |
Operation fails; step becomes unprotected OR write is rejected (test whichever policy is chosen) |
| CR-03 | Crash during step promotion | FailStepPromotion |
WAL in_progress remains; restart rolls back |
| CR-04 | Truncated manifest | TruncateManifest { after_bytes: 50 } |
Restart detects corruption, rolls back, emits event.recovery |
| CR-05 | Clean shutdown (no crash) | None | WAL empty; steps directory contains committed steps |
| CR-06 | Double recovery (restart twice without new writes) | None | Second restart is a no-op; no duplicate events |
| CR-07 | Crash with empty step (step opened but no writes) | Kill before any preimage | Restart discards empty WAL entry; no-op rollback |
| ID | Scenario | Assert |
|---|---|---|
| EB-01 | External write during active session | event.external_modification emitted with affected paths |
| EB-02 | Rollback across barrier (no force) | Rollback rejected with error listing barrier details |
| EB-03 | Rollback across barrier (force=true) | Rollback proceeds; warning included in response |
| EB-04 | Barrier visible in undo.history |
History entry has type: "barrier" with timestamp and paths |
| EB-05 | Internal sandbox write does NOT trigger barrier | Correlation logic filters backend-originated watcher events |
| EB-06 | Multiple barriers between steps | Each barrier listed; rollback blocked at nearest |
| EB-07 | Watcher overflow | Agent emits warning; degrades to conservative barrier behavior |
| EB-08 | policy=warn |
External write emits warning but no barrier; rollback proceeds |
| EB-09 | policy=lock (if implemented) |
External write attempt fails (CI-optional, requires permission control) |
| ID | Scenario | Assert |
|---|---|---|
| SG-01 | Delete count reaches threshold | event.safeguard_triggered emitted; no further host mutations while paused |
| SG-02 | Confirm allow | Command completes; step commits; undo works |
| SG-03 | Confirm deny | Entire step rolled back; tree matches pre-step snapshot |
| SG-04 | Timeout (no confirm sent) | Auto-deny; step rolled back |
| SG-05 | Overwrite-large-file threshold | Triggered when existing file > configured size is overwritten |
| SG-06 | Rename-over-existing threshold | Triggered when rename destination exists |
| SG-07 | Queue overflow (request-holding mode) | Queue cap reached; further ops get ENOSPC; no OOM |
| SG-08 | QMP pause mode (if available) | QMP stop issued on trigger; cont on allow/deny; VM verifiably paused |
| SG-09 | Pre-operation trigger ordering | Safeguard fires before the Nth deletion executes (not after) |
Unit tests (JSONL parsing):
| ID | Scenario | Assert |
|---|---|---|
| CC-01 | Valid step_started / step_completed sequence |
Step opens and closes; filesystem writes attributed correctly |
| CC-02 | Malformed JSON | Structured error logged; channel not broken |
| CC-03 | Unknown message type | Ignored or logged; channel not broken |
| CC-04 | Oversized message (>1MB) | Rejected before full allocation |
| CC-05 | step_completed without step_started |
Error logged; no crash |
| CC-06 | Duplicate step_started for same ID |
Error logged; existing step unaffected |
| CC-07 | Cancellation mid-step | Step finalized appropriately |
Integration tests (fake shim, no QEMU):
| ID | Scenario | Assert |
|---|---|---|
| CC-08 | Normal exec cycle | Host sends exec; fake shim returns started/output/completed; events forwarded |
| CC-09 | Quiescence window: no late writes | Step closes immediately after step_completed + quiescence timeout |
| CC-10 | Quiescence window: late write arrives | Step closure waits for in-flight ops to drain; late write included in step |
| CC-11 | Quiescence timeout: prevent hang | If in-flight ops never drain, step closes after max quiescence timeout (e.g., 2s) |
| CC-12 | Ambient writes after step close | Writes after quiescence window go to ambient step, not the closed step |
Schema tests (unit):
| ID | Scenario | Assert |
|---|---|---|
| SA-01 | Each request type parses correctly | Valid request → accepted |
| SA-02 | Unknown request type | Structured error: {code: "unknown_operation", message: "..."} |
| SA-03 | Missing required field | Structured error with field name |
| SA-04 | Version negotiation (once defined) | Mismatched version → graceful rejection |
Stream behavior tests (integration with JsonlClient):
| ID | Scenario | Assert |
|---|---|---|
| SA-05 | Response correlates to request_id |
Response request_id matches request |
| SA-06 | Events interleave with responses | Client correctly demuxes both |
| SA-07 | Stderr is valid JSONL logs | Every stderr line parses as JSON with timestamp, level, component |
| SA-08 | Stdout contains no log lines | No line on stdout has level or component fields |
| SA-09 | Backpressure: client stops reading | Agent does not deadlock (bounded buffers or timeout) |
Security tests:
| ID | Scenario | Assert |
|---|---|---|
| SA-10 | fs.read with ../../etc/passwd path |
Rejected; resolved relative to working dir root |
| SA-11 | fs.list with absolute path outside root |
Rejected |
| SA-12 | Oversized write_file payload |
Size limit enforced; structured error |
| ID | Scenario | Assert |
|---|---|---|
| MC-01 | JSON-RPC compliance (id, errors, unknown method) | Correct JSON-RPC responses |
| MC-02 | execute_command returns exit_code/stdout/stderr |
Values match what fake shim sent |
| MC-03 | write_file creates synthetic API step |
Step appears in undo.history with type: "api" |
| MC-04 | write_file → rollback |
Written file removed; preimage restored |
| MC-05 | write_file error (path outside root) |
JSON-RPC error; no step created |
| MC-06 | MCP triggers safeguard → STDIO event emitted | Cross-interface consistency |
| MC-07 | Connection without auth token (if implemented) | Rejected |
| MC-08 | Concurrent MCP + STDIO operations | Shared undo/safeguard state consistent; no races |
| ID | Scenario | Assert |
|---|---|---|
| UL-01 | Manifest correctness | Affected paths, existed_before, file type, metadata encoding all round-trip |
| UL-02 | Preimage atomicity | Preimage writes use temp file + atomic rename |
| UL-03 | Step promotion atomicity | wal/in_progress/ renamed to steps/{id}/ atomically |
| UL-04 | Version mismatch on startup | event.undo_version_mismatch emitted; undo disabled |
| UL-05 | undo.discard after mismatch |
Old log wiped; new version file written; undo re-enabled |
| UL-06 | Corrupt manifest (truncated) | Graceful error; agent doesn't crash |
| UL-07 | Missing preimage file | Rollback returns error for that step; other steps unaffected |
| UL-08 | Corrupt preimage (flipped bytes) | Detected (if checksums used) or rollback produces incorrect state (documented) |
| UL-09 | Reflink detection + fallback | If FICLONE succeeds, preimage is a reflink; if it fails, falls back to copy+zstd |
Unit/integration (no VM):
| ID | Scenario | Assert |
|---|---|---|
| VF-01 | InodePathMap: insert/update/remove/rename |
Lookup returns correct path |
| VF-02 | InodePathMap: negative lookup |
Returns defined error (not panic) |
| VF-03 | InodePathMap: path always within root |
No path returned outside share root |
E2E (with QEMU):
| ID | Scenario | Assert |
|---|---|---|
| VF-04 | Guest performs each primitive op | Undo step lists correct paths; rollback restores snapshot |
| VF-05 | pjdfstest curated subset |
POSIX semantics match for create/unlink/rename/chmod/symlink |
Reuse: Run upstream virtiofsd unit tests in fork CI. Keep them passing. Add wrapper-specific tests on top.
| ID | Scenario | Assert |
|---|---|---|
| P9-01 | Wire round-trip for each message type | Serialize → deserialize = identity |
| P9-02 | Known-byte fixtures | Match crosvm test vectors (adapt licensing) |
| P9-03 | Invalid sizes/offsets/flags | Correct Rlerror errno |
| P9-04 | Out-of-order responses by tag | Pipelined requests handled correctly |
| P9-05 | Oversized message | Rejected before full allocation |
| ID | Scenario | Assert |
|---|---|---|
| WN-01 | Case-collision detection | Create Foo then foo → error |
| WN-02 | Reserved names | Create CON, NUL, etc. → rejected |
| WN-03 | POSIX metadata overlay: chmod persistence | chmod 755 → getattr returns 755 across sessions |
| WN-04 | Overlay: new file defaults | File without overlay entry gets heuristic mode |
| WN-05 | Overlay: rollback restores mode | Mode changed by step → rollback restores previous overlay entry |
| WN-06 | Reparse point escape | Create junction inside root → outside; write through it → rejected |
| WN-07 | Reparse point: read through junction | Read via junction pointing outside root → rejected |
| ID | Scenario | Assert |
|---|---|---|
| SL-01 | session.start with invalid working dir |
Structured error |
| SL-02 | session.start with multiple dirs |
Each dir gets mount tag; backend instances created |
| SL-03 | session.stop (persistent) |
VM shuts down; disk image preserved |
| SL-04 | session.stop (ephemeral) |
VM destroyed; disk image deleted |
| SL-05 | session.reset |
Persistent VM wiped and recreated |
| SL-06 | QEMU launch failure | Structured error event; agent doesn't hang |
| SL-07 | Control channel disconnect | Agent transitions to error state; emits event |
| SL-08 | Resource cleanup on stop | Sockets removed; child processes terminated |
| ID | Scenario | Assert |
|---|---|---|
| OB-01 | Stderr logs parse as JSONL | Every line has timestamp, level, component |
| OB-02 | request_id correlation |
Log entries for a request include matching request_id |
| OB-03 | step_id correlation |
Log entries during a step include matching step_id |
| OB-04 | No protocol frames in logs | Logs never contain raw 9P bytes or control channel messages |
| OB-05 | Log level filtering | --log-level=warn suppresses info/debug/trace |
Build a minimal guest image containing:
- Busybox or Alpine base (< 50MB)
- VM-side shim (baked in)
- Core utilities:
sh,dd,truncate,chmod,ln,mv,rm,mkdir,touch,stat - Optional:
setfattr/getfattr(for xattr E2E tests) - No
node,cargo, etc. — those are nightly workload tests
Build recipe: Alpine-based initramfs created via a Dockerfile or Buildroot config. The xtask command cargo xtask build-guest produces vmlinuz + initrd.img for both x86_64 and aarch64.
Every E2E test follows this sequence:
#[tokio::test]
#[ignore] // Only run in nightly CI with KVM
async fn test_undo_single_file_write() {
let ws = TempWorkspace::with_fixture(small_tree);
let initial_snapshot = ws.snapshot();
let mut client = JsonlClient::spawn_agent(&[
"--working-dir", ws.working_dir.to_str().unwrap(),
"--undo-dir", ws.undo_dir.to_str().unwrap(),
"--vm-mode", "ephemeral",
"--backend", "virtiofs", // or "9p" for Phase 3 tests
]).await;
client.send(session_start()).await;
client.recv_response("start", Duration::from_secs(30)).await; // VM boot
// Execute mutation
client.send(agent_execute("echo 'hello' > /mnt/working/test.txt")).await;
client.recv_event("event.step_completed", Duration::from_secs(10)).await;
// Verify mutation happened
let post_snapshot = ws.snapshot();
assert_ne!(&initial_snapshot, &post_snapshot);
// Rollback
client.send(undo_rollback(1)).await;
client.recv_response("rollback", Duration::from_secs(5)).await;
// Verify restoration
assert_tree_eq(&initial_snapshot, &ws.snapshot(), &default_opts());
client.send(session_stop()).await;
}pub fn small_tree(root: &Path) {
// Files of various sizes
fs::write(root.join("empty.txt"), "");
fs::write(root.join("small.txt"), "hello world");
fs::write(root.join("medium.txt"), "x".repeat(4096));
fs::write(root.join("large.bin"), &vec![0xABu8; 1_000_000]);
// Nested directories
fs::create_dir_all(root.join("src/components"));
fs::write(root.join("src/main.rs"), "fn main() {}");
fs::write(root.join("src/components/app.rs"), "pub struct App;");
// Executable file
let exec_path = root.join("run.sh");
fs::write(&exec_path, "#!/bin/sh\necho ok");
fs::set_permissions(&exec_path, Permissions::from_mode(0o755));
}
pub fn rename_tree(root: &Path) { /* a.txt + b.txt with distinct contents */ }
pub fn xattr_tree(root: &Path) { /* file with user.test xattr */ }
pub fn symlink_tree(root: &Path) { /* symlinks to file and dir */ }
pub fn deep_tree(root: &Path) { /* 5 levels deep, 100+ files for safeguard tests */ }Each target is a fuzz_target! in fuzz/fuzz_targets/. Corpora are committed and grow over time.
| Target | Input | Assertions |
|---|---|---|
p9_wire |
Raw bytes | No panic; no alloc > 16MB; error on invalid input |
control_jsonl |
UTF-8 string (one line) | No panic; no alloc > 1MB; valid parse or structured error |
stdio_json |
UTF-8 string (one line) | No panic; no alloc > 1MB; valid parse or structured error |
mcp_jsonrpc |
UTF-8 string | No panic; no alloc > 1MB |
undo_manifest |
Raw bytes (simulated manifest file) | No panic; valid parse or error |
path_normalize |
Vec<Vec<u8>> (path components) |
No panic; result is within root or error; no .. traversal |
| ID | Scenario | Phase | Assert |
|---|---|---|---|
| SC-01 | Symlink in working dir → /etc/passwd (Linux virtiofsd+chroot) |
MVP | Preimage capture does not access /etc/passwd |
| SC-02 | Rollback with symlink to outside root | MVP | Restore does not write outside root |
| SC-03 | Symlink chain (A→B→C→outside) | MVP | Entire chain resolved; access denied |
| SC-04 | macOS: symlink escape without chroot | Phase 2 | openat-relative containment rejects |
| SC-05 | macOS: TOCTOU during F_GETPATH re-open |
Phase 2 | No re-open by path; use fd directly |
| SC-06 | Windows: junction to C:\Windows\System32 |
Phase 3 | 9P server rejects; no host access |
| SC-07 | Windows: reparse point during rename | Phase 3 | Rename target validated within root |
| ID | Scenario | Assert |
|---|---|---|
| DOS-01 | Create many unique files in one step until max_single_step_size hit |
Step becomes unprotected; agent responsive |
| DOS-02 | Safeguard pause: flood with filesystem ops | Queue capped; ENOSPC beyond cap; no OOM |
| DOS-03 | Many concurrent 9P requests (pipelined) | Agent handles within bounded memory |
| DOS-04 | Giant 9P message (size field claims 2GB) | Rejected at wire parse; no allocation |
Run as part of per-PR CI:
cargo audit— known vulnerability checkcargo deny check— license and advisory policycargo clippy --all-targets— lint- Sanitizer jobs (nightly CI): ASan + UBSan on fuzz targets, TSAN on concurrency-heavy unit tests if feasible
| Benchmark | Sizes | Regression threshold |
|---|---|---|
| Preimage capture throughput | 4KB, 1MB, 100MB | 30% regression alerts |
| zstd compression (level 3) | 4KB, 1MB, 100MB | 30% |
| Rollback restore throughput | 4KB, 1MB, 100MB | 30% |
| Manifest write + promotion | 10 paths, 100 paths, 1000 paths | 30% |
TreeSnapshot capture |
100 files, 1000 files, 10000 files | 30% |
| Workload | Metrics |
|---|---|
git status on large repo (Linux kernel tree) |
Wall time, agent RSS |
rm -rf node_modules (10,000 files) → undo |
Wall time, undo log size, restore time |
fsx (random filesystem exerciser) for 60s |
No errors; agent RSS stable |
fio sequential 1MB writes × 1000 |
Throughput vs baseline (no interception) |
Record results in CI artifacts for trend analysis. Alert on >30% regression from rolling 7-day baseline.
| Suite | Source | How to use | Phase |
|---|---|---|---|
| virtiofsd unit tests | Upstream fork | Run in fork CI; keep passing; add wrapper tests on top | MVP |
pjdfstest |
github.com/pjd/pjdfstest |
Run curated subset inside guest against /mnt/working |
MVP |
crosvm p9 crate fixtures |
chromium.googlesource.com/crosvm |
Port known-byte test vectors for wire format; adapt attribution | Phase 3 |
| Mutagen test vectors | github.com/mutagen-io/mutagen |
Port reserved-name tables, case-collision scenarios, chmod persistence behaviors as Rust table-driven tests | Phase 3 |
xfstests (optional) |
github.com/kdave/xfstests |
Run small subset externally (not vendored) for extended POSIX validation | Nightly |
| Phase | Host OS | Backend | CI Runner | Must pass before moving on |
|---|---|---|---|---|
| MVP | Linux x86_64 | virtiofsd fork | GitHub Actions + self-hosted KVM runner | L1–L3 full; L4 E2E subset; L5 fuzz smoke; pjdfstest subset |
| Phase 2 | macOS Apple Silicon | virtiofsd fork (ported) | macOS self-hosted runner (M-series) | macOS containment tests; portability layer tests; FSEvents barrier tests; E2E mount+undo |
| Phase 3 | Windows x86_64 | 9P server | Windows self-hosted runner with WHPX | 9P wire+dispatch tests; junction/reparse containment; case/reserved-name tests; metadata overlay tests; WHPX E2E |
This ordering keeps the tight TDD loop fast and avoids "debugging QEMU" as the primary development activity.
| Step | What to build | What to test | Layer |
|---|---|---|---|
| 1 | TreeSnapshot + assert_tree_eq |
Snapshot round-trip; equality and diff output | L1 |
| 2 | UndoInterceptor core (first-touch, preimage write, rollback) |
UI-01 through UI-08 (create/write/rename/delete → rollback) | L2 |
| 3 | WAL + crash recovery | CR-01 through CR-07 (fault injection, no VM) | L2 |
| 4 | Undo barriers | EB-01 through EB-06 (external mod simulation) | L2 |
| 5 | Safeguards (interceptor level) | SG-01 through SG-06 (simulate delete counts) | L2 |
| 6 | Metadata capture (mode, mtime, xattrs) | UI-09 through UI-15 | L2 |
| 7 | Resource limits + pruning | UI-17 through UI-19, UL-01 through UL-09 | L2 |
| 8 | Control channel parsing + state machine | CC-01 through CC-07 | L1 |
| 9 | Control channel integration (fake shim) | CC-08 through CC-12 (quiescence, ambient) | L3 |
| 10 | STDIO API contract tests | SA-01 through SA-12 | L3 |
| 11 | MCP server contract tests | MC-01 through MC-08 | L3 |
| 12 | Fuzz targets (initial) | All 6 fuzz targets with seed corpus | L5 |
| 13 | QEMU E2E: session lifecycle | SL-01 through SL-08 | L4 |
| 14 | QEMU E2E: undo round-trip | echo → step → rollback → snapshot compare |
L4 |
| 15 | QEMU E2E: pjdfstest subset |
POSIX semantics validation | L4 |
| 16 | QEMU E2E: safeguard flow | rm -rf in VM → trigger → deny → verify rollback |
L4 |
| 17 | Model-based / property tests | Random op sequences → rollback → snapshot | L2 |
| 18 | Performance baselines | Microbenchmarks (criterion) |
L6 |
Steps 1–7 require no QEMU, no networking, no async runtime — pure Rust + filesystem. This is where the majority of correctness bugs will be found and fixed.
cargo xtask test-fast # L1 + L2 + L3 (per-PR)
cargo xtask test-fuzz-smoke # L5 short runs (per-PR)
cargo xtask test-e2e # L4 (requires KVM; nightly)
cargo xtask test-all # Everything
cargo xtask fuzz <target> # Run a specific fuzz target continuously
cargo xtask bench # L6 microbenchmarks
cargo xtask build-guest # Build test guest image (vmlinuz + initrd)
cargo xtask ci-check # clippy + fmt + deny + audit
[features]
default = []
fault_injection = [] # Enables FaultInjector compile paths; never in release
e2e_tests = [] # Enables QEMU E2E test compilationTests use:
cargo test # L1 + L2 + L3
cargo test --features fault_injection # L2 with fault injection
cargo test --features e2e_tests --ignored # L4 QEMU E2E