CRUCIBLE Performance Baseline

This document records the current performance model carried by the CRUCIBLE audit artifacts. It is intentionally specific about where speed comes from, which code paths are supposed to be fast, and which caveats still affect any performance claim made against the current repo.

What “fast” means in the current codebase

UFFS has two distinct performance stories:

Workflow	Preferred representation	Why
Simple live search (`glob`, extension, size, basic filters)	`MftIndex`	Avoids `DataFrame` construction and uses search-specific structures.
Rich tabular workflows, parquet input, aggregations/sorting	`DataFrame`	Pays conversion/materialization cost to unlock the richer query model.
Full-drive live indexing	`MftReader` + drive-tuned MFT readers	Speed comes from direct NTFS MFT access, not Win32 file enumeration.
Multi-drive search	bounded cross-drive orchestration	Favors predictable host utilization over unbounded fanout.

The fastest path in the repo is not “Polars everywhere”; it is “read the MFT efficiently, keep simple search on MftIndex, and convert to tabular form only when that extra expressiveness is actually needed.”

Current hot-path model

1. Direct MFT access is still the primary macro-win

UFFS gets its biggest speed advantage by bypassing Windows file enumeration APIs and reading NTFS metadata directly.

That still means:

volume handle opening and MFT extent discovery in uffs-mft
optional bitmap-driven skipping of unused records
direct/aligned volume I/O for the MFT pipeline
no dependency on recursive directory walking for the live path

2. Auto mode is now IOCP-first

One important current-state performance detail: Auto mode is effectively an IOCP-sliding-window selector now.

Today:

DataFrame reads map Auto -> SlidingIocp
lean-index reads map Auto -> SlidingIocpInline

That is true across Nvme, Ssd, Hdd, and Unknown drive types. Drive type still matters, but mainly as a tuning input for:

I/O chunk size (Nvme 4 MiB, Ssd 2 MiB, Hdd/Unknown 1 MiB)
read concurrency (Nvme 32, Ssd 8, Hdd 4 with extent-aware HDD tuning)
whether certain parsing strategies are beneficial

So the current performance story is not “Auto picks different reader families by device class”; it is “Auto picks the sliding-window IOCP family, then tunes it per device.”

3. The index path is the main query-time optimization

uffs-core::index_search still documents the expected gap between the two query representations for simple searches on large datasets:

MftIndex path: roughly ~100-200ms for 23M entries
DataFrame path: roughly ~3-5s, largely because of conversion/materialization overhead

That gap is not accidental. The fast path is built to avoid work:

direct record iteration instead of DataFrame expression setup
extension-aware candidate reduction through the extension index
path resolution only when needed
Rayon-powered filtering/expansion over already-compact in-memory structures

For simple search, any performance comparison that forces the DataFrame path is measuring a different workload.

4. Path resolution is an explicit hot path

Path reconstruction remains a major cost center because NTFS stores parent FRS links, not full paths.

The current optimization stack is:

FastPathResolver uses Vec-indexed O(1) lookup rather than HashMap-based lookup
file names are interned in a contiguous NameArena
resolver entries are packed to 16 bytes each
parallel path-column addition uses Rayon when the dataset is large enough

The code documents the expected win as:

FastPathResolver is roughly 3-5x faster than the legacy resolver
memory use is roughly ~50% lower due to NameArena

5. SoA staging reduces `DataFrame` build cost

ParsedColumns is the current struct-of-arrays staging format for the MFT parse pipeline.

That matters because it removes the old array-of-structs -> struct-of-arrays transpose during DataFrame build. The code documents this as reducing df_build time by about ~20%.

This optimization only helps the tabular path, but it is still important because that path is used for parquet/export/analytics-style workflows.

Where the current costs still go

Live full-scan costs

Even with the current optimizations, a live full-drive read still spends real time in:

volume setup and metadata collection
reading the MFT extents themselves
fixup/parsing work on each record
optional extension-record merging
optional placeholder parent synthesis
DataFrame construction when the tabular path is requested

Query-time costs

For search, the main remaining variable costs are:

path resolution when output needs full paths
row expansion for hard links and streams
DataFrame materialization when the query mode requires it
drive fanout and output serialization for multi-drive streaming search

Cache-sensitive costs

The best-case live-query performance is often “use a fresh cached index and do a small or no-op USN refresh.”

The worst case is still “cache miss or stale cache -> full rebuild.”

In between, current behavior intentionally trades peak freshness for safety:

journal unavailable/read failure -> use cached index as-is
journal wrapped/journal ID changed -> rebuild
valid delta -> apply USN changes and recompute tree metrics

That is often a major latency difference, so performance discussions should always state whether the measured path was a cold full scan, a warm cache hit, or a USN refresh.

Performance-sensitive options and their tradeoffs

The current codebase still exposes a few switches that materially change runtime costs:

extension-record merging: more complete, slower
bitmap usage: usually faster, but can be disabled for debugging/experiments
placeholder creation: code comments document roughly ~15% CPU savings when disabled, at the cost of weaker path-resolution behavior on some records
hard-link expansion: more user-visible rows, more work

One caveat worth stating explicitly: the public “fast vs --full” story still exists, but internal defaults around extension merging are not fully harmonized across every entrypoint. Benchmark explicit flags/settings when making claims instead of assuming one universal default.

Benchmark inventory in the repo

Cross-platform benchmark lane

When live Windows MFT validation is unavailable, the repo still has useful, smaller-scope benchmark coverage in uffs-core:

cargo bench -p uffs-core --bench query
cargo bench -p uffs-core --bench search_benchmarks

Those benches cover:

pattern parsing and glob/regex conversion
MftQuery building and execution
extension filter performance
tree metric/index operations
PathResolver vs FastPathResolver
sequential vs parallel path-column addition

Windows-specific MFT benchmark lane

The repo also keeps a Windows-oriented benchmark lane for the MFT path:

cargo bench -p uffs-mft --bench mft_read

That bench focuses on lower-level components such as:

aligned buffer allocation/write cost
ParsedColumns allocation/merge cost
ParsedColumns -> DataFrame conversion

For more realistic end-to-end Windows measurements, the uffs_mft binary also retains benchmark commands such as benchmark-index and benchmark-index-lean.

Validation canon anchors

The approved validation canon for this baseline remains:

cargo build --release -p uffs-cli --bin uffs
cargo xwin check -p uffs-mft --lib --bin uffs_mft
cargo test -p uffs-mft --bin uffs_mft required_output_path
rust-script scripts/verify_parity.rs /Users/rnio/uffs_data D --regenerate

These are not all “performance benchmarks” in the narrow sense. They are the approved correctness-and-parity anchors that protect the performance baseline from silently regressing via behavior drift.

Current carried status

Validation canon alignment is verified.
Wave 1C parity artifact resolution is verified.
The required_output_path regression check is still considered mandatory, but the current rerun is blocked by external disk pressure on the host (No space left on device, os error 28). This is carried forward as an environment blocker, not a performance or correctness regression.

How to interpret performance claims safely

When discussing UFFS performance, always specify at least:

live MFT vs cached index vs cached DataFrame
MftIndex vs DataFrame query path
single-drive vs multi-drive
extension merge / placeholder settings if relevant
Windows elevated live run vs offline/non-Windows benchmark lane

Without those qualifiers, two “UFFS is fast” statements may be describing very different parts of the system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRUCIBLE Performance Baseline

What “fast” means in the current codebase

Current hot-path model

1. Direct MFT access is still the primary macro-win

2. Auto mode is now IOCP-first

3. The index path is the main query-time optimization

4. Path resolution is an explicit hot path

5. SoA staging reduces `DataFrame` build cost

Where the current costs still go

Live full-scan costs

Query-time costs

Cache-sensitive costs

Performance-sensitive options and their tradeoffs

Benchmark inventory in the repo

Cross-platform benchmark lane

Windows-specific MFT benchmark lane

Validation canon anchors

Current carried status

How to interpret performance claims safely

FilesExpand file tree

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

CRUCIBLE Performance Baseline

What “fast” means in the current codebase

Current hot-path model

1. Direct MFT access is still the primary macro-win

2. Auto mode is now IOCP-first

3. The index path is the main query-time optimization

4. Path resolution is an explicit hot path

5. SoA staging reduces DataFrame build cost

Where the current costs still go

Live full-scan costs

Query-time costs

Cache-sensitive costs

Performance-sensitive options and their tradeoffs

Benchmark inventory in the repo

Cross-platform benchmark lane

Windows-specific MFT benchmark lane

Validation canon anchors

Current carried status

How to interpret performance claims safely

5. SoA staging reduces `DataFrame` build cost