This document records the current performance model carried by the CRUCIBLE audit artifacts. It is intentionally specific about where speed comes from, which code paths are supposed to be fast, and which caveats still affect any performance claim made against the current repo.
UFFS has two distinct performance stories:
| Workflow | Preferred representation | Why |
|---|---|---|
Simple live search (glob, extension, size, basic filters) |
MftIndex |
Avoids DataFrame construction and uses search-specific structures. |
| Rich tabular workflows, parquet input, aggregations/sorting | DataFrame |
Pays conversion/materialization cost to unlock the richer query model. |
| Full-drive live indexing | MftReader + drive-tuned MFT readers |
Speed comes from direct NTFS MFT access, not Win32 file enumeration. |
| Multi-drive search | bounded cross-drive orchestration | Favors predictable host utilization over unbounded fanout. |
The fastest path in the repo is not “Polars everywhere”; it is “read the MFT
efficiently, keep simple search on MftIndex, and convert to tabular form only
when that extra expressiveness is actually needed.”
UFFS gets its biggest speed advantage by bypassing Windows file enumeration APIs and reading NTFS metadata directly.
That still means:
- volume handle opening and MFT extent discovery in
uffs-mft - optional bitmap-driven skipping of unused records
- direct/aligned volume I/O for the MFT pipeline
- no dependency on recursive directory walking for the live path
One important current-state performance detail: Auto mode is effectively an
IOCP-sliding-window selector now.
Today:
DataFramereads mapAuto->SlidingIocp- lean-index reads map
Auto->SlidingIocpInline
That is true across Nvme, Ssd, Hdd, and Unknown drive types. Drive type
still matters, but mainly as a tuning input for:
- I/O chunk size (
Nvme4 MiB,Ssd2 MiB,Hdd/Unknown1 MiB) - read concurrency (
Nvme32,Ssd8,Hdd4 with extent-aware HDD tuning) - whether certain parsing strategies are beneficial
So the current performance story is not “Auto picks different reader families by device class”; it is “Auto picks the sliding-window IOCP family, then tunes it per device.”
uffs-core::index_search still documents the expected gap between the two query
representations for simple searches on large datasets:
MftIndexpath: roughly~100-200msfor 23M entriesDataFramepath: roughly~3-5s, largely because of conversion/materialization overhead
That gap is not accidental. The fast path is built to avoid work:
- direct record iteration instead of
DataFrameexpression setup - extension-aware candidate reduction through the extension index
- path resolution only when needed
- Rayon-powered filtering/expansion over already-compact in-memory structures
For simple search, any performance comparison that forces the DataFrame path is
measuring a different workload.
Path reconstruction remains a major cost center because NTFS stores parent FRS links, not full paths.
The current optimization stack is:
FastPathResolverusesVec-indexed O(1) lookup rather thanHashMap-based lookup- file names are interned in a contiguous
NameArena - resolver entries are packed to 16 bytes each
- parallel path-column addition uses Rayon when the dataset is large enough
The code documents the expected win as:
FastPathResolveris roughly3-5xfaster than the legacy resolver- memory use is roughly
~50%lower due toNameArena
ParsedColumns is the current struct-of-arrays staging format for the MFT parse
pipeline.
That matters because it removes the old array-of-structs -> struct-of-arrays
transpose during DataFrame build. The code documents this as reducing
df_build time by about ~20%.
This optimization only helps the tabular path, but it is still important because that path is used for parquet/export/analytics-style workflows.
Even with the current optimizations, a live full-drive read still spends real time in:
- volume setup and metadata collection
- reading the MFT extents themselves
- fixup/parsing work on each record
- optional extension-record merging
- optional placeholder parent synthesis
DataFrameconstruction when the tabular path is requested
For search, the main remaining variable costs are:
- path resolution when output needs full paths
- row expansion for hard links and streams
DataFramematerialization when the query mode requires it- drive fanout and output serialization for multi-drive streaming search
The best-case live-query performance is often “use a fresh cached index and do a small or no-op USN refresh.”
The worst case is still “cache miss or stale cache -> full rebuild.”
In between, current behavior intentionally trades peak freshness for safety:
- journal unavailable/read failure -> use cached index as-is
- journal wrapped/journal ID changed -> rebuild
- valid delta -> apply USN changes and recompute tree metrics
That is often a major latency difference, so performance discussions should always state whether the measured path was a cold full scan, a warm cache hit, or a USN refresh.
The current codebase still exposes a few switches that materially change runtime costs:
- extension-record merging: more complete, slower
- bitmap usage: usually faster, but can be disabled for debugging/experiments
- placeholder creation: code comments document roughly
~15%CPU savings when disabled, at the cost of weaker path-resolution behavior on some records - hard-link expansion: more user-visible rows, more work
One caveat worth stating explicitly: the public “fast vs --full” story still
exists, but internal defaults around extension merging are not fully harmonized
across every entrypoint. Benchmark explicit flags/settings when making claims
instead of assuming one universal default.
When live Windows MFT validation is unavailable, the repo still has useful,
smaller-scope benchmark coverage in uffs-core:
cargo bench -p uffs-core --bench querycargo bench -p uffs-core --bench search_benchmarks
Those benches cover:
- pattern parsing and glob/regex conversion
MftQuerybuilding and execution- extension filter performance
- tree metric/index operations
PathResolvervsFastPathResolver- sequential vs parallel path-column addition
The repo also keeps a Windows-oriented benchmark lane for the MFT path:
cargo bench -p uffs-mft --bench mft_read
That bench focuses on lower-level components such as:
- aligned buffer allocation/write cost
ParsedColumnsallocation/merge costParsedColumns->DataFrameconversion
For more realistic end-to-end Windows measurements, the uffs_mft binary also
retains benchmark commands such as benchmark-index and benchmark-index-lean.
The approved validation canon for this baseline remains:
cargo build --release -p uffs-cli --bin uffscargo xwin check -p uffs-mft --lib --bin uffs_mftcargo test -p uffs-mft --bin uffs_mft required_output_pathrust-script scripts/verify_parity.rs /Users/rnio/uffs_data D --regenerate
These are not all “performance benchmarks” in the narrow sense. They are the approved correctness-and-parity anchors that protect the performance baseline from silently regressing via behavior drift.
- Validation canon alignment is verified.
- Wave 1C parity artifact resolution is verified.
- The
required_output_pathregression check is still considered mandatory, but the current rerun is blocked by external disk pressure on the host (No space left on device,os error 28). This is carried forward as an environment blocker, not a performance or correctness regression.
When discussing UFFS performance, always specify at least:
- live MFT vs cached index vs cached
DataFrame MftIndexvsDataFramequery path- single-drive vs multi-drive
- extension merge / placeholder settings if relevant
- Windows elevated live run vs offline/non-Windows benchmark lane
Without those qualifiers, two “UFFS is fast” statements may be describing very different parts of the system.