Authoritative for: optimization history. Per-layer performance analysis, before/after numbers, experiments (including failures), pool development phases (L0-L3.5), competitive analysis vs MinIO and RustFS, allocation audit. If you need to know why a design decision was made or what alternatives were tried, this is the only doc.
Not covered here: how writes work now (see write-path.md), on-disk format (see storage-layout.md), benchmark results (see benchmarks.md).
- Layer architecture
- Benchmark environment
- L1: HTTP transport
- L2: S3 protocol
- L3: Storage pipeline
- L4: Hashing + RS encode
- L5: Disk I/O
- L6: S3 + real storage
- L7: Full client path
- Layer-to-layer gaps
- Optimization history
- Allocation audit
- Log-structured storage
- Competitive analysis
- Future optimization candidates
PUT path (outside in, following the request):
Client request
|
L1 HTTP transport (hyper) accept TCP, read/write body
L2 S3 protocol (s3s) parse HTTP, SigV4 verify, dispatch to AbixioS3
L3 Storage pipeline VolumePool: placement, RS encode, write shards
L4 Hashing + RS encode blake3, MD5, reed-solomon-erasure galois_8
L5 Disk I/O tokio::fs::write / tokio::fs::read
L6 S3 + real storage s3s + VolumePool combined (integration)
L7 Full client (aws-sdk-s3) SigV4 signing, SDK HTTP, UNSIGNED-PAYLOAD
L1-L2 are protocol layers. L3 is the storage pipeline. L4-L5 are primitives measured in isolation. L6 combines L1-L5 into an integrated in-process test. L7 adds a real S3 client with auth.
- Windows 10 Home, single machine, all volumes on same NTFS drive (tmpdir)
- Single-node, page-cache writes (no fsync), sequential requests
- 10 iterations per measurement (50 for 4KB, 5 for 1GB, 30 for focused tests)
bench_perfinabixio-ui/src/bench/with JSON output and A/B comparison
| Operation | Throughput |
|---|---|
| blake3 | 4303 MB/s |
| MD5 | 703 MB/s |
blake3 uses SIMD (AVX2/SSE4.1), 4.3 GB/s is near hardware ceiling. MD5 at 703 MB/s is near theoretical max for single-threaded MD5. Both are computed inline during the streaming encode path, so their cost overlaps with I/O for large objects (10MB+).
MD5 is required by S3 (ETag = MD5 of body) when the client sends
Content-MD5. When absent, we skip MD5 and use xxhash64 for the ETag.
Before: MD5 computed inline on every PUT regardless of client headers. 703 MB/s throughput = ~30% of L3 PUT time for 1GB objects.
After: when input.content_md5 is None (the common case; aws-sdk-s3
with disable_payload_signing(), mc, rclone all skip it), set
PutOptions.skip_md5 = true. The encode path uses xxhash64 (~10+ GB/s)
instead of MD5 for the ETag. Format: {xxh64:016x}{size:016x} (32 hex
chars, same length as MD5 ETags).
Result: L7 1GB PUT improved from 380 to 695 MB/s (+83%). Matrix PUT improved from 320 to 441 MB/s (+38%), now fastest of all three servers.
Files changed:
src/storage/metadata.rs: addedskip_md5: booltoPutOptionssrc/storage/erasure_encode.rs: conditional MD5/xxhash in streaming pathsrc/storage/volume_pool.rs: conditional MD5/xxhash in small-object pathsrc/s3_service.rs: setskip_md5based oncontent_md5header
Investigated 2026-04-09 for cases where MD5 IS required:
md-5crateasmfeature: ships x86_64 assembly viamd5-asmcrate. Fails on MSVC. The.Sfiles use GAS syntax whichcl.execannot compile. Would work on Linux/macOS or withx86_64-pc-windows-gnutarget.- Go's
crypto/md5(what MinIO uses): has hand-written amd64 assembly that runs automatically. - RustCrypto/asm-hashes archived (Aug 2025). Official recommendation:
port to Rust
global_asm!inline assembly in RustCrypto/hashes. SHA-1 already hascompress/x86.rs; MD5 does not yet have x86_64 inline asm. - Future: write x86_64 MD5 compress using
global_asm!(works on all targets including MSVC). On Linux, enablemd-5 = { features = ["asm"] }.
| Operation | Throughput |
|---|---|
| RS encode 3+1 | 2762 MB/s |
reed-solomon-erasure with simd-accel feature. 2.8 GB/s is within expected
range for galois_8 on x86. 6x faster than the L3 storage pipeline (439 MB/s).
Could use ISA-L bindings for ~2x more, but pointless when L3 is the bottleneck.
No change. RS encode is not limiting.
| Operation | 10MB | 1GB |
|---|---|---|
| write (page cache) | 1625 MB/s | 1056 MB/s |
| read (cached) | 2703 MB/s | 2902 MB/s |
These are the ceiling numbers for L3. Page-cache writes avoid disk sync. Real durability requires fsync, which would cut throughput to physical disk speed (SATA SSD ~500 MB/s, NVMe ~2-3 GB/s).
4KB writes are slow (13 MB/s) due to filesystem metadata overhead per operation.
No change. This is the theoretical ceiling, not the current bottleneck. Future: configurable fsync for durability, io_uring on Linux for zero-copy I/O.
| Operation | 1 disk 10MB | 1 disk 1GB | 4 disk 10MB | 4 disk 1GB |
|---|---|---|---|---|
| put_stream | 439 MB/s | 489 MB/s | 371 MB/s | 409 MB/s |
| get (buffered) | 1175 MB/s | 1244 MB/s | 774 MB/s | 919 MB/s |
| get_stream | 1287 MB/s | 1439 MB/s | 842 MB/s | 983 MB/s |
The streaming PUT path (encode_and_write in src/storage/erasure_encode.rs)
reads body chunks, accumulates to 1MB blocks, RS-encodes each block, and writes
encoded shard data to each shard writer sequentially.
The sequential write loop in write_block():
for (i, shard_data) in shards.iter().enumerate() {
shard_hashers[i].update(shard_data);
writers[i].write_chunk(shard_data).await?;
}L3 PUT at 510 MB/s (with skip_md5) vs L5 disk write at 1625 MB/s = 3.2x overhead from inline hashing (blake3 + xxhash64), RS encode, metadata writes, and the sequential shard write loop. Without skip_md5 (MD5 required): 439 MB/s, 3.7x overhead.
Three approaches were tested to parallelize shard writes:
-
FuturesUnordered (earlier work): regressed 10MB from 147->74 MB/s. Per-iteration spawn overhead exceeded sequential cost.
-
Channel-based shard tasks (this session): spawned one tokio task per shard writer at encode start, sent blocks via mpsc channels. Regressed 4-disk 10MB by -21%. Channel + task scheduling overhead outweighed any parallelism benefit.
-
join_all on concurrent writes: not possible due to Rust's borrow rules (
write_chunktakes&mut self, can't hold multiple&mutinto a slice).
Root cause: all tmpdir shard files are on the same physical NTFS volume. There is no actual I/O parallelism to exploit. The OS page cache serializes writes to the same device. Parallel shard writes would only help when shards map to different physical disks, which the single-machine benchmark doesn't test.
Decision: sequential writes are correct for same-disk configurations.
The comment in write_block() documents this finding so future work doesn't
repeat the experiment without separate physical disks.
Before: read_and_decode() read all shard files in parallel via
join_all (good), but collected the full decoded object into a Vec<u8>
before returning. For 1GB objects, this allocated 1GB in memory before the
first response byte could be sent.
After: read_and_decode_stream() opens shard files, reads them in 1MB
blocks matching the encode path's STREAM_BLOCK_SIZE, RS-decodes each block,
and yields decoded chunks via a futures::channel::mpsc channel (4-slot
buffer). A spawned tokio task handles the block-by-block decode loop.
Key implementation details:
Backend::read_shard_stream()returns(AsyncRead, ObjectMeta)instead of(Vec<u8>, ObjectMeta). LocalVolume opens the file; RemoteVolume falls back to buffered read.- Block boundaries are derived from
total_size,data_n, andSTREAM_BLOCK_SIZE(1MB). Each shard file stores blocks contiguously, so per-block shard chunk size =ceil(block_size / data_n). - The last block may be smaller than 1MB.
split_datapadding is handled by truncating the reassembled block to the actual remaining bytes.
Result: +9-16% throughput at L3, but the bigger win is at L6 (see below).
Files changed:
src/storage/erasure_decode.rs:read_and_decode_stream()src/storage/mod.rs:Backend::read_shard_stream(),Store::get_object_stream()src/storage/local_volume.rs: native file-backedread_shard_stream()src/storage/volume_pool.rs:VolumePool::get_object_stream()
Problem: streaming decode still allocated 1MB Bytes per chunk through
an mpsc channel (1024 chunks for 1GB). RustFS uses tokio::io::duplex + mmap.
After: two-case approach via memmap2:
-
1+0 (no EC): mmap the shard file,
Bytes::from_owner(mmap)gives a ref-counted handle to the entire file. Yield 4MB.slice()chunks. Zero-copy, zero allocation, no RS decode, no spawned task. -
EC (N+M): mmap all shard files. Spawned task slices directly into mmap regions (no read syscall), RS-decodes 4MB blocks (4x fewer iterations than 1MB), yields decoded data through mpsc channel.
Both paths use Backend::mmap_shard() which returns MmapOrVec. Local
volumes return a real mmap, remote volumes fall back to buffered read.
Result:
- L3 GET (1 disk, mmap): 19,098 MB/s at 10MB (page cache), effectively instant
- L6 GET (1 disk): 809 MB/s at 10MB, 1048 MB/s at 1GB (was 833)
- curl GET (no auth): 1220 MB/s at 1GB, faster than MinIO through mc
- EC GET (4 disk): 1048 MB/s (10MB), 1236 MB/s (1GB), zero-alloc fast path slices from mmap when all shards healthy. +35% over pre-mmap baseline
Files changed:
Cargo.toml: addedmemmap2src/storage/mod.rs:MmapOrVectype,Backend::mmap_shard()src/storage/local_volume.rs: mmap-backedmmap_shard()src/storage/erasure_decode.rs: mmap-basedread_and_decode_stream(), 4MB blockssrc/storage/volume_pool.rs: 1+0 mmap fast path inget_object_stream()
| Operation | Throughput |
|---|---|
| HTTP PUT (reqwest -> hyper) | 762 MB/s |
hyper 1.x with http-body-util on localhost. 762 MB/s is reasonable, limited
by memcpy + TCP stack overhead. The benchmark uses data.clone() on each PUT
which adds a copy; the real S3 path streams the body without copying.
No change. HTTP transport is not limiting (762 MB/s > L6's 272 MB/s). L1 is the outermost layer a PUT request hits.
| Operation | 10MB | 1GB |
|---|---|---|
| S3 PUT | 272 MB/s | 310 MB/s |
| S3 GET (1 disk, mmap) | 809 MB/s | 1048 MB/s |
| curl GET (1 disk, no auth) | n/a | 1220 MB/s |
L6 runs in-process with no_auth: true and raw reqwest (no SigV4).
Before: every PUT called get_versioning_config() which read bucket
settings from disk via Backend::read_bucket_settings(). This added a file
read to every single PUT request, even when versioning was never configured.
After: AbixioS3 caches versioning config in a RwLock<HashMap>. First
access populates the cache from disk. put_bucket_versioning invalidates the
cache entry. get_bucket_versioning also invalidates before re-reading to
ensure fresh data.
Result: L6 PUT improved from 214 MB/s to 272 MB/s (+27%).
Remaining PUT overhead (272 vs L3's 439 = 38% gap) is s3s request parsing, XML serialization, and service dispatch. This is inherent to the protocol layer and not easily reducible without replacing s3s.
Files changed:
src/s3_service.rs:cached_versioning_config(),invalidate_versioning_cache()
Before: get_object() called Store::get_object() which returned
(Vec<u8>, ObjectInfo), the entire object buffered in memory. Then it
wrapped the data as a single-chunk StreamingBlob::wrap(once(...)). For
1GB objects, this meant 1GB allocated before the first response byte was sent.
After: get_object() calls Store::get_object_stream() which returns
(ObjectInfo, Stream<Item=Result<Bytes>>). The stream feeds directly into
StreamingBlob::wrap(). The first response byte is sent after decoding the
first 1MB block, not after decoding the entire object.
Key implementation detail: s3s StreamingBlob::wrap() requires Send + Sync.
The boxed stream from get_object_stream() is Send but not Sync. A
SyncStream newtype wrapper adds Sync via unsafe impl. This is safe
because streams are only polled from one task at a time.
Range requests and versioned GETs still use the buffered path because they need random access into the response body. This is acceptable because range requests typically target small byte ranges, and versioned objects are the minority of GET traffic.
Result: L6 GET improved from 365 to 750 MB/s at 10MB (+105%), and from 452 to 833 MB/s at 1GB (+84%).
Files changed:
src/s3_service.rs: streaming GET path,SyncStream,get_object_buffered()
Added 2026-04-09 to close the gap between L6 (in-process, no auth) and external matrix benchmarks.
| Operation | L6 (no auth, reqwest) | L6A (auth, reqwest) | L7 (auth, aws-sdk-s3) |
|---|---|---|---|
| PUT (before skip-md5) | 344 MB/s | 337 MB/s | 380 MB/s |
| PUT (after skip-md5) | n/a | n/a | 695 MB/s |
| GET | 535 MB/s | n/a | 696 MB/s |
L7 runs in-process with auth enabled and the full aws-sdk-s3 client, the
same path real users take. aws-sdk-s3 does not send Content-MD5, so
skip_md5 is active and xxhash64 is used for the ETag.
The matrix benchmark previously showed 52 MB/s because it was running an
unoptimized debug binary (target/debug/abixio.exe). After fixing the
binary path to use release builds, matrix results match L7.
Auth overhead is negligible (L6 vs L6A: 344 vs 337 MB/s, within noise). aws-sdk-s3 client overhead is negligible (L6 vs L7: 344 vs 380 MB/s pre-skip-md5).
Note: L6A is an auth variant of L6, not a separate layer.
| Layer | 1GB GET MB/s | What it measures |
|---|---|---|
| L3 get_stream | instant (mmap) | Storage: zero-copy mmap, 4MB Bytes::slice |
| L1 http_get | 724 | Raw hyper HTTP body (no S3, no storage) |
| L6 s3s_get | 714 | s3s + storage, no auth, reqwest |
| L7 sdk_get | 920 | s3s + auth + aws-sdk-s3 (in-process) |
| Matrix | 604 | Separate abixio.exe process |
| MinIO (matrix) | 791 | Separate minio.exe process |
Key findings:
-
s3s adds near-zero overhead to GET (L1 724 vs L6 714, within noise). The s3s response path is efficient. Prior attribution of GET bottleneck to s3s was incorrect.
-
L1 (raw hyper) is the HTTP ceiling at 724 MB/s for writing 1GB over 127.0.0.1 TCP on Windows. This is the hyper/Windows loopback limit.
-
The real GET gap is L7 (920) -> Matrix (604) = -34%. This is the cost of running as a separate process on Windows. MinIO loses less here (791 vs 920-equivalent) because Go's
net/httpis more efficient at TCP response writes on Windows than hyper. -
AbixIO's GET code is not the bottleneck. The mmap path is zero-copy, chunked yield is working, s3s adds nothing. The remaining gap vs MinIO is hyper's TCP write performance on Windows vs Go's
net/http.
| Layer | 1GB PUT MB/s | What it measures |
|---|---|---|
| L3 put_stream | 510 (skip-md5) | Storage: encode + write shards |
| L6 s3s_put | 344 (with MD5) | s3s + storage, no auth |
| L7 sdk_put | 695 (skip-md5) | s3s + auth + aws-sdk-s3 (in-process) |
| Matrix | 498 (skip-md5, TCP_NODELAY) | Separate abixio.exe process |
| MinIO (matrix) | 380 | Separate minio.exe process |
| Gap | Current | Ratio | Notes |
|---|---|---|---|
| L7 PUT -> Matrix PUT | 695 vs 498 MB/s | 1.4x | process boundary (improved from 1.5x with TCP_NODELAY) |
| L7 GET -> Matrix GET | 920 vs 686 MB/s | 1.3x | process boundary (improved from 1.5x with TCP_NODELAY) |
| Matrix GET: AbixIO vs MinIO | 686 vs 757 MB/s | 0.91x | 9% gap, down from 24% before TCP_NODELAY |
| 10MB GET: AbixIO vs MinIO | 324 vs 748 MB/s | 0.43x | 2.3x gap from per-response overhead in hyper vs Go |
| L3 PUT vs L5 write | 510 vs 1625 MB/s | 3.2x | blake3 + xxhash + RS + metadata |
| EC GET vs 1+0 GET | 1236 vs page cache | n/a | RS decode (inherent cost of EC) |
Root cause: Go enables TCP_NODELAY by default on all TCP connections.
AbixIO did not. Nagle's algorithm was batching small writes, adding latency.
Fix: stream.set_nodelay(true)? in src/main.rs after listener.accept().
Result: Matrix PUT +10% (453->498), Matrix GET +14% (604->686). Closed the 1GB GET gap from 24% to 9% vs MinIO.
Remaining 10MB GET gap (2.3x vs MinIO): confirmed as hyper's HTTP
write ceiling, not AbixIO code. Evidence: L1 raw hyper GET (no S3, no
storage, just Full::new(bytes)) = 510 MB/s with TCP_NODELAY. MinIO
achieves 748 MB/s through a full S3 stack in a separate process.
Tested hyper config variants to find where time goes in HTTP GET responses. All measurements in-process, 10 iterations, TCP_NODELAY on.
| Variant | 10MB GET MB/s | 1GB GET MB/s | Config |
|---|---|---|---|
| raw_tcp | 327 | 395 | Raw tokio write_all, no HTTP framing |
| L1 (baseline) | 510 | 724 | hyper default, Full::new(bytes) |
| writev | 429 | 605 | hyper writev(true) |
| bigbuf | 364 | 800 | hyper max_buf_size(4MB) |
| stream | 462 | 849 | hyper StreamBody (4MB frame chunks) |
| all_opts | 726 | 844 | writev + bigbuf + stream combined |
| MinIO (matrix) | 648-685 | 796 | Go net/http reference |
Key findings:
-
Raw TCP is SLOWER than hyper (327 vs 510 MB/s at 10MB). Naive
write_allon a raw socket is worse than hyper's optimized write path. hyper is NOT fundamentally slower; its default config just isn't optimized for large bodies. -
all_optshits 726 MB/s at 10MB, within 6% of MinIO's 685 at L1. The combination ofwritev(true)+max_buf_size(4MB)+ chunked StreamBody nearly closes the gap at L5 level. -
For 1GB, tuned hyper matches MinIO (844 vs 796 MB/s).
-
The prior claim that "hyper's ceiling is 32% below MinIO" was wrong. hyper with correct config is competitive. The gap was configuration, not architecture.
Applied: writev(true) + max_buf_size(4MB) in src/main.rs.
The remaining matrix gap (674 vs 796 for 1GB) is the s3s/ChunkedBytes
overhead between L1X and the full stack.
| # | Change | Layer | Result | Status |
|---|---|---|---|---|
| 1 | Streaming GET | L3+L6 | GET +105% (10MB), +84% (1GB) | done |
| 2 | Parallel shard writes | L3 | Regressed -21%. Sequential faster on same disk | reverted |
| 3 | Versioning config cache | L6 | PUT +27% (10MB) | done |
| 4 | mmap GET (1+0 fast path) | L3+L6 | L6 GET 1GB: 833->1048 MB/s (+26%). curl: 1220 MB/s | done |
| 5 | mmap GET (EC, 4MB blocks) | L3 | 4-disk GET 675-803 MB/s (regressed from mmap->Vec copy) | superseded by #6 |
| 6 | Zero-alloc EC GET fast path | L3 | EC GET +35% (919->1236 MB/s at 1GB). Slices from mmap, no Vec alloc | done |
| 7 | Debug binary fix | n/a | Matrix PUT 52->320 MB/s (was benchmarking debug build) | done |
| 8 | Pipeline experiments | L3 | Double-buffer -96%, channel -96%, 4MB chunks -11%. All worse | reverted |
| 9 | Skip MD5 (xxhash64 ETag) | L4+L3+L7 | L7 PUT +83% (380->695 MB/s). Matrix PUT +38% (320->441). Fastest PUT at all sizes | done |
| 10 | Chunked mmap GET (4MB slices) | L3+L7 | Matrix GET +25% (484->604 MB/s). Zero-copy Bytes::slice | done |
| 11 | TCP_NODELAY on accept | L1+ | Matrix PUT +10% (453->498), GET +14% (604->686). Go sets this by default | done |
| 12 | ChunkedBytes GET (hybrid) | L6+L7 | 10MB GET +36% (324->440 MB/s). Collect <=64MB, stream >64MB. Eliminates SyncStream+StreamWrapper chain | done |
| 13 | hyper writev + max_buf_size | L1+ | 1GB GET +15% (584->674). L1X shows hyper can match MinIO with correct config | done |
Operation Before After Change
----------- ----------- ----------- -------
PUT 1GB 380 MB/s 695 MB/s +83% (skip MD5)
GET 1GB 752 MB/s 880 MB/s +17% (chunked mmap)
Operation Start Current Change
----------- ----------- ----------- -------
PUT 4KB 442 obj/s 1923 obj/s +335%
PUT 10MB 50 MB/s 325 MB/s +550%
PUT 1GB 52 MB/s 546 MB/s +950%
GET 10MB 241 MB/s 399 MB/s +66%
GET 1GB 577 MB/s 674 MB/s +17%
All optimizations applied: debug binary fix, skip MD5 (xxhash64 ETag), chunked mmap (4MB Bytes::slice), TCP_NODELAY, ChunkedBytes hybrid response, hyper writev(true) + max_buf_size(4MB).
Operation Before After Change
----------- ----------- ----------- -------
PUT 10MB 214 MB/s 272 MB/s +27% (versioning cache)
PUT 1GB 305 MB/s 310 MB/s ~same
GET 10MB 365 MB/s 809 MB/s +122% (streaming + mmap)
GET 1GB 452 MB/s 1048 MB/s +132% (mmap fast path)
GET 1GB (curl) n/a 1220 MB/s n/a (no mc overhead)
Operation Before After Change
----------- ----------- ----------- -------
GET 10MB (1d) 1175 MB/s 19098 MB/s mmap page cache
GET 1GB (1d) 1244 MB/s (page cache) mmap instant
GET 10MB (4d EC) 774 MB/s 1048 MB/s +35% (zero-alloc mmap)
GET 1GB (4d EC) 919 MB/s 1236 MB/s +35% (zero-alloc mmap)
PUT 1GB (1d) 439 MB/s 510 MB/s +16% (skip MD5, xxhash64)
Every allocation in the hot path is a potential throughput ceiling. The 1+0 GET path is already zero-alloc (mmap -> Bytes::from_owner -> slice). The EC GET and PUT paths still allocate heavily. This section inventories every allocation, rates its impact, and identifies the fix.
Complete inventory of every heap allocation per request. "Per 1GB" counts assume 3+1 EC with 1024 encode blocks and 256 decode iterations (4MB each).
| # | File:line | What | Count | Heap? |
|---|---|---|---|---|
| 1 | volume_pool.rs:307 | Bytes::from_owner(mmap) |
1 | ref-counted wrapper, no data copy |
| 2 | volume_pool.rs:308 | stream::once(owned), yields entire Bytes |
1 | one yield, one poll_frame, zero slicing |
| 3 | s3_service.rs:419 | StreamingBlob::wrap(SyncStream(..)) |
1 | one Box for stream trait object |
| 4 | s3_service.rs:424-435 | info.content_type.clone() etc |
~5 | small strings, once per request |
Data-path allocs: 0. Stream yields: 1. Entire mmap served as single Bytes. hyper writes to TCP in kernel-sized chunks. No slicing, no Arc per chunk.
| # | File:line | What | Count | Heap? |
|---|---|---|---|---|
| 1 | :112 | mmap_slots: Vec<Option<MmapOrVec>> |
1 | once at start (N slots) |
| 2 | :135 | meta_clone = good_meta.clone() |
1 | strings in ObjectMeta |
| 3 | :137 | mpsc::channel(4) |
1 | channel internal buffers |
| 4 | :146 | ReedSolomon::new() |
1 | RS lookup tables |
| 5 | :158 | pool.take(), 4MB output buffer from pool |
0 (steady state) | done. BufPool returns on drop |
| 6 | :174-177 | extend_from_slice(&mmap[..]) into iter_data |
256 x data_n | copy from mmap (inherent for EC reassembly) but no alloc (writes into #5) |
| 7 | :218 | Bytes::from(iter_data) |
256/GB | takes ownership of #5, no copy |
Data-path allocs: 8 warmup, then 0 steady state. BufPool pre-allocates
8 x 4MB Vecs. Bytes::from_owner(PooledBuf) wraps the buffer; when hyper
drops the Bytes, PooledBuf::drop returns the Vec to the pool. After
warmup, buffers circulate without allocation. EC GET 4-disk 1GB: 2105 MB/s.
| # | File:line | What | Count | Heap? |
|---|---|---|---|---|
| 1-7 | same as healthy | same | same | |
| 8 | :183 | vec![None; total] shard slot array |
per degraded block | yes |
| 9 | :188 | mmap[..].to_vec() per available shard |
per shard per degraded block | yes, full shard copy |
| 10 | :200 | rs.reconstruct() internal |
per degraded block | yes (reconstructs missing shards) |
Only when shards are missing. Correctness path, not optimized.
| # | File:line | What | Count | Heap? |
|---|---|---|---|---|
| 1 | :96-98 | content_type.clone() or .to_string() |
1 | once at start |
| 2 | :104 | distribution: Vec<usize> |
1 | once at start (N shards) |
| 3 | :109-112 | volume_ids: Vec<String> from placement |
1 | once at start |
| 4 | :118 | ReedSolomon::new() |
1 | RS lookup tables |
| 5 | :131 | writers: Vec<Box<dyn ShardWriter>> |
1 | N boxed writers |
| 6 | :138 | shard_hashers: Vec<blake3::Hasher> |
1 | N hashers |
| 7 | :144-146 | shard_bufs: Vec<Vec<u8>>, pre-allocated, reused |
1 | N Vecs with capacity, allocated once |
| 8 | :149 | block_buf: Vec::with_capacity(2MB) |
1 | once, reused across all chunks |
| 9 | :313 (split_data_into) | buf.clear() + extend_from_slice + resize |
0 | zero alloc, reuses #7 capacity |
| 10 | :318-320 | parity buf clear() + resize |
0 | zero alloc, reuses #7 capacity |
| 11 | :193-195 | checksums: Vec<String> from blake3 hex |
N shards | finalize path, once |
| 12 | :206-223 | ObjectMeta { etag.clone(), .. } per shard |
N shards | finalize path, once per shard |
| 13 | :255-256 | MrfEntry { bucket.to_string(), .. } |
0 or 1 | only on partial write |
| 14 | :262-263 | ObjectInfo { bucket.to_string(), .. } |
1 | return value |
Data-path allocs per block: 0. The encode loop (#9, #10) allocates
nothing. split_data_into and parity zeroing reuse pre-allocated buffers.
All allocations are either once-at-start (#1-8) or once-at-finalize (#11-14).
| # | File:line | What | Count | Heap? |
|---|---|---|---|---|
| 1 | :417 | io::Error::new(..e.to_string()) in stream map |
per chunk on error | error path only |
| 2 | :419 | SyncStream(mapped) |
1 | stack wrapper, no heap |
| 3 | :419 | StreamingBlob::wrap() -> DynByteStream |
1 | one Box |
| 4 | :424-435 | info.etag.clone() etc |
~5 | small strings |
| 5 | :352 | PUT body chunk.map_err(..e.to_string()) |
per chunk on error | error path only |
Data-path allocs: 0. Error conversions only allocate on error.
| Path | Data-path allocs/GB | Remaining alloc | Can eliminate? |
|---|---|---|---|
| 1+0 GET | 0 | none | done |
| EC GET (healthy) | 0 (steady state) | 8 warmup allocs, then pool recycles | done. BufPool + PooledBuf + Bytes::from_owner |
| EC GET (degraded) | ~5400 | Vec per shard for RS reconstruct | no: RS API requires owned buffers |
| PUT (per-block loop) | 0 | none | done |
| PUT (total request) | ~10 | setup + finalize allocs | no: inherent (metadata, hashers, writers) |
| Response wrapper | 0 | none | done |