Skip to content

Resolve MPI deduplicatable data generation (mlpstorage integration) #278

@russfellows

Description

@russfellows

Bug: Multi-process / MPI runs produce 100% deduplicatable data across workers

This is related to issue #272

Summary

When kv_cache_benchmark is run with multiple processes (MPI or otherwise) targeting shared storage, every worker generates byte-for-byte identical data for the same logical request IDs. On a dedup-capable storage system this is invisible, but it means the benchmark measures write throughput of deduplicated data — not unique data — rendering multi-host storage stress results meaningless.

Affected Files

  • kv_cache_benchmark/kv_cache/cache.pyKVCacheGenerator, _seed_from_key()
  • kv_cache_benchmark/kv_cache/benchmark.pycache_key construction

Root Cause

Key generation is deterministic with no per-process identity

Cache keys are constructed in benchmark.py using only the request counter and a modulo of num_users:

# benchmark.py line ~382
user_id  = f"dataset_user_{req_id % self.num_users}"
cache_key = f"{user_id}_req_{req_id:06d}"

These strings are identical on every worker, because req_id is reset to 0 on each process independently.

Data generation depends only on the key string and a fixed global seed

In cache.py, KVCacheGenerator uses a fixed global_seed (default 0) and derives a per-entry seed via SHA-256 of the key string:

# cache.py line ~39
rng = np.random.default_rng(self.global_seed)   # same on every worker
self.precomputed_buffer = rng.uniform(...)        # same 256 MB buffer on every worker

# cache.py line ~49
return (key_hash64 ^ self.global_seed) & 0xFFFF_FFFF_FFFF_FFFF  # same XOR stamp for same key

Because both the key strings and the global seed are identical across workers, every worker produces bitwise-identical 4 KB blocks for every cache entry.

Impact

Workers (N) Dedup ratio Unique data written
1 0% 100%
2 50% 50%
8 87.5% 12.5%
16 93.75% 6.25%
64 98.4% 1.6%

A storage system with inline deduplication (e.g. many all-flash arrays and object stores) will absorb N× the logical write I/O while storing only 1× the data, appearing N× faster than it actually is for unique workloads. This makes the benchmark unreliable as a measure of raw write capacity in any multi-host scenario.

Steps to Reproduce

  1. Run the benchmark on 2+ hosts targeting the same shared storage mount or object store endpoint.
  2. Compare effective storage capacity consumed vs. logical bytes written — consumption will not scale with host count.
  3. Alternatively, inspect the raw data: any two workers' output files for the same time window will be byte-for-byte identical.

Expected Behavior

Each worker (MPI rank, process, or host) should produce unique data so that N workers write N× unique bytes to storage, properly stressing storage capacity and ingestion throughput.

Proposed Fix

Embed a per-worker identity into either the global_seed or the cache key string. Two equivalent options:

Option A — Unique seed per worker (minimal change)

Pass the MPI rank (or os.getpid() / hostname hash as fallback) as global_seed when constructing KVCacheGenerator:

import os, socket, hashlib

def _worker_seed() -> int:
    """Return a seed unique to this process on this host."""
    try:
        from mpi4py import MPI
        return MPI.COMM_WORLD.Get_rank()
    except ImportError:
        # Fallback: hash of hostname + PID
        ident = f"{socket.gethostname()}:{os.getpid()}"
        return int(hashlib.sha256(ident.encode()).hexdigest()[:16], 16)

# When constructing KVCacheGenerator:
generator = KVCacheGenerator(model_config, global_seed=_worker_seed())

This changes the 256 MB precomputed buffer and the XOR stamp for every worker, making all 4 KB blocks unique across workers while keeping them reproducible within a single worker run.

Option B — Unique key prefix per worker (more explicit)

Prefix every cache key with the worker identity:

worker_prefix = f"rank{mpi_rank}_host{hostname_hash}"
cache_key = f"{worker_prefix}_{user_id}_req_{req_id:06d}"

This keeps the same precomputed buffer but changes the XOR stamp per worker, which is sufficient to eliminate cross-worker dedup.

Recommendation

Option A (unique global_seed) is preferred because:

  • It also diversifies the 256 MB precomputed noise buffer, giving true statistical independence between workers.
  • It requires changes in only one place (benchmark initialization).
  • It is transparent to all downstream key-derivation and stamping logic.

The --seed CLI argument (if added) should document that it sets the per-worker base seed, and that MPI rank is XOR'd in automatically so users can still get reproducible multi-worker runs by fixing --seed.

Notes

  • Single-process runs are not affected — the current design is correct and the anti-dedup properties (no intra-entry or cross-entry block collisions) verified at 64 GB scale hold for single-process use.
  • The fix does not change the on-disk format, the benchmark output schema, or any config file fields.
  • This issue is distinct from the previously fixed 96.7% intra-entry dedup bug (commit 0aa9aee) — that was a single-process issue; this is a multi-process issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions