Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
39246aa
Merge pull request #214 from hazemawadalla/TF_KVCache
FileSystemGuy Nov 25, 2025
f3577eb
vdb_benchmark commit with unit tests
idevasena Dec 11, 2025
6876bac
Merge pull request #220 from idevasena/TF_VDBBench
idevasena Dec 12, 2025
92d5e89
feat: Replace legacy spillover logic with Waterfall LRU architecture …
hazemawadalla Dec 22, 2025
27d506d
vdb_benchmark: adding AISAQ indexing support
ram-sangle Jan 27, 2026
d1fc97a
feat(kv-cache): MLPerf v3.0 compliance and configuration overhaul
hazemawadalla Jan 27, 2026
d9715bc
feat(wrapper): config integration and workload automation
hazemawadalla Jan 27, 2026
001fd3b
test(kv-cache): comprehensive pytest suite for v3.0 features
hazemawadalla Jan 27, 2026
2956288
docs(readme): comprehensive documentation for v3.0
hazemawadalla Jan 27, 2026
166f2b2
test(results): add pytest HTML test report
hazemawadalla Jan 27, 2026
99b42f0
feat(xlsx): extended metrics export for v3.0
hazemawadalla Jan 27, 2026
1bfe885
deps(requirements): add pyyaml for config support
hazemawadalla Jan 27, 2026
8a6aa50
config: add default YAML configuration file
hazemawadalla Jan 27, 2026
094eec5
Updated README.md
ram-sangle Jan 30, 2026
3db89bd
Refactor monolithic kv-cache.py into modular kv_cache/ package
hazemawadalla Feb 10, 2026
e38cfe9
Fix DeepSeek-V3 MLA values in README, move validate.sh to utils/
hazemawadalla Feb 10, 2026
f4c10a2
docs: fix decode_batch_size shown as hardcoded in proposal
hazemawadalla Feb 10, 2026
f7ecca1
docs: clarify eviction mechanisms in proposal
hazemawadalla Feb 10, 2026
f1d6dc1
Merge pull request #230 from ram-sangle/vdb
FileSystemGuy Feb 12, 2026
0bf572b
Merge hazem/modular-refactor into TF_KVCache with conflict resolution
FileSystemGuy Feb 18, 2026
1c72521
Added recall metrics to VDB benchmark script
idevasena Feb 18, 2026
059c494
Merge pull request #244 from mlcommons/feature/hazem-refactor-merge
dslik Feb 18, 2026
0ceb24f
Merge pull request #245 from idevasena/TF_VDB_Recall
dslik Feb 19, 2026
1f6fbca
feat: Add s3dlio integration for MLPerf Storage with s3torchconnector…
Feb 7, 2026
95d1396
feat: Add multi-library S3 storage support (s3torchconnector, minio, …
Feb 13, 2026
cfa584a
refactor: Organize integration tests into tests/integration/
Feb 16, 2026
366904a
docs: Add branch strategy and PR management infrastructure
Feb 16, 2026
2b8cf25
feat: Integrate dgen-py for 155x faster checkpoint data generation
Feb 16, 2026
79a9849
feat: Add StreamingCheckpointing implementation for producer-consumer…
russfellows Feb 17, 2026
0271bc3
feat: Add multi-library streaming checkpoint support
russfellows Feb 19, 2026
af8e3fd
test: Add comprehensive streaming checkpoint tests and demos
russfellows Feb 19, 2026
b5eb1fe
docs: Consolidate and enhance documentation
russfellows Feb 19, 2026
1f818c9
security: Remove hardcoded credentials and internal IPs from test files
russfellows Feb 19, 2026
afb6f1f
docs: Remove unnecessary TWO_PR_WORKFLOW.md
russfellows Feb 19, 2026
d7e73fe
Point to russfellows/dlio_benchmark fork for integrated setup
russfellows Feb 19, 2026
923df54
docs: Clean up outdated documentation and remove azstoragetorch refer…
russfellows Feb 19, 2026
ac1a07f
refactor: Remove all azstoragetorch references from codebase
russfellows Feb 19, 2026
0eee558
deps: Update s3dlio requirement to version 0.9.50
russfellows Feb 19, 2026
0c5a0a4
refactor: Remove dlio_benchmark from git tracking
russfellows Feb 19, 2026
ce62e8b
Add required dependencies and remove native Azure backend
russfellows Feb 19, 2026
690e6b8
Merged TF_KVCache into main, accepting all incoming changes for kv_ca…
russfellows Feb 25, 2026
98da6d4
Add vdb benchmark
idevasena Feb 26, 2026
ac5970c
feat: add --io-trace-log trace mode with tensor-parallel / multi-GPU …
russfellows Feb 26, 2026
b22a1d5
Merge pull request #1 from russfellows/feature/io-trace-log
russfellows Feb 26, 2026
397f38b
feat: zero-copy data generation via dgen-py producer-consumer pool
russfellows Feb 26, 2026
cf88bc0
Merge pull request #2 from russfellows/feature/zero-copy-datagen
russfellows Feb 26, 2026
969d168
feat: add fill-rate comparison benchmark (numpy vs dgen-py producer p…
russfellows Feb 27, 2026
9a9750f
Merge pull request #3 from russfellows/feature/bench-fill-comparison
russfellows Feb 27, 2026
04e8125
include interactive vdb collectio manager
idevasena Feb 27, 2026
ef3e503
Merge branch 'main' into TF_VDBBench
idevasena Feb 27, 2026
5f2e568
vector normalization and updated diskann parameter naming
idevasena Mar 3, 2026
901c36c
Merge branch 'TF_VDBBench' into TF_VDBBench
russfellows Mar 3, 2026
62dbc4a
Merge pull request #4 from russfellows/TF_VDBBench
russfellows Mar 3, 2026
6195885
Merge branch 'mlcommons:main' into main
russfellows Mar 4, 2026
8c0fede
enhancements to benchmark script for GT comparisons for recall from A…
idevasena Mar 10, 2026
9096c68
Merge pull request #10 from russfellows/TF_VDBBench
idevasena Mar 12, 2026
90492b4
update clone command from branch as it will be merged to main
idevasena Mar 13, 2026
5899573
Merge pull request #11 from russfellows/TF_VDBBench
idevasena Mar 13, 2026
a3120a3
Merge upstream mlcommons/storage main (156 commits) into local main
russfellows Mar 18, 2026
2ef6e95
Merge mlc-fork/main (VDBBench PRs and bench-fill-comparison)
russfellows Mar 18, 2026
b2228b8
feat: restore --io-trace-log, --num-gpus, --tensor-parallel from ac5970c
russfellows Mar 18, 2026
7162835
chore: remove dgen-py integration files (preserved on feature/dgen-py…
russfellows Mar 18, 2026
abd8430
chore: add .env and env-fast to .gitignore (contain S3 credentials)
russfellows Mar 18, 2026
ea06774
Merge pull request #12 from russfellows/feature/io-trace-log-tensor-p…
russfellows Mar 18, 2026
13ec205
feat: object-store S3 integration, storage readers/writers, ban boto3
russfellows Mar 19, 2026
741f5ed
chore: update submodule branch tracking to main (feature branch merged)
russfellows Mar 19, 2026
ee9594d
chore: update dlio_benchmark submodule pointer to main (post-merge a2…
russfellows Mar 19, 2026
e61e7ba
chore: bump version to 3.0.0; derive VERSION from package metadata
russfellows Mar 19, 2026
e7014b4
feat: integrate dlio_benchmark v3.0.0-beta with multi-library S3 read…
russfellows Mar 19, 2026
e29a92c
Merge pull request #13 from russfellows/feature/parquet-readers-and-i…
russfellows Mar 19, 2026
c441062
chore: update dlio_benchmark submodule pointer to latest main (4d5703c)
russfellows Mar 19, 2026
9638e5e
docs: add temporary development fork notice and clone instructions to…
russfellows Mar 19, 2026
6de6049
feat: add local_fs (fadvise) and direct_fs (O_DIRECT) checkpoint back…
russfellows Mar 20, 2026
895f737
fix: skip filesystem path checks when storage type is S3/object
russfellows Mar 20, 2026
05f9f6e
test: fix test infrastructure for object store backends
russfellows Mar 20, 2026
5ce8095
docs: add comprehensive benchmark results for all checkpoint backends
russfellows Mar 20, 2026
6a05fdc
chore: bump s3dlio requirement to >=0.9.82; update .gitignore
russfellows Mar 20, 2026
b5ad516
Merge pull request #14 from russfellows/feat/file-and-direct-backends
russfellows Mar 20, 2026
70922eb
Merge branch 'mlcommons:main' into main
russfellows Mar 20, 2026
b2918d1
chore: update dlio_benchmark submodule to merged main (3f37071)
russfellows Mar 20, 2026
78982bb
storage: three-library S3 benchmark suite, unet3d h100 configs, secur…
russfellows Mar 21, 2026
a08c84d
chore: update dlio_benchmark submodule to merged branch
russfellows Mar 21, 2026
1634c94
Merge pull request #15 from russfellows/feat/s3-benchmark-suite-march…
russfellows Mar 21, 2026
6f4ff59
docs: restructure docs for library-neutral coverage — remove S3DLIO-s…
russfellows Mar 23, 2026
52b7e3a
docs: fix Where to Start table formatting; add complete cross-referen…
russfellows Mar 23, 2026
056df25
docs: neutralize STORAGE_LIBRARIES bias; replace s3dlio detailed anal…
russfellows Mar 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,23 @@ Thumbs.db
.vscode/
CLAUDE.md
.roomodes
LOCAL_BRANCH_NOTES.md

# DLIO test artifacts — created in cwd when running dlio_benchmark tests
output/
dlio_test_output/
data/
checkpoints/
dlio_benchmark_test.log
dlio_aistore_benchmark_test.log

# Backup directories — local-only, never commit
Test-Backup/
dlio_benchmark.OLD*/

# Credential / environment files — NEVER commit these
.env
env-fast

# TLS certificates — local only, never commit (paths to certs are in .env)
.certs/
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[submodule "dlio_benchmark"]
path = dlio_benchmark
url = https://github.com/russfellows/dlio_benchmark.git
branch = main
52 changes: 51 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
# MLPerf Storage Benchmark Suite
MLPerf® Storage is a benchmark suite to characterize the performance of storage systems that support machine learning workloads.

> **⚠️ TEMPORARY — Development Fork**
>
> This is a personal development fork ([russfellows/mlc-storage](https://github.com/russfellows/mlc-storage)) containing work-in-progress features not yet merged into the official [MLCommons Storage](https://github.com/mlcommons/storage) repository. Once this work is accepted upstream, this notice will be removed and users should switch to the official repo.
>
> **To clone this fork with all submodules (required):**
> ```bash
> git clone --recurse-submodules https://github.com/russfellows/mlc-storage.git
> ```

- [Overview](#overview)
- [Prerequisite](#prerequisite)
- [Installation](#installation)
- [Testing and Demos](#testing-and-demos)
- [Configuration](#configuration)
- [Workloads](#workloads)
- [U-Net3D](#u-net3d)
Expand All @@ -13,7 +23,24 @@ MLPerf® Storage is a benchmark suite to characterize the performance of storage
- [CLOSED](#closed)
- [OPEN](#open)
- [Submission Rules](#submission-rules)
-

---

## Documentation

Two README files cover the full project in detail — read both before diving into the
code or running benchmarks:

| Document | What it covers |
|----------|----------------|
| **[docs/README.md](docs/README.md)** | Complete project overview: all four benchmark workloads, document reference, object storage library guides, and quick-link index to every test script |
| **[tests/README.md](tests/README.md)** | Everything needed to run tests: environment setup, unit tests, integration tests, object-store performance scripts, and how pytest is configured |

The top-level sections below give the official MLCommons parameter reference and
are retained for submission compliance.

---

## Overview
For an overview of how this benchmark suite is used by submitters to compare the performance of storage systems supporting an AI cluster, see the MLPerf® Storage Benchmark submission rules here: [doc](https://github.com/mlcommons/storage/blob/main/Submission_guidelines.md).

Expand Down Expand Up @@ -76,6 +103,29 @@ The working directory structure is as follows

The benchmark simulation will be performed through the [dlio_benchmark](https://github.com/argonne-lcf/dlio_benchmark) code, a benchmark suite for emulating I/O patterns for deep learning workloads. [dlio_benchmark](https://github.com/argonne-lcf/dlio_benchmark) is listed as a prerequisite to a specific git branch. A future release will update the installer to pull DLIO from PyPi. The DLIO configuration of each workload is specified through a yaml file. You can see the configs of all MLPerf Storage workloads in the `configs` folder.

## Testing and Demos

See **[tests/README.md](tests/README.md)** for the complete test guide — environment
setup, unit tests (no infrastructure required), integration tests, and object-store
performance scripts for all three supported object storage libraries.

### Quick Demos

- **StreamingCheckpointing Demo**: Run `./tests/checkpointing/demo_checkpoint_methods.sh` to see:
- dgen-py integration (155× faster data generation)
- StreamingCheckpointing (192× memory reduction)
- Comparison of old vs new checkpoint methods

- **Backend Validation**: Test multi-library support:
```bash
python tests/checkpointing/test_streaming_backends.py --backends s3dlio minio
```

- **Unit tests** (no infrastructure required):
```bash
pytest tests/unit/
```

## Operation
The benchmarks uses nested commands to select the workload category, workload, and workload parameters.

Expand Down
Loading
Loading