- Define filesystem caching rules in detail
- Define system json schema and creation process
- Define allowed time between runs
- Define rules that use local SSD for caching data
- Define rules for hyperconverged and local cache
-
Configure datasize to collect the memory information from the hosts instead of getting a number of hosts for the calculation
-
Determine method to use cgroups for memory limitation in the benchmark script.
-
Add a log block at the start of datagen & run that output all the parms being used to be clear on what a run is.
-
Remove accelerator type from datagen
-
datasize should output the datagen command to copy and paste
-
Add autosize parameter for run_benchmark and datasize
-
for run it's just size of dataset based on memory capacity
-
For datasize it needs an input of GB/s for the cluster and list of hosts
-
Keep a log of mlperfstorage commands executed in a mlperf.history file in results_dir
-
Add support for datagen to use subdirectories
-
Capture cluster information and write to a json document in outputdir.
-
Figure out how to get all clients for milvus
- Unique names for files and directories with structure for benchmark, accelerator, count, run-sequence, run-number
- Better installer that manages dependencies
- Containerization
-
- Ease of Deployment of Benchmark (just get it working)
-
- Cgroups and resource limits (better cache management)
- Flush Cache before a run
- Validate inputs for –closed runs (eg: don’t allow runs against datasets that are too small)
- Reportgen should run validation against outputs
- Add better system.json creation to automate the system description for consistency
-
- Add json schema checker for system documents that submitters create
- Automate execution of multiple runs
-
Add support for code changes in closed to supported categories [ data loader, s3 connector, etc] -
-
Add patches directory that gets applied before execution
-
- Add runtime estimation
- and --what-if or --dry-run flag
- Automate selection of minimum required dataset
-
Determine if batch sizes in MLPerf Training are representative of batch sizes for realistically sized datasets - Split system.json into automatically capturable (clients) and manual (storage)
- Define system.json schema and add schema checker to the tool for reportgen
- Add report-dir csv of results from tests as they are run
- Collect versions of all prerequisite packages for storage and dlio
- Reduce verbosity of logging
- Add callback handler for custom monitoring
-
- SPECStorage uses a “PRIME_MON_SCRIPT” environment variable that will execute at different times
-
- Checkpoint_bench uses RPC to call execution which can be wrapped externally
- Add support for DIRECTIO
- Add seed for dataset creation so that distribution of sizes is the same for all submitters (file 1 = mean + x bytes, file 2 = mean + y bytes, etc)
- Determine if global barrier for each batch matches industry behavior
- Better linking and presentation of system diagrams (add working links to system diagrams to supplementals)
- Define presentation and rules for hyperconverged or systems with local cache