Skip to content

fix: resolve 11 high/medium severity bugs from broad codebase scan#395

Open
FileSystemGuy wants to merge 1 commit into
mainfrom
FileSystemGuy-bugs-high
Open

fix: resolve 11 high/medium severity bugs from broad codebase scan#395
FileSystemGuy wants to merge 1 commit into
mainfrom
FileSystemGuy-bugs-high

Conversation

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Summary

This PR fixes 11 bugs identified in a broad static analysis of the codebase. The findings span core framework stability, submission validation correctness, and benchmark execution reliability. Every change is a targeted fix — no refactoring or feature additions.

Issues are grouped by subsystem below. Each entry includes the root cause and what was done to fix it.


Core Framework

CORE-1 — Ctrl+C raises AttributeError instead of exiting cleanly

File: mlpstorage_py/config.py

EXIT_CODE.INTERRUPTED and EXIT_CODE.ERROR were referenced in main.py but absent from the EXIT_CODE enum. Pressing Ctrl+C triggered the SIGINT handler which called sys.exit(EXIT_CODE.INTERRUPTED), raising AttributeError: INTERRUPTED is not a valid EXIT_CODE and crashing with a traceback instead of exiting cleanly.

Fix: Added INTERRUPTED = 8 and ERROR = 9 to the EXIT_CODE enum in config.py.


CORE-2 — Run entries validated against datagen metadata

File: mlpstorage_py/submission_checker/loader.py

In the submission loader, metadata_path was set inside the outer loop over datagen timestamps and never updated in the inner loop over run timestamps. Every run entry was therefore loaded with the datagen's metadata file. Checks such as closed_submission_parameters and verify_datasize_usage were reading datagen invocation arguments instead of run invocation arguments, silently corrupting all parameter validation for training runs.

Fix: Added metadata_path = self.find_metadata_path(timestamp_path) as the first statement inside the inner run-timestamp loop. Cross-phase checks that need datagen params already iterate self.submissions_logs.datagen_files explicitly and are unaffected.


CORE-3 — --params values containing = crash before benchmark starts

File: mlpstorage_py/benchmarks/dlio.py

process_dlio_params() split each --params argument on every = character with no maxsplit. Any param value containing = (base64 credentials, S3 endpoint URIs, connection strings) produced more than 2 parts, causing ValueError: too many values to unpack before the benchmark started. Object-storage workloads with credential parameters were entirely blocked.

Fix: Changed item.split("=") to item.split("=", 1).


CORE-4 — CheckpointingBenchmark._run() swallows all exceptions silently

File: mlpstorage_py/benchmarks/dlio.py

The except Exception as e block in CheckpointingBenchmark._run() discarded the caught exception without logging it, returning EXIT_CODE.FAILURE with no diagnostic output. Any failure in execute_command() or datasize() was invisible to the operator. TrainingBenchmark._run() already logged str(e) correctly.

Fix: Added self.logger.error(f'Checkpointing benchmark failed: {e}') before the return.


CORE-5 — UnboundLocalError replaces original exception from _run()

File: mlpstorage_py/benchmarks/base.py

In Benchmark.run(), result = self._run() was inside a try with a finally that performed cleanup. If _run() raised an exception, the finally block ran correctly but execution then fell through to return result. Since result was never assigned, Python raised UnboundLocalError: local variable 'result' referenced before assignment, replacing the original diagnostic exception with a secondary one and destroying failure information.

Fix: Initialized result = EXIT_CODE.FAILURE immediately before the try block. Also added EXIT_CODE to the import from mlpstorage_py.config in base.py.


CORE-6 — JSONParser.__contains__ raises AttributeError on any in test

File: mlpstorage_py/submission_checker/parsers/json_parser.py

__contains__ returned key in self.messages, but self.messages does not exist on JSONParser — the parsed JSON dict is stored in self.d. Any key in parser test raised AttributeError: 'JSONParser' object has no attribute 'messages'.

Fix: Changed self.messages to self.d.


CORE-12 — Unused pyarrow import makes it a hard dependency of all benchmarks

File: mlpstorage_py/benchmarks/base.py

from pyarrow.ipc import open_stream was present at the top of base.py but open_stream was never referenced anywhere in the file. Because it was a top-level import, any environment without pyarrow installed would fail to import the benchmark base class entirely, breaking all benchmarks with ImportError. pyarrow is retained in pyproject.toml as it is needed by DLIO and parquet handling elsewhere.

Fix: Removed the unused import line.


CORE-13 — DLIO exit code discarded; failed runs report EXIT_CODE.SUCCESS

File: mlpstorage_py/benchmarks/dlio.py

execute_command() called self._execute_command(...) but discarded the (stdout, stderr, return_code) return value. If DLIO exited with a non-zero code (OOM, assertion failure, I/O error), execute_command() returned silently and TrainingBenchmark._run() returned EXIT_CODE.SUCCESS. Results validation then proceeded against nonexistent or incomplete output files.

Fix: execute_command() now unpacks the return value and raises RuntimeError if the return code is non-zero. The existing except Exception handler in _run() (improved by CORE-4) catches and logs this.


Submission Validation

RULES-1 — 500-steps dataset minimum formula is circular; constraint never fires

File: mlpstorage_py/submission_checker/checks/training_checks.py

The dataset minimum size check computed:

num_steps_per_epoch = max(MIN_STEPS_PER_EPOCH,
                          num_files_train * num_samples_per_file // (batch_size * num_accelerators))
min_samples_steps = num_steps_per_epoch * batch_size * num_accelerators

Because the second argument to max() is derived from the actual file count, num_steps_per_epoch is always ≥ the actual steps, making min_samples_steps always ≥ the actual sample count. The steps constraint could never produce a "too few files" error. The canonical computation in rules/utils.py does not have this defect.

Fix: Replaced the two-line calculation with the direct formula:

min_samples_steps = MIN_STEPS_PER_EPOCH * batch_size * num_accelerators

RULES-3 — NameError in subset-mode process count check; check silently passes

File: mlpstorage_py/submission_checker/checks/checkpointing_checks.py

In the closed-submission process count check, model_key was used in the error log inside the if checkpoint_mode == "subset": branch, but model_key was only assigned inside the else: branch (after a regex match on the model name). When a CLOSED subset-mode submission had the wrong process count, the code hit NameError, which was silently swallowed, and the check returned as if it passed. The required 8-process count for subset mode was never enforced.

Fix: Replaced model_key with model_name in the subset-mode error log. model_name is assigned unconditionally at the top of the loop body and is the correct identifier for the message.


RULES-4 — AU check reads nonexistent DLIO fields; every submission fails spuriously

File: mlpstorage_py/submission_checker/checks/training_checks.py

The accelerator utilization check read train_au_mean_percentage and train_au_meet_expectation from the DLIO summary JSON. Neither field exists in actual DLIO output. The real field is train_au_percentage, a list of per-epoch AU percentage values. Both .get() calls always returned their defaults (0 and ""), causing au_expectation != "success" to always be True. Every training submission was flagged as an AU failure regardless of actual utilization, making it impossible to distinguish passing from failing submissions.

Fix: Replaced the broken field lookups with logic that reads train_au_percentage, computes the mean, and compares it against the 90% minimum threshold specified in the MLPerf Storage rules (Rules.md §3.3.2):

au_values = metrics.get("train_au_percentage", [])
au_mean = sum(au_values) / len(au_values)
if au_mean < 90.0:
    # log and fail

Test plan

  • pytest tests/unit -v passes with no new failures
  • pytest tests/integration -v passes where applicable
  • Ctrl+C during a benchmark run exits cleanly with a non-zero code (not a traceback)
  • A --params argument containing = in the value (e.g. key=val=extra) is parsed correctly
  • A submission with train_au_percentage values above 90% passes the AU check; values below 90% fail it
  • A checkpointing run that fails produces a logged error message, not a silent failure

Fixes span core framework stability, submission validation correctness,
and benchmark execution reliability. All changes are targeted one-line
or small-block fixes with no refactoring.

## CORE-1: Ctrl+C raises AttributeError instead of exiting cleanly
config.py — Added missing EXIT_CODE.INTERRUPTED (8) and EXIT_CODE.ERROR (9)
enum members. Without these, the SIGINT/SIGTERM signal handler called
sys.exit(EXIT_CODE.INTERRUPTED) and crashed with AttributeError before
the process could exit.

## CORE-2: Run entries validated against datagen metadata (stale path)
submission_checker/loader.py — Added metadata_path re-computation inside
the inner run-timestamp loop. Previously metadata_path was set only in the
outer datagen loop, so every run entry was loaded with the datagen's
metadata file. Checks like closed_submission_parameters and
verify_datasize_usage were auditing datagen invocation args instead of
run invocation args.

## CORE-3: --params values containing '=' crash before benchmark starts
benchmarks/dlio.py — Changed item.split("=") to item.split("=", 1) in
process_dlio_params(). Without maxsplit=1, any param value containing '='
(base64 credentials, S3 URIs, endpoint strings) produced more than 2 parts
and raised ValueError: too many values to unpack before the benchmark started.

## CORE-4: CheckpointingBenchmark._run() swallows exceptions silently
benchmarks/dlio.py — Added self.logger.error() call in the except block of
CheckpointingBenchmark._run(). Previously the caught exception 'e' was
discarded with no log output, making all checkpointing failures produce a
silent EXIT_CODE.FAILURE. Now matches the logging pattern in
TrainingBenchmark._run().

## CORE-5: UnboundLocalError masks original exception from _run()
benchmarks/base.py — Initialized result = EXIT_CODE.FAILURE before the
try block in Benchmark.run(). If _run() raised an exception, the finally
block completed cleanup correctly but then 'return result' hit
UnboundLocalError (result was never assigned), replacing the original
diagnostic exception with a secondary one. Also added EXIT_CODE to the
config import in base.py.

## CORE-6: JSONParser.__contains__ raises AttributeError on any 'in' test
submission_checker/parsers/json_parser.py — Changed self.messages to
self.d in __contains__. The attribute self.messages does not exist; the
parsed JSON dict is stored in self.d. Any 'key in parser' test raised
AttributeError.

## CORE-12: Unused pyarrow import makes pyarrow a hard benchmark dependency
benchmarks/base.py — Removed unused 'from pyarrow.ipc import open_stream'.
The symbol open_stream was never referenced in the file. The top-level
import forced pyarrow to be present at import time for all benchmarks,
failing with ImportError if absent. pyarrow remains in pyproject.toml as
it is needed by DLIO and parquet handling elsewhere.

## CORE-13: DLIO exit code discarded; failed runs report EXIT_CODE.SUCCESS
benchmarks/dlio.py — execute_command() now captures the return code from
_execute_command() and raises RuntimeError on non-zero. Previously the
return value was discarded entirely, so a DLIO crash or assertion failure
left TrainingBenchmark._run() returning SUCCESS and proceeding to validate
nonexistent or incomplete results.

## RULES-1: 500-steps dataset minimum formula is circular; check never fires
submission_checker/checks/training_checks.py — Replaced the circular
num_steps_per_epoch intermediate with the direct formula:
  min_samples_steps = MIN_STEPS_PER_EPOCH * batch_size * num_accelerators
The old code derived num_steps_per_epoch from the actual file count using
max(..., actual_steps), then multiplied back. Because actual_steps >= itself,
min_samples_steps was always >= actual samples, so the constraint could never
produce a "too few files" error. The direct formula matches rules/utils.py.

## RULES-3: NameError in subset-mode process count check; check silently passes
submission_checker/checks/checkpointing_checks.py — Replaced undefined
model_key with model_name in the subset-mode error log. model_key was only
assigned inside the else branch (after a regex match), but was referenced in
the if branch. The NameError was silently swallowed, causing CLOSED subset-mode
submissions with the wrong process count to pass validation unchallenged.

## RULES-4: AU check reads nonexistent DLIO fields; every submission fails
submission_checker/checks/training_checks.py — Replaced lookups for
train_au_mean_percentage and train_au_meet_expectation (neither exists in
DLIO output) with the actual field train_au_percentage (a list of per-epoch
AU values). The check now computes the mean of that list and compares it
against the 90% minimum required by the MLPerf Storage rules (Rules.md §3.3.2).
Previously both .get() calls always returned their defaults (0 and ""),
making au_expectation != "success" always True and flagging every submission
as an AU failure regardless of actual utilization.
@FileSystemGuy FileSystemGuy requested a review from a team May 26, 2026 23:26
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant