fix: resolve 11 high/medium severity bugs from broad codebase scan#395
Open
FileSystemGuy wants to merge 1 commit into
Open
fix: resolve 11 high/medium severity bugs from broad codebase scan#395FileSystemGuy wants to merge 1 commit into
FileSystemGuy wants to merge 1 commit into
Conversation
Fixes span core framework stability, submission validation correctness,
and benchmark execution reliability. All changes are targeted one-line
or small-block fixes with no refactoring.
## CORE-1: Ctrl+C raises AttributeError instead of exiting cleanly
config.py — Added missing EXIT_CODE.INTERRUPTED (8) and EXIT_CODE.ERROR (9)
enum members. Without these, the SIGINT/SIGTERM signal handler called
sys.exit(EXIT_CODE.INTERRUPTED) and crashed with AttributeError before
the process could exit.
## CORE-2: Run entries validated against datagen metadata (stale path)
submission_checker/loader.py — Added metadata_path re-computation inside
the inner run-timestamp loop. Previously metadata_path was set only in the
outer datagen loop, so every run entry was loaded with the datagen's
metadata file. Checks like closed_submission_parameters and
verify_datasize_usage were auditing datagen invocation args instead of
run invocation args.
## CORE-3: --params values containing '=' crash before benchmark starts
benchmarks/dlio.py — Changed item.split("=") to item.split("=", 1) in
process_dlio_params(). Without maxsplit=1, any param value containing '='
(base64 credentials, S3 URIs, endpoint strings) produced more than 2 parts
and raised ValueError: too many values to unpack before the benchmark started.
## CORE-4: CheckpointingBenchmark._run() swallows exceptions silently
benchmarks/dlio.py — Added self.logger.error() call in the except block of
CheckpointingBenchmark._run(). Previously the caught exception 'e' was
discarded with no log output, making all checkpointing failures produce a
silent EXIT_CODE.FAILURE. Now matches the logging pattern in
TrainingBenchmark._run().
## CORE-5: UnboundLocalError masks original exception from _run()
benchmarks/base.py — Initialized result = EXIT_CODE.FAILURE before the
try block in Benchmark.run(). If _run() raised an exception, the finally
block completed cleanup correctly but then 'return result' hit
UnboundLocalError (result was never assigned), replacing the original
diagnostic exception with a secondary one. Also added EXIT_CODE to the
config import in base.py.
## CORE-6: JSONParser.__contains__ raises AttributeError on any 'in' test
submission_checker/parsers/json_parser.py — Changed self.messages to
self.d in __contains__. The attribute self.messages does not exist; the
parsed JSON dict is stored in self.d. Any 'key in parser' test raised
AttributeError.
## CORE-12: Unused pyarrow import makes pyarrow a hard benchmark dependency
benchmarks/base.py — Removed unused 'from pyarrow.ipc import open_stream'.
The symbol open_stream was never referenced in the file. The top-level
import forced pyarrow to be present at import time for all benchmarks,
failing with ImportError if absent. pyarrow remains in pyproject.toml as
it is needed by DLIO and parquet handling elsewhere.
## CORE-13: DLIO exit code discarded; failed runs report EXIT_CODE.SUCCESS
benchmarks/dlio.py — execute_command() now captures the return code from
_execute_command() and raises RuntimeError on non-zero. Previously the
return value was discarded entirely, so a DLIO crash or assertion failure
left TrainingBenchmark._run() returning SUCCESS and proceeding to validate
nonexistent or incomplete results.
## RULES-1: 500-steps dataset minimum formula is circular; check never fires
submission_checker/checks/training_checks.py — Replaced the circular
num_steps_per_epoch intermediate with the direct formula:
min_samples_steps = MIN_STEPS_PER_EPOCH * batch_size * num_accelerators
The old code derived num_steps_per_epoch from the actual file count using
max(..., actual_steps), then multiplied back. Because actual_steps >= itself,
min_samples_steps was always >= actual samples, so the constraint could never
produce a "too few files" error. The direct formula matches rules/utils.py.
## RULES-3: NameError in subset-mode process count check; check silently passes
submission_checker/checks/checkpointing_checks.py — Replaced undefined
model_key with model_name in the subset-mode error log. model_key was only
assigned inside the else branch (after a regex match), but was referenced in
the if branch. The NameError was silently swallowed, causing CLOSED subset-mode
submissions with the wrong process count to pass validation unchallenged.
## RULES-4: AU check reads nonexistent DLIO fields; every submission fails
submission_checker/checks/training_checks.py — Replaced lookups for
train_au_mean_percentage and train_au_meet_expectation (neither exists in
DLIO output) with the actual field train_au_percentage (a list of per-epoch
AU values). The check now computes the mean of that list and compares it
against the 90% minimum required by the MLPerf Storage rules (Rules.md §3.3.2).
Previously both .get() calls always returned their defaults (0 and ""),
making au_expectation != "success" always True and flagging every submission
as an AU failure regardless of actual utilization.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes 11 bugs identified in a broad static analysis of the codebase. The findings span core framework stability, submission validation correctness, and benchmark execution reliability. Every change is a targeted fix — no refactoring or feature additions.
Issues are grouped by subsystem below. Each entry includes the root cause and what was done to fix it.
Core Framework
CORE-1 — Ctrl+C raises
AttributeErrorinstead of exiting cleanlyFile:
mlpstorage_py/config.pyEXIT_CODE.INTERRUPTEDandEXIT_CODE.ERRORwere referenced inmain.pybut absent from theEXIT_CODEenum. Pressing Ctrl+C triggered the SIGINT handler which calledsys.exit(EXIT_CODE.INTERRUPTED), raisingAttributeError: INTERRUPTED is not a valid EXIT_CODEand crashing with a traceback instead of exiting cleanly.Fix: Added
INTERRUPTED = 8andERROR = 9to theEXIT_CODEenum inconfig.py.CORE-2 — Run entries validated against datagen metadata
File:
mlpstorage_py/submission_checker/loader.pyIn the submission loader,
metadata_pathwas set inside the outer loop over datagen timestamps and never updated in the inner loop over run timestamps. Every run entry was therefore loaded with the datagen's metadata file. Checks such asclosed_submission_parametersandverify_datasize_usagewere reading datagen invocation arguments instead of run invocation arguments, silently corrupting all parameter validation for training runs.Fix: Added
metadata_path = self.find_metadata_path(timestamp_path)as the first statement inside the inner run-timestamp loop. Cross-phase checks that need datagen params already iterateself.submissions_logs.datagen_filesexplicitly and are unaffected.CORE-3 —
--paramsvalues containing=crash before benchmark startsFile:
mlpstorage_py/benchmarks/dlio.pyprocess_dlio_params()split each--paramsargument on every=character with nomaxsplit. Any param value containing=(base64 credentials, S3 endpoint URIs, connection strings) produced more than 2 parts, causingValueError: too many values to unpackbefore the benchmark started. Object-storage workloads with credential parameters were entirely blocked.Fix: Changed
item.split("=")toitem.split("=", 1).CORE-4 —
CheckpointingBenchmark._run()swallows all exceptions silentlyFile:
mlpstorage_py/benchmarks/dlio.pyThe
except Exception as eblock inCheckpointingBenchmark._run()discarded the caught exception without logging it, returningEXIT_CODE.FAILUREwith no diagnostic output. Any failure inexecute_command()ordatasize()was invisible to the operator.TrainingBenchmark._run()already loggedstr(e)correctly.Fix: Added
self.logger.error(f'Checkpointing benchmark failed: {e}')before thereturn.CORE-5 —
UnboundLocalErrorreplaces original exception from_run()File:
mlpstorage_py/benchmarks/base.pyIn
Benchmark.run(),result = self._run()was inside atrywith afinallythat performed cleanup. If_run()raised an exception, thefinallyblock ran correctly but execution then fell through toreturn result. Sinceresultwas never assigned, Python raisedUnboundLocalError: local variable 'result' referenced before assignment, replacing the original diagnostic exception with a secondary one and destroying failure information.Fix: Initialized
result = EXIT_CODE.FAILUREimmediately before thetryblock. Also addedEXIT_CODEto the import frommlpstorage_py.configinbase.py.CORE-6 —
JSONParser.__contains__raisesAttributeErroron anyintestFile:
mlpstorage_py/submission_checker/parsers/json_parser.py__contains__returnedkey in self.messages, butself.messagesdoes not exist onJSONParser— the parsed JSON dict is stored inself.d. Anykey in parsertest raisedAttributeError: 'JSONParser' object has no attribute 'messages'.Fix: Changed
self.messagestoself.d.CORE-12 — Unused
pyarrowimport makes it a hard dependency of all benchmarksFile:
mlpstorage_py/benchmarks/base.pyfrom pyarrow.ipc import open_streamwas present at the top ofbase.pybutopen_streamwas never referenced anywhere in the file. Because it was a top-level import, any environment withoutpyarrowinstalled would fail to import the benchmark base class entirely, breaking all benchmarks withImportError.pyarrowis retained inpyproject.tomlas it is needed by DLIO and parquet handling elsewhere.Fix: Removed the unused import line.
CORE-13 — DLIO exit code discarded; failed runs report
EXIT_CODE.SUCCESSFile:
mlpstorage_py/benchmarks/dlio.pyexecute_command()calledself._execute_command(...)but discarded the(stdout, stderr, return_code)return value. If DLIO exited with a non-zero code (OOM, assertion failure, I/O error),execute_command()returned silently andTrainingBenchmark._run()returnedEXIT_CODE.SUCCESS. Results validation then proceeded against nonexistent or incomplete output files.Fix:
execute_command()now unpacks the return value and raisesRuntimeErrorif the return code is non-zero. The existingexcept Exceptionhandler in_run()(improved by CORE-4) catches and logs this.Submission Validation
RULES-1 — 500-steps dataset minimum formula is circular; constraint never fires
File:
mlpstorage_py/submission_checker/checks/training_checks.pyThe dataset minimum size check computed:
Because the second argument to
max()is derived from the actual file count,num_steps_per_epochis always ≥ the actual steps, makingmin_samples_stepsalways ≥ the actual sample count. The steps constraint could never produce a "too few files" error. The canonical computation inrules/utils.pydoes not have this defect.Fix: Replaced the two-line calculation with the direct formula:
RULES-3 —
NameErrorin subset-mode process count check; check silently passesFile:
mlpstorage_py/submission_checker/checks/checkpointing_checks.pyIn the closed-submission process count check,
model_keywas used in the error log inside theif checkpoint_mode == "subset":branch, butmodel_keywas only assigned inside theelse:branch (after a regex match on the model name). When a CLOSED subset-mode submission had the wrong process count, the code hitNameError, which was silently swallowed, and the check returned as if it passed. The required 8-process count for subset mode was never enforced.Fix: Replaced
model_keywithmodel_namein the subset-mode error log.model_nameis assigned unconditionally at the top of the loop body and is the correct identifier for the message.RULES-4 — AU check reads nonexistent DLIO fields; every submission fails spuriously
File:
mlpstorage_py/submission_checker/checks/training_checks.pyThe accelerator utilization check read
train_au_mean_percentageandtrain_au_meet_expectationfrom the DLIO summary JSON. Neither field exists in actual DLIO output. The real field istrain_au_percentage, a list of per-epoch AU percentage values. Both.get()calls always returned their defaults (0 and""), causingau_expectation != "success"to always beTrue. Every training submission was flagged as an AU failure regardless of actual utilization, making it impossible to distinguish passing from failing submissions.Fix: Replaced the broken field lookups with logic that reads
train_au_percentage, computes the mean, and compares it against the 90% minimum threshold specified in the MLPerf Storage rules (Rules.md §3.3.2):Test plan
pytest tests/unit -vpasses with no new failurespytest tests/integration -vpasses where applicable--paramsargument containing=in the value (e.g.key=val=extra) is parsed correctlytrain_au_percentagevalues above 90% passes the AU check; values below 90% fail it