madengine v2 with unified framework for local and distribution#57
Open
madengine v2 with unified framework for local and distribution#57
Conversation
…, and K8s; Expanded command line interface;
…environment variables, which is particularly useful in CI/CD environments, containerized deployments, or when you want to avoid storing credentials in files
[WIP] Enhanced distributed execution with runners
…into coketaste/refactor
…t all Docker operations
… for truly successful runs while correctly identifying and handling various failure scenarios.
…engine into coketaste/refactor-dis
Silent error check via subprocess
Update the additional context and manifest printout
Update the interface of report to-html
* feat(madengine): ROCm path override and RPD e2e fix ROCM_PATH / --rocm-path: - Add get_rocm_path() and wire through Context, ROCmToolManager, gpu_tool_factory, gpu_validator, container_runner, run orchestrator; add --rocm-path to run CLI - Unit tests for get_rocm_path, Context, ROCmToolManager, run --help; update fixtures and test_get_cached_managers for (vendor, rocm_path) cache RPD e2e: - pre_scripts/trace.sh: install nlohmann-json3-dev (Ubuntu) and json-devel (CentOS) so rocmProfileData rpd_tracer build finds nlohmann/json.hpp * Updated docs * Addd madengine logo icon * Resize the logo icon
…performance, metric, and status show in the correct column (#77) Use header index to replace fixed column order
…to coketaste/refactor-dis
…super) (#78) K8s orchestrator now uses the same reporting path as single-node Docker when multiple_results CSV is present, so both produce the same artifacts: perf.csv, perf_entry.json, perf_entry.csv, perf_super.json, perf_entry_super.json/csv, and perf_super.csv. - Add Docker-compatible reporting path in _collect_results: - _build_common_info_dict() for common_info (args, gpu_arch, etc.) - _ensure_perf_csv_exists() so update_perf_csv can read perf.csv - Call update_perf_csv, update_perf_super_json, update_perf_super_csv with scripts_base_dir so --config from models.json resolves correctly - Multi-node multiple_results: resolve one CSV path per run - _resolve_multiple_results_csv(): single pod → use that CSV; multi-pod → merge all pod CSVs with sum/average rules - _merge_multi_node_multiple_results_csv(): align rows by index; performance aggregated by metric type (throughput→sum, latency→avg, memory→max); extra columns by _aggregation_for_extra_column (sum/avg/max/first) - _aggregation_for_extra_column() for consistent multi-node semantics - Keep legacy row-by-row _write_to_perf_csv when reporting module unavailable; record failure when no CSV found - job.yaml.j2: no functional change required; existing copy block and find fallback for multiple_results already support this refactor
Resolve merge conflicts by keeping refactor-dis (v2) and discarding main (v1) changes: - Remove src/madengine/mad.py and src/madengine/tools/run_models.py (deleted in v2, accept deletion over main's modifications) - Resolve rocenv_tool.py conflict: keep current-branch version for unknown GPU device handling - Resolve tests/fixtures/dummy/models.json: keep v2 fixture set (dummy_superset and full model list) over main's therock-only entry
…V2 (#85) * fixed model discovery to support tags in subdirectories * made addtional context text and file mutually exclusive on build and run
* refactor(madengine): improve code quality, remove dead code, reduce duplication
- Remove dead test: delete tests/test_cleanup.py (imported removed
madengine.tools.run_models).
- Constants: add _get_env_or_creds_or_default() in core/constants.py and
refactor NAS_NODES, MAD_AWS_S3, MAD_MINIO, PUBLIC_GITHUB_ROCM_KEY to use
it; add tests/unit/test_constants.py.
- Path helpers: add utils/path_utils.py with scripts_base_dir_from() and
get_madengine_root(); use in container_runner and kubernetes; add
tests/unit/test_path_utils.py.
- Run-details: add utils/run_details.py with get_pipeline(), get_build_number(),
and flatten_tags_in_place(); refactor container_runner and kubernetes to use
them; add tests/unit/test_run_details.py.
- Deployment: add apply_deployment_config() in config_loader and create_jinja_env()
in base; refactor slurm and kubernetes init to use them; add
TestApplyDeploymentConfig and tests/unit/test_deployment_base.py.
All changes preserve behavior. Unit test suite (157+ tests) passes.
* test: reorganize unit and integration tests, reduce duplication
Unit tests (tests/unit/):
- Remove duplicate get_rocm_path coverage from test_constants (kept in test_rocm_path).
- Consolidate by topic: test_utils (path_utils + run_details), test_deployment
(base + common), test_execution (container_runner_helpers + dockerfile_utils),
test_orchestration (image_filtering + orchestrator_logic).
- Parametrize and merge redundant cases (e.g. falsy inputs, env/default, launcher
normalization, timeout behavior).
- Drop duplicate test (e.g. gfx_compilation in dockerfile_utils) and trim
edge-case bloat; keep important behavior.
- Reduce from 16 to 12 unit test files; 203 tests passing.
Integration tests (tests/integration/):
- Fix 6 failures in multi-GPU-arch tests: use madengine.execution.dockerfile_utils
instead of removed DockerBuilder private methods.
- Merge error tests into test_errors.py (CLI + unified system, dedupe setup_logging
and context serialization).
- Merge batch manifest into test_orchestrator_workflows; merge multi_gpu_arch into
test_platform_integration.
- Reduce from 10 to 7 integration test files; 155 passed, 1 skipped.
* - Add shared e2e helpers in tests/fixtures/utils.py:
- build_run_command(), assert_model_in_perf_csv(),
get_run_live_log_path(), get_timeout_seconds_from_log()
- DEFAULT_CLEAN_FILES for consistent cleanup lists
- Use DEFAULT_CLEAN_FILES (and + extras) in all e2e modules:
test_run_workflows, test_execution_features, test_data_workflows,
test_build_workflows, test_profiling_workflows, test_scripting_workflows
- Parametrize context tests: merge test_dockerfile_picked_on_detected_context_0/1
into test_dockerfile_picked_on_detected_context (ctx_value, expected_performance)
- Parametrize timeout-in-log tests: replace four separate tests with
test_timeout_value_in_log (tags, log_base_name, expected_seconds, extra_args)
- Leave timeout-fire and other e2e tests unchanged; 60 e2e tests still pass
* Stop tracking run_directory (now in .gitignore)
* Multi-node SLURM: No node writes perf.csv; after the job, collect_results parses each node’s .out, aggregates (e.g. sum for samples_per_second), and writes one row. 2N×8G should report about 2× the 1N×8G throughput.
* Refactored the slurm_output to slurm_results to improve the management of result collected from each node
* Fixed the workload's output to slurm results
* Refactored the reporting system for local, k8s and slurm, respectively, to handle single result and mult results
* Fixed the multi results collection for multi node
* Fixed the test duration missing in multi results case
* Improve madengine reliability, error handling, and exit codes for Jenkins
- Record pre-run failures (e.g. pull/setup) in the performance CSV with status
FAILURE and add to failed_runs so the perf table and exit code stay correct.
- Preserve typer.Exit in cli_main so run/build exit codes (0/1/2/3/4) are
passed through to the shell/Jenkins.
- Use ExitCode.BUILD_FAILURE (2) when BuildError is raised; let BuildError
propagate from RunOrchestrator instead of wrapping in MADRuntimeError.
- Make update_perf_csv create the perf CSV with PERF_CSV_HEADER when the file
is missing but a result (exception/single/multiple) is provided.
- Use shared PERF_CSV_HEADER in container_runner.ensure_perf_csv_exists and
fix read_json to use a context manager for the file handle.
- Add unit tests for exit codes, cli_main typer.Exit preservation, BuildError
→ BUILD_FAILURE, setup-failure recording to CSV, and update_perf_csv
create-when-missing; fix unclosed file in update_perf_csv.
* Cleanup
* Updated docs of project
* update model discovery to handle tags in subdirectories (#83)
---------
Co-authored-by: leconcio <leconcio@amd.com>
Add --tags scope/tag (e.g. MAD-private/inference) so discovery is limited to models under scripts/<scope>/ with the given tag. Exactly one / and no : in the tag enables scoped matching; scope/all selects all models in that scope. - Parse scope/tag in DiscoverModels.select_models() - Add _model_entry_has_tag() for list or comma-separated tag checks - Document scoped tags in discover command help - Add unit tests in tests/unit/test_discover_models.py
- Update .gitignore, docker.py, slurm.py - dummy_vllm: Dockerfile (coreutils), run.sh, run_vllm_inference.py - dummy_torchtitan: Dockerfile, run.sh - dummy_sglang: README, run.sh, run_sglang_inference.py
…ngine into coketaste/refactor-dis
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist