Releases: ROCm/madengine
Releases · ROCm/madengine
v2.0.0
🎉 What's New
madengine v2.0 is a complete rewrite of the MAD orchestration engine with a modern, production-ready architecture. This major release replaces the legacy v1.x codebase with a unified CLI, comprehensive error handling, and support for distributed AI workloads across Kubernetes and SLURM.
🚀 Key Highlights
Unified CLI Experience
One command to rule them all: madengine now provides a consistent interface for all operations.
Multi-Target Deployment
Run AI workloads wherever you need them:
- Local: Direct Docker execution for development and single-GPU jobs
- Kubernetes: Production-ready K8s Jobs with full launcher support
- SLURM: HPC cluster integration with intelligent job scheduling
Distributed Framework Support
Native support for 6 distributed training and inference frameworks:
Training:
- torchrun (PyTorch DDP/FSDP)
- DeepSpeed (ZeRO optimization)
- Megatron-LM (large-scale transformers)
- TorchTitan (LLM pre-training with FSDP2+TP+PP+CP)
Inference:
- vLLM (high-throughput LLM inference)
- SGLang (structured generation)
All launchers work seamlessly with both Kubernetes and SLURM deployments.
Advanced Profiling
Comprehensive ROCm profiling suite for AMD GPUs:
- 8 pre-configured profiles: compute, memory, communication, full analysis, and more
- ROCprofv3 support: Latest ROCm 7.0+ profiling capabilities
- Perfetto integration: Generate traces for Perfetto UI visualization
- Ready-to-use configs: 6 example configurations in
examples/profiling-configs/
Production-Grade Quality
- 4.5/5 code quality rating (detailed metrics in CODE_QUALITY_REPORT_v2.md)
- 71% type hint coverage with mypy validation
- Zero technical debt: No TODO/FIXME/HACK markers
- Pre-commit hooks: Automated quality checks (black, isort, flake8, mypy, bandit)
- Security fixes: SQL injection vulnerability patched, improved exception handling
What's Changed
- madengine v2 with unified framework for local and distribution by @coketaste in #57
Full Changelog: v1.0.0...v2.0.0
v1.0.0
What's Changed
- Update README.md by @gargrahul in #1
- Update the scripts and dockers in madengine package by @coketaste in #2
- Add support of deprecated models by @coketaste in #4
- Fix the failure of unit tests by @coketaste in #6
- Use normpath and improve override argument parsing in madengine discover by @Rohan138 in #7
- Fix small issues with madengine by @GeneDer in #5
- Fix docker sha inspect by @Rohan138 in #9
- Fix the location of error in perf csv update: by @coketaste in #13
- shared memory config in docker run by @coketaste in #10
- Revert "shared memory config in docker run" by @gargrahul in #21
- Share memory control, disable ipc option when shm-size is set by @coketaste in #22
- Add MAD_SYSTEM_GPU_PRODUCT_NAME to the madengine by @coketaste in #33
- fix GPU product name on MI250,MI355, and other platforms by @Rohan138 in #34
- Refactor rocm-smi to amd-smi by @coketaste in #19
- Update profiler and tracing with ROCm7 and amd-smi by @coketaste in #44
- Add self test for MAD_SYSTEM_GPU_PRODUCT_NAME by @ahmed-bsod in #39
- Update amd-smi and utils for ROCm7 by @coketaste in #48
- Fix DataFrame concatenation warning by @ahmed-bsod in #40
- Make the validation logic smarter by @coketaste in #49
- Fix profiling using amdsmi_cli python module by @coketaste in #50
- Add proper support for multiple_columns by @Rohan138 in #51
- Add TheRock model for validation by @coketaste in #53
- Fix the cleanup by @coketaste in #60
- Perf entry superset by @coketaste in #58
- Revert "Perf entry superset" by @gargrahul in #66
- Fail Check condition update for RPM distro by @shashank-parsi in #64
- update model discovery to handle tags in subdirectories for madenginev1 by @leconcio in #83
- rocm-smi back call if amd-smi missing by @coketaste in #54
- Enhanced Perf Metric Reporting System by @coketaste in #65
New Contributors
- @gargrahul made their first contribution in #1
- @Rohan138 made their first contribution in #7
- @GeneDer made their first contribution in #5
- @ahmed-bsod made their first contribution in #39
- @shashank-parsi made their first contribution in #64
Full Changelog: https://github.com/ROCm/madengine/commits/v1.0.0