v1.0
This release delivers a significant breakthrough in the accuracy of memory and performance estimation for large models. It also introduces several major features to enhance model compatibility, flexibility, and user experience.
Highlights
- Dramatically Improved Estimation Accuracy:
- Memory Estimation: Expanded test coverage for both Dense and MoE models. Memory estimation error is now consistently controlled within 1%.
- Performance Estimation:
- On NVIDIA A100-PCIE, performance estimation error is consistently below 3%.
New Features & Enhancements
- MLA Support:
- Introduced support for the MLA model architecture
- Enhanced Layer Specification:
- Added granular control for defining first-stage and last-stage layers in pipeline parallelism, allowing for more optimized model partitioning.
- Advanced MoE Customization:
- Support for customizable dense layers in Mixture-of-Experts (MoE) models, providing greater flexibility in model design.
- Megatron Compatibility Layer:
- Launched a simplified model migration pipeline for effortless conversion and analysis of models built with NVIDIA's Megatron framework.
- Optimized Recomputation Strategy:
- Implemented finer-grained selective recompute, enabling more precise control over the memory-for-computation trade-off to optimize for larger model sizes or higher throughput.
- Comprehensive Efficiency Analysis:
- New capability to measure and analyze efficiency and utilization across various tensor shapes and memory layouts.
Bug Fixes
- Fixed an incorrect token numbers calculation when etp > 1.
- Corrected the FLOPs or memory access (e.g., HBM access volume) calculation for several operators.
- Resolved inaccuracies in the estimated communication volume and associated data types.