Skip to content

v1.0

Choose a tag to compare

@HuangHunag-MT HuangHunag-MT released this 26 Aug 02:37
· 5 commits to main since this release
a613115

This release delivers a significant breakthrough in the accuracy of memory and performance estimation for large models. It also introduces several major features to enhance model compatibility, flexibility, and user experience.

Highlights

  • Dramatically Improved Estimation Accuracy:
    • Memory Estimation: Expanded test coverage for both Dense and MoE models. Memory estimation error is now consistently controlled within 1%.
    • Performance Estimation:
      • On NVIDIA A100-PCIE, performance estimation error is consistently below 3%.

New Features & Enhancements

  • MLA Support:
    • Introduced support for the MLA model architecture
  • Enhanced Layer Specification:
    • Added granular control for defining first-stage and last-stage layers in pipeline parallelism, allowing for more optimized model partitioning.
  • Advanced MoE Customization:
    • Support for customizable dense layers in Mixture-of-Experts (MoE) models, providing greater flexibility in model design.
  • Megatron Compatibility Layer:
    • Launched a simplified model migration pipeline for effortless conversion and analysis of models built with NVIDIA's Megatron framework.
  • Optimized Recomputation Strategy:
    • Implemented finer-grained selective recompute, enabling more precise control over the memory-for-computation trade-off to optimize for larger model sizes or higher throughput.
  • Comprehensive Efficiency Analysis:
    • New capability to measure and analyze efficiency and utilization across various tensor shapes and memory layouts.

Bug Fixes

  • Fixed an incorrect token numbers calculation when etp > 1.
  • Corrected the FLOPs or memory access (e.g., HBM access volume) calculation for several operators.
  • Resolved inaccuracies in the estimated communication volume and associated data types.