-
Notifications
You must be signed in to change notification settings - Fork 73
[Build Speed][WIP] Dynamnic Type, Polymorphic Value, and Precompiled Headers #5747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Preparatory refactor for wrapper class conversion. No behavior change - just moves the DynamicType alias into detail::DynamicTypeAlias and re-exports as PolymorphicValue.
This reverts commit 339731c.
…, wrapper dynamic_type.h)
…r extern template suppression. Reduces compile time by 56% and template instantiation by 75%.
…nment) to friend functions.
… guards. Fix tests for new error messages.
Reduces template instantiation by 28% by confining ForAllTypes dispatch to one TU.
Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.
Replaces ForAllTypes/dispatch with fold expression dispatch, eliminating template overhead.
…rAllTypes/Void overhead and fix Clang 18 template crash
…wise, named comparisons). Uses macro-generated switch statements supporting up to 16 type alternatives.
…cpp with -fvisibility=default. Resolves undefined symbol error when importing nvfuser.
|
!test |
Description
|
| Relevant files |
|---|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Header Refactoring
|
Test failures
-
(High, 44)
NCCL NVLS multicast memory bind failures in multi-device distributed tests (dtensor/matmul/overlap/transformer) on dlcluster_viking_ciTest Name H100 (dist.) Source tests.python.multidevice.test_communication.test_allgather ❌ tests.python.multidevice.test_communication.test_allgather_expanded_broadcast ❌ tests.python.multidevice.test_communication.test_allreduce ❌ tests.python.multidevice.test_communication.test_reduce_scatter ❌ tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous ❌ tests.python.multidevice.test_dtensor.test_column_parallel_linear ❌ tests.python.multidevice.test_dtensor.test_plus_one ❌ tests.python.multidevice.test_dtensor.test_row_parallel_linear ❌ tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine ❌ tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm ❌ ... with 34 more test failures omitted. Check internal logs. -
(High, 1)
NCCL invalid usage error in multidevice overlap tests (test_overlap_allgather_matmul_shard_outermost)Test Name H100 (dist.) Source tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda] ❌
Build Time Improvements (GCC)
*M10 GCC measurement pending; estimate based on Clang improvement (6m 55s, -45% vs M9).
Template instantiation reduced by 94%+ from original baseline.
Milestone 8: Friend Functions + Extern Template
Build time: 19m 34s → 15m 12s (-22%)
Converted all DynamicType operators from free function templates to friend functions with static member implementations. This allows
extern templateto properly suppress instantiation across translation units—previously, free function templates were instantiated in every TU regardless of extern template declarations.Milestone 9: Function Moving + Expanded PCH
Build time: 15m 12s → 12m 28s (-18%)
With M8's template reduction, header parsing became a significant cost (~48% of frontend time, up from 6%). Applied multiple optimizations:
Task 2: Function Moving
Moved
getDataType()andcastToDtype()fromtype.htotype.cpp. GCC: 15m 12s → 14m 50s (-2.4%).Task 3: Narrow PCH
Precompiled header for
polymorphic_value.h. GCC: 14m 50s → 13m 33s (-9%).Task 4: Shared Test PCH
Discovered 20 redundant PCH files (9.2 GB total) being created for test targets. Consolidated to 2 shared PCH files, saving 8.3 GB. GCC: 13m 33s → 12m 42s (-6%).
Task 5: Expanded PCH (10 Headers)
Expanded PCH from 1 header to 10 nvFuser-controllable headers:
polymorphic_value.htype_traits.hir/base_nodes.habstract_tensor.htype.hir/container.hserde/fusion_cache_generated.hiter_visitor.hir/internal_nodes.hir/interface_nodes.hGCC: 12m 42s → 12m 28s (-1.8%).
Milestone 10: Index-Based Switch Dispatch
GCC build time: ~10m (estimated, pending verification)
Clang verified: 6m 55s (-45% vs M9)
Core rewrite replacing recursive
ForAllTypestemplate machinery with flat switch statements using variant index. This eliminates the expensive Void/tuple/dispatch template overhead entirely.Task 3: Visibility Fix
Template-template parameters break visibility attributes. Fixed by compiling
polymorphic_value.cppwith-fvisibility=default:Key Files Modified
lib/dynamic_type/src/dynamic_type/decl.hlib/dynamic_type/src/dynamic_type/impl.hcsrc/polymorphic_value.hcsrc/polymorphic_value.cppcsrc/type.hgetDataType/castToDtypeimplementationscsrc/type.cppgetDataType/castToDtypeimplementationsCMakeLists.txtTest Status
Commits
518198f874e50e4a4bd2fdb387f89ae03c0f321164986d9ab3518c1c4484a2743668285036a887906fecdd77b3473ae6b2e