Skip to content

Conversation

@csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Jan 2, 2026

Build Time Improvements (GCC)

Milestone Build Time Improvement Cumulative
Baseline (main) 19m 34s - -
M8 15m 12s -22% -22%
M9 12m 28s -18% -36%
M10 ~10m* ~-20%* ~-49%*

*M10 GCC measurement pending; estimate based on Clang improvement (6m 55s, -45% vs M9).

Template instantiation reduced by 94%+ from original baseline.


Milestone 8: Friend Functions + Extern Template

Build time: 19m 34s → 15m 12s (-22%)

Converted all DynamicType operators from free function templates to friend functions with static member implementations. This allows extern template to properly suppress instantiation across translation units—previously, free function templates were instantiated in every TU regardless of extern template declarations.

// Before: Free function template (NOT covered by extern template)
template <typename LHS, typename RHS, typename DT>
DT operator+(LHS&& x, RHS&& y);

// After: Friend function + static member (COVERED by extern template)
friend DynamicType operator+(const DynamicType& a, const DynamicType& b) {
    return add_impl(a, b);
}

Milestone 9: Function Moving + Expanded PCH

Build time: 15m 12s → 12m 28s (-18%)

With M8's template reduction, header parsing became a significant cost (~48% of frontend time, up from 6%). Applied multiple optimizations:

Task 2: Function Moving

Moved getDataType() and castToDtype() from type.h to type.cpp. GCC: 15m 12s → 14m 50s (-2.4%).

Task 3: Narrow PCH

Precompiled header for polymorphic_value.h. GCC: 14m 50s → 13m 33s (-9%).

Task 4: Shared Test PCH

Discovered 20 redundant PCH files (9.2 GB total) being created for test targets. Consolidated to 2 shared PCH files, saving 8.3 GB. GCC: 13m 33s → 12m 42s (-6%).

Task 5: Expanded PCH (10 Headers)

Expanded PCH from 1 header to 10 nvFuser-controllable headers:

Header Exclusive Parse Time
polymorphic_value.h 27.9m
type_traits.h 7.9m
ir/base_nodes.h 4.7m
abstract_tensor.h 2.7m
type.h 1.4m
ir/container.h 0.9m
serde/fusion_cache_generated.h 0.7m
iter_visitor.h 0.6m
ir/internal_nodes.h 0.6m
ir/interface_nodes.h 0.5m

GCC: 12m 42s → 12m 28s (-1.8%).


Milestone 10: Index-Based Switch Dispatch

GCC build time: ~10m (estimated, pending verification)
Clang verified: 6m 55s (-45% vs M9)

Core rewrite replacing recursive ForAllTypes template machinery with flat switch statements using variant index. This eliminates the expensive Void/tuple/dispatch template overhead entirely.

// Before: Recursive template instantiation
ForAllTypes<T1, T2, ..., T9>{}(lambda)
    → tuple<Void, Result, Void, ...>
    → remove_void_from_tuple()

// After: Flat switch dispatch
switch (a.index()) {
    case 0: switch (b.index()) { case 0: return op(get<0>(a), get<0>(b)); ... }
    case 1: ...
}

Task 3: Visibility Fix

Template-template parameters break visibility attributes. Fixed by compiling polymorphic_value.cpp with -fvisibility=default:

set_source_files_properties(
  "${NVFUSER_SRCS_DIR}/polymorphic_value.cpp"
  PROPERTIES 
    SKIP_PRECOMPILE_HEADERS ON
    COMPILE_OPTIONS "-fvisibility=default"
)

Key Files Modified

File Change
lib/dynamic_type/src/dynamic_type/decl.h Friend function declarations
lib/dynamic_type/src/dynamic_type/impl.h Switch dispatch macros + implementations
csrc/polymorphic_value.h Extern template declaration
csrc/polymorphic_value.cpp Explicit template instantiation
csrc/type.h Removed getDataType/castToDtype implementations
csrc/type.cpp Added getDataType/castToDtype implementations
CMakeLists.txt PCH configuration (10 headers), visibility override

Test Status

Test Suite Status
DynamicType library (72 tests) ✅ All pass
nvFuser PolymorphicValue tests ✅ All pass
Python import ✅ Works

Commits

Commit Description
518198f87 Fix symbol visibility for DynamicType (-fvisibility=default)
4e50e4a4b Extend switch dispatch to all binary operators
d2fdb387f Convert comparison operators to switch dispatch
89ae03c0f Implement index-based switch dispatch for operator==
321164986 Expand PCH to include top nvFuser headers
d9ab3518c Extend PCH to test targets (shared PCH)
1c4484a27 Enable narrow PCH for polymorphic_value.h
436682850 Move getDataType and castToDtype to type.cpp
36a887906 Refactor DynamicType operators to non-template friends
fecdd77b3 M8 Task 12: Convert remaining operators to friend functions
473ae6b2e M8 Task 12: Convert 22 binary operators to friend pattern

Preparatory refactor for wrapper class conversion.
No behavior change - just moves the DynamicType alias into
detail::DynamicTypeAlias and re-exports as PolymorphicValue.
…r extern template suppression. Reduces compile time by 56% and template instantiation by 75%.
Reduces template instantiation by 28% by confining ForAllTypes dispatch to one TU.
Precompile polymorphic_value.h to eliminate ~4000s of redundant header parsing. Enabled by default for Release builds. Disable with -DNVFUSER_USE_POLYMORPHIC_PCH=OFF.
Replaces ForAllTypes/dispatch with fold expression dispatch, eliminating template overhead.
…rAllTypes/Void overhead and fix Clang 18 template crash
…wise, named comparisons). Uses macro-generated switch statements supporting up to 16 type alternatives.
…cpp with -fvisibility=default. Resolves undefined symbol error when importing nvfuser.
@csarofeen
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 2, 2026

Description

  • Split DynamicType into decl.h, impl.h, and wrapper headers for better build organization

  • Convert 22+ operators to friend functions reducing template instantiation by 75%

  • Move operators and functions from headers to implementation files to reduce compile time

  • Enable precompiled headers for polymorphic_value.h eliminating ~4000s of redundant parsing

Changes walkthrough

Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Header Refactoring

The main header file has been completely refactored to now just include "impl.h" instead of containing the full implementation. This is a breaking change for users who were relying on the previous header structure. The PR mentions this is for "compile-time optimization" but doesn't explain the migration path for existing users.

// clang-format on
#pragma once

// Backward-compatible header - includes everything.
// For compile-time optimization, include decl.h directly
// and use extern template declarations.

#include "impl.h"
Template Instantiation

An explicit template instantiation is added for DynamicType with specific types. This creates a single point of instantiation but may cause linking issues if the explicit instantiation doesn't match what other translation units expect. The -fvisibility=default requirement should be verified across different build configurations.

// Explicit instantiation of DynamicType for PolymorphicValue.
// This is the single point where the template is fully instantiated.
// Note: This file is compiled with -fvisibility=default (set in CMakeLists.txt)
// to ensure all DynamicType symbols are exported from the shared library.
template struct dynamic_type::DynamicType<
    dynamic_type::Containers<std::vector>,
    nvfuser::StructHandle,
    nvfuser::Pointer,
    nvfuser::Opaque,
    at::Tensor,
    std::complex<double>,
    double,
    int64_t,
    bool>;
Test Behavior Change

Tests are converted from compile-time static_assert to runtime EXPECT_EQ tests, and error messages are changed from "Result is dynamic but not convertible to result type" to "Cannot compute". This changes the testing behavior from compile-time failures to runtime failures, which could affect test execution patterns and debugging workflows.

EXPECT_EQ(                                                                 \
    (DoubleInt64Bool(2L) op DoubleInt64Bool(2.5))                          \
        .as<decltype(2L op 2.5)>(),                                        \
    (2L op 2.5));                                                          \
EXPECT_EQ(                                                                 \
    (DoubleInt64Bool(2L) op DoubleInt64BoolTwo{})                          \
        .as<decltype(2L op 2L)>(),                                         \
    (2L op 2L));                                                           \
EXPECT_EQ(                                                                 \
    (DoubleInt64BoolTwo {} op DoubleInt64Bool(2L))                         \
        .as<decltype(2L op 2L)>(),                                         \
    (2L op 2L));                                                           \

Test failures

  • (High, 44) NCCL NVLS multicast memory bind failures in multi-device distributed tests (dtensor/matmul/overlap/transformer) on dlcluster_viking_ci

    Test Name H100 (dist.) Source
    tests.python.multidevice.test_communication.test_allgather
    tests.python.multidevice.test_communication.test_allgather_expanded_broadcast
    tests.python.multidevice.test_communication.test_allreduce
    tests.python.multidevice.test_communication.test_reduce_scatter
    tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous
    tests.python.multidevice.test_dtensor.test_column_parallel_linear
    tests.python.multidevice.test_dtensor.test_plus_one
    tests.python.multidevice.test_dtensor.test_row_parallel_linear
    tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine
    tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm
    ... with 34 more test failures omitted. Check internal logs.
  • (High, 1) NCCL invalid usage error in multidevice overlap tests (test_overlap_allgather_matmul_shard_outermost)

    Test Name H100 (dist.) Source
    tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants