|
| 1 | +# Build Time Optimization |
| 2 | + |
| 3 | +This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase. |
| 4 | + |
| 5 | +## Why Build Time Matters |
| 6 | + |
| 7 | +Composable Kernel relies heavily on C++ template metaprogramming to achieve GPU kernels with no runtime abstraction penalty. However, deep template instantiation can significantly impact build times. A single translation unit may trigger hundreds of thousands of template instantiations, with each instantiation adding to compile time. |
| 8 | + |
| 9 | +## Measuring Build Time |
| 10 | + |
| 11 | +Use Clang's `-ftime-trace` flag to generate JSON build traces: |
| 12 | + |
| 13 | +```bash |
| 14 | +# Build with time trace enabled |
| 15 | +cmake -DCMAKE_CXX_FLAGS="-ftime-trace -ftime-trace-granularity=1" .. |
| 16 | +ninja example_gemm_xdl_fp16 |
| 17 | + |
| 18 | +# Find the trace file |
| 19 | +find . -name "*.json" -path "*/CMakeFiles/*" |
| 20 | +``` |
| 21 | + |
| 22 | +The trace file can be viewed in Chrome's `chrome://tracing` or analyzed with tools like [ClangBuildAnalyzer](https://github.com/aras-p/ClangBuildAnalyzer). |
| 23 | + |
| 24 | +Key metrics to monitor: |
| 25 | + |
| 26 | +- **Template instantiation count**: Total number of unique template instantiations |
| 27 | +- **Template instantiation depth**: Maximum recursion depth during instantiation |
| 28 | +- **Wall-clock time**: Actual time spent instantiating templates |
| 29 | + |
| 30 | +The `script/tools/ck-build-analysis` script automates trace collection and analysis: |
| 31 | + |
| 32 | +```bash |
| 33 | +script/tools/ck-build-analysis example_gemm_xdl_fp16 --granularity=1 |
| 34 | +``` |
| 35 | + |
| 36 | +## Optimization Techniques |
| 37 | + |
| 38 | +### 1. Replace O(N) Recursion with O(1) Pack Expansion |
| 39 | + |
| 40 | +Recursive template patterns create O(N) instantiation depth. Use compiler intrinsics and fold expressions for O(1) depth. |
| 41 | + |
| 42 | +**Before** (O(N) recursive instantiation): |
| 43 | + |
| 44 | +```cpp |
| 45 | +template <index_t N, typename F, index_t... Is> |
| 46 | +struct sequence_gen_impl |
| 47 | +{ |
| 48 | + using type = typename sequence_gen_impl<N-1, F, F{}(Number<N-1>{}), Is...>::type; |
| 49 | +}; |
| 50 | + |
| 51 | +template <typename F, index_t... Is> |
| 52 | +struct sequence_gen_impl<0, F, Is...> |
| 53 | +{ |
| 54 | + using type = Sequence<Is...>; |
| 55 | +}; |
| 56 | +``` |
| 57 | +
|
| 58 | +**After** (O(1) using compiler intrinsic): |
| 59 | +
|
| 60 | +```cpp |
| 61 | +template <index_t N, typename F> |
| 62 | +struct sequence_gen |
| 63 | +{ |
| 64 | + template <index_t... Is> |
| 65 | + static constexpr auto make(std::integer_sequence<index_t, Is...>) |
| 66 | + { |
| 67 | + return Sequence<F{}(Number<Is>{})...>{}; |
| 68 | + } |
| 69 | + using type = decltype(make(__make_integer_seq<std::integer_sequence, index_t, N>{})); |
| 70 | +}; |
| 71 | +``` |
| 72 | + |
| 73 | +The `__make_integer_seq` intrinsic (available in Clang and MSVC) generates integer sequences with O(1) template depth. |
| 74 | + |
| 75 | +### 2. Replace Lambdas with Named Functors |
| 76 | + |
| 77 | +Each lambda expression creates a unique closure type, causing separate template instantiations at every call site. |
| 78 | + |
| 79 | +**Before** (lambda creates unique instantiations): |
| 80 | + |
| 81 | +```cpp |
| 82 | +// Called in multiple places - each creates new instantiations |
| 83 | +auto result = transform_tensor_descriptor( |
| 84 | + desc, |
| 85 | + make_tuple(make_pass_through_transform(Length)), |
| 86 | + make_tuple(Sequence<0>{}), |
| 87 | + make_tuple(Sequence<0>{})); |
| 88 | + |
| 89 | +// The lambda inside transform_tensor_descriptor: |
| 90 | +generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{}); |
| 91 | +``` |
| 92 | +
|
| 93 | +**After** (named functor shares instantiations): |
| 94 | +
|
| 95 | +```cpp |
| 96 | +// Define functor once |
| 97 | +struct generate_identity_sequence |
| 98 | +{ |
| 99 | + template <index_t I> |
| 100 | + __host__ __device__ constexpr auto operator()(Number<I>) const |
| 101 | + { |
| 102 | + return Sequence<I>{}; |
| 103 | + } |
| 104 | +}; |
| 105 | +
|
| 106 | +// Use everywhere - shares instantiations |
| 107 | +generate_tuple(generate_identity_sequence{}, Number<N>{}); |
| 108 | +``` |
| 109 | + |
| 110 | +This reduced `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction). |
| 111 | + |
| 112 | +#### container_concat optimization |
| 113 | + |
| 114 | +The same pattern applies to utility functions like `container_concat`: |
| 115 | + |
| 116 | +**Before**: |
| 117 | + |
| 118 | +```cpp |
| 119 | +template <typename... X, typename... Y> |
| 120 | +__host__ __device__ constexpr auto container_concat(const Tuple<X...>& tx, const Tuple<Y...>& ty) |
| 121 | +{ |
| 122 | + return unpack2([](auto&&... zs) { return make_tuple(forward<decltype(zs)>(zs)...); }, tx, ty); |
| 123 | +} |
| 124 | +``` |
| 125 | +
|
| 126 | +**After**: |
| 127 | +
|
| 128 | +```cpp |
| 129 | +struct make_tuple_functor |
| 130 | +{ |
| 131 | + template <typename... Ts> |
| 132 | + __host__ __device__ constexpr auto operator()(Ts&&... xs) const |
| 133 | + { |
| 134 | + return make_tuple(forward<Ts>(xs)...); |
| 135 | + } |
| 136 | +}; |
| 137 | +
|
| 138 | +template <typename... X, typename... Y> |
| 139 | +__host__ __device__ constexpr auto container_concat(const Tuple<X...>& tx, const Tuple<Y...>& ty) |
| 140 | +{ |
| 141 | + return unpack2(make_tuple_functor{}, tx, ty); |
| 142 | +} |
| 143 | +``` |
| 144 | + |
| 145 | +This reduced `container_concat` instantiations from 186 to 93 (50% reduction). |
| 146 | + |
| 147 | +#### make_uniform_tuple helper |
| 148 | + |
| 149 | +For patterns that create tuples with repeated values, use dedicated helpers instead of lambdas: |
| 150 | + |
| 151 | +**Before**: |
| 152 | + |
| 153 | +```cpp |
| 154 | +// Creates unique lambda type at each call site |
| 155 | +generate_tuple([](auto) { return some_value; }, Number<N>{}); |
| 156 | +``` |
| 157 | +
|
| 158 | +**After**: |
| 159 | +
|
| 160 | +```cpp |
| 161 | +// Defined once, shared across all call sites |
| 162 | +template <index_t N, typename T> |
| 163 | +__host__ __device__ constexpr auto make_uniform_tuple(T&& value) |
| 164 | +{ |
| 165 | + return detail::make_uniform_tuple_impl(static_cast<T&&>(value), make_index_sequence<N>{}); |
| 166 | +} |
| 167 | +
|
| 168 | +// Usage |
| 169 | +make_uniform_tuple<N>(some_value); |
| 170 | +``` |
| 171 | + |
| 172 | +### 3. Use Constexpr Arrays Instead of Template Recursion |
| 173 | + |
| 174 | +Replace recursive template searches with constexpr functions using arrays. |
| 175 | + |
| 176 | +**Before** (O(N) recursive template search): |
| 177 | + |
| 178 | +```cpp |
| 179 | +template <index_t Target, typename FirstSeq, typename... RestSeqs> |
| 180 | +struct find_in_tuple_of_sequences_impl |
| 181 | +{ |
| 182 | + static constexpr index_t pos = sequence_find<Target>(FirstSeq{}); |
| 183 | + static constexpr bool found_here = (pos >= 0); |
| 184 | + |
| 185 | + using next = find_in_tuple_of_sequences_impl<Target, RestSeqs...>; |
| 186 | + |
| 187 | + static constexpr index_t itran = found_here ? 0 : 1 + next::itran; |
| 188 | + static constexpr index_t idim_up = found_here ? pos : next::idim_up; |
| 189 | +}; |
| 190 | +``` |
| 191 | + |
| 192 | +**After** (O(1) pack expansion with constexpr array): |
| 193 | + |
| 194 | +```cpp |
| 195 | +template <index_t Target, typename... Seqs> |
| 196 | +struct FindInTupleOfSequencesCompute |
| 197 | +{ |
| 198 | + static constexpr auto compute() |
| 199 | + { |
| 200 | + if constexpr(sizeof...(Seqs) == 0) { |
| 201 | + return ResultData{0, 0, false}; |
| 202 | + } else { |
| 203 | + // Pack expansion creates array - O(1) template depth |
| 204 | + constexpr index_t indices[] = {sequence_find_value<Target>(Seqs{})...}; |
| 205 | + for(index_t i = 0; i < sizeof...(Seqs); ++i) |
| 206 | + if(indices[i] >= 0) return ResultData{i, indices[i], true}; |
| 207 | + return ResultData{0, 0, false}; |
| 208 | + } |
| 209 | + } |
| 210 | +}; |
| 211 | +``` |
| 212 | + |
| 213 | +This reduced instantiations by 50% and wall-clock time by 69%. |
| 214 | + |
| 215 | +### 4. Avoid Unnecessary Template Parameter Variations |
| 216 | + |
| 217 | +Templates with many parameter combinations cause combinatorial explosion. |
| 218 | + |
| 219 | +- Cache template results where possible |
| 220 | +- Use type erasure for runtime-only variations |
| 221 | +- Consider `if constexpr` to reduce branch template instantiations |
| 222 | + |
| 223 | +## Case Studies |
| 224 | + |
| 225 | +The following PRs demonstrate these techniques applied to Composable Kernel: |
| 226 | + |
| 227 | +- **sequence_gen optimization**: Replaced O(N) recursion with `__make_integer_seq` intrinsic |
| 228 | +- **transform_tensor_descriptor**: Replaced lambdas with named functors (92% instantiation reduction) |
| 229 | +- **container_concat**: Replaced lambdas with named functors (50% instantiation reduction) |
| 230 | +- **find_in_tuple_of_sequences**: Replaced recursive search with pack expansion (50% reduction) |
| 231 | +- **sequence_merge**: Replaced O(log N) recursion with O(1) fold expression |
| 232 | + |
| 233 | +See tracking issue [#3575](https://github.com/ROCm/composable_kernel/issues/3575) for the full list of PRs. |
| 234 | + |
| 235 | +## Tools and Commands |
| 236 | + |
| 237 | +Identify optimization targets: |
| 238 | + |
| 239 | +```bash |
| 240 | +# Run analysis on a specific target |
| 241 | +script/tools/ck-build-analysis example_convnd_fwd_xdl_fp16 --granularity=1 |
| 242 | + |
| 243 | +# View the generated report |
| 244 | +cat build_time_analysis_report.md |
| 245 | +``` |
| 246 | + |
| 247 | +The report shows template instantiation counts, wall-clock times, and identifies the most expensive templates. |
0 commit comments