Skip to content

Commit 52fa8f6

Browse files
committed
Add build time optimization documentation
1 parent a565d87 commit 52fa8f6

1 file changed

Lines changed: 247 additions & 0 deletions

File tree

BUILD_TIME_OPTIMIZATION.md

Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# Build Time Optimization
2+
3+
This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase.
4+
5+
## Why Build Time Matters
6+
7+
Composable Kernel relies heavily on C++ template metaprogramming to achieve GPU kernels with no runtime abstraction penalty. However, deep template instantiation can significantly impact build times. A single translation unit may trigger hundreds of thousands of template instantiations, with each instantiation adding to compile time.
8+
9+
## Measuring Build Time
10+
11+
Use Clang's `-ftime-trace` flag to generate JSON build traces:
12+
13+
```bash
14+
# Build with time trace enabled
15+
cmake -DCMAKE_CXX_FLAGS="-ftime-trace -ftime-trace-granularity=1" ..
16+
ninja example_gemm_xdl_fp16
17+
18+
# Find the trace file
19+
find . -name "*.json" -path "*/CMakeFiles/*"
20+
```
21+
22+
The trace file can be viewed in Chrome's `chrome://tracing` or analyzed with tools like [ClangBuildAnalyzer](https://github.com/aras-p/ClangBuildAnalyzer).
23+
24+
Key metrics to monitor:
25+
26+
- **Template instantiation count**: Total number of unique template instantiations
27+
- **Template instantiation depth**: Maximum recursion depth during instantiation
28+
- **Wall-clock time**: Actual time spent instantiating templates
29+
30+
The `script/tools/ck-build-analysis` script automates trace collection and analysis:
31+
32+
```bash
33+
script/tools/ck-build-analysis example_gemm_xdl_fp16 --granularity=1
34+
```
35+
36+
## Optimization Techniques
37+
38+
### 1. Replace O(N) Recursion with O(1) Pack Expansion
39+
40+
Recursive template patterns create O(N) instantiation depth. Use compiler intrinsics and fold expressions for O(1) depth.
41+
42+
**Before** (O(N) recursive instantiation):
43+
44+
```cpp
45+
template <index_t N, typename F, index_t... Is>
46+
struct sequence_gen_impl
47+
{
48+
using type = typename sequence_gen_impl<N-1, F, F{}(Number<N-1>{}), Is...>::type;
49+
};
50+
51+
template <typename F, index_t... Is>
52+
struct sequence_gen_impl<0, F, Is...>
53+
{
54+
using type = Sequence<Is...>;
55+
};
56+
```
57+
58+
**After** (O(1) using compiler intrinsic):
59+
60+
```cpp
61+
template <index_t N, typename F>
62+
struct sequence_gen
63+
{
64+
template <index_t... Is>
65+
static constexpr auto make(std::integer_sequence<index_t, Is...>)
66+
{
67+
return Sequence<F{}(Number<Is>{})...>{};
68+
}
69+
using type = decltype(make(__make_integer_seq<std::integer_sequence, index_t, N>{}));
70+
};
71+
```
72+
73+
The `__make_integer_seq` intrinsic (available in Clang and MSVC) generates integer sequences with O(1) template depth.
74+
75+
### 2. Replace Lambdas with Named Functors
76+
77+
Each lambda expression creates a unique closure type, causing separate template instantiations at every call site.
78+
79+
**Before** (lambda creates unique instantiations):
80+
81+
```cpp
82+
// Called in multiple places - each creates new instantiations
83+
auto result = transform_tensor_descriptor(
84+
desc,
85+
make_tuple(make_pass_through_transform(Length)),
86+
make_tuple(Sequence<0>{}),
87+
make_tuple(Sequence<0>{}));
88+
89+
// The lambda inside transform_tensor_descriptor:
90+
generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{});
91+
```
92+
93+
**After** (named functor shares instantiations):
94+
95+
```cpp
96+
// Define functor once
97+
struct generate_identity_sequence
98+
{
99+
template <index_t I>
100+
__host__ __device__ constexpr auto operator()(Number<I>) const
101+
{
102+
return Sequence<I>{};
103+
}
104+
};
105+
106+
// Use everywhere - shares instantiations
107+
generate_tuple(generate_identity_sequence{}, Number<N>{});
108+
```
109+
110+
This reduced `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction).
111+
112+
#### container_concat optimization
113+
114+
The same pattern applies to utility functions like `container_concat`:
115+
116+
**Before**:
117+
118+
```cpp
119+
template <typename... X, typename... Y>
120+
__host__ __device__ constexpr auto container_concat(const Tuple<X...>& tx, const Tuple<Y...>& ty)
121+
{
122+
return unpack2([](auto&&... zs) { return make_tuple(forward<decltype(zs)>(zs)...); }, tx, ty);
123+
}
124+
```
125+
126+
**After**:
127+
128+
```cpp
129+
struct make_tuple_functor
130+
{
131+
template <typename... Ts>
132+
__host__ __device__ constexpr auto operator()(Ts&&... xs) const
133+
{
134+
return make_tuple(forward<Ts>(xs)...);
135+
}
136+
};
137+
138+
template <typename... X, typename... Y>
139+
__host__ __device__ constexpr auto container_concat(const Tuple<X...>& tx, const Tuple<Y...>& ty)
140+
{
141+
return unpack2(make_tuple_functor{}, tx, ty);
142+
}
143+
```
144+
145+
This reduced `container_concat` instantiations from 186 to 93 (50% reduction).
146+
147+
#### make_uniform_tuple helper
148+
149+
For patterns that create tuples with repeated values, use dedicated helpers instead of lambdas:
150+
151+
**Before**:
152+
153+
```cpp
154+
// Creates unique lambda type at each call site
155+
generate_tuple([](auto) { return some_value; }, Number<N>{});
156+
```
157+
158+
**After**:
159+
160+
```cpp
161+
// Defined once, shared across all call sites
162+
template <index_t N, typename T>
163+
__host__ __device__ constexpr auto make_uniform_tuple(T&& value)
164+
{
165+
return detail::make_uniform_tuple_impl(static_cast<T&&>(value), make_index_sequence<N>{});
166+
}
167+
168+
// Usage
169+
make_uniform_tuple<N>(some_value);
170+
```
171+
172+
### 3. Use Constexpr Arrays Instead of Template Recursion
173+
174+
Replace recursive template searches with constexpr functions using arrays.
175+
176+
**Before** (O(N) recursive template search):
177+
178+
```cpp
179+
template <index_t Target, typename FirstSeq, typename... RestSeqs>
180+
struct find_in_tuple_of_sequences_impl
181+
{
182+
static constexpr index_t pos = sequence_find<Target>(FirstSeq{});
183+
static constexpr bool found_here = (pos >= 0);
184+
185+
using next = find_in_tuple_of_sequences_impl<Target, RestSeqs...>;
186+
187+
static constexpr index_t itran = found_here ? 0 : 1 + next::itran;
188+
static constexpr index_t idim_up = found_here ? pos : next::idim_up;
189+
};
190+
```
191+
192+
**After** (O(1) pack expansion with constexpr array):
193+
194+
```cpp
195+
template <index_t Target, typename... Seqs>
196+
struct FindInTupleOfSequencesCompute
197+
{
198+
static constexpr auto compute()
199+
{
200+
if constexpr(sizeof...(Seqs) == 0) {
201+
return ResultData{0, 0, false};
202+
} else {
203+
// Pack expansion creates array - O(1) template depth
204+
constexpr index_t indices[] = {sequence_find_value<Target>(Seqs{})...};
205+
for(index_t i = 0; i < sizeof...(Seqs); ++i)
206+
if(indices[i] >= 0) return ResultData{i, indices[i], true};
207+
return ResultData{0, 0, false};
208+
}
209+
}
210+
};
211+
```
212+
213+
This reduced instantiations by 50% and wall-clock time by 69%.
214+
215+
### 4. Avoid Unnecessary Template Parameter Variations
216+
217+
Templates with many parameter combinations cause combinatorial explosion.
218+
219+
- Cache template results where possible
220+
- Use type erasure for runtime-only variations
221+
- Consider `if constexpr` to reduce branch template instantiations
222+
223+
## Case Studies
224+
225+
The following PRs demonstrate these techniques applied to Composable Kernel:
226+
227+
- **sequence_gen optimization**: Replaced O(N) recursion with `__make_integer_seq` intrinsic
228+
- **transform_tensor_descriptor**: Replaced lambdas with named functors (92% instantiation reduction)
229+
- **container_concat**: Replaced lambdas with named functors (50% instantiation reduction)
230+
- **find_in_tuple_of_sequences**: Replaced recursive search with pack expansion (50% reduction)
231+
- **sequence_merge**: Replaced O(log N) recursion with O(1) fold expression
232+
233+
See tracking issue [#3575](https://github.com/ROCm/composable_kernel/issues/3575) for the full list of PRs.
234+
235+
## Tools and Commands
236+
237+
Identify optimization targets:
238+
239+
```bash
240+
# Run analysis on a specific target
241+
script/tools/ck-build-analysis example_convnd_fwd_xdl_fp16 --granularity=1
242+
243+
# View the generated report
244+
cat build_time_analysis_report.md
245+
```
246+
247+
The report shows template instantiation counts, wall-clock times, and identifies the most expensive templates.

0 commit comments

Comments
 (0)