You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three changes to the CPU deformable convolution forward kernel:
1. Replace at::zeros with at::empty for columns and out_buf buffers.
The deformable_im2col_kernel writes every element of the columns
buffer, and out_buf is fully written by addmm_, so zero-initialization
is wasted work.
2. Use addmm_ with beta=0 instead of the default beta=1. This avoids
accumulating into uninitialized memory while preserving in-place
operation (no extra allocation unlike at::mm).
3. Parallelize deformable_im2col_kernel with at::parallel_for. The
im2col loop was the only single-threaded phase in the forward pass
(GEMM is already parallelized by BLAS). Each loop iteration writes
to a non-overlapping region of the columns buffer, so parallelization
is safe.
Benchmark results on Apple M2 (CPU, float32):
Config Before (ms) After (ms) Change
small-b1 9.76 2.44 -75%
small-b8 91.77 33.88 -63%
medium-b1 216.70 75.80 -65%
medium-b8 1152.09 650.00 -44%
large-b1 348.86 302.70 -13%
large-b4 1342.75 1289.96 -4%
Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
0 commit comments