Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines by MiniYuanBot · Pull Request #7439 · deepmodeling/abacus-develop

MiniYuanBot · 2026-06-05T14:51:14Z

What's changed

This PR optimizes the memory-bound copy loops in PW_Basis::real2recip and PW_Basis::recip2real (source/module_pw/pw_transform.cpp) using cache blocking and SIMD vectorization, while maintaining full numerical compatibility with the original implementation.

Key Changes

Cache blocking (tiling)
Introduced a unified block size pw_transform_cache_block = 1024 and helper block_end(). All long copy loops are rewritten in a two-level structure:

#pragma omp parallel for schedule(static)
for (int ib = 0; ib < nrxx_; ib += pw_transform_cache_block) {
    const int iend = block_end(ib, nrxx_);
    #pragma omp simd
    for (int ir = ib; ir < iend; ++ir) {
        auxr[ir] = in_[ir];
    }
}

This keeps the working set in L1/L2 cache and mitigates false sharing across OpenMP threads.

SIMD vectorization
Added #pragma omp simd to the inner stride-1 loops (continuous copy, zeroing, and accumulation). This helps the compiler emit contiguous SIMD instructions (AVX2/AVX-512) for std::complex<FPTYPE> and real-valued buffers.
Alias analysis & pointer caching
Cached frequently accessed member variables (nrxx, npw, nxyz, ig2isz) and FFT buffer pointers (auxr, auxg, rspace) as local const variables. This reduces repeated this-> indirection and improves compiler aliasing assumptions.
Finer-grained timers
Added sub-timers (real2recip_copy_r, real2recip_copy_g, recip2real_copy_r, recip2real_copy_g) to isolate memory-copy overhead from FFT library time, aiding future profiling.

Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)

Threads	Time (s)	Speedup	Efficiency
1	29.34	1.00	100.0%
2	13.84	2.12	105.9%
4	10.17	2.88	72.1%
8	4.53	6.48	81.0%
12	4.93	5.96	49.6%
16	3.71	7.91	49.4%

8 physical cores achieve 6.48× speedup at 81% parallel efficiency.
Efficiency drops beyond 8 threads due to Hyper-Threading and memory-bandwidth saturation, which is expected for memory-intensive FFT kernels.

Files Changed

source/module_pw/pw_transform.cpp — optimized copy loops and timers

MiniYuanBot · 2026-06-05T14:52:16Z

\label project_learning
This is Problem 3 of the assignment01 on the plane wave module.
Thanks for the review: )

add simd to fft

4af6586

mohanchen added the project_learning label Jun 5, 2026

mohanchen assigned Qianruipku Jun 5, 2026

mohanchen requested a review from Qianruipku June 5, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439
MiniYuanBot wants to merge 1 commit into
deepmodeling:developfrom
mystic-qaq:feat/fft-copy-block-simd

MiniYuanBot commented Jun 5, 2026

Uh oh!

MiniYuanBot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MiniYuanBot commented Jun 5, 2026

What's changed

Key Changes

Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)

Files Changed

Uh oh!

MiniYuanBot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MiniYuanBot commented Jun 5, 2026 •

edited

Loading