Skip to content

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439

Open
MiniYuanBot wants to merge 1 commit into
deepmodeling:developfrom
mystic-qaq:feat/fft-copy-block-simd
Open

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439
MiniYuanBot wants to merge 1 commit into
deepmodeling:developfrom
mystic-qaq:feat/fft-copy-block-simd

Conversation

@MiniYuanBot
Copy link
Copy Markdown

What's changed

This PR optimizes the memory-bound copy loops in PW_Basis::real2recip and PW_Basis::recip2real (source/module_pw/pw_transform.cpp) using cache blocking and SIMD vectorization, while maintaining full numerical compatibility with the original implementation.

Key Changes

  1. Cache blocking (tiling)
    Introduced a unified block size pw_transform_cache_block = 1024 and helper block_end(). All long copy loops are rewritten in a two-level structure:

    #pragma omp parallel for schedule(static)
    for (int ib = 0; ib < nrxx_; ib += pw_transform_cache_block) {
        const int iend = block_end(ib, nrxx_);
        #pragma omp simd
        for (int ir = ib; ir < iend; ++ir) {
            auxr[ir] = in_[ir];
        }
    }

    This keeps the working set in L1/L2 cache and mitigates false sharing across OpenMP threads.

  2. SIMD vectorization
    Added #pragma omp simd to the inner stride-1 loops (continuous copy, zeroing, and accumulation). This helps the compiler emit contiguous SIMD instructions (AVX2/AVX-512) for std::complex<FPTYPE> and real-valued buffers.

  3. Alias analysis & pointer caching
    Cached frequently accessed member variables (nrxx, npw, nxyz, ig2isz) and FFT buffer pointers (auxr, auxg, rspace) as local const variables. This reduces repeated this-> indirection and improves compiler aliasing assumptions.

  4. Finer-grained timers
    Added sub-timers (real2recip_copy_r, real2recip_copy_g, recip2real_copy_r, recip2real_copy_g) to isolate memory-copy overhead from FFT library time, aiding future profiling.

Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)

Threads Time (s) Speedup Efficiency
1 29.34 1.00 100.0%
2 13.84 2.12 105.9%
4 10.17 2.88 72.1%
8 4.53 6.48 81.0%
12 4.93 5.96 49.6%
16 3.71 7.91 49.4%
  • 8 physical cores achieve 6.48× speedup at 81% parallel efficiency.
  • Efficiency drops beyond 8 threads due to Hyper-Threading and memory-bandwidth saturation, which is expected for memory-intensive FFT kernels.

Files Changed

  • source/module_pw/pw_transform.cpp — optimized copy loops and timers

@MiniYuanBot
Copy link
Copy Markdown
Author

MiniYuanBot commented Jun 5, 2026

\label project_learning
This is Problem 3 of the assignment01 on the plane wave module.
Thanks for the review: )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants