Add experimental PW LOBPCG eigensolver#7406
Conversation
|
Nice try. You can keep refining the implementation and use test cases to verify its correctness and performance. |
| Real* eigenvalue_in, | ||
| const std::vector<double>& ethr_band); | ||
|
|
||
| /// USPP (S≠I) diagonalization. NOT IMPLEMENTED — aborts. |
There was a problem hiding this comment.
Actually standard eigenproblem is just a special case(S=I) for generalized problem. Only one diag interface is necessary for both, and the latter is adequate.
There was a problem hiding this comment.
Thanks for the suggestion. I will narrow down the interface.
| Real* eigenvalue_in, | ||
| const std::vector<double>& ethr_band); | ||
|
|
||
| /// USPP (S≠I) diagonalization. NOT IMPLEMENTED — aborts. |
There was a problem hiding this comment.
The standard eigenvalue problem is just a special case (where S is I) for the generalized problem. Thus one diag interface (the latter) is adequate.
| namespace hsolver { | ||
|
|
||
| // ============================================================================ | ||
| // Band-major explicit-loop helpers (CPU, n_band_l == n_band required) |
There was a problem hiding this comment.
Most loops can be achieved with a BLAS 3 op
|
Updated LOBPCG to expose only the generalized diag(hpsi, spsi, ...) interface. The current implementation still only accepts S = I; non-identity overlap exits explicitly. Also replaced several explicit dense matrix operations with existing PGemmCN / PLinearTransform helpers. |
|
Update: PW LOBPCG band-parallel generalized path This update moves the experimental PW LOBPCG solver beyond the previous What changedThe LOBPCG PW path now always enters through the generalized Band parallelism has been wired through the PW solver and input/global-state
For the generalized band-parallel algorithm, the distributed subspace now
Stability changesThe generalized band-parallel path uses rank-compression when the projected The implementation also keeps explicit diagnostics for failed subspace solves, Performance-related changesTwo low-risk optimizations are included:
There is also a small shared ValidationUnit and integration-style tests run locally: cmake --build build --target MODULE_HSOLVER_lobpcg -j2
cmake --build build --target MODULE_HSOLVER_linear_trans MODULE_HSOLVER_lobpcg -j2
cmake --build build --target abacus_basic_para -j2
cmake --build build --target MODULE_IO_system_variable_test -j2
env I_MPI_FABRICS=shm ctest --test-dir build -R '^MODULE_HSOLVER_lobpcg$' --output-on-failure
env I_MPI_FABRICS=shm ctest --test-dir build -R '^MODULE_HSOLVER_lobpcg_parallel$' --output-on-failure
env I_MPI_FABRICS=shm ctest --test-dir build -R '^MODULE_HSOLVER_para_linear_trans$' --output-on-failure
ctest --test-dir build -R '^MODULE_IO_system_variable_test$' --output-on-failure
git diff --checkReal-case checksThe following 2-rank band-parallel USPP cases were rerun after the latest
Previously validated reference matrix:
Current limitationsThis is still an experimental PW LOBPCG implementation. The currently
Still not covered:
The next performance step would likely require a batched or multi-RHS |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
source/source_psi/psi_prepare.cpp:1
- In the
lobpcg_band_parallelcase,scatter_sourceis taken frombp_global_evc_device->get_pointer(). WhenPARAM.inp.device == "gpu", that pointer is very likely device memory; passing it intoParallel_Common::send_data(...)(and also intosyncmem_complex_op()as a source) will break unless the codebase guarantees CUDA-aware MPI and thatsend_datasupports device buffers. A safer approach is to always use a host-accessible pointer for MPI sends (e.g., usebp_global_evc_cpu->get_pointer()/ stage a host buffer and synchronize device→host explicitly before the sends), and keep device pointers confined to device-side operations."
#include "psi_prepare.h"
| void DiagoLobpcg<T, Device>::validate_ethr_band(const std::vector<double>& ethr_band) const | ||
| { | ||
| if (ethr_band.size() < static_cast<size_t>(this->n_band_l)) { | ||
| std::ostringstream oss; | ||
| oss << "LOBPCG ethr_band size mismatch: size=" << ethr_band.size() | ||
| << ", required=" << this->n_band_l; | ||
| if (!this->diag_context.empty()) { | ||
| oss << ", context={" << this->diag_context << "}"; | ||
| } | ||
| throw std::invalid_argument(oss.str()); | ||
| } | ||
| } | ||
|
|
||
| template <typename T, typename Device> | ||
| bool DiagoLobpcg<T, Device>::test_error( | ||
| const ct::Tensor& err_in, const std::vector<double>& ethr_band) | ||
| { | ||
| Real* _err_st = err_in.data<Real>(); | ||
| bool not_conv = false; | ||
| std::vector<Real> tmp_cpu; | ||
| if (err_in.device_type() == ct::DeviceType::GpuDevice) { | ||
| tmp_cpu.resize(this->n_band_l); | ||
| _err_st = tmp_cpu.data(); | ||
| syncmem_var_d2h_op()(_err_st, err_in.data<Real>(), this->n_band_l); | ||
| } | ||
| for (int ii = 0; ii < this->n_band_l; ii++) | ||
| if (_err_st[ii] > ethr_band[ii]) not_conv = true; | ||
| #ifdef __MPI | ||
| MPI_Allreduce(MPI_IN_PLACE, ¬_conv, 1, MPI_C_BOOL, MPI_LOR, BP_WORLD); | ||
| #endif | ||
| return not_conv; | ||
| } |
| #ifdef __MPI | ||
| int nproc_in_pool, kpar = 1, mypool, rank_in_pool; | ||
| setupmpi(argc, argv, nproc, myrank); | ||
| divide_pools(nproc, myrank, nproc_in_pool, kpar, mypool, rank_in_pool); | ||
| const bool use_band_parallel_world = std::getenv("ABACUS_LOBPCG_TEST_BNDPAR") != nullptr; | ||
| MPI_Comm_split(MPI_COMM_WORLD, use_band_parallel_world ? 0 : myrank, 0, &BP_WORLD); | ||
| if (use_band_parallel_world) | ||
| { | ||
| GlobalV::MY_BNDGROUP = myrank; | ||
| GlobalV::NPROC_IN_BNDGROUP = nproc; | ||
| MPI_Comm_free(&POOL_WORLD); | ||
| MPI_Comm_split(MPI_COMM_WORLD, myrank, 0, &POOL_WORLD); | ||
| } | ||
| GlobalV::NPROC_IN_POOL = nproc; | ||
| #else | ||
| MPI_Init(&argc, &argv); | ||
| #endif |
Background
During recent code reading and testing, we found that the current PW
bpcgsolver path does not appear to expose an explicit generalized eigenproblem
interface for the overlap matrix
S. Further comparison on USPP cases showedthat
bpcgcan produce results noticeably different fromcganddav, wheregeneralized eigenproblem handling is required.
As a first step toward improving this area, this PR introduces an experimental
LOBPCG solver framework for PW standard eigenproblems.
Changes
DiagoLobpcgfor CPU PW standard eigenproblems (S = I).ks_solver = lobpcgsupport in the PW hsolver path.lobpcg.LB.Current Limitations
This PR intentionally does not claim USPP/generalized eigenproblem support yet.
Currently supported:
std::complex<double>S = I)Currently not supported:
S != I)Unsupported generalized usage exits explicitly with
WARNING_QUIT.Tests
cmake --build . --target hsolver MODULE_HSOLVER_lobpcg abacus_basic_para -j2env I_MPI_FABRICS=shm ctest -R MODULE_HSOLVER_lobpcg --output-on-failuregit diff --check