Skip to content

Add experimental PW LOBPCG eigensolver#7406

Open
amadeus725 wants to merge 7 commits into
deepmodeling:developfrom
amadeus725:feature/pw-lobpcg
Open

Add experimental PW LOBPCG eigensolver#7406
amadeus725 wants to merge 7 commits into
deepmodeling:developfrom
amadeus725:feature/pw-lobpcg

Conversation

@amadeus725
Copy link
Copy Markdown

Background

During recent code reading and testing, we found that the current PW bpcg
solver path does not appear to expose an explicit generalized eigenproblem
interface for the overlap matrix S. Further comparison on USPP cases showed
that bpcg can produce results noticeably different from cg and dav, where
generalized eigenproblem handling is required.

As a first step toward improving this area, this PR introduces an experimental
LOBPCG solver framework for PW standard eigenproblems.

Changes

  • Add DiagoLobpcg for CPU PW standard eigenproblems (S = I).
  • Add ks_solver = lobpcg support in the PW hsolver path.
  • Add input documentation for lobpcg.
  • Add SCF iteration print label LB.
  • Add unit tests for:
    • Hermitian eigenvalue correctness
    • orthonormality and residual checks
    • generalized-interface not-implemented behavior

Current Limitations

This PR intentionally does not claim USPP/generalized eigenproblem support yet.

Currently supported:

  • PW basis
  • CPU
  • std::complex<double>
  • norm-conserving / standard eigenproblem path (S = I)

Currently not supported:

  • USPP / generalized eigenproblem (S != I)
  • GPU
  • float instantiation
  • band-parallel LOBPCG

Unsupported generalized usage exits explicitly with WARNING_QUIT.

Tests

  • cmake --build . --target hsolver MODULE_HSOLVER_lobpcg abacus_basic_para -j2
  • env I_MPI_FABRICS=shm ctest -R MODULE_HSOLVER_lobpcg --output-on-failure
  • git diff --check

Copilot AI review requested due to automatic review settings May 30, 2026 13:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@mohanchen
Copy link
Copy Markdown
Collaborator

Nice try. You can keep refining the implementation and use test cases to verify its correctness and performance.

Comment thread source/source_hsolver/diago_lobpcg.h Outdated
Real* eigenvalue_in,
const std::vector<double>& ethr_band);

/// USPP (S≠I) diagonalization. NOT IMPLEMENTED — aborts.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually standard eigenproblem is just a special case(S=I) for generalized problem. Only one diag interface is necessary for both, and the latter is adequate.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I will narrow down the interface.

Comment thread source/source_hsolver/diago_lobpcg.h Outdated
Real* eigenvalue_in,
const std::vector<double>& ethr_band);

/// USPP (S≠I) diagonalization. NOT IMPLEMENTED — aborts.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard eigenvalue problem is just a special case (where S is I) for the generalized problem. Thus one diag interface (the latter) is adequate.

Comment thread source/source_hsolver/diago_lobpcg.cpp Outdated
namespace hsolver {

// ============================================================================
// Band-major explicit-loop helpers (CPU, n_band_l == n_band required)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most loops can be achieved with a BLAS 3 op

@amadeus725
Copy link
Copy Markdown
Author

Updated LOBPCG to expose only the generalized diag(hpsi, spsi, ...) interface. The current implementation still only accepts S = I; non-identity overlap exits explicitly. Also replaced several explicit dense matrix operations with existing PGemmCN / PLinearTransform helpers.

@amadeus725
Copy link
Copy Markdown
Author

Update: PW LOBPCG band-parallel generalized path

This update moves the experimental PW LOBPCG solver beyond the previous
serial/S=I prototype. The current version now has a 2-rank validated
band-parallel path, including the generalized S != I / USPP case.

What changed

The LOBPCG PW path now always enters through the generalized
diag(hpsi, spsi, ...) interface. The standard problem is treated as the
S = I special case, while USPP uses the same interface with an explicit
overlap operator.

Band parallelism has been wired through the PW solver and input/global-state
setup:

  • ks_solver = lobpcg can now keep bndpar > 1.
  • ks_run, local band counts, and initial wavefunction preparation are
    updated for LOBPCG band-parallel execution.
  • Initial PW wavefunctions are prepared globally when needed and then
    scattered across band groups.

For the generalized band-parallel algorithm, the distributed subspace now
tracks the full set of objects needed by LOBPCG:

  • trial/search blocks: X, Z, P
  • Hamiltonian products: HX, HZ, HP
  • overlap products: SX, SZ, SP
  • projected subspace matrices: H_sub, S_sub
  • global coefficient rotation back to distributed wavefunctions

Stability changes

The generalized band-parallel path uses rank-compression when the projected
overlap matrix S_sub is ill-conditioned. This is important for USPP cases:
direct generalized diagonalization of the raw subspace can be unstable when
the local LOBPCG subspace contains nearly dependent directions.

The implementation also keeps explicit diagnostics for failed subspace solves,
non-finite vectors, and unconverged LOBPCG inner iterations. These diagnostics
include k-point, local/global band count, basis size, and USPP context.

Performance-related changes

Two low-risk optimizations are included:

  • Build Hermitian projected subspace blocks from the upper triangle and mirror
    the conjugate block. This reduces PGemmCN calls in the tested
    band-parallel cases.
  • Avoid redundant H*pdir / S*pdir recomputation after the generalized
    band-parallel tail update.

There is also a small shared PLinearTransform::act cleanup: an unused
B_tmp copy in the MPI path was removed. The existing
MODULE_HSOLVER_para_linear_trans test was run after this change.

Validation

Unit and integration-style tests run locally:

cmake --build build --target MODULE_HSOLVER_lobpcg -j2
cmake --build build --target MODULE_HSOLVER_linear_trans MODULE_HSOLVER_lobpcg -j2
cmake --build build --target abacus_basic_para -j2
cmake --build build --target MODULE_IO_system_variable_test -j2

env I_MPI_FABRICS=shm ctest --test-dir build -R '^MODULE_HSOLVER_lobpcg$' --output-on-failure
env I_MPI_FABRICS=shm ctest --test-dir build -R '^MODULE_HSOLVER_lobpcg_parallel$' --output-on-failure
env I_MPI_FABRICS=shm ctest --test-dir build -R '^MODULE_HSOLVER_para_linear_trans$' --output-on-failure
ctest --test-dir build -R '^MODULE_IO_system_variable_test$' --output-on-failure
git diff --check

Real-case checks

The following 2-rank band-parallel USPP cases were rerun after the latest
cleanup:

Case Final energy Notes
NaCl USPP bp2 -1220.8380554789837333 eV PGemmCN multiply = 82
Fe USPP tight bp2 -2860.7186058101456183 eV PGemmCN multiply = 160, PLinearTransform act = 272

Previously validated reference matrix:

Case family Parallel settings Result
Si NC bp1 / bp2 / bp4 energies agree to roundoff
NaCl USPP bp1 / bp2 / bp4 energies agree to roundoff
Fe USPP tight bp1 / bp2 / bp4 energies agree to roundoff

Current limitations

This is still an experimental PW LOBPCG implementation. The currently
validated scope is:

  • PW basis
  • CPU
  • std::complex<double>
  • standard S = I and generalized S != I / USPP paths
  • band-parallel validation on 2 ranks, with prior bp4 reference checks

Still not covered:

  • GPU
  • float instantiation
  • systematic bp4/bp8 stress testing on a larger machine
  • deeper PLinearTransform call-count reduction

The next performance step would likely require a batched or multi-RHS
PLinearTransform interface. That would touch lower-level band-parallel
communication, so I kept it out of this update.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

source/source_psi/psi_prepare.cpp:1

  • In the lobpcg_band_parallel case, scatter_source is taken from bp_global_evc_device->get_pointer(). When PARAM.inp.device == "gpu", that pointer is very likely device memory; passing it into Parallel_Common::send_data(...) (and also into syncmem_complex_op() as a source) will break unless the codebase guarantees CUDA-aware MPI and that send_data supports device buffers. A safer approach is to always use a host-accessible pointer for MPI sends (e.g., use bp_global_evc_cpu->get_pointer() / stage a host buffer and synchronize device→host explicitly before the sends), and keep device pointers confined to device-side operations."
#include "psi_prepare.h"

Comment on lines +1503 to +1534
void DiagoLobpcg<T, Device>::validate_ethr_band(const std::vector<double>& ethr_band) const
{
if (ethr_band.size() < static_cast<size_t>(this->n_band_l)) {
std::ostringstream oss;
oss << "LOBPCG ethr_band size mismatch: size=" << ethr_band.size()
<< ", required=" << this->n_band_l;
if (!this->diag_context.empty()) {
oss << ", context={" << this->diag_context << "}";
}
throw std::invalid_argument(oss.str());
}
}

template <typename T, typename Device>
bool DiagoLobpcg<T, Device>::test_error(
const ct::Tensor& err_in, const std::vector<double>& ethr_band)
{
Real* _err_st = err_in.data<Real>();
bool not_conv = false;
std::vector<Real> tmp_cpu;
if (err_in.device_type() == ct::DeviceType::GpuDevice) {
tmp_cpu.resize(this->n_band_l);
_err_st = tmp_cpu.data();
syncmem_var_d2h_op()(_err_st, err_in.data<Real>(), this->n_band_l);
}
for (int ii = 0; ii < this->n_band_l; ii++)
if (_err_st[ii] > ethr_band[ii]) not_conv = true;
#ifdef __MPI
MPI_Allreduce(MPI_IN_PLACE, &not_conv, 1, MPI_C_BOOL, MPI_LOR, BP_WORLD);
#endif
return not_conv;
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines +931 to +947
#ifdef __MPI
int nproc_in_pool, kpar = 1, mypool, rank_in_pool;
setupmpi(argc, argv, nproc, myrank);
divide_pools(nproc, myrank, nproc_in_pool, kpar, mypool, rank_in_pool);
const bool use_band_parallel_world = std::getenv("ABACUS_LOBPCG_TEST_BNDPAR") != nullptr;
MPI_Comm_split(MPI_COMM_WORLD, use_band_parallel_world ? 0 : myrank, 0, &BP_WORLD);
if (use_band_parallel_world)
{
GlobalV::MY_BNDGROUP = myrank;
GlobalV::NPROC_IN_BNDGROUP = nproc;
MPI_Comm_free(&POOL_WORLD);
MPI_Comm_split(MPI_COMM_WORLD, myrank, 0, &POOL_WORLD);
}
GlobalV::NPROC_IN_POOL = nproc;
#else
MPI_Init(&argc, &argv);
#endif
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants