Skip to content

Intermittent LatticeReduction::sort assertion m[0]<=m[1]*onePlusEpsilon during well-tempered metadynamics (GROMACS 2025.4 runtime kernel, v2.10.0) #1399

@PricklyE

Description

@PricklyE

Summary

PLUMED 2.10.0, loaded as runtime kernel by GROMACS 2025.4, occasionally aborts with an internal assertion in LatticeReduction::sort during a well-tempered 2D metadynamics run on an explicit-solvent polymer-drug system. The crash is intermittent — restarts from the same checkpoint sometimes reproduce the crash within a few ns, sometimes run cleanly for tens of thousands of steps past the last crash point. No equivalent fix commit exists ahead of v2.10.0 on either v2.10 maintenance branch or master (verified via git log v2.10.0..origin/v2.10 -- src/tools/LatticeReduction.cpp → empty).

Environment

  • PLUMED: 2.10.0 (tag 0f2be2d617d7b8d577f1856c9172759543f5c9f0 Release v2.10.0)
  • GROMACS: 2025.4, mixed precision, CUDA 13.1, linked against PLUMED via runtime kernel loader (PLUMED_KERNEL env var; PLUMED is not statically patched into GROMACS)
  • Hardware: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, AMD Threadripper 7960X, Ubuntu 24.04
  • mdrun flags: -ntmpi 1 -ntomp 16 -nb gpu -pme gpu -bonded gpu -update cpu -pin on -pinoffset 16 -gpu_id 1 -noappend
  • PLUMED CV stack: DISTANCE between two COM groups + COORDINATION with NLIST NL_CUTOFF=1.8 + GYRATION, biased by 2D well-tempered METAD
  • Integrator: 1 fs timestep, NPT with standard GROMACS pressure coupling (Parrinello-Rahman or C-rescale — system mdp standard)

Assertion message (verbatim)

-------------------------------------------------------
Program:     gmx mdrun, version 2025.4
Source file: src/gromacs/applied_forces/plumed/plumedforceprovider.cpp (line 212)
Function:    virtual void gmx::PlumedForceProvider::calculateForces(
               const gmx::ForceProviderInput&, gmx::ForceProviderOutput*)

Internal error (bug):
An error occurred while PLUMED was calculating the forces
:
(tools/LatticeReduction.cpp:43) static void
PLMD::LatticeReduction::sort(PLMD::Vector*)
+++ assertion failed: m[0]<=m[1]*onePlusEpsilon
-------------------------------------------------------

Observation history

Attempt Kernel Start step End step / outcome Progress (ns)
1 — Apr 18 2.10.0 17 252 300 crashed at assertion (step not logged in summary) ~4-5 ns
2 — Apr 18 (retry) 2.10.0 20 661 100 crashed at step 22 910 000 at the same assertion 2.25 ns
3 — Apr 20 (15 min diagnostic run on sandbox copy of same checkpoint) 2.10.0 22 902 100 clean exit at -maxh 0.25, reached step 22 936 500 0.034 ns, no crash

Key observation: attempts 1 and 2 crashed; attempt 3, resuming from a checkpoint just 8 ps earlier than attempt 2's crash point, ran past that point cleanly. The bug is not deterministically triggered by a particular trajectory state — GPU GROMACS + PLUMED restarts are not bit-exact (non-associative FP reduction on GPU; PLUMED stochastic elements), so each restart explores a slightly different numerical trajectory.

Why this is probably numerical

The assertion m[0]<=m[1]*onePlusEpsilon is in LatticeReduction::sort(), called during lattice-vector magnitude sorting for minimum-image calculations. Given the box evolves under NPT with moderate anisotropy, two lattice-vector magnitudes occasionally become degenerate within the onePlusEpsilon tolerance (current value ≈ 1 + 1e-6), making the sort's ordering invariant unsatisfied.

Reproducer

Available on request: metad.tpr (45 MB), plumed.dat, metad.cpt (38 MB), start.gro (112 MB), system.top, index.ndx. System: 6285 atoms polymer (10-chain cationic amphiphile), 47 atoms drug (curcumin), ~19 357 solute atoms total in explicit TIP3P water, triclinic box ~6×6×6 nm, 1 fs timestep.

Crash is intermittent; tends to surface within 2-6 ns of restart. Current MTBF estimate (n=2 crashes over 8 ns progress): ~4 ns between crashes, but sample is small.

Suggested investigation

  1. Is onePlusEpsilon safely chosen for single-precision float paths? (GROMACS 2025.4 used mixed precision; PLUMED kernel uses double internally AFAIK, but the vectors may originate from GROMACS single floats.)
  2. Would std::stable_sort with a relaxed comparator (tolerance, or nominal ordering on tie) be a safer fallback than the assertion?
  3. Is there a precision gap between single-precision GROMACS box vectors passed into the PLUMED kernel vs. what the kernel expects?

Workarounds already ruled out

  • NOPBC on DISTANCE / COORDINATION: scientifically incorrect for inter-molecular CVs where the drug crosses box boundaries.
  • WHOLEMOLECULES: keeps molecules intact but does not fix inter-molecular minimum-image calculation.
  • Downgrade to v2.9: shares the same LatticeReduction.cpp code (78c27bc67 Fix lattice reduction from Oct 2020 is already in both trees). Unlikely to help.
  • Pull v2.10 maintenance branch HEAD: verified — no commits touch LatticeReduction.cpp since v2.10.0.

Current workaround in use

Auto-restart wrapper (run_metad_restart.sh) that retries on any non-zero gmx mdrun exit from the same checkpoint. Tolerable for this 100 ns run (expected ~20-40 restarts worst-case), but not a long-term solution for ensemble or production workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions