Skip to content

add gromacs to frames node#463

Open
MrJulEnergy wants to merge 3 commits intomainfrom
addGmx2FramesNode
Open

add gromacs to frames node#463
MrJulEnergy wants to merge 3 commits intomainfrom
addGmx2FramesNode

Conversation

@MrJulEnergy
Copy link
Copy Markdown
Contributor

@MrJulEnergy MrJulEnergy commented Apr 3, 2026

Summary by CodeRabbit

  • New Features
    • Added GROMACS → ASE conversion for trajectories and topologies, including optional energies, stresses, forces and velocities per frame. Supports selecting frame ranges, matches external energy data to trajectory times (with tolerance reporting), and saves converted frames to a persistent frames file for later access. Includes a command-line/demo entry to print basic frame info.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Adds a new GROMACS→ASE converter gmx_to_ase() that builds ASE Atoms frames (positions, cell/pbc, velocities, forces), parses optional .edr energy data matched by timestamps, and a Gmx2Frames zntrack node that runs the conversion and persists frames to frames.h5.

Changes

Cohort / File(s) Summary
GROMACS-to-ASE Conversion & ZnTrack Node
ipsuite/data_loading/add_data_gromacs.py
New module implementing gmx_to_ase(topology, trajectory=None, edr=None, start=None, stop=None, step=None) and Gmx2Frames(zntrack.Node). Adds atom symbol resolution (elements → types → capitalization), frame iteration, positions/velocities/forces handling, EDR parsing with timestamp nearest-match and tolerance logging, energy and stress extraction, SinglePointCalculator attachment, HDF5 persistence via znh5md.IO.extend(), and a frames property to read stored ase.Atoms list. Includes __main__ demonstration block.

Sequence Diagram

sequenceDiagram
    actor User
    participant Gmx2Frames as Gmx2Frames<br/>(ZnTrack Node)
    participant gmx_to_ase as gmx_to_ase<br/>(Function)
    participant MDAnalysis as MDAnalysis<br/>(Universe/Trajectory)
    participant EDRParser as EDR Parser
    participant ASE as ASE<br/>(Atoms + SinglePointCalculator)
    participant HDF5 as HDF5<br/>(frames.h5)

    User->>Gmx2Frames: run()
    Gmx2Frames->>gmx_to_ase: gmx_to_ase(topology, trajectory, edr, start, stop, step)
    gmx_to_ase->>MDAnalysis: Load topology & trajectory
    MDAnalysis-->>gmx_to_ase: Universe & timesteps

    loop per selected timestep
        gmx_to_ase->>MDAnalysis: read frame (positions, box, velocities, forces)
        alt edr provided
            gmx_to_ase->>EDRParser: load/suppress-warnings & find nearest sample by time
            EDRParser-->>gmx_to_ase: energy, pressure terms, all EDR terms
            gmx_to_ase->>gmx_to_ase: compute 6-component stress vector (from Pres-*)
        end
        gmx_to_ase->>ASE: create Atoms, set positions/cell/pbc/velocities
        gmx_to_ase->>ASE: attach SinglePointCalculator (energy, forces, stress, results)
        ASE-->>gmx_to_ase: Atoms frame
    end

    gmx_to_ase-->>Gmx2Frames: list[Atoms]
    Gmx2Frames->>HDF5: persist frames via znh5md.IO.extend()
    HDF5-->>Gmx2Frames: confirmation

    User->>Gmx2Frames: .frames
    Gmx2Frames->>HDF5: read frames
    HDF5-->>User: list[Atoms]
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through GROMACS trails one night,

found atoms, boxes, velocities in flight,
matched EDR beats by their time,
stitched energies, stresses—what a rhyme!
Frames tucked in HDF5, snug and bright.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add gromacs to frames node' directly and clearly describes the main change: adding a new GROMACS-to-frames conversion node (Gmx2Frames class) to the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch addGmx2FramesNode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
ipsuite/data_loading/add_data_gromacs.py (1)

134-134: Consider warning when EDR and trajectory times diverge significantly.

The nearest-time matching via argmin silently succeeds even when the time difference is large (e.g., different output frequencies). Consider logging a warning if the matched time differs by more than a small tolerance.

Example tolerance check
idx = np.argmin(np.abs(edr_times - ts.time))
time_diff = abs(edr_times[idx] - ts.time)
if time_diff > 0.1:  # ps, adjust as appropriate
    warnings.warn(f"EDR time {edr_times[idx]:.3f} differs from trajectory time {ts.time:.3f} by {time_diff:.3f} ps")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` at line 134, The nearest-time
matching using idx = np.argmin(np.abs(edr_times - ts.time)) can silently match
large time gaps; in the function or loop where that line appears (e.g., the
trajectory frame handling block in add_data_gromacs.py), compute time_diff =
abs(edr_times[idx] - ts.time) after selecting idx and if time_diff exceeds a
small tolerance (e.g., 0.1 ps, configurable) emit a warning (warnings.warn or
your logger.warn) indicating the EDR time, trajectory time, and the difference
so users know the outputs are misaligned; make the tolerance configurable or
documented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ipsuite/data_loading/add_data_gromacs.py`:
- Line 211: The run method in the class defined in add_data_gromacs.py is
missing a return type annotation; update the def run(self): signature to include
the appropriate return type (e.g., -> None or the actual return type used) so it
conforms with the project's full type annotation guideline, and update any
related docstring or callers if necessary; locate the method by the symbol run
in that file and add the correct type hint.
- Around line 202-207: The annotated types for node attributes are too narrow:
change trajectory and edr from Path to Optional[Path] and start, stop, step from
int to Optional[int] to reflect that zntrack.deps_path(None) /
zntrack.params(None) can return None; update the import to include
typing.Optional (e.g., from typing import Optional) and modify the annotations
for trajectory, edr, start, stop, and step accordingly while leaving topology as
Path.
- Around line 239-240: The example prints may raise an IndexError and shows the
wrong unit: guard access to frames[1] by checking len(frames) >= 2 before
referencing frames[1].calc.results (or else reference frame 0 consistently), and
fix the unit label by using the actual energy units returned by
frames[0].get_potential_energy() (or an energy_unit variable if present) instead
of hardcoding "kJ/mol" (use "eV" if energies are converted to eV); update the
two print lines to conditionally print the second-frame EDR terms only when a
second frame exists and to display the correct units from the energy
value/source.
- Line 135: The energy read from the EDR file (assigned to variable energy in
the line energy = float(edr_data["Potential"][idx])) is in kJ/mol and must be
converted to eV before passing to ASE's SinglePointCalculator; change the code
to convert from kJ/mol to eV (using ase.units, e.g. multiply/divide by units.kJ,
units.mol and units.eV) and supply the converted value to the
SinglePointCalculator so downstream ASE methods return correct energies.

---

Nitpick comments:
In `@ipsuite/data_loading/add_data_gromacs.py`:
- Line 134: The nearest-time matching using idx = np.argmin(np.abs(edr_times -
ts.time)) can silently match large time gaps; in the function or loop where that
line appears (e.g., the trajectory frame handling block in add_data_gromacs.py),
compute time_diff = abs(edr_times[idx] - ts.time) after selecting idx and if
time_diff exceeds a small tolerance (e.g., 0.1 ps, configurable) emit a warning
(warnings.warn or your logger.warn) indicating the EDR time, trajectory time,
and the difference so users know the outputs are misaligned; make the tolerance
configurable or documented.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d200379-0275-4ef7-a180-661cd8621105

📥 Commits

Reviewing files that changed from the base of the PR and between 236f4ac and 4584b12.

📒 Files selected for processing (1)
  • ipsuite/data_loading/add_data_gromacs.py

Comment on lines +145 to +146
# GROMACS pressure in bar -> store as-is (not ASE native eV/ų)
stress = np.array([pxx, pyy, pzz, pyz, pxz, pxy])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stress units mismatch: GROMACS pressure is in bar, ASE expects eV/ų.

The comment acknowledges the unit discrepancy, but storing stress in bar breaks ASE's stress conventions. Any code calling atoms.get_stress() will receive values in the wrong units, leading to incorrect virial/pressure calculations downstream.

Convert bar to eV/ų: 1 bar = 1e-4 GPa = 6.2415e-7 eV/ų (use ase.units for precision).

Proposed fix
+from ase import units
+
+# bar -> eV/ų
+BAR_TO_EV_ANG3 = 1.0 / (units.bar / (units.eV / units.Ang**3))
+
 ...
-                # GROMACS pressure in bar -> store as-is (not ASE native eV/ų)
-                stress = np.array([pxx, pyy, pzz, pyz, pxz, pxy])
+                # Convert GROMACS pressure (bar) to ASE stress (eV/ų)
+                stress = np.array([pxx, pyy, pzz, pyz, pxz, pxy]) * BAR_TO_EV_ANG3

Comment on lines +239 to +240
print(f"Potential energy (frame 0): {frames[0].get_potential_energy()} kJ/mol")
print(f"All EDR terms on frame 0: {list(frames[1].calc.results.keys())}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential IndexError and misleading unit label in example.

  1. Line 240 accesses frames[1] without checking if there are at least 2 frames.
  2. Line 239 prints "kJ/mol" but if energy units are converted to eV (as recommended), this label will be incorrect.
Proposed fix
-    print(f"Potential energy (frame 0): {frames[0].get_potential_energy()} kJ/mol")
-    print(f"All EDR terms on frame 0: {list(frames[1].calc.results.keys())}")
+    print(f"Potential energy (frame 0): {frames[0].get_potential_energy()} eV")
+    print(f"All EDR terms on frame 0: {list(frames[0].calc.results.keys())}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"Potential energy (frame 0): {frames[0].get_potential_energy()} kJ/mol")
print(f"All EDR terms on frame 0: {list(frames[1].calc.results.keys())}")
print(f"Potential energy (frame 0): {frames[0].get_potential_energy()} eV")
print(f"All EDR terms on frame 0: {list(frames[0].calc.results.keys())}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` around lines 239 - 240, The example
prints may raise an IndexError and shows the wrong unit: guard access to
frames[1] by checking len(frames) >= 2 before referencing frames[1].calc.results
(or else reference frame 0 consistently), and fix the unit label by using the
actual energy units returned by frames[0].get_potential_energy() (or an
energy_unit variable if present) instead of hardcoding "kJ/mol" (use "eV" if
energies are converted to eV); update the two print lines to conditionally print
the second-frame EDR terms only when a second frame exists and to display the
correct units from the energy value/source.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #464

coderabbitai bot added a commit that referenced this pull request Apr 3, 2026
Docstrings generation was requested by @MrJulEnergy.

* #463 (comment)

The following files were modified:

* `ipsuite/data_loading/add_data_gromacs.py`
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
ipsuite/data_loading/add_data_gromacs.py (1)

156-165: ⚠️ Potential issue | 🟠 Major

Convert the pressure tensor before assigning stress.

ASE works in eV-based units, and its pressure/stress quantities are represented in eV/ų; passing the raw Pres-* values here keeps atoms.get_stress() in bar instead. If you want to preserve the native tensor too, keep the bar numbers only under the Pres-* keys and convert the array you pass as stress. (ase-lib.org)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` around lines 156 - 165, The code
currently assigns raw GROMACS Pres-* values (in bar) directly to stress, which
leaves ASE seeing stress in bar; instead, convert those pressure components to
ASE internal units (eV/ų) before assigning to the stress array. Locate the
block reading edr_data["Pres-XX"/"Pres-YY"/..."] and replace the direct
assignment to stress with a conversion step (keep the raw bar values in edr_data
if you need them) — e.g., compute a conversion factor or use ASE unit helpers
(ase.units / ase.units.convert) to convert from bar to eV/ų and then set stress
= np.array([pxx_converted, pyy_converted, pzz_converted, pyz_converted,
pxz_converted, pxy_converted]) so atoms.get_stress() returns values in ASE
units.
🧹 Nitpick comments (1)
ipsuite/data_loading/add_data_gromacs.py (1)

210-219: Make the node example runnable end-to-end.

Right now the example stops after construction, so it isn't testable in the same way as the other data-loading nodes. Add project.repro() and one observable result/assertion.

📝 Suggested docstring tweak
     >>> with project:
     ...     md = ips.Gmx2Frames(
     ...         topology="gromacs/system.gro",
     ...         trajectory="gromacs/production.xtc",
     ...         edr="gromacs/production.edr",
     ...         start=1,
     ...     )
+    >>> project.repro()
+    >>> len(md.frames) > 0
+    True

As per coding guidelines: "Each Node’s docstring should include a testable example using project and ips".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` around lines 210 - 219, The
docstring example for Gmx2Frames stops after construction and isn't runnable;
update it to call project.repro() and assert one observable outcome so it
executes end-to-end. Specifically, after constructing md = ips.Gmx2Frames(...),
invoke project.repro() and then check a result such as verifying md.frames (or
another public attribute/method on Gmx2Frames) has expected length/type (e.g.,
assert len(md.frames) > 0 or isinstance(md.frames, list/ndarray)) so the example
both runs and asserts behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ipsuite/data_loading/add_data_gromacs.py`:
- Around line 60-73: _match_edr_frame currently scans the whole edr_times array
per frame which is O(n×m) and silently returns an out-of-tolerance index;
replace the np.argmin approach with np.searchsorted on the monotonic edr_times
(or use a moving pointer maintained by the caller) to find the nearest candidate
on the left/right, compute the nearest index and time_diff, and if time_diff >
tolerance reject the match (raise a ValueError or return a sentinel like
None/-1) instead of returning a bad index; update callers to handle the
rejection and keep the function signature (_match_edr_frame(edr_times,
traj_time, tolerance)) and logging behavior.
- Around line 147-179: ts.forces are left in MDAnalysis units (kJ/(mol·Å)) but
ASE expects eV/Å; multiply the copied forces by the same conversion factor used
for energy (the (kJ / mol) factor) before passing into SinglePointCalculator.
Update the code that sets forces (the ts.forces.copy() branch) to convert units
(e.g., forces = ts.forces.copy() * (kJ / mol) when ts.has_forces) so the forces
variable passed to SinglePointCalculator(atoms, energy=..., forces=forces,
stress=...) is in eV/Å.
- Around line 44-57: The code currently derives element symbols from
u.atoms.types which misclassifies CHARMM/AMBER types; change the logic to use
atom names and MDAnalysis's element guessing instead: use u.atoms.names (or call
u.guess_TopologyAttrs(to_guess=['elements']) and read u.atoms.elements) and
build the symbols list from those guessed elements (or rely on
MDAnalysis.DefaultGuesser behavior) rather than transforming types; update the
block that references types, t, and symbols to use names/elements (keeping the
same return of symbols) and remove the manual type-based heuristics.

---

Duplicate comments:
In `@ipsuite/data_loading/add_data_gromacs.py`:
- Around line 156-165: The code currently assigns raw GROMACS Pres-* values (in
bar) directly to stress, which leaves ASE seeing stress in bar; instead, convert
those pressure components to ASE internal units (eV/ų) before assigning to the
stress array. Locate the block reading edr_data["Pres-XX"/"Pres-YY"/..."] and
replace the direct assignment to stress with a conversion step (keep the raw bar
values in edr_data if you need them) — e.g., compute a conversion factor or use
ASE unit helpers (ase.units / ase.units.convert) to convert from bar to eV/ų
and then set stress = np.array([pxx_converted, pyy_converted, pzz_converted,
pyz_converted, pxz_converted, pxy_converted]) so atoms.get_stress() returns
values in ASE units.

---

Nitpick comments:
In `@ipsuite/data_loading/add_data_gromacs.py`:
- Around line 210-219: The docstring example for Gmx2Frames stops after
construction and isn't runnable; update it to call project.repro() and assert
one observable outcome so it executes end-to-end. Specifically, after
constructing md = ips.Gmx2Frames(...), invoke project.repro() and then check a
result such as verifying md.frames (or another public attribute/method on
Gmx2Frames) has expected length/type (e.g., assert len(md.frames) > 0 or
isinstance(md.frames, list/ndarray)) so the example both runs and asserts
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3211bbb4-e6f9-44fd-bc99-ce5c595a7702

📥 Commits

Reviewing files that changed from the base of the PR and between 4584b12 and c4a2475.

📒 Files selected for processing (1)
  • ipsuite/data_loading/add_data_gromacs.py

Comment on lines +44 to +57
# 2. Use atom types (usually cleaner than names for CHARMM-GUI)
types = u.atoms.types
symbols = []
for t in types:
t_upper = t.upper()
if t_upper in _TYPE_TO_ELEMENT:
symbols.append(_TYPE_TO_ELEMENT[t_upper])
elif len(t) <= 2 and t[0].isalpha():
# Capitalize properly: first letter upper, rest lower
symbols.append(t[0].upper() + t[1:].lower() if len(t) > 1 else t.upper())
else:
# Last resort: take leading alphabetic characters from atom name
symbols.append(t[0].upper())
return symbols
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n ipsuite/data_loading/add_data_gromacs.py | head -100

Repository: zincware/IPSuite

Length of output: 3699


🏁 Script executed:

# Check the full _get_symbols function and surrounding context
rg -A 30 "def _get_symbols" ipsuite/data_loading/add_data_gromacs.py

Repository: zincware/IPSuite

Length of output: 1268


🏁 Script executed:

# Look for how _get_symbols is used in the file
rg "_get_symbols" ipsuite/data_loading/add_data_gromacs.py -B 2 -A 2

Repository: zincware/IPSuite

Length of output: 335


🏁 Script executed:

# Check if there's any existing logic that uses u.atoms.names
rg "u\.atoms\.(names|types)" ipsuite/data_loading/add_data_gromacs.py -B 2 -A 2

Repository: zincware/IPSuite

Length of output: 191


🏁 Script executed:

# Let's understand MDAnalysis element guessing behavior better
python << 'EOF'
# Check what MDAnalysis does with element guessing
import sys
# Try to get info about the default guesser
try:
    from MDAnalysis.guesser import DEFAULT_GUESSER
    print("DEFAULT_GUESSER available")
    print(f"Type: {type(DEFAULT_GUESSER)}")
except ImportError:
    print("DEFAULT_GUESSER not directly available")

# Check the implementation
try:
    from MDAnalysis.guesser.default import DefaultGuesser
    print("DefaultGuesser available")
    # Check what the guesser does
    import inspect
    print(inspect.getsourcefile(DefaultGuesser))
except ImportError as e:
    print(f"Could not import DefaultGuesser: {e}")

EOF

Repository: zincware/IPSuite

Length of output: 160


🏁 Script executed:

# Let's search for documentation or usage of element guessing in the codebase
rg "guess.*element|element.*guess" ipsuite/ -i

Repository: zincware/IPSuite

Length of output: 42


🏁 Script executed:

# Check if there are any tests or examples showing how _get_symbols should work
rg -l "_get_symbols|gmx_to_ase" ipsuite/ --type py

Repository: zincware/IPSuite

Length of output: 100


🏁 Script executed:

# Look at the broader context - is there any test file for this module?
fd "test.*gromacs|gromacs.*test" ipsuite/ -type f

Repository: zincware/IPSuite

Length of output: 230


🏁 Script executed:

# Let's check what MDAnalysis version is being used and look at its documentation
rg "MDAnalysis" ipsuite/ setup.py pyproject.toml requirements.txt -A 1 -B 1

Repository: zincware/IPSuite

Length of output: 2179


🏁 Script executed:

# Check if there are any comments or documentation about element inference in the file
rg "element|symbol" ipsuite/data_loading/add_data_gromacs.py -i

Repository: zincware/IPSuite

Length of output: 647


🏁 Script executed:

# Let's check if there are other usages of this in the codebase to understand the impact
rg "gmx_to_ase|_get_symbols" ipsuite/ -l

Repository: zincware/IPSuite

Length of output: 100


🏁 Script executed:

# Check the pyproject.toml or similar for MDAnalysis version info
find ipsuite -name "pyproject.toml" -o -name "setup.cfg" | head -5 | xargs cat 2>/dev/null | head -100

Repository: zincware/IPSuite

Length of output: 42


🏁 Script executed:

# Look at what ASE Atoms expects for symbols
python << 'EOF'
# Check what happens when you create ASE atoms with invalid symbols
# This will help us understand the severity

test_cases = [
    "Ow",  # Invalid - OW type
    "Ca",  # Calcium
    "C",   # Carbon
    "O",   # Oxygen
]

# Check if these are valid element symbols in ASE
try:
    from ase.data import atomic_numbers
    for sym in test_cases:
        try:
            an = atomic_numbers[sym]
            print(f"✓ {sym:3s} -> atomic number {an}")
        except KeyError:
            print(f"✗ {sym:3s} -> INVALID/UNKNOWN")
except ImportError:
    print("ASE not available in sandbox, but we can reason about this")
    print("Standard element symbols are 1-2 chars, capitalized properly")
    for sym in test_cases:
        # Standard chemistry rules
        if sym == "Ow":
            print(f"✗ {sym} -> INVALID (no such element)")
        elif sym == "Ca":
            print(f"✓ {sym} -> Calcium (valid but semantically wrong for Carbon Alpha)")
        elif sym in ["C", "O"]:
            print(f"✓ {sym} -> Valid")

EOF

Repository: zincware/IPSuite

Length of output: 303


🏁 Script executed:

# Let's also verify what the comment says about MDAnalysis names vs types
# Look at the gmx.py file that might have similar logic
cat -n ipsuite/configuration_generation/gmx.py | grep -A 20 "def timestep_to_atoms"

Repository: zincware/IPSuite

Length of output: 842


🏁 Script executed:

# Let's check what happens after symbols are used - does ASE validate them?
rg -A 5 "Atoms(symbols=symbols" ipsuite/data_loading/add_data_gromacs.py

Repository: zincware/IPSuite

Length of output: 137


🏁 Script executed:

# Let's understand the problem better by looking at actual GROMACS atom types
# Check if there are any test files or examples
find . -name "*.gro" -o -name "*.tpr" -o -name "*.xtc" 2>/dev/null | head -5

Repository: zincware/IPSuite

Length of output: 42


🏁 Script executed:

# Search for any documentation on the expected behavior
rg "water|OW|CA" ipsuite/data_loading/ -i

Repository: zincware/IPSuite

Length of output: 1415


🏁 Script executed:

# Let's check what the MDAnalysis documentation actually says about element guessing
web_search
MDAnalysis element guessing default guesser names types

Repository: zincware/IPSuite

Length of output: 159


🌐 Web query:

MDAnalysis element guessing default guesser documentation

💡 Result:

The default guesser for element guessing in MDAnalysis is DefaultGuesser from MDAnalysis.guesser.default_guesser. It is the generic, default guessing class used for most topologies, particularly following biological naming conventions (e.g., "CA" guessed as carbon, not calcium). Element guessing behavior: The guess_atom_element(atomname) method strips numbers and symbols from the atom name, looks it up in MDAnalysis.guesser.tables.atomelements (derived from CHARMM/AMBER tables), removes the last character if not found, or falls back to the first character of the name if still not found. Empty names return empty string. When guessing types, they represent the element using the same method. It is used automatically as the default context (context='default') when creating a Universe with to_guess=['types', 'elements'] or via u.guess_TopologyAttrs(to_guess=['elements']). Pre-2.8.0, standalone functions like guess_atom_element in MDAnalysis.topology.guessers were used (now deprecated). Example: import MDAnalysis as mda u = mda.Universe('topology.gro', context='default', to_guess=['elements']) # or u.guess_TopologyAttrs(to_guess=['elements']) Official documentation: https://docs.mdanalysis.org/stable/documentation_pages/guesser_modules/default_guesser.html and https://userguide.mdanalysis.org/stable/formats/guessers/default.html (MDAnalysis 2.10.0 as of latest docs).

Citations:


🏁 Script executed:

# Let's look at the actual usage and see if invalid symbols cause problems
cat -n ipsuite/data_loading/add_data_gromacs.py | sed -n '120,160p'

Repository: zincware/IPSuite

Length of output: 1823


🏁 Script executed:

# Check if there's validation happening elsewhere
rg "atomic_numbers\|invalid.*symbol\|element.*valid" ipsuite/data_loading/add_data_gromacs.py

Repository: zincware/IPSuite

Length of output: 42


🏁 Script executed:

# Let's examine what happens with "Ow" type - is it even a realistic scenario?
# Check if there are any CHARMM-GUI or common GROMACS force field references
rg "CHARMM|amber|oplsaa|gromos" ipsuite/ -i | head -10

Repository: zincware/IPSuite

Length of output: 168


🏁 Script executed:

# Let's look at the exact lines where symbols are used to create Atoms
cat -n ipsuite/data_loading/add_data_gromacs.py | sed -n '136,155p'

Repository: zincware/IPSuite

Length of output: 928


Use MDAnalysis's proper element guessing from atom names, not types.

The current approach using u.atoms.types creates semantic bugs:

  • OW (water oxygen in CHARMM-GUI) becomes invalid Ow
  • CA (carbon alpha) is hardcoded to Ca (calcium) instead of being guessed as carbon

MDAnalysis's DefaultGuesser operates on atom names using domain-standard rules (stripping digits/symbols, lookups in CHARMM/AMBER tables). Replace the manual type-based logic with u.atoms.names and leverage MDAnalysis's built-in element inference, or call u.guess_TopologyAttrs(to_guess=['elements']) to populate u.atoms.elements properly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` around lines 44 - 57, The code
currently derives element symbols from u.atoms.types which misclassifies
CHARMM/AMBER types; change the logic to use atom names and MDAnalysis's element
guessing instead: use u.atoms.names (or call
u.guess_TopologyAttrs(to_guess=['elements']) and read u.atoms.elements) and
build the symbols list from those guessed elements (or rely on
MDAnalysis.DefaultGuesser behavior) rather than transforming types; update the
block that references types, t, and symbols to use names/elements (keeping the
same return of symbols) and remove the manual type-based heuristics.

Comment on lines +60 to +73
def _match_edr_frame(
edr_times: np.ndarray, traj_time: float, tolerance: float = 0.1
) -> int:
"""Find the EDR index closest to a trajectory time, warning on large gaps."""
idx = int(np.argmin(np.abs(edr_times - traj_time)))
time_diff = abs(edr_times[idx] - traj_time)
if time_diff > tolerance:
logger.warning(
"EDR time %.3f ps does not match trajectory time %.3f ps (diff=%.3f ps)",
edr_times[idx],
traj_time,
time_diff,
)
return idx
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid quadratic EDR matching and reject missed timestamps.

np.argmin(np.abs(edr_times - traj_time)) rescans the full EDR axis for every frame, so matching becomes O(n×m) on long runs. It also still returns a row after tolerance is exceeded, which means shifted or downsampled .edr files silently annotate frames with the wrong labels. If the time axes are monotonic, np.searchsorted or a moving pointer avoids the repeated scan and makes it easy to skip/raise on out-of-tolerance matches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` around lines 60 - 73,
_match_edr_frame currently scans the whole edr_times array per frame which is
O(n×m) and silently returns an out-of-tolerance index; replace the np.argmin
approach with np.searchsorted on the monotonic edr_times (or use a moving
pointer maintained by the caller) to find the nearest candidate on the
left/right, compute the nearest index and time_diff, and if time_diff >
tolerance reject the match (raise a ValueError or return a sentinel like
None/-1) instead of returning a bad index; update callers to handle the
rejection and keep the function signature (_match_edr_frame(edr_times,
traj_time, tolerance)) and logging behavior.

Comment on lines +147 to +179
forces = ts.forces.copy() if ts.has_forces else None
energy = None
stress = None
extra_results = {}

if edr_data is not None:
idx = _match_edr_frame(edr_times, ts.time)
energy = float(edr_data["Potential"][idx]) * (kJ / mol) # convert to eV

# Build Voigt stress from pressure tensor if available
try:
pxx = edr_data["Pres-XX"][idx]
pyy = edr_data["Pres-YY"][idx]
pzz = edr_data["Pres-ZZ"][idx]
pyz = edr_data["Pres-YZ"][idx]
pxz = edr_data["Pres-XZ"][idx]
pxy = edr_data["Pres-XY"][idx]
# GROMACS pressure in bar -> store as-is (not ASE native eV/ų)
stress = np.array([pxx, pyy, pzz, pyz, pxz, pxy])
except KeyError:
pass

# Store all EDR terms for this frame
for term in edr_terms:
extra_results[term] = float(edr_data[term][idx])

if energy is not None or forces is not None:
calc = SinglePointCalculator(
atoms,
energy=energy,
forces=forces,
stress=stress,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the full file context
cd /tmp && find . -name "add_data_gromacs.py" -type f 2>/dev/null | head -5

Repository: zincware/IPSuite

Length of output: 42


🏁 Script executed:

# Find the file in the repository
find . -name "add_data_gromacs.py" -type f 2>/dev/null

Repository: zincware/IPSuite

Length of output: 102


🏁 Script executed:

# Read the relevant file section to understand context
cat -n ipsuite/data_loading/add_data_gromacs.py | head -200

Repository: zincware/IPSuite

Length of output: 7743


🏁 Script executed:

# Check imports to see how kJ/mol is defined
head -50 ipsuite/data_loading/add_data_gromacs.py

Repository: zincware/IPSuite

Length of output: 1203


🌐 Web query:

MDAnalysis force units trajectory trajectories

💡 Result:

In MDAnalysis, the base unit for forces in trajectories is kJ/(mol·Å). This applies when forces are present in the trajectory file format (e.g., via Universe.atoms.forces). MDAnalysis automatically converts forces to this unit upon reading trajectories, regardless of the original MD simulation format. Other units like kcal/(mol·Å) or N can be converted using MDAnalysis.units.convert.

Citations:


🌐 Web query:

ASE SinglePointCalculator force units eV Angstrom

💡 Result:

In the Atomic Simulation Environment (ASE), the SinglePointCalculator stores pre-computed results (energy, forces, stress, etc.) for a specific atomic configuration from an external calculation. It does not perform new calculations but returns the stored values via methods like get_forces. ASE uses consistent internal units: energies in eV, lengths in Angstroms (Å). Consequently, forces are in eV/Å. The SinglePointCalculator preserves whatever units the input data had when created (typically from I/O readers that convert external code outputs to ASE units). When reading results from formats like CASTEP .geom or ABINIT output, forces are explicitly converted to eV/Angstrom. For example: - CASTEP .geom reader: "Note that everything in the .geom file is in atomic units, which has been conversed to commonly used unit angstrom(length) and eV (energy)." - ABINIT reader: forces from 'cartesian_forces' converted using Hartree / Bohr = eV/Å. All ASE calculators, including those populating SinglePointCalculator, follow this convention (eV for energy, eV/Å for forces), as confirmed by documentation, tutorials (e.g., EMT calculator forces in eV/Å), and community sources. Thus, forces from ASE SinglePointCalculator are in eV/Å.

Citations:


Convert forces to eV/Å before creating the calculator.

Line 147 forwards ts.forces unchanged, but MDAnalysis uses kJ/(mol·Å) as its base force unit while ASE expects forces in eV/Å. That leaves every stored force off by the same kJ/mol → eV factor you're already applying to Potential on line 154.

Suggested fix
-        forces = ts.forces.copy() if ts.has_forces else None
+        forces = ts.forces.copy() * (kJ / mol) if ts.has_forces else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ipsuite/data_loading/add_data_gromacs.py` around lines 147 - 179, ts.forces
are left in MDAnalysis units (kJ/(mol·Å)) but ASE expects eV/Å; multiply the
copied forces by the same conversion factor used for energy (the (kJ / mol)
factor) before passing into SinglePointCalculator. Update the code that sets
forces (the ts.forces.copy() branch) to convert units (e.g., forces =
ts.forces.copy() * (kJ / mol) when ts.has_forces) so the forces variable passed
to SinglePointCalculator(atoms, energy=..., forces=forces, stress=...) is in
eV/Å.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant