Conversation
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughA new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@src/lammpsparser/potential.py`:
- Around line 343-357: Normalize species to a set before checking membership and
avoid double metadata parsing: add a helper (e.g., normalize_species) used by
contains_target_elements, deduplicate, and filter_by_elements to convert species
(string or list) into a consistent set of element strings; in the loop over
df.iterrows() call species_set = normalize_species(row["Species"]) and pass
species_set to contains_target_elements instead of the current conditional. Also
call metadata = parse_potential_metadata(name) once, use metadata to compute
family_id (or inline the family id logic that currently lives in get_family_id
to accept metadata) and then update families[family_id]["count"],
["repos"].add(metadata["repo_type"]) and ["names"].append(name) so
parse_potential_metadata is not invoked twice.
- Around line 119-131: The current camelCase-splitting in normalize_author is
too aggressive and chops names like "McDonald" or "DErrico"; update
normalize_author to only split when the main token matches a clear
concatenated-capitalized-names pattern (e.g., two capitalized name-like
segments) and then keep the first full segment; specifically, replace the
heuristic that splits on any uppercase boundary (inside normalize_author, using
variables author_str and main) with a targeted regex check such as matching
^([A-Z][a-z]{2,})([A-Z][a-z]+) and, only if that matches, extract group 1 as the
primary name before lowercasing and stripping non-letters—otherwise leave main
intact.
- Around line 216-221: The code silently drops string-valued species by treating
anything not list/set as empty; update the handling in the block that sets
elements (variable name elements in src/lammpsparser/potential.py) to detect
strings and other single-value types and convert them into a one-element set
(e.g., if isinstance(species, str): elements = {species}) instead of set(),
while preserving the existing branches for list and set; ensure this logic is
used wherever the target-elements filter relies on elements so single-species
strings like "Ni" are preserved.
- Around line 80-105: The __init__ currently hardcodes target_elements="Ni" and
raises ValueError for wrong types; change the signature to default
target_elements=None (or require explicit argument) and update initialization to
handle None (e.g., set self.target_elements=None and validate/use later) or
document/validate that it must be provided before use; also replace the final
raise ValueError(...) with raise TypeError(...) for invalid input types,
referencing the __init__ parameter target_elements and the attribute
self.target_elements so you update both the default and the exception type
consistently.
- Around line 663-668: The deduplicator currently forces verbose output in
view_potentials by calling
PotentialDeduplicator(target_elements=list_of_elements, verbose=True) and also
redundantly passes target_elements even though
LammpsPotentialFile.find(list_of_elements) already filters; change this to
silence library output: set verbose to False (or omit the flag) when
constructing PotentialDeduplicator in view_potentials (and in any callers such
as get_potential_dataframe), and remove/ignore the redundant target_elements
argument so PotentialDeduplicator does not re-filter already-filtered raw_df (or
accept None to skip element filtering). Also update PotentialDeduplicator to use
the logging module (logger = logging.getLogger(__name__)) and replace print(...)
calls with logger.debug/info as appropriate so diagnostics are controlled by the
consumer's logging configuration.
- Around line 36-41: The file has duplicate/unused typing imports and
inconsistent pandas aliasing: remove the second typing import that re-exports
Optional and Union (avoid F811), drop unused Tuple from the typing list, and
consolidate pandas import to a single alias (either import pandas as pd and
update existing references or keep import pandas and change the new pd usages to
pandas); update all references accordingly so only one pandas symbol and one
typing block (including Optional and Union) remain, and ensure symbols mentioned
(Optional, Union, Tuple, pandas/pd) are corrected.
🧹 Nitpick comments (6)
src/lammpsparser/potential.py (6)
44-67:family_idlogic is duplicated in multiple places.The
family_idproperty (lines 63–66) computesauthor_year[suffix], but the same logic is repeated indeduplicate(lines 266–267) andget_family_id(lines 326–331). Usepot.family_idindeduplicatefor the grouping key, and delegate fromget_family_idtoPotential.family_idto stay DRY.Also,
df_index: Optional[int]is narrower than whatDataFrame.iterrows()actually yields — the index can be any hashable type (string, tuple, etc.). ConsiderOptional[Any]or a broader annotation to avoid confusion.
160-170: Clarify theyear_idxguard — accidental correctness when index is 0.Line 167 uses
if year_idx and year_idx > 0, which relies on0being falsy. This happens to be correct (no author exists before index 0), but it conflates "not found" (None) with "found at position 0". A reader or future maintainer could easily misread the intent.Suggested clarity fix
- if year_idx and year_idx > 0: + if year_idx is not None and year_idx > 0:
36-41: Duplicateimport pandasunder two aliases.
pandasis imported on line 5 and again aspdon line 41. The existing codebase (PotentialAbstract,view_potentials, etc.) usespandas, while all new code usespd. Pick one alias and use it consistently to avoid confusion.Also applies to: 1-6
111-117: Nit: prefernext(iter(…))over list conversion for single-element set.Line 115 creates a temporary list to extract one element. As flagged by Ruff (RUF015):
- return list(self.target_elements)[0] + return next(iter(self.target_elements))
138-150: LAMMPS regex discards elements — no cross-check with theSpeciescolumn.The third capture group (element list from the name, e.g.,
Ni-Cu) is assigned to_and discarded. ThePotentialobject'selementsfield is populated from the DataFrame'sSpeciescolumn instead. This means a mismatch between the name-embedded elements and the Species column would go undetected, potentially causing silent mis-grouping or incorrect filtering. Consider at least a debug-level assertion or log when they diverge.
294-295: Unparsed potentials are silently kept without deduplication.Potentials whose names don't match either LAMMPS or OpenKIM format are re-added as-is (line 295). This is reasonable for safety, but if the set of unparsable names is large, duplicates among them — or between them and parsed potentials — will slip through. A verbose log line or stat entry for this case would improve transparency.
| import re | ||
| from collections import defaultdict | ||
| from dataclasses import dataclass | ||
| from typing import Dict, List, Optional, Set, Tuple, Union | ||
|
|
||
| import pandas as pd |
There was a problem hiding this comment.
Duplicate and unused imports.
OptionalandUnionare already imported on line 3; re-importing them on line 39 triggers F811.Tupleis imported but never used.pandasis imported on line 5 aspandas, but line 41 imports it again aspd. The new code usespdwhile existing code usespandas, creating inconsistency.
Consolidate into a single set of imports at the top of the file.
Proposed fix
At the top of the file, merge the imports:
-from typing import Optional, Union
+from typing import Dict, List, Optional, Set, UnionRemove the duplicate block:
import re
from collections import defaultdict
from dataclasses import dataclass
-from typing import Dict, List, Optional, Set, Tuple, Union
-
-import pandas as pdThen either use pandas consistently or alias once at the top. The new code can use pandas (matching the existing convention) or change the top-level import to import pandas as pd and update existing references.
🧰 Tools
🪛 Ruff (0.15.0)
[error] 39-39: Redefinition of unused Optional from line 3: Optional redefined here
Remove definition: Optional
(F811)
[error] 39-39: Redefinition of unused Union from line 3: Union redefined here
Remove definition: Union
(F811)
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 36 - 41, The file has
duplicate/unused typing imports and inconsistent pandas aliasing: remove the
second typing import that re-exports Optional and Union (avoid F811), drop
unused Tuple from the typing list, and consolidate pandas import to a single
alias (either import pandas as pd and update existing references or keep import
pandas and change the new pd usages to pandas); update all references
accordingly so only one pandas symbol and one typing block (including Optional
and Union) remain, and ensure symbols mentioned (Optional, Union, Tuple,
pandas/pd) are corrected.
| def __init__( | ||
| self, | ||
| target_elements: Union[str, List[str], Set[str]] = "Ni", | ||
| verbose: bool = False, | ||
| ): | ||
| """ | ||
| Parameters | ||
| ---------- | ||
| target_elements : str, list of str, or set of str | ||
| Element(s) to filter for. If multiple elements provided, | ||
| potentials must contain ALL of them. | ||
| Examples: 'Ni', ['Ni', 'Al'], {'Ni', 'Al', 'Cu'} | ||
| verbose : bool | ||
| Print deduplication details | ||
| """ | ||
| # Normalize to set | ||
| if isinstance(target_elements, str): | ||
| self.target_elements = {target_elements} | ||
| elif isinstance(target_elements, (list, tuple)): | ||
| self.target_elements = set(target_elements) | ||
| elif isinstance(target_elements, set): | ||
| self.target_elements = target_elements | ||
| else: | ||
| raise ValueError( | ||
| f"target_elements must be str, list, or set, got {type(target_elements)}" | ||
| ) |
There was a problem hiding this comment.
Hardcoded default "Ni" and wrong exception type.
-
target_elementsdefaults to"Ni", which is surprising for a general-purpose utility. A sentinel likeNone(with validation that it's set before use) or requiring an explicit argument would be safer. -
Line 103: the
elsebranch raisesValueErrorfor an invalid type, butTypeErroris the idiomatic choice (also flagged by Ruff TRY004).
Proposed fix for exception type
- raise ValueError(
+ raise TypeError(
f"target_elements must be str, list, or set, got {type(target_elements)}"
)🧰 Tools
🪛 Ruff (0.15.0)
[warning] 103-105: Prefer TypeError exception for invalid type
(TRY004)
[warning] 103-105: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 80 - 105, The __init__ currently
hardcodes target_elements="Ni" and raises ValueError for wrong types; change the
signature to default target_elements=None (or require explicit argument) and
update initialization to handle None (e.g., set self.target_elements=None and
validate/use later) or document/validate that it must be provided before use;
also replace the final raise ValueError(...) with raise TypeError(...) for
invalid input types, referencing the __init__ parameter target_elements and the
attribute self.target_elements so you update both the default and the exception
type consistently.
| @staticmethod | ||
| def normalize_author(author_str: str) -> str: | ||
| """Extract and normalize primary author surname.""" | ||
| parts = re.split(r"[-_]", author_str) | ||
| main = parts[0] | ||
|
|
||
| # Handle camelCase in OpenKIM | ||
| if len(re.findall(r"[A-Z]", main)) > 1: | ||
| camel_parts = re.split(r"(?=[A-Z])", main) | ||
| camel_parts = [p for p in camel_parts if p] | ||
| main = camel_parts[0] if camel_parts else main | ||
|
|
||
| return re.sub(r"[^a-z]", "", main.lower()) |
There was a problem hiding this comment.
CamelCase splitting heuristic is fragile and can mis-normalize author names.
The logic splits on uppercase boundaries when more than one capital is detected, then keeps only the first fragment. This incorrectly truncates real surnames:
"McDonald"→["Mc", "Donald"]→"mc"(should be"mcdonald")"DErrico"→["D", "Errico"]→"d"(loses the name entirely)"MacLeod"→["Mac", "Leod"]→"mac"
This can cause false-positive deduplication (unrelated authors collapsing to the same key) or false-negative (same author with different casing getting different keys). If the goal is to handle OpenKIM's concatenated multi-author strings like "SmithJones", a more targeted pattern (e.g., splitting only when the name contains a known concatenation marker or checking specific OpenKIM conventions) would be safer.
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 119 - 131, The current
camelCase-splitting in normalize_author is too aggressive and chops names like
"McDonald" or "DErrico"; update normalize_author to only split when the main
token matches a clear concatenated-capitalized-names pattern (e.g., two
capitalized name-like segments) and then keep the first full segment;
specifically, replace the heuristic that splits on any uppercase boundary
(inside normalize_author, using variables author_str and main) with a targeted
regex check such as matching ^([A-Z][a-z]{2,})([A-Z][a-z]+) and, only if that
matches, extract group 1 as the primary name before lowercasing and stripping
non-letters—otherwise leave main intact.
| if isinstance(species, list): | ||
| elements = set(species) | ||
| elif isinstance(species, set): | ||
| elements = species | ||
| else: | ||
| elements = set() |
There was a problem hiding this comment.
String species silently becomes an empty set — potential data loss.
If species is a string (e.g., "Ni"), the else branch produces set(), which will never pass the target-elements filter and the potential is silently dropped. While the upstream CSV parser currently returns lists, this fallback is misleading — it looks like defensive handling but actually discards valid data.
Proposed fix
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
+ elif isinstance(species, str):
+ elements = {species}
else:
elements = set()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(species, list): | |
| elements = set(species) | |
| elif isinstance(species, set): | |
| elements = species | |
| else: | |
| elements = set() | |
| if isinstance(species, list): | |
| elements = set(species) | |
| elif isinstance(species, set): | |
| elements = species | |
| elif isinstance(species, str): | |
| elements = {species} | |
| else: | |
| elements = set() |
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 216 - 221, The code silently
drops string-valued species by treating anything not list/set as empty; update
the handling in the block that sets elements (variable name elements in
src/lammpsparser/potential.py) to detect strings and other single-value types
and convert them into a one-element set (e.g., if isinstance(species, str):
elements = {species}) instead of set(), while preserving the existing branches
for list and set; ensure this logic is used wherever the target-elements filter
relies on elements so single-species strings like "Ni" are preserved.
| for _, row in df.iterrows(): | ||
| name = row["Name"] | ||
| species = row["Species"] | ||
|
|
||
| # Check target elements | ||
| elements = set(species) if isinstance(species, list) else species | ||
| if not self.contains_target_elements(elements): | ||
| continue | ||
|
|
||
| family_id = self.get_family_id(name) | ||
| if family_id: | ||
| metadata = self.parse_potential_metadata(name) | ||
| families[family_id]["count"] += 1 | ||
| families[family_id]["repos"].add(metadata["repo_type"]) | ||
| families[family_id]["names"].append(name) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Inconsistent species-to-set conversion and redundant metadata parsing.
-
Line 348 uses
set(species) if isinstance(species, list) else species. Ifspeciesis a string, it passes the raw string toissubset, which does character-level membership —{"Ni"}.issubset("Ni")isFalsebecause the element"Ni"is not a character in the iterable"Ni". This is the same pattern as indeduplicate(lines 216–221) andfilter_by_elements(line 407). Consider extracting a shared helper for species normalization. -
parse_potential_metadata(name)is called twice per row — once insideget_family_id(line 352) and again on line 354. Inline the family_id computation or cache the result.
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 343 - 357, Normalize species to a
set before checking membership and avoid double metadata parsing: add a helper
(e.g., normalize_species) used by contains_target_elements, deduplicate, and
filter_by_elements to convert species (string or list) into a consistent set of
element strings; in the loop over df.iterrows() call species_set =
normalize_species(row["Species"]) and pass species_set to
contains_target_elements instead of the current conditional. Also call metadata
= parse_potential_metadata(name) once, use metadata to compute family_id (or
inline the family id logic that currently lives in get_family_id to accept
metadata) and then update families[family_id]["count"],
["repos"].add(metadata["repo_type"]) and ["names"].append(name) so
parse_potential_metadata is not invoked twice.
| raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements) | ||
|
|
||
| dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True) | ||
| clean_df = dedup.deduplicate(raw_df) | ||
|
|
||
| return clean_df |
There was a problem hiding this comment.
Hardcoded verbose=True prints to stdout on every call — inappropriate for a library function.
view_potentials is called by get_potential_dataframe (line 728), which is part of the public API. Hardcoding verbose=True means every consumer silently gets deduplication diagnostics dumped to stdout. Library code should default to silent operation and use the logging module for diagnostics.
Additionally, raw_df is already element-filtered by find(list_of_elements) on line 663, so the deduplicator's own element filter is redundant — filtered_out_indices will always be empty.
Proposed fix
- dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
+ dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=False)
clean_df = dedup.deduplicate(raw_df)Or better, use logging:
import logging
logger = logging.getLogger(__name__)and replace all print(...) calls in PotentialDeduplicator with logger.debug(...) / logger.info(...).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements) | |
| dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True) | |
| clean_df = dedup.deduplicate(raw_df) | |
| return clean_df | |
| raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements) | |
| dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=False) | |
| clean_df = dedup.deduplicate(raw_df) | |
| return clean_df |
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 663 - 668, The deduplicator
currently forces verbose output in view_potentials by calling
PotentialDeduplicator(target_elements=list_of_elements, verbose=True) and also
redundantly passes target_elements even though
LammpsPotentialFile.find(list_of_elements) already filters; change this to
silence library output: set verbose to False (or omit the flag) when
constructing PotentialDeduplicator in view_potentials (and in any callers such
as get_potential_dataframe), and remove/ignore the redundant target_elements
argument so PotentialDeduplicator does not re-filter already-filtered raw_df (or
accept None to skip element filtering). Also update PotentialDeduplicator to use
the logging module (logger = logging.getLogger(__name__)) and replace print(...)
calls with logger.debug/info as appropriate so diagnostics are controlled by the
consumer's logging configuration.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #310 +/- ##
==========================================
- Coverage 82.36% 78.50% -3.87%
==========================================
Files 11 11
Lines 1157 1326 +169
==========================================
+ Hits 953 1041 +88
- Misses 204 285 +81 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds PotentialDeduplicator class to identify and remove duplicate interatomic potentials from DataFrame-based potential databases.
There is considerable overlap in the potentials from NIST-ipr and OpenKIM repositories, the same potential appears multiple times with different naming conventions or version numbers. This creates redundant calculations.
The PotentialDeduplicator class parses potential names from both LAMMPS (YEAR--Author--Elements--LAMMPS--iprN) and OpenKIM (Author_YEAR[suffix]Elements__MO*) formats
Groups potentials by family (normalized author + year + suffix)
Selects the version within each family using preference rules:
NIST-IPR potentials are given preference over OpenKIM potentials due to the frequent issues in building OpenKIM potentials.
Higher IPR number > lower IPR number (within NIST) -- latest version.
Filters potentials by the Species present (supports single or multiple target elements)
Summary by CodeRabbit
Release Notes