Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
384 changes: 383 additions & 1 deletion src/lammpsparser/potential.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,383 @@

"""

import re
from collections import defaultdict
from dataclasses import dataclass
from typing import Dict, List, Optional, Set, Tuple, Union

import pandas as pd
Comment on lines +36 to +41
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate and unused imports.

  • Optional and Union are already imported on line 3; re-importing them on line 39 triggers F811.
  • Tuple is imported but never used.
  • pandas is imported on line 5 as pandas, but line 41 imports it again as pd. The new code uses pd while existing code uses pandas, creating inconsistency.

Consolidate into a single set of imports at the top of the file.

Proposed fix

At the top of the file, merge the imports:

-from typing import Optional, Union
+from typing import Dict, List, Optional, Set, Union

Remove the duplicate block:

 import re
 from collections import defaultdict
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Set, Tuple, Union
-
-import pandas as pd

Then either use pandas consistently or alias once at the top. The new code can use pandas (matching the existing convention) or change the top-level import to import pandas as pd and update existing references.

🧰 Tools
🪛 Ruff (0.15.0)

[error] 39-39: Redefinition of unused Optional from line 3: Optional redefined here

Remove definition: Optional

(F811)


[error] 39-39: Redefinition of unused Union from line 3: Union redefined here

Remove definition: Union

(F811)

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 36 - 41, The file has
duplicate/unused typing imports and inconsistent pandas aliasing: remove the
second typing import that re-exports Optional and Union (avoid F811), drop
unused Tuple from the typing list, and consolidate pandas import to a single
alias (either import pandas as pd and update existing references or keep import
pandas and change the new pd usages to pandas); update all references
accordingly so only one pandas symbol and one typing block (including Optional
and Union) remain, and ensure symbols mentioned (Optional, Union, Tuple,
pandas/pd) are corrected.



@dataclass
class Potential:
"""Unified potential representation."""

year: str
year_suffix: str
authors: str
elements: Set[str]
repo_type: str
ipr: Optional[int]
original: str
df_index: Optional[int] = None

@property
def sort_key(self):
"""Key for sorting - prefer LAMMPS, then higher ipr."""
return (0 if self.repo_type == "LAMMPS" else 1, -(self.ipr if self.ipr else 0))

@property
def family_id(self) -> str:
"""Return author_year[suffix] identifier."""
year_full = self.year + self.year_suffix if self.year_suffix else self.year
return f"{self.authors}_{year_full}"


class PotentialDeduplicator:
"""
Deduplicate interatomic potentials from DataFrame.

Rules:
1. Potentials from same author+year+suffix are duplicates
2. Within LAMMPS: prefer higher ipr
3. Across repos: prefer LAMMPS over OpenKIM
4. Only keep potentials containing ALL target_elements
"""

def __init__(
self,
target_elements: Union[str, List[str], Set[str]] = "Ni",
verbose: bool = False,
):
"""
Parameters
----------
target_elements : str, list of str, or set of str
Element(s) to filter for. If multiple elements provided,
potentials must contain ALL of them.
Examples: 'Ni', ['Ni', 'Al'], {'Ni', 'Al', 'Cu'}
verbose : bool
Print deduplication details
"""
# Normalize to set
if isinstance(target_elements, str):
self.target_elements = {target_elements}
elif isinstance(target_elements, (list, tuple)):
self.target_elements = set(target_elements)
elif isinstance(target_elements, set):
self.target_elements = target_elements
else:
raise ValueError(
f"target_elements must be str, list, or set, got {type(target_elements)}"
)
Comment on lines +80 to +105
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded default "Ni" and wrong exception type.

  1. target_elements defaults to "Ni", which is surprising for a general-purpose utility. A sentinel like None (with validation that it's set before use) or requiring an explicit argument would be safer.

  2. Line 103: the else branch raises ValueError for an invalid type, but TypeError is the idiomatic choice (also flagged by Ruff TRY004).

Proposed fix for exception type
-            raise ValueError(
+            raise TypeError(
                 f"target_elements must be str, list, or set, got {type(target_elements)}"
             )
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 103-105: Prefer TypeError exception for invalid type

(TRY004)


[warning] 103-105: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 80 - 105, The __init__ currently
hardcodes target_elements="Ni" and raises ValueError for wrong types; change the
signature to default target_elements=None (or require explicit argument) and
update initialization to handle None (e.g., set self.target_elements=None and
validate/use later) or document/validate that it must be provided before use;
also replace the final raise ValueError(...) with raise TypeError(...) for
invalid input types, referencing the __init__ parameter target_elements and the
attribute self.target_elements so you update both the default and the exception
type consistently.


self.verbose = verbose
self.last_duplicates_map = {}
self.last_stats = {}

@property
def target_elements_str(self) -> str:
"""Human-readable string of target elements."""
if len(self.target_elements) == 1:
return list(self.target_elements)[0]
else:
return "{" + ", ".join(sorted(self.target_elements)) + "}"

@staticmethod
def normalize_author(author_str: str) -> str:
"""Extract and normalize primary author surname."""
parts = re.split(r"[-_]", author_str)
main = parts[0]

# Handle camelCase in OpenKIM
if len(re.findall(r"[A-Z]", main)) > 1:
camel_parts = re.split(r"(?=[A-Z])", main)
camel_parts = [p for p in camel_parts if p]
main = camel_parts[0] if camel_parts else main

return re.sub(r"[^a-z]", "", main.lower())
Comment on lines +119 to +131
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CamelCase splitting heuristic is fragile and can mis-normalize author names.

The logic splits on uppercase boundaries when more than one capital is detected, then keeps only the first fragment. This incorrectly truncates real surnames:

  • "McDonald"["Mc", "Donald"]"mc" (should be "mcdonald")
  • "DErrico"["D", "Errico"]"d" (loses the name entirely)
  • "MacLeod"["Mac", "Leod"]"mac"

This can cause false-positive deduplication (unrelated authors collapsing to the same key) or false-negative (same author with different casing getting different keys). If the goal is to handle OpenKIM's concatenated multi-author strings like "SmithJones", a more targeted pattern (e.g., splitting only when the name contains a known concatenation marker or checking specific OpenKIM conventions) would be safer.

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 119 - 131, The current
camelCase-splitting in normalize_author is too aggressive and chops names like
"McDonald" or "DErrico"; update normalize_author to only split when the main
token matches a clear concatenated-capitalized-names pattern (e.g., two
capitalized name-like segments) and then keep the first full segment;
specifically, replace the heuristic that splits on any uppercase boundary
(inside normalize_author, using variables author_str and main) with a targeted
regex check such as matching ^([A-Z][a-z]{2,})([A-Z][a-z]+) and, only if that
matches, extract group 1 as the primary name before lowercasing and stripping
non-letters—otherwise leave main intact.


@staticmethod
def parse_potential_metadata(name: str) -> Optional[Dict]:
"""Parse potential name for metadata (year, author, repo, ipr)."""

# Try LAMMPS format
lammps_pattern = (
r"(\d{4})--([^-]+(?:-[^-]+)*)--([^-]+(?:-[^-]+)*)--LAMMPS--ipr(\d+)"
)
match = re.match(lammps_pattern, name)
if match:
year, authors, _, ipr = match.groups()
return {
"year": year,
"year_suffix": "",
"authors": PotentialDeduplicator.normalize_author(authors),
"repo_type": "LAMMPS",
"ipr": int(ipr),
}

# Try OpenKIM format
year_match = re.search(r"_(\d{4})([^_]*)", name)
mo_match = re.search(r"__(MO_|SM_)", name)

if year_match and mo_match:
year = year_match.group(1)
year_suffix = year_match.group(2)

parts = name.split("_")
year_idx = None
for i, part in enumerate(parts):
if part.startswith(year):
year_idx = i
break

if year_idx and year_idx > 0:
authors = parts[year_idx - 1]
else:
authors = ""

return {
"year": year,
"year_suffix": year_suffix,
"authors": (
PotentialDeduplicator.normalize_author(authors) if authors else ""
),
"repo_type": "OpenKIM",
"ipr": None,
}

return None

def contains_target_elements(self, elements: Set[str]) -> bool:
"""Check if elements set contains ALL target elements."""
return self.target_elements.issubset(elements)

def deduplicate(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Deduplicate potentials from DataFrame.

Parameters
----------
df : DataFrame
Must have 'Name' and 'Species' columns

Returns
-------
deduplicated_df : DataFrame
Deduplicated potentials containing all target elements
"""

if "Name" not in df.columns or "Species" not in df.columns:
raise ValueError("DataFrame must have 'Name' and 'Species' columns")

# Parse all potentials
potentials = []
unparsed_indices = []
filtered_out_indices = []

for idx, row in df.iterrows():
name = row["Name"]
species = row["Species"]

# Convert species to set
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
else:
elements = set()
Comment on lines +216 to +221
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

String species silently becomes an empty set — potential data loss.

If species is a string (e.g., "Ni"), the else branch produces set(), which will never pass the target-elements filter and the potential is silently dropped. While the upstream CSV parser currently returns lists, this fallback is misleading — it looks like defensive handling but actually discards valid data.

Proposed fix
             if isinstance(species, list):
                 elements = set(species)
             elif isinstance(species, set):
                 elements = species
+            elif isinstance(species, str):
+                elements = {species}
             else:
                 elements = set()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
else:
elements = set()
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
elif isinstance(species, str):
elements = {species}
else:
elements = set()
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 216 - 221, The code silently
drops string-valued species by treating anything not list/set as empty; update
the handling in the block that sets elements (variable name elements in
src/lammpsparser/potential.py) to detect strings and other single-value types
and convert them into a one-element set (e.g., if isinstance(species, str):
elements = {species}) instead of set(), while preserving the existing branches
for list and set; ensure this logic is used wherever the target-elements filter
relies on elements so single-species strings like "Ni" are preserved.


# Check if ALL target elements are present
if not self.contains_target_elements(elements):
filtered_out_indices.append(idx)
continue

# Parse metadata
metadata = self.parse_potential_metadata(name)

if metadata:
pot = Potential(
year=metadata["year"],
year_suffix=metadata["year_suffix"],
authors=metadata["authors"],
elements=elements,
repo_type=metadata["repo_type"],
ipr=metadata["ipr"],
original=name,
df_index=idx,
)
potentials.append(pot)
else:
unparsed_indices.append(idx)

# Store stats
self.last_stats = {
"total": len(df),
"filtered_out": len(filtered_out_indices),
"unparsed": len(unparsed_indices),
"valid": len(potentials),
}

if self.verbose:
print(f"Total potentials: {self.last_stats['total']}")
print(f"Target elements: {self.target_elements_str}")
print(
f"Filtered out (missing target elements): {self.last_stats['filtered_out']}"
)
print(f"Unparsed: {self.last_stats['unparsed']}")
print(f"Valid for deduplication: {self.last_stats['valid']}")

# Group by (year+suffix, author)
groups = defaultdict(list)
for pot in potentials:
year_full = pot.year + pot.year_suffix if pot.year_suffix else pot.year
key = (year_full, pot.authors)
groups[key].append(pot)

# Keep only the best from each group
kept_indices = []
self.last_duplicates_map = {}

for (year_full, author), group in sorted(groups.items()):
if len(group) == 1:
kept_indices.append(group[0].df_index)
continue

# Sort by preference: LAMMPS first, then highest ipr
group.sort(key=lambda p: p.sort_key)

best = group[0]
rest = group[1:]

kept_indices.append(best.df_index)
self.last_duplicates_map[best.original] = [p.original for p in rest]

if self.verbose:
print(f"\nGroup: {year_full} - {author}")
print(f" Kept: {best.original}")
for dup in rest:
print(f" Removed: {dup.original}")

# Add back unparsed items
kept_indices.extend(unparsed_indices)

# Update stats
self.last_stats["kept"] = len(kept_indices)
self.last_stats["removed_duplicates"] = sum(
len(v) for v in self.last_duplicates_map.values()
)

if self.verbose:
print(f"\nFinal count: {self.last_stats['kept']}")
print(f"Duplicates removed: {self.last_stats['removed_duplicates']}")

# Return deduplicated DataFrame
return df.loc[kept_indices].copy()

def get_duplicates(self) -> Dict[str, List[str]]:
"""Return the duplicates map from last deduplication."""
return self.last_duplicates_map.copy()

def get_stats(self) -> Dict[str, int]:
"""Return statistics from last deduplication."""
return self.last_stats.copy()

def get_family_id(self, potential_name: str) -> Optional[str]:
"""
Get the family label (author_year[suffix]) for a potential.

Returns normalized label like 'foiles_1986' or 'adams_1989Universal6'.
"""
metadata = self.parse_potential_metadata(potential_name)
if metadata:
year_full = (
metadata["year"] + metadata["year_suffix"]
if metadata["year_suffix"]
else metadata["year"]
)
return f"{metadata['authors']}_{year_full}"
return None

def analyze_families(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Analyze potential families in the DataFrame.

Returns a summary DataFrame with family counts and repo types.
Only includes potentials with all target elements.
"""
families = defaultdict(lambda: {"count": 0, "repos": set(), "names": []})

for _, row in df.iterrows():
name = row["Name"]
species = row["Species"]

# Check target elements
elements = set(species) if isinstance(species, list) else species
if not self.contains_target_elements(elements):
continue

family_id = self.get_family_id(name)
if family_id:
metadata = self.parse_potential_metadata(name)
families[family_id]["count"] += 1
families[family_id]["repos"].add(metadata["repo_type"])
families[family_id]["names"].append(name)
Comment on lines +343 to +357
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Inconsistent species-to-set conversion and redundant metadata parsing.

  1. Line 348 uses set(species) if isinstance(species, list) else species. If species is a string, it passes the raw string to issubset, which does character-level membership — {"Ni"}.issubset("Ni") is False because the element "Ni" is not a character in the iterable "Ni". This is the same pattern as in deduplicate (lines 216–221) and filter_by_elements (line 407). Consider extracting a shared helper for species normalization.

  2. parse_potential_metadata(name) is called twice per row — once inside get_family_id (line 352) and again on line 354. Inline the family_id computation or cache the result.

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 343 - 357, Normalize species to a
set before checking membership and avoid double metadata parsing: add a helper
(e.g., normalize_species) used by contains_target_elements, deduplicate, and
filter_by_elements to convert species (string or list) into a consistent set of
element strings; in the loop over df.iterrows() call species_set =
normalize_species(row["Species"]) and pass species_set to
contains_target_elements instead of the current conditional. Also call metadata
= parse_potential_metadata(name) once, use metadata to compute family_id (or
inline the family id logic that currently lives in get_family_id to accept
metadata) and then update families[family_id]["count"],
["repos"].add(metadata["repo_type"]) and ["names"].append(name) so
parse_potential_metadata is not invoked twice.


# Convert to DataFrame
summary = []
for family_id, info in sorted(families.items()):
summary.append(
{
"family": family_id,
"count": info["count"],
"repos": ", ".join(sorted(info["repos"])),
"has_duplicates": info["count"] > 1,
}
)

return pd.DataFrame(summary)

def filter_by_elements(
self,
df: pd.DataFrame,
target_elements: Optional[Union[str, List[str], Set[str]]] = None,
) -> pd.DataFrame:
"""
Filter DataFrame to only potentials containing specified elements.

Parameters
----------
df : DataFrame
Input DataFrame with 'Species' column
target_elements : str, list, set, or None
Elements to filter for. If None, uses self.target_elements

Returns
-------
filtered_df : DataFrame
Filtered to potentials with all target elements
"""
if target_elements is not None:
# Temporarily change target elements
if isinstance(target_elements, str):
target_set = {target_elements}
elif isinstance(target_elements, (list, tuple)):
target_set = set(target_elements)
else:
target_set = target_elements
else:
target_set = self.target_elements

filtered_indices = []
for idx, row in df.iterrows():
species = row["Species"]
elements = set(species) if isinstance(species, list) else species
if target_set.issubset(elements):
filtered_indices.append(idx)

return df.loc[filtered_indices].copy()


class PotentialAbstract:
"""
Expand Down Expand Up @@ -283,7 +660,12 @@ def view_potentials(structure: Atoms, resource_path: str) -> pandas.DataFrame:
pandas.Dataframe: Dataframe including all potential parameters.
"""
list_of_elements = set(structure.get_chemical_symbols())
return LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)
raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)

dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
clean_df = dedup.deduplicate(raw_df)

return clean_df
Comment on lines +663 to +668
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded verbose=True prints to stdout on every call — inappropriate for a library function.

view_potentials is called by get_potential_dataframe (line 728), which is part of the public API. Hardcoding verbose=True means every consumer silently gets deduplication diagnostics dumped to stdout. Library code should default to silent operation and use the logging module for diagnostics.

Additionally, raw_df is already element-filtered by find(list_of_elements) on line 663, so the deduplicator's own element filter is redundant — filtered_out_indices will always be empty.

Proposed fix
-    dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
+    dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=False)
     clean_df = dedup.deduplicate(raw_df)

Or better, use logging:

import logging
logger = logging.getLogger(__name__)

and replace all print(...) calls in PotentialDeduplicator with logger.debug(...) / logger.info(...).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)
dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
clean_df = dedup.deduplicate(raw_df)
return clean_df
raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)
dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=False)
clean_df = dedup.deduplicate(raw_df)
return clean_df
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 663 - 668, The deduplicator
currently forces verbose output in view_potentials by calling
PotentialDeduplicator(target_elements=list_of_elements, verbose=True) and also
redundantly passes target_elements even though
LammpsPotentialFile.find(list_of_elements) already filters; change this to
silence library output: set verbose to False (or omit the flag) when
constructing PotentialDeduplicator in view_potentials (and in any callers such
as get_potential_dataframe), and remove/ignore the redundant target_elements
argument so PotentialDeduplicator does not re-filter already-filtered raw_df (or
accept None to skip element filtering). Also update PotentialDeduplicator to use
the logging module (logger = logging.getLogger(__name__)) and replace print(...)
calls with logger.debug/info as appropriate so diagnostics are controlled by the
consumer's logging configuration.



def convert_path_to_abs_posix(path: str) -> str:
Expand Down
Loading