Skip to content

Adding Potential deduplication#310

Open
Sriram2631 wants to merge 5 commits intomainfrom
SortingPotentials
Open

Adding Potential deduplication#310
Sriram2631 wants to merge 5 commits intomainfrom
SortingPotentials

Conversation

@Sriram2631
Copy link

@Sriram2631 Sriram2631 commented Feb 12, 2026

Adds PotentialDeduplicator class to identify and remove duplicate interatomic potentials from DataFrame-based potential databases.
There is considerable overlap in the potentials from NIST-ipr and OpenKIM repositories, the same potential appears multiple times with different naming conventions or version numbers. This creates redundant calculations.

The PotentialDeduplicator class parses potential names from both LAMMPS (YEAR--Author--Elements--LAMMPS--iprN) and OpenKIM (Author_YEAR[suffix]Elements__MO*) formats
Groups potentials by family (normalized author + year + suffix)
Selects the version within each family using preference rules:

NIST-IPR potentials are given preference over OpenKIM potentials due to the frequent issues in building OpenKIM potentials.
Higher IPR number > lower IPR number (within NIST) -- latest version.

Filters potentials by the Species present (supports single or multiple target elements)

Summary by CodeRabbit

Release Notes

  • New Features
    • Deduplication workflow removes duplicate potentials based on publication year and author information
    • Element-based filtering ensures only potentials containing all target elements are included
    • Family analysis tool provides summary statistics on potential groups and repository types
    • Enhanced data quality in potential listings through automated cleaning and validation

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

A new Potential dataclass and PotentialDeduplicator class have been added to handle deduplication and analysis of potential data. The deduplicator supports metadata parsing, element filtering, grouping by author and year, and family analysis. The view_potentials() function now applies this deduplication workflow to return a cleaned DataFrame.

Changes

Cohort / File(s) Summary
Potential Data Container
src/lammpsparser/potential.py
Added Potential dataclass with fields for metadata (year, authors, elements, repo_type, ipr, original) and deduplication-specific attributes (df_index, sort_key, family_id).
Deduplication Engine
src/lammpsparser/potential.py
Added PotentialDeduplicator class with multi-rule deduplication: parses potential names, normalizes author surnames, filters by target elements, groups by (year_full, author), and ranks by repo type (LAMMPS priority) and IPR. Provides deduplicate(), get_duplicates(), get_stats(), get_family_id(), analyze_families(), and filter_by_elements() methods.
Potential Viewing
src/lammpsparser/potential.py
Updated view_potentials() to instantiate PotentialDeduplicator and apply deduplication workflow with verbose output before returning the cleaned DataFrame.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰✨ Potentials once tangled, now sorted with care,
Duplicates removed, the data laid bare!
By author and year, grouped neat as can be,
A deduplicator hops—cleaned data, hurray! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Adding Potential deduplication' directly reflects the main change—introducing a PotentialDeduplicator class for deduplicating potentials in the database.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch SortingPotentials

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@src/lammpsparser/potential.py`:
- Around line 343-357: Normalize species to a set before checking membership and
avoid double metadata parsing: add a helper (e.g., normalize_species) used by
contains_target_elements, deduplicate, and filter_by_elements to convert species
(string or list) into a consistent set of element strings; in the loop over
df.iterrows() call species_set = normalize_species(row["Species"]) and pass
species_set to contains_target_elements instead of the current conditional. Also
call metadata = parse_potential_metadata(name) once, use metadata to compute
family_id (or inline the family id logic that currently lives in get_family_id
to accept metadata) and then update families[family_id]["count"],
["repos"].add(metadata["repo_type"]) and ["names"].append(name) so
parse_potential_metadata is not invoked twice.
- Around line 119-131: The current camelCase-splitting in normalize_author is
too aggressive and chops names like "McDonald" or "DErrico"; update
normalize_author to only split when the main token matches a clear
concatenated-capitalized-names pattern (e.g., two capitalized name-like
segments) and then keep the first full segment; specifically, replace the
heuristic that splits on any uppercase boundary (inside normalize_author, using
variables author_str and main) with a targeted regex check such as matching
^([A-Z][a-z]{2,})([A-Z][a-z]+) and, only if that matches, extract group 1 as the
primary name before lowercasing and stripping non-letters—otherwise leave main
intact.
- Around line 216-221: The code silently drops string-valued species by treating
anything not list/set as empty; update the handling in the block that sets
elements (variable name elements in src/lammpsparser/potential.py) to detect
strings and other single-value types and convert them into a one-element set
(e.g., if isinstance(species, str): elements = {species}) instead of set(),
while preserving the existing branches for list and set; ensure this logic is
used wherever the target-elements filter relies on elements so single-species
strings like "Ni" are preserved.
- Around line 80-105: The __init__ currently hardcodes target_elements="Ni" and
raises ValueError for wrong types; change the signature to default
target_elements=None (or require explicit argument) and update initialization to
handle None (e.g., set self.target_elements=None and validate/use later) or
document/validate that it must be provided before use; also replace the final
raise ValueError(...) with raise TypeError(...) for invalid input types,
referencing the __init__ parameter target_elements and the attribute
self.target_elements so you update both the default and the exception type
consistently.
- Around line 663-668: The deduplicator currently forces verbose output in
view_potentials by calling
PotentialDeduplicator(target_elements=list_of_elements, verbose=True) and also
redundantly passes target_elements even though
LammpsPotentialFile.find(list_of_elements) already filters; change this to
silence library output: set verbose to False (or omit the flag) when
constructing PotentialDeduplicator in view_potentials (and in any callers such
as get_potential_dataframe), and remove/ignore the redundant target_elements
argument so PotentialDeduplicator does not re-filter already-filtered raw_df (or
accept None to skip element filtering). Also update PotentialDeduplicator to use
the logging module (logger = logging.getLogger(__name__)) and replace print(...)
calls with logger.debug/info as appropriate so diagnostics are controlled by the
consumer's logging configuration.
- Around line 36-41: The file has duplicate/unused typing imports and
inconsistent pandas aliasing: remove the second typing import that re-exports
Optional and Union (avoid F811), drop unused Tuple from the typing list, and
consolidate pandas import to a single alias (either import pandas as pd and
update existing references or keep import pandas and change the new pd usages to
pandas); update all references accordingly so only one pandas symbol and one
typing block (including Optional and Union) remain, and ensure symbols mentioned
(Optional, Union, Tuple, pandas/pd) are corrected.
🧹 Nitpick comments (6)
src/lammpsparser/potential.py (6)

44-67: family_id logic is duplicated in multiple places.

The family_id property (lines 63–66) computes author_year[suffix], but the same logic is repeated in deduplicate (lines 266–267) and get_family_id (lines 326–331). Use pot.family_id in deduplicate for the grouping key, and delegate from get_family_id to Potential.family_id to stay DRY.

Also, df_index: Optional[int] is narrower than what DataFrame.iterrows() actually yields — the index can be any hashable type (string, tuple, etc.). Consider Optional[Any] or a broader annotation to avoid confusion.


160-170: Clarify the year_idx guard — accidental correctness when index is 0.

Line 167 uses if year_idx and year_idx > 0, which relies on 0 being falsy. This happens to be correct (no author exists before index 0), but it conflates "not found" (None) with "found at position 0". A reader or future maintainer could easily misread the intent.

Suggested clarity fix
-            if year_idx and year_idx > 0:
+            if year_idx is not None and year_idx > 0:

36-41: Duplicate import pandas under two aliases.

pandas is imported on line 5 and again as pd on line 41. The existing codebase (PotentialAbstract, view_potentials, etc.) uses pandas, while all new code uses pd. Pick one alias and use it consistently to avoid confusion.

Also applies to: 1-6


111-117: Nit: prefer next(iter(…)) over list conversion for single-element set.

Line 115 creates a temporary list to extract one element. As flagged by Ruff (RUF015):

-            return list(self.target_elements)[0]
+            return next(iter(self.target_elements))

138-150: LAMMPS regex discards elements — no cross-check with the Species column.

The third capture group (element list from the name, e.g., Ni-Cu) is assigned to _ and discarded. The Potential object's elements field is populated from the DataFrame's Species column instead. This means a mismatch between the name-embedded elements and the Species column would go undetected, potentially causing silent mis-grouping or incorrect filtering. Consider at least a debug-level assertion or log when they diverge.


294-295: Unparsed potentials are silently kept without deduplication.

Potentials whose names don't match either LAMMPS or OpenKIM format are re-added as-is (line 295). This is reasonable for safety, but if the set of unparsable names is large, duplicates among them — or between them and parsed potentials — will slip through. A verbose log line or stat entry for this case would improve transparency.

Comment on lines +36 to +41
import re
from collections import defaultdict
from dataclasses import dataclass
from typing import Dict, List, Optional, Set, Tuple, Union

import pandas as pd
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate and unused imports.

  • Optional and Union are already imported on line 3; re-importing them on line 39 triggers F811.
  • Tuple is imported but never used.
  • pandas is imported on line 5 as pandas, but line 41 imports it again as pd. The new code uses pd while existing code uses pandas, creating inconsistency.

Consolidate into a single set of imports at the top of the file.

Proposed fix

At the top of the file, merge the imports:

-from typing import Optional, Union
+from typing import Dict, List, Optional, Set, Union

Remove the duplicate block:

 import re
 from collections import defaultdict
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Set, Tuple, Union
-
-import pandas as pd

Then either use pandas consistently or alias once at the top. The new code can use pandas (matching the existing convention) or change the top-level import to import pandas as pd and update existing references.

🧰 Tools
🪛 Ruff (0.15.0)

[error] 39-39: Redefinition of unused Optional from line 3: Optional redefined here

Remove definition: Optional

(F811)


[error] 39-39: Redefinition of unused Union from line 3: Union redefined here

Remove definition: Union

(F811)

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 36 - 41, The file has
duplicate/unused typing imports and inconsistent pandas aliasing: remove the
second typing import that re-exports Optional and Union (avoid F811), drop
unused Tuple from the typing list, and consolidate pandas import to a single
alias (either import pandas as pd and update existing references or keep import
pandas and change the new pd usages to pandas); update all references
accordingly so only one pandas symbol and one typing block (including Optional
and Union) remain, and ensure symbols mentioned (Optional, Union, Tuple,
pandas/pd) are corrected.

Comment on lines +80 to +105
def __init__(
self,
target_elements: Union[str, List[str], Set[str]] = "Ni",
verbose: bool = False,
):
"""
Parameters
----------
target_elements : str, list of str, or set of str
Element(s) to filter for. If multiple elements provided,
potentials must contain ALL of them.
Examples: 'Ni', ['Ni', 'Al'], {'Ni', 'Al', 'Cu'}
verbose : bool
Print deduplication details
"""
# Normalize to set
if isinstance(target_elements, str):
self.target_elements = {target_elements}
elif isinstance(target_elements, (list, tuple)):
self.target_elements = set(target_elements)
elif isinstance(target_elements, set):
self.target_elements = target_elements
else:
raise ValueError(
f"target_elements must be str, list, or set, got {type(target_elements)}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded default "Ni" and wrong exception type.

  1. target_elements defaults to "Ni", which is surprising for a general-purpose utility. A sentinel like None (with validation that it's set before use) or requiring an explicit argument would be safer.

  2. Line 103: the else branch raises ValueError for an invalid type, but TypeError is the idiomatic choice (also flagged by Ruff TRY004).

Proposed fix for exception type
-            raise ValueError(
+            raise TypeError(
                 f"target_elements must be str, list, or set, got {type(target_elements)}"
             )
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 103-105: Prefer TypeError exception for invalid type

(TRY004)


[warning] 103-105: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 80 - 105, The __init__ currently
hardcodes target_elements="Ni" and raises ValueError for wrong types; change the
signature to default target_elements=None (or require explicit argument) and
update initialization to handle None (e.g., set self.target_elements=None and
validate/use later) or document/validate that it must be provided before use;
also replace the final raise ValueError(...) with raise TypeError(...) for
invalid input types, referencing the __init__ parameter target_elements and the
attribute self.target_elements so you update both the default and the exception
type consistently.

Comment on lines +119 to +131
@staticmethod
def normalize_author(author_str: str) -> str:
"""Extract and normalize primary author surname."""
parts = re.split(r"[-_]", author_str)
main = parts[0]

# Handle camelCase in OpenKIM
if len(re.findall(r"[A-Z]", main)) > 1:
camel_parts = re.split(r"(?=[A-Z])", main)
camel_parts = [p for p in camel_parts if p]
main = camel_parts[0] if camel_parts else main

return re.sub(r"[^a-z]", "", main.lower())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CamelCase splitting heuristic is fragile and can mis-normalize author names.

The logic splits on uppercase boundaries when more than one capital is detected, then keeps only the first fragment. This incorrectly truncates real surnames:

  • "McDonald"["Mc", "Donald"]"mc" (should be "mcdonald")
  • "DErrico"["D", "Errico"]"d" (loses the name entirely)
  • "MacLeod"["Mac", "Leod"]"mac"

This can cause false-positive deduplication (unrelated authors collapsing to the same key) or false-negative (same author with different casing getting different keys). If the goal is to handle OpenKIM's concatenated multi-author strings like "SmithJones", a more targeted pattern (e.g., splitting only when the name contains a known concatenation marker or checking specific OpenKIM conventions) would be safer.

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 119 - 131, The current
camelCase-splitting in normalize_author is too aggressive and chops names like
"McDonald" or "DErrico"; update normalize_author to only split when the main
token matches a clear concatenated-capitalized-names pattern (e.g., two
capitalized name-like segments) and then keep the first full segment;
specifically, replace the heuristic that splits on any uppercase boundary
(inside normalize_author, using variables author_str and main) with a targeted
regex check such as matching ^([A-Z][a-z]{2,})([A-Z][a-z]+) and, only if that
matches, extract group 1 as the primary name before lowercasing and stripping
non-letters—otherwise leave main intact.

Comment on lines +216 to +221
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
else:
elements = set()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

String species silently becomes an empty set — potential data loss.

If species is a string (e.g., "Ni"), the else branch produces set(), which will never pass the target-elements filter and the potential is silently dropped. While the upstream CSV parser currently returns lists, this fallback is misleading — it looks like defensive handling but actually discards valid data.

Proposed fix
             if isinstance(species, list):
                 elements = set(species)
             elif isinstance(species, set):
                 elements = species
+            elif isinstance(species, str):
+                elements = {species}
             else:
                 elements = set()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
else:
elements = set()
if isinstance(species, list):
elements = set(species)
elif isinstance(species, set):
elements = species
elif isinstance(species, str):
elements = {species}
else:
elements = set()
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 216 - 221, The code silently
drops string-valued species by treating anything not list/set as empty; update
the handling in the block that sets elements (variable name elements in
src/lammpsparser/potential.py) to detect strings and other single-value types
and convert them into a one-element set (e.g., if isinstance(species, str):
elements = {species}) instead of set(), while preserving the existing branches
for list and set; ensure this logic is used wherever the target-elements filter
relies on elements so single-species strings like "Ni" are preserved.

Comment on lines +343 to +357
for _, row in df.iterrows():
name = row["Name"]
species = row["Species"]

# Check target elements
elements = set(species) if isinstance(species, list) else species
if not self.contains_target_elements(elements):
continue

family_id = self.get_family_id(name)
if family_id:
metadata = self.parse_potential_metadata(name)
families[family_id]["count"] += 1
families[family_id]["repos"].add(metadata["repo_type"])
families[family_id]["names"].append(name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Inconsistent species-to-set conversion and redundant metadata parsing.

  1. Line 348 uses set(species) if isinstance(species, list) else species. If species is a string, it passes the raw string to issubset, which does character-level membership — {"Ni"}.issubset("Ni") is False because the element "Ni" is not a character in the iterable "Ni". This is the same pattern as in deduplicate (lines 216–221) and filter_by_elements (line 407). Consider extracting a shared helper for species normalization.

  2. parse_potential_metadata(name) is called twice per row — once inside get_family_id (line 352) and again on line 354. Inline the family_id computation or cache the result.

🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 343 - 357, Normalize species to a
set before checking membership and avoid double metadata parsing: add a helper
(e.g., normalize_species) used by contains_target_elements, deduplicate, and
filter_by_elements to convert species (string or list) into a consistent set of
element strings; in the loop over df.iterrows() call species_set =
normalize_species(row["Species"]) and pass species_set to
contains_target_elements instead of the current conditional. Also call metadata
= parse_potential_metadata(name) once, use metadata to compute family_id (or
inline the family id logic that currently lives in get_family_id to accept
metadata) and then update families[family_id]["count"],
["repos"].add(metadata["repo_type"]) and ["names"].append(name) so
parse_potential_metadata is not invoked twice.

Comment on lines +663 to +668
raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)

dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
clean_df = dedup.deduplicate(raw_df)

return clean_df
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded verbose=True prints to stdout on every call — inappropriate for a library function.

view_potentials is called by get_potential_dataframe (line 728), which is part of the public API. Hardcoding verbose=True means every consumer silently gets deduplication diagnostics dumped to stdout. Library code should default to silent operation and use the logging module for diagnostics.

Additionally, raw_df is already element-filtered by find(list_of_elements) on line 663, so the deduplicator's own element filter is redundant — filtered_out_indices will always be empty.

Proposed fix
-    dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
+    dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=False)
     clean_df = dedup.deduplicate(raw_df)

Or better, use logging:

import logging
logger = logging.getLogger(__name__)

and replace all print(...) calls in PotentialDeduplicator with logger.debug(...) / logger.info(...).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)
dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=True)
clean_df = dedup.deduplicate(raw_df)
return clean_df
raw_df = LammpsPotentialFile(resource_path=resource_path).find(list_of_elements)
dedup = PotentialDeduplicator(target_elements=list_of_elements, verbose=False)
clean_df = dedup.deduplicate(raw_df)
return clean_df
🤖 Prompt for AI Agents
In `@src/lammpsparser/potential.py` around lines 663 - 668, The deduplicator
currently forces verbose output in view_potentials by calling
PotentialDeduplicator(target_elements=list_of_elements, verbose=True) and also
redundantly passes target_elements even though
LammpsPotentialFile.find(list_of_elements) already filters; change this to
silence library output: set verbose to False (or omit the flag) when
constructing PotentialDeduplicator in view_potentials (and in any callers such
as get_potential_dataframe), and remove/ignore the redundant target_elements
argument so PotentialDeduplicator does not re-filter already-filtered raw_df (or
accept None to skip element filtering). Also update PotentialDeduplicator to use
the logging module (logger = logging.getLogger(__name__)) and replace print(...)
calls with logger.debug/info as appropriate so diagnostics are controlled by the
consumer's logging configuration.

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 52.35294% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.50%. Comparing base (687df62) to head (7b23c2f).

Files with missing lines Patch % Lines
src/lammpsparser/potential.py 52.35% 81 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #310      +/-   ##
==========================================
- Coverage   82.36%   78.50%   -3.87%     
==========================================
  Files          11       11              
  Lines        1157     1326     +169     
==========================================
+ Hits          953     1041      +88     
- Misses        204      285      +81     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants