[WIP] ElementEmbeddings integration#442
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #442 +/- ##
===========================================
+ Coverage 80.66% 81.30% +0.64%
===========================================
Files 33 36 +3
Lines 2871 3055 +184
===========================================
+ Hits 2316 2484 +168
- Misses 555 571 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughIntroduces element embeddings integration via new wrapper module, adds Python 3.11 StrEnum compatibility backport, enhances smact_filter with flexible output formatting, adds composition dictionary conversion utility, expands test coverage, and updates project dependencies and configurations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
smact/utils/composition.py (1)
96-105: Fix type annotation inconsistency in docstring.The function signature correctly shows
tuple[str, int, int] | tuple[str, int]but the docstring only mentionstuple[str, int, int].Args: - smact_filter_output (tuple[str, int, int]): An item in the list returned from smact_filter + smact_filter_output (tuple[str, int, int] | tuple[str, int]): An item in the list returned from smact_filter Returns: composition_dict (dict[str, float]): An composition dictionaryThe implementation correctly reuses
comp_makerand follows the same pattern asformula_maker.smact/screening.py (2)
337-338: Consider refactoring to reduce code duplication in match statements.The match statements at lines 426-433 and 438-444 are nearly identical. Consider extracting the format conversion logic into a helper function.
+def _format_compositions(compositions, return_output): + """Helper function to format compositions based on return_output type.""" + match return_output: + case SmactFilterOutputs.default: + return compositions + case SmactFilterOutputs.formula: + return [formula_maker(smact_filter_output=comp) for comp in compositions] + case SmactFilterOutputs.dict: + return [composition_dict_maker(smact_filter_output=comp) for comp in compositions] # Return list depending on whether we are interested in unique species combinations # or just unique element combinations. if species_unique: - match return_output: - case SmactFilterOutputs.default: - return compositions - case SmactFilterOutputs.formula: - return [formula_maker(smact_filter_output=comp) for comp in compositions] - case SmactFilterOutputs.dict: - return [composition_dict_maker(smact_filter_output=comp) for comp in compositions] + return _format_compositions(compositions, return_output) else: compositions = [(i[0], i[2]) for i in compositions] compositions = list(set(compositions)) - match return_output: - case SmactFilterOutputs.default: - return compositions - case SmactFilterOutputs.formula: - return [formula_maker(smact_filter_output=comp) for comp in compositions] - case SmactFilterOutputs.dict: - return [composition_dict_maker(smact_filter_output=comp) for comp in compositions] + return _format_compositions(compositions, return_output)
354-354: Improve docstring formatting and add return type documentation.The docstring should better document the different return types based on the
return_outputparameter.- return_output (SmactFilterOutputs): If set to 'default', the function will return a list of tuples containing the tuples of symbols, oxidation states and stoichiometry values. "Formula" returns a list of formulas and "dict" returns a list of dictionaries. + return_output (SmactFilterOutputs): Controls the output format: + - 'default': List of tuples containing symbols, oxidation states and stoichiometry values + - 'formula': List of chemical formula strings + - 'dict': List of composition dictionaries Returns: ------- - allowed_comps (list): Allowed compositions for that chemical system - in the form [(elements), (oxidation states), (ratios)] if species_unique=True and tuple=False - or in the form [(elements), (ratios)] if species_unique=False and tuple=False. + allowed_comps (list): Allowed compositions for that chemical system. + Return type depends on return_output parameter: + - SmactFilterOutputs.default: List[tuple] with format depending on species_unique + - SmactFilterOutputs.formula: List[str] of chemical formulas + - SmactFilterOutputs.dict: List[dict] of composition dictionariessmact/io/elementembeddings.py (2)
17-31: Address the TODO comment about moving enums to ElementEmbeddings.The comment suggests this enum should be moved to the ElementEmbeddings codebase. Consider whether this duplication is necessary or if the enums should be imported from ElementEmbeddings directly.
If these enums are temporary, consider adding a TODO comment with a timeline or GitHub issue reference. If they're permanent, remove the comment and ensure they stay in sync with ElementEmbeddings capabilities.
39-49: Consider the same enum consolidation for PoolingStats.Similar to the previous enum, this also has a comment suggesting it should be moved to ElementEmbeddings codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
docs/tutorials/element_embeddings_integration.ipynb(1 hunks)smact/io/__init__.py(1 hunks)smact/io/elementembeddings.py(1 hunks)smact/screening.py(5 hunks)smact/utils/composition.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
smact/screening.py (1)
smact/structure_prediction/structure.py (1)
composition(575-595)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.12, macos-latest)
- GitHub Check: test (3.13, ubuntu-latest)
- GitHub Check: test (3.13, macos-latest)
🔇 Additional comments (6)
smact/io/__init__.py (1)
1-1: LGTM! Clean package initialisation.The docstring appropriately describes the purpose of this package as an interface to external libraries.
docs/tutorials/element_embeddings_integration.ipynb (2)
732-753: Excellent tutorial demonstrating the integration workflow.This cell effectively shows the complete workflow from element screening to feature vector generation. The imports are correct and the progression from
smact_filterwith the newSmactFilterOutputs.formulaoption tocomposition_featuriserdemonstrates the practical value of the integration.
898-902: Good demonstration of the raw output format.This second cell shows users what the actual feature vector arrays look like, which is valuable for understanding the output format.
smact/screening.py (1)
27-33: Well-designed enum for output format options.The
SmactFilterOutputsenum provides clear, descriptive options for the different return formats. The use ofauto()ensures consistent string values.smact/io/elementembeddings.py (2)
70-102: Well-documented wrapper function with clear parameter descriptions.The
species_composition_featuriserfunction provides good documentation and maintains consistency with the ElementEmbeddings interface. The parameter passing is straightforward and appropriate.
53-67: Verify ElementEmbeddings wrapper parametersPlease ensure that our
composition_featuriserwrapper remains aligned with the upstream ElementEmbeddings API:
- Confirm that
ee_composition_featuriserinelementembeddings.compositionactually expects parameters nameddata,formula_column,embedding,statsandinplace.- Verify that the allowed types (
pd.DataFrame | pd.Series | CompositionalEmbedding | listfor data;Embedding | AllowedElementEmbeddingsfor embedding; andPoolingStats | list[PoolingStats]for stats) match the upstream function’s signature.- Pin and document the minimum ElementEmbeddings version that our wrapper was tested against, so future releases don’t silently break this interface.
- Add an integration test (or CI check) that exercises the wrapper end-to-end with a known ElementEmbeddings version to catch breaking changes early.
|
@AntObi do you need any help increasing patch coverage? |
I've forgotten about this PR from the hackathon. I think the code here pretty much handles the ElementEmbeddings integration. The only additional thing required would be writing up unit tests. |
…ocks - Extract _format_output() helper in screening.py to deduplicate match blocks - Add 11 unit tests for smact.io.elementembeddings (enums, import check, featurisers) - Add smact_filter return_output tests in test_utils.py - Consolidate StrEnum backport into smact/utils/compat.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (5)
smact/io/elementembeddings.py (2)
15-23:_check_elementembeddings: the# noqa: F401is unnecessary and the error message is clear.Ruff reports the
noqadirective on line 18 is unused (F401 rule is not enabled). The import is only for its side-effect (checking availability), so thenoqacan be safely removed.🧹 Suggested cleanup
- import elementembeddings # noqa: F401 + import elementembeddings🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/io/elementembeddings.py` around lines 15 - 23, Remove the unnecessary "# noqa: F401" in the _check_elementembeddings function: the import of elementembeddings is intentionally only to test availability, so delete the noqa directive from the import line inside _check_elementembeddings to satisfy Ruff and keep the clear ImportError message unchanged.
26-28: Track the TODO: "Should be moved to element embeddings codebase".Lines 26 and 48 contain comments indicating these enums should live in the upstream
ElementEmbeddingspackage. Consider creating a tracking issue so this doesn't get lost, especially since the PR is already WIP.Would you like me to open an issue to track migrating
AllowedElementEmbeddingsandPoolingStatsupstream?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/io/elementembeddings.py` around lines 26 - 28, Create a tracking issue in the repository to migrate the enums to the upstream ElementEmbeddings package and reference it in the code: open an issue titled like "Migrate AllowedElementEmbeddings and PoolingStats to ElementEmbeddings package" describing both symbols (AllowedElementEmbeddings, PoolingStats), why they should move, and assign labels/assignee; then add a TODO comment in the code next to the existing comments for AllowedElementEmbeddings and PoolingStats that includes the new issue number or URL so the migration task is discoverable and not lost.smact/screening.py (1)
449-452: Potential data loss whenspecies_unique=Falsecombined withreturn_output="dict".When
species_unique=False, compositions are de-duplicated into 2-tuples(symbols, ratio)— dropping oxidation states. Thecomposition_dict_makerandformula_makerthen receive these 2-tuples instead of 3-tuples.This is fine as long as both utility functions handle 2-element tuples correctly (the tests in
test_utils.pysuggest they do), but users should be aware thatreturn_output="dict"withspecies_unique=Falsewill produce element-keyed dictionaries (e.g.{"Fe": 1, "O": 1}) rather than species-keyed ones (e.g.{"Fe2+": 1, "O2-": 1}). It might be worth noting this in the docstring forreturn_output.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/screening.py` around lines 449 - 452, The code de-duplicates compositions into 2-tuples when species_unique is False (compositions = list({(i[0], i[2]) for i in compositions})), which strips oxidation states and means downstream helpers (_format_output -> composition_dict_maker / formula_maker) will produce element-keyed dictionaries when return_output="dict" rather than species (oxidation-state)-keyed ones; update the docstring for the function that exposes return_output to explicitly document this behavior (mention species_unique=False yields element-keyed dicts like {"Fe":1,"O":1} instead of {"Fe2+":1,"O2-":1}) so users are aware of the data loss of oxidation-state labels.smact/tests/test_elementembeddings.py (1)
78-84: Minor DRY opportunity:_setup_mocksis duplicated across two test classes.
TestCompositionFeaturiser._setup_mocksandTestSpeciesCompositionFeaturiser._setup_mocksare nearly identical, differing only in which attribute is set onmock_comp_modand the return value string. Consider extracting a shared helper or base class to reduce duplication, though this is a minor nit for a test file.Also applies to: 138-144
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/tests/test_elementembeddings.py` around lines 78 - 84, Extract the duplicated mock setup into a shared helper or base class to DRY the tests: factor the logic in _setup_mocks used by TestCompositionFeaturiser and TestSpeciesCompositionFeaturiser into a single helper (e.g., a module-level function create_mocks(mock_attr_name, return_value) or a TestBase class with a _setup_mocks that accepts which attribute to set), then have each test class call that helper or inherit and pass "composition_featuriser"/"species_composition_featuriser" and the respective return strings ("featurised_result"/"species_featurised_result") so only the differing attribute name and return value remain configurable while reusing the shared MagicMock creation.smact/tests/test_utils.py (1)
172-176: Consider testing with a valid string that matches an enum value.The invalid-input test is good. You might also want a test that passes a plain string like
"formula"(not the enum member) to confirm theStrEnumequality semantics work end-to-end — this would validate that users can pass raw strings as a convenience.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/tests/test_utils.py` around lines 172 - 176, Add a positive test that verifies smact_filter accepts a plain string matching the StrEnum value (e.g., "formula") and behaves the same as passing the enum member; specifically, create a new test (alongside test_smact_filter_invalid_return_output) that calls smact_filter(els, threshold=2, return_output="formula") and asserts the returned output equals the result when calling smact_filter(els, threshold=2, return_output=ReturnOutput.FORMULA) (or otherwise checks the same output shape/value), referencing the smact_filter function and the ReturnOutput enum to validate string-to-enum equality handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@smact/io/elementembeddings.py`:
- Around line 15-23: Remove the unnecessary "# noqa: F401" in the
_check_elementembeddings function: the import of elementembeddings is
intentionally only to test availability, so delete the noqa directive from the
import line inside _check_elementembeddings to satisfy Ruff and keep the clear
ImportError message unchanged.
- Around line 26-28: Create a tracking issue in the repository to migrate the
enums to the upstream ElementEmbeddings package and reference it in the code:
open an issue titled like "Migrate AllowedElementEmbeddings and PoolingStats to
ElementEmbeddings package" describing both symbols (AllowedElementEmbeddings,
PoolingStats), why they should move, and assign labels/assignee; then add a TODO
comment in the code next to the existing comments for AllowedElementEmbeddings
and PoolingStats that includes the new issue number or URL so the migration task
is discoverable and not lost.
In `@smact/screening.py`:
- Around line 449-452: The code de-duplicates compositions into 2-tuples when
species_unique is False (compositions = list({(i[0], i[2]) for i in
compositions})), which strips oxidation states and means downstream helpers
(_format_output -> composition_dict_maker / formula_maker) will produce
element-keyed dictionaries when return_output="dict" rather than species
(oxidation-state)-keyed ones; update the docstring for the function that exposes
return_output to explicitly document this behavior (mention species_unique=False
yields element-keyed dicts like {"Fe":1,"O":1} instead of {"Fe2+":1,"O2-":1}) so
users are aware of the data loss of oxidation-state labels.
In `@smact/tests/test_elementembeddings.py`:
- Around line 78-84: Extract the duplicated mock setup into a shared helper or
base class to DRY the tests: factor the logic in _setup_mocks used by
TestCompositionFeaturiser and TestSpeciesCompositionFeaturiser into a single
helper (e.g., a module-level function create_mocks(mock_attr_name, return_value)
or a TestBase class with a _setup_mocks that accepts which attribute to set),
then have each test class call that helper or inherit and pass
"composition_featuriser"/"species_composition_featuriser" and the respective
return strings ("featurised_result"/"species_featurised_result") so only the
differing attribute name and return value remain configurable while reusing the
shared MagicMock creation.
In `@smact/tests/test_utils.py`:
- Around line 172-176: Add a positive test that verifies smact_filter accepts a
plain string matching the StrEnum value (e.g., "formula") and behaves the same
as passing the enum member; specifically, create a new test (alongside
test_smact_filter_invalid_return_output) that calls smact_filter(els,
threshold=2, return_output="formula") and asserts the returned output equals the
result when calling smact_filter(els, threshold=2,
return_output=ReturnOutput.FORMULA) (or otherwise checks the same output
shape/value), referencing the smact_filter function and the ReturnOutput enum to
validate string-to-enum equality handling.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
smact/io/elementembeddings.pysmact/screening.pysmact/tests/test_elementembeddings.pysmact/tests/test_utils.pysmact/utils/compat.pysmact/utils/composition.py
🚧 Files skipped from review as they are similar to previous changes (1)
- smact/utils/composition.py
There was a problem hiding this comment.
Actionable comments posted: 6
> [!CAUTION]
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)smact/utils/crystal_space/generate_composition_with_smact.py (1)
154-154:⚠️ Potential issue | 🔴 Critical
generate_composition_with_smact_custom()will crash at runtime when called with default arguments.The type annotation change to
oxidation_states_set: str | None = None(line 154) is syntactically correct but breaks downstream calls:
- Line 209:
lookup_element_oxidation_states_custom("all", oxidation_states_set, copy=False)requiresfilepathto be a string pointing to an oxidation states file. Whenoxidation_states_set=None, this parameter receives no default value and is not guarded—_get_data_rows(None)will raise an error.- Line 226:
smact_filter(..., oxidation_states_set=oxidation_states_set)expectsoxidation_states_set: str(preset name like"icsd24"or file path). PassingNoneviolates the type contract and will fail downstream when the function expects a string.Either:
- Change the default back to a valid string (e.g.
str = "icsd24"), or- Update both downstream functions to handle
Noneby using a sensible fallback preset, or- Change line 154 type annotation to require the argument and remove the default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/utils/crystal_space/generate_composition_with_smact.py` at line 154, The parameter oxidation_states_set in generate_composition_with_smact_custom currently defaults to None which causes downstream calls to lookup_element_oxidation_states_custom and smact_filter to receive None and crash; revert the default to a valid preset string and strict str type by changing the signature to oxidation_states_set: str = "icsd24" (or another appropriate preset), so lookup_element_oxidation_states_custom("all", oxidation_states_set, copy=False) and smact_filter(..., oxidation_states_set=oxidation_states_set) always receive a valid string.smact/screening.py (1)
548-549:⚠️ Potential issue | 🟡 MinorDead branch condition —
or oxidation_states_set is Noneis unreachable.By the time execution reaches line 548,
oxidation_states_set is Nonehas already been handled and returned from theifblock starting at line 529. Theor oxidation_states_set is Nonesub-expression can never beTruehere.🐛 Proposed fix
- elif oxidation_states_set == "icsd24" or oxidation_states_set is None: + elif oxidation_states_set == "icsd24":🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/screening.py` around lines 548 - 549, The condition "or oxidation_states_set is None" in the elif is unreachable because None is already handled earlier; remove the redundant sub-expression so the branch reads "elif oxidation_states_set == 'icsd24':" and keep the body assigning ox_combos = [el.oxidation_states_icsd24 for el in smact_elems] (no other logic changes to oxidation_states_set, ox_combos, or smact_elems).
🧹 Nitpick comments (3)
.github/dependabot.yml (1)
15-19: Considerweeklyinstead ofdailyfor pip updates.
dailypip polling is unusually aggressive for a research Python library whose dependencies don't change on that cadence. Combined with a per-ecosystem limit of 5 open PRs, this could generate up to 5 simultaneous dependency-bump PRs every day, adding meaningful review overhead.
weeklyis the Dependabot default and suits most projects well. If staying withdaily, adding agroups:block to batch compatible updates into a single PR would significantly reduce noise.⚙️ Suggested configuration with grouping and a less aggressive schedule
- package-ecosystem: "pip" directory: "/" schedule: - interval: daily + interval: weekly open-pull-requests-limit: 5 target-branch: develop + groups: + python-dependencies: + patterns: + - "*"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/dependabot.yml around lines 15 - 19, Update the Dependabot pip config to reduce noise: change the schedule block under the package-ecosystem: "pip" entry so interval: daily is replaced with interval: weekly (or, if you opt to keep daily, add a groups: section to batch compatible updates into a single PR); adjust or retain open-pull-requests-limit: 5 as needed after grouping. Target the stanza that contains package-ecosystem: "pip", schedule: interval: daily and open-pull-requests-limit: 5 and make the schedule change or add grouping to limit daily PR churn..pre-commit-config.yaml (1)
7-16: Consider migrating the legacyruffhook ID toruff-check.The upstream
.pre-commit-hooks.yamlexplicitly labelsid: ruffas a "Legacy alias", withruff-checkbeing the canonical hook ID. The alias remains functional, but aligning with the canonical name future-proofs the config against eventual alias removal.♻️ Proposed update
- repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.14.10 hooks: # Run the linting tool - - id: ruff + - id: ruff-check types_or: [python, pyi] args: [--fix]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.pre-commit-config.yaml around lines 7 - 16, Replace the legacy hook ID "ruff" with the canonical "ruff-check" in the hooks block so the linting hook uses id: ruff-check (preserve existing types_or and args settings); leave the ruff-format hook unchanged. Locate the hooks where "id: ruff" appears and update it to "id: ruff-check" (ensuring the surrounding hook entries, e.g., types_or and args, remain intact).smact/screening.py (1)
28-33:SmactFilterOutputs.dictshadows the Python builtindict.The member name
dictis scoped to the class namespace so it causes no runtime collision, but it makes the attribute name potentially confusing (e.g. type annotations, docs, IDE auto-complete). Consider renaming tocomposition_dictoras_dict.♻️ Suggested rename
class SmactFilterOutputs(StrEnum): """Allowed outputs of the `smact_filter` function.""" default = "default" formula = "formula" - dict = "dict" + composition_dict = "dict"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/screening.py` around lines 28 - 33, The enum member SmactFilterOutputs.dict shadows the builtin name dict which can confuse type hints and IDEs; rename this member (e.g., to SmactFilterOutputs.composition_dict or SmactFilterOutputs.as_dict) and update all usages of SmactFilterOutputs.dict (calls, comparisons, docs, and type annotations) to the new name so references to the enum remain correct and clear. Ensure the enum string value is updated if desired (or left as "dict" only if external compatibility requires it) and run tests/imports to confirm no broken references remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 78: Update the Tornado dependency lower bound from "tornado>=6.5" to
"tornado>=6.5.3" wherever it appears (the docs dependency entry and the
[dependency-groups] dev section) to ensure inclusion of the 6.5.3 security
fixes; locate the string "tornado>=6.5" in pyproject.toml and replace it with
"tornado>=6.5.3" in both places.
In `@smact/screening.py`:
- Line 379: The docstring for the parameter return_output refers to allowed
values with inconsistent capitalization ("Formula" vs the enum value "formula");
update the docstring so the listed values exactly match the SmactFilterOutputs
enum (e.g., use "default", "formula", "dict") to avoid confusion—locate the
return_output description in screening.py (the SmactFilterOutputs reference) and
change "Formula" to "formula".
In `@smact/structure_prediction/database.py`:
- Line 21: The module-level import of MPResterNew (from mp_api.client) must be
made optional to avoid hard failure when mp-api isn't installed; either wrap the
import in try/except (set MPResterNew = None on ImportError and raise a clear
error when the MP-API functionality is invoked) or move the import statement for
MPResterNew inside the function that actually uses it (the function that
currently calls the MP API client at the usage site), mirroring how pathos is
handled; update the runtime check to raise an informative message if MPResterNew
is unavailable when that function is called.
In `@smact/structure_prediction/structure.py`:
- Around line 131-159: The __hash__ and __eq__ mismatch is caused by np.allclose
using a nonzero rtol and by lax iteration using zip(..., strict=False); update
the equality logic in __eq__ to call np.allclose(..., atol=1e-7, rtol=0) so it
becomes an absolute-only tolerance matching the rounding in __hash__, and
tighten the site/spec iteration to enforce equal-length coordinate lists (e.g.,
use zip(..., strict=True) or explicitly compare lengths for each species key) so
that objects considered equal will produce the same hash computed in __hash__.
- Line 14: The top-level import of MPResterNew makes the except ImportError in
from_mp unreachable; move the import of MPResterNew (mp_api.client.MPRester)
into the from_mp function and mirror the optional-dependency pattern used for
pathos: attempt the import inside from_mp, catch ImportError, and then fall back
to the legacy MPRester path. Update references in from_mp to use the locally
imported MPResterNew and ensure the ImportError branch preserves the existing
fallback behavior.
In `@smact/utils/crystal_space/generate_composition_with_smact.py`:
- Line 138: The assignment results_df.loc[smact_allowed, "smact_allowed"] = True
can silently create new rows when smact_allowed contains labels not present in
results_df.index (derived from compounds); change it to only set rows that exist
by using results_df.loc[results_df.index.isin(smact_allowed), "smact_allowed"] =
True (i.e., use isin against results_df.index or compounds to filter
smact_allowed before assignment) to avoid inflating the DataFrame with
missing-index entries; apply the same guarded fix to the analogous assignment
later in the file (the other occurrence around the second mentioned location).
---
Outside diff comments:
In `@smact/screening.py`:
- Around line 548-549: The condition "or oxidation_states_set is None" in the
elif is unreachable because None is already handled earlier; remove the
redundant sub-expression so the branch reads "elif oxidation_states_set ==
'icsd24':" and keep the body assigning ox_combos = [el.oxidation_states_icsd24
for el in smact_elems] (no other logic changes to oxidation_states_set,
ox_combos, or smact_elems).
In `@smact/utils/crystal_space/generate_composition_with_smact.py`:
- Line 154: The parameter oxidation_states_set in
generate_composition_with_smact_custom currently defaults to None which causes
downstream calls to lookup_element_oxidation_states_custom and smact_filter to
receive None and crash; revert the default to a valid preset string and strict
str type by changing the signature to oxidation_states_set: str = "icsd24" (or
another appropriate preset), so lookup_element_oxidation_states_custom("all",
oxidation_states_set, copy=False) and smact_filter(...,
oxidation_states_set=oxidation_states_set) always receive a valid string.
---
Nitpick comments:
In @.github/dependabot.yml:
- Around line 15-19: Update the Dependabot pip config to reduce noise: change
the schedule block under the package-ecosystem: "pip" entry so interval: daily
is replaced with interval: weekly (or, if you opt to keep daily, add a groups:
section to batch compatible updates into a single PR); adjust or retain
open-pull-requests-limit: 5 as needed after grouping. Target the stanza that
contains package-ecosystem: "pip", schedule: interval: daily and
open-pull-requests-limit: 5 and make the schedule change or add grouping to
limit daily PR churn.
In @.pre-commit-config.yaml:
- Around line 7-16: Replace the legacy hook ID "ruff" with the canonical
"ruff-check" in the hooks block so the linting hook uses id: ruff-check
(preserve existing types_or and args settings); leave the ruff-format hook
unchanged. Locate the hooks where "id: ruff" appears and update it to "id:
ruff-check" (ensuring the surrounding hook entries, e.g., types_or and args,
remain intact).
In `@smact/screening.py`:
- Around line 28-33: The enum member SmactFilterOutputs.dict shadows the builtin
name dict which can confuse type hints and IDEs; rename this member (e.g., to
SmactFilterOutputs.composition_dict or SmactFilterOutputs.as_dict) and update
all usages of SmactFilterOutputs.dict (calls, comparisons, docs, and type
annotations) to the new name so references to the enum remain correct and clear.
Ensure the enum string value is updated if desired (or left as "dict" only if
external compatibility requires it) and run tests/imports to confirm no broken
references remain.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.github/dependabot.yml.pre-commit-config.yamlREADME.mdpyproject.tomlrequirements.txtrequirements/requirements-py310.txtsmact/data_loader.pysmact/screening.pysmact/structure_prediction/database.pysmact/structure_prediction/probability_models.pysmact/structure_prediction/structure.pysmact/tests/test_core.pysmact/tests/test_elementembeddings.pysmact/tests/test_utils.pysmact/utils/crystal_space/generate_composition_with_smact.py
✅ Files skipped from review due to trivial changes (2)
- smact/data_loader.py
- smact/structure_prediction/probability_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
- smact/tests/test_elementembeddings.py
| df = pd.DataFrame(data=False, index=compounds, columns=["smact_allowed"]) | ||
| df.loc[smact_allowed, "smact_allowed"] = True | ||
| results_df = pd.DataFrame(data=False, index=compounds, columns=["smact_allowed"]) | ||
| results_df.loc[smact_allowed, "smact_allowed"] = True |
There was a problem hiding this comment.
Silent DataFrame row expansion if a formula in smact_allowed is absent from compounds.
pd.DataFrame.loc silently inserts a new row when a label is not present in the index. In the normal execution path smact_allowed is a strict subset of compounds, but there is no guard enforcing this. Any reduced-formula mismatch (e.g., from a Pauling-threshold element producing a compound whose GCD-reduced form was never in the stoichiometry sweep) would inflate the DataFrame with rows that only carry smact_allowed=True but no False baseline entry, corrupting the output.
Using isin is explicit and zero-cost for correctness:
🛡️ Proposed fix
- results_df = pd.DataFrame(data=False, index=compounds, columns=["smact_allowed"])
- results_df.loc[smact_allowed, "smact_allowed"] = True
+ smact_allowed_set = set(smact_allowed)
+ results_df = pd.DataFrame(
+ {"smact_allowed": [c in smact_allowed_set for c in compounds]},
+ index=compounds,
+ )Apply the same fix at line 249–250.
Also applies to: 250-250
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@smact/utils/crystal_space/generate_composition_with_smact.py` at line 138,
The assignment results_df.loc[smact_allowed, "smact_allowed"] = True can
silently create new rows when smact_allowed contains labels not present in
results_df.index (derived from compounds); change it to only set rows that exist
by using results_df.loc[results_df.index.isin(smact_allowed), "smact_allowed"] =
True (i.e., use isin against results_df.index or compounds to filter
smact_allowed before assignment) to avoid inflating the DataFrame with
missing-index entries; apply the same guarded fix to the analogous assignment
later in the file (the other occurrence around the second mentioned location).
Replace _check_elementembeddings() lazy imports with top-level try/except and HAS_ELEMENTEMBEDDINGS flag to satisfy ruff (no noqa suppressions). Update test mocking strategy to patch the flag and featuriser functions directly. Use pytest.raises instead of self.assertRaises (PT027), keyword arg new= for patch() booleans (FBT003), and combined with statements (SIM117).
- Rename SmactFilterOutputs.dict -> composition_dict to avoid shadowing Python's built-in dict - Move mp_api lazy imports inside the branches that use them in structure_prediction/database.py and structure_prediction/structure.py - Add rtol=0 to np.allclose in SmactStructure.__eq__ for stricter coordinate comparison - Change oxidation_states_set default from None to "icsd24" in generate_composition_with_smact_custom - Update test_utils.py to reflect composition_dict rename
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
smact/utils/crystal_space/generate_composition_with_smact.py (1)
154-166:⚠️ Potential issue | 🔴 Critical
"icsd24"default will causeFileNotFoundErrorinlookup_element_oxidation_states_custom.The function at line 209 calls
lookup_element_oxidation_states_custom("all", oxidation_states_set, copy=False), which expects a file path as the second argument. The_get_data_rows()function it uses internally callsopen(filename)directly. Passing"icsd24"(the current default) will raise aFileNotFoundErrorbecause it is not a valid file path.Additionally, the docstring at line 166 claims the parameter is "the path to the oxidation states file" whilst the default is
"icsd24"— a named set identifier, not a path — creating a contradictory contract.Either provide a valid file path as the default or
Nonewith proper fallback logic, and ensure the docstring accurately reflects what the parameter accepts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/utils/crystal_space/generate_composition_with_smact.py` around lines 154 - 166, The default "icsd24" for oxidation_states_set will cause FileNotFoundError because lookup_element_oxidation_states_custom expects a file path (its helper _get_data_rows calls open(filename)); change the function generate_composition_with_smact signature to use oxidation_states_set: Optional[str] = None, update the docstring to state the parameter accepts either a file path or None (which will use the built‑in "icsd24" dataset), and add fallback logic before calling lookup_element_oxidation_states_custom (e.g. if oxidation_states_set is None resolve the actual file path for the "icsd24" resource or call the appropriate loader to obtain a real filename) so that lookup_element_oxidation_states_custom("all", resolved_path, copy=False) always receives a valid filesystem path; reference functions: generate_composition_with_smact, lookup_element_oxidation_states_custom, and _get_data_rows.
♻️ Duplicate comments (3)
smact/structure_prediction/structure.py (2)
113-117:⚠️ Potential issue | 🟠 Major
__eq__can still treat length-mismatched sites as equal, breaking hashing guarantees.
zip(..., strict=False)truncates, so extra coordinates in one structure won’t be checked. That can yieldself == otherwhilehash(self) != hash(other)because__hash__includes all coordinates. Please enforce equal-length site lists (and species count) during comparison.🔧 Proposed fix (enforce equal lengths)
- for si, sj in zip(self.sites.values(), other.sites.values(), strict=False): - for ci, cj in zip(si, sj, strict=False): + for si, sj in zip(self.sites.values(), other.sites.values(), strict=True): + if len(si) != len(sj): + sites_equal = False + break + for ci, cj in zip(si, sj, strict=True): if not np.allclose(ci, cj, atol=1e-7, rtol=0): sites_equal = False break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/structure_prediction/structure.py` around lines 113 - 117, The equality method (__eq__) currently uses zip(..., strict=False) which truncates unequal-length iterables and can mark structures equal even when site lists or per-site coordinate lists differ; update __eq__ to first check that the number of sites (len(self.sites) vs len(other.sites)) is equal and then for each corresponding site (iterate by index or by ordered keys from self.sites) verify that the length of the coordinate/species list for that site matches before comparing coordinates with np.allclose; if any length differs return False immediately so __eq__ is consistent with __hash__ which assumes all coordinates are accounted for.
130-158:⚠️ Potential issue | 🟠 MajorHash rounding still allows
__eq__-equal coords to hash differently.
__eq__allows differences up to 1e-7, but__hash__rounds to 7 decimals. Two coords that differ by, say, 9e-8 can be==yet round to different values, violating the hash contract. Canonicalise coordinates the same way in both__eq__and__hash__(e.g., compare rounded coords in__eq__, or drop tolerance and use exact equality).🔧 One consistent approach (compare rounded coords in __eq__)
- for ci, cj in zip(si, sj, strict=True): - if not np.allclose(ci, cj, atol=1e-7, rtol=0): + for ci, cj in zip(si, sj, strict=True): + if tuple(round(c, 7) for c in ci) != tuple(round(c, 7) for c in cj): sites_equal = False break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/structure_prediction/structure.py` around lines 130 - 158, The current __hash__ rounds coordinates to 7 decimals which can still map __eq__-equal coordinates to different hashes; fix by quantizing coordinates the same way the equality tolerance is applied (use integer quantization at the 1e-7 resolution) so values within atol=1e-7 collapse to the same representation. Update the __hash__ implementation (function __hash__) to convert each coordinate component c to an integer key like int(round(c * 1e7)) (apply same transformation for lattice_mat rows and any site coordinates), then build species_t, lattice_t (from quantized lattice), and sites_t from these integer tuples before hashing (keep lattice_param unchanged); this ensures hash consistency with __eq__ that uses np.allclose(atol=1e-7).smact/utils/crystal_space/generate_composition_with_smact.py (1)
137-138: Unresolved: silent DataFrame row expansion vialocassignment.The
results_df.loc[smact_allowed, "smact_allowed"] = Trueassignment at both lines 138 and 250 still silently inserts rows when any formula insmact_allowedis absent fromresults_df.index. The issue and proposed fix were raised in a prior review and remain unaddressed in this revision.Also applies to: 249-250
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smact/utils/crystal_space/generate_composition_with_smact.py` around lines 137 - 138, The assignment results_df.loc[smact_allowed, "smact_allowed"] = True can silently add rows if any formulas in the smact_allowed list are not present in results_df.index; instead restrict the assignment to the intersection of existing index values (or use an index boolean mask) so only existing rows are set. Locate the two occurrences that perform this write (the results_df variable and the smact_allowed identifier around the assignment) and change them to compute the valid targets first (e.g., valid = results_df.index.intersection(smact_allowed) or mask = results_df.index.isin(smact_allowed)) and then assign True only to those valid rows to avoid row expansion. Ensure both occurrences (the one near the top and the one near line ~250) are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@smact/utils/crystal_space/generate_composition_with_smact.py`:
- Around line 154-166: The default "icsd24" for oxidation_states_set will cause
FileNotFoundError because lookup_element_oxidation_states_custom expects a file
path (its helper _get_data_rows calls open(filename)); change the function
generate_composition_with_smact signature to use oxidation_states_set:
Optional[str] = None, update the docstring to state the parameter accepts either
a file path or None (which will use the built‑in "icsd24" dataset), and add
fallback logic before calling lookup_element_oxidation_states_custom (e.g. if
oxidation_states_set is None resolve the actual file path for the "icsd24"
resource or call the appropriate loader to obtain a real filename) so that
lookup_element_oxidation_states_custom("all", resolved_path, copy=False) always
receives a valid filesystem path; reference functions:
generate_composition_with_smact, lookup_element_oxidation_states_custom, and
_get_data_rows.
---
Duplicate comments:
In `@smact/structure_prediction/structure.py`:
- Around line 113-117: The equality method (__eq__) currently uses zip(...,
strict=False) which truncates unequal-length iterables and can mark structures
equal even when site lists or per-site coordinate lists differ; update __eq__ to
first check that the number of sites (len(self.sites) vs len(other.sites)) is
equal and then for each corresponding site (iterate by index or by ordered keys
from self.sites) verify that the length of the coordinate/species list for that
site matches before comparing coordinates with np.allclose; if any length
differs return False immediately so __eq__ is consistent with __hash__ which
assumes all coordinates are accounted for.
- Around line 130-158: The current __hash__ rounds coordinates to 7 decimals
which can still map __eq__-equal coordinates to different hashes; fix by
quantizing coordinates the same way the equality tolerance is applied (use
integer quantization at the 1e-7 resolution) so values within atol=1e-7 collapse
to the same representation. Update the __hash__ implementation (function
__hash__) to convert each coordinate component c to an integer key like
int(round(c * 1e7)) (apply same transformation for lattice_mat rows and any site
coordinates), then build species_t, lattice_t (from quantized lattice), and
sites_t from these integer tuples before hashing (keep lattice_param unchanged);
this ensures hash consistency with __eq__ that uses np.allclose(atol=1e-7).
In `@smact/utils/crystal_space/generate_composition_with_smact.py`:
- Around line 137-138: The assignment results_df.loc[smact_allowed,
"smact_allowed"] = True can silently add rows if any formulas in the
smact_allowed list are not present in results_df.index; instead restrict the
assignment to the intersection of existing index values (or use an index boolean
mask) so only existing rows are set. Locate the two occurrences that perform
this write (the results_df variable and the smact_allowed identifier around the
assignment) and change them to compute the valid targets first (e.g., valid =
results_df.index.intersection(smact_allowed) or mask =
results_df.index.isin(smact_allowed)) and then assign True only to those valid
rows to avoid row expansion. Ensure both occurrences (the one near the top and
the one near line ~250) are updated consistently.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
.github/dependabot.yml.pre-commit-config.yamlpyproject.tomlsmact/io/elementembeddings.pysmact/screening.pysmact/structure_prediction/structure.pysmact/tests/test_elementembeddings.pysmact/tests/test_utils.pysmact/utils/crystal_space/generate_composition_with_smact.py
🚧 Files skipped from review as they are similar to previous changes (3)
- .github/dependabot.yml
- .pre-commit-config.yaml
- pyproject.toml
- Add top-level try/except HAS_MP_API flag in structure_prediction/database.py and structure_prediction/structure.py to satisfy ruff PLC0415 (imports must be at module level); replace inline `from mp_api.client import` with conditional HAS_MP_API checks - Refactor else/if to elif in SmactStructure.from_mp (PLR5501) - Remove PD901 from ruff ignore list in pyproject.toml — rule was removed upstream and was emitting a noisy warning on every hook run
ElementEmbeddings Integration
Description
Type of change
New feature (non-breaking change which adds functionality)
This change requires a documentation update
How Has This Been Tested?
(TODO)
Test Configuration:
Reviewers
Checklist
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores