Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@
## 2024-05-19 - Caching YAML Load for Framework Registry
**Learning:** `yaml.safe_load` on `frameworks.yml` within `load_framework_registry()` was taking ~2-3 ms per call and it was repeatedly called for every framework entry via `get_framework_config()`. This was a micro-bottleneck, especially when dealing with lists or multiple frameworks.
**Action:** Applied the `@lru_cache` and `deepcopy` pattern successfully again to `load_framework_registry()` and `get_framework_config()` to avoid caching a mutable dictionary directly and avoid repeated YAML I/O parsing.

## 2024-05-20 - Fast DataFrame Iteration Avoids Overhead
**Learning:** Iterating over Pandas DataFrames using `.iterrows()` introduces significant overhead because it creates a new `pd.Series` object for every single row. In calculation scripts (like elasticity, solvMPCONF196, etc.), this causes notable slowdowns when processing large datasets or many files.
**Action:** Replace `.iterrows()` with `.itertuples(index=False)` (or `name=None` if accessing by index `row[0]`) for purely positional or namedtuple-based attribute access, which is drastically faster. If variable column names require string keys, use `for row in df.to_dict('records'):`. Always check if `.iterrows()` is a bottleneck and optimize it.
2 changes: 1 addition & 1 deletion ml_peg/calcs/bulk_crystal/elasticity/calc_elasticity.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ def run_elasticity_benchmark(

# Save relaxed structures to extxyz for visualisation
atoms_list = []
for _, row in results.iterrows():
for row in results.to_dict("records"):
struct = row.get("final_structure")
if struct is not None:
atoms = AseAtomsAdaptor.get_atoms(struct).copy()
Expand Down
6 changes: 3 additions & 3 deletions ml_peg/calcs/conformers/MPCONF196/calc_MPCONF196.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ def get_ref_energies(data_path: Path) -> dict[str, float]:
)
ref_energies = {}

for row in df.iterrows():
label = row[1][0]
ref_energies[label] = float(row[1][2]) * KCAL_TO_EV
for row in df.itertuples(index=False, name=None):
label = row[0]
ref_energies[label] = float(row[2]) * KCAL_TO_EV

return ref_energies

Expand Down
6 changes: 3 additions & 3 deletions ml_peg/calcs/conformers/solvMPCONF196/calc_solvMPCONF196.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,9 @@ def get_ref_energies(data_path: Path) -> dict[str, float]:
)
ref_energies = {}

for row in df.iterrows():
label = row[1][0]
e_ref = float(row[1][1]) * units.Hartree
for row in df.itertuples(index=False, name=None):
label = row[0]
e_ref = float(row[1]) * units.Hartree
ref_energies[label] = e_ref

return ref_energies
Expand Down
10 changes: 6 additions & 4 deletions ml_peg/calcs/utils/gscdb138.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,13 @@ def run_gscdb138(
df_refs["Reference"] *= units.Hartree

# Calculate relative energy for each entry.
for _, row in tqdm(df_refs.iterrows(), dataset, total=df_refs.shape[0]):
for row in tqdm(
df_refs.itertuples(index=False), dataset, total=df_refs.shape[0]
):
atoms_list = []
identifier = row["Reaction"]
reactions = row["Stoichiometry"].split(",") # Parse stoichiometry string.
e_rel_ref = row["Reference"]
identifier = row.Reaction
reactions = row.Stoichiometry.split(",") # Parse stoichiometry string.
e_rel_ref = row.Reference
num_species = len(reactions) // 2 # Each species has coefficient and name.

e_rel_model = 0
Expand Down
Loading