Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@
## 2024-05-19 - Caching YAML Load for Framework Registry
**Learning:** `yaml.safe_load` on `frameworks.yml` within `load_framework_registry()` was taking ~2-3 ms per call and it was repeatedly called for every framework entry via `get_framework_config()`. This was a micro-bottleneck, especially when dealing with lists or multiple frameworks.
**Action:** Applied the `@lru_cache` and `deepcopy` pattern successfully again to `load_framework_registry()` and `get_framework_config()` to avoid caching a mutable dictionary directly and avoid repeated YAML I/O parsing.
## 2025-03-01 - Optimizing DataFrame Iteration in Calculation Loops
**Learning:** Pandas `iterrows()` is a known performance bottleneck. Using `iterrows()` forces Pandas to return a Series for each row, invoking expensive Series construction, type checking, and boxing. In our codebase loops traversing hundreds or thousands of structures/materials to perform benchmarking calculations, replacing `iterrows()` with `itertuples(index=False, name=None)` (returning standard tuples) or `to_dict('records')` removes this heavy overhead and cuts iteration time significantly.
**Action:** When refactoring nested loops parsing DataFrames for benchmarking, always use `itertuples()` for indexing columns or `to_dict('records')` when dictionary access is required over `iterrows()`. Be careful to replace `.get()` or `[1][x]` Series references with tuple indexing (`[x]`) or `namedtuple` property (`.x`) access.
3 changes: 2 additions & 1 deletion ml_peg/calcs/bulk_crystal/elasticity/calc_elasticity.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,8 @@ def run_elasticity_benchmark(

# Save relaxed structures to extxyz for visualisation
atoms_list = []
for _, row in results.iterrows():
# Perf opt: Replace `iterrows` with `to_dict('records')`
for row in results.to_dict("records"):
struct = row.get("final_structure")
if struct is not None:
atoms = AseAtomsAdaptor.get_atoms(struct).copy()
Expand Down
7 changes: 4 additions & 3 deletions ml_peg/calcs/conformers/MPCONF196/calc_MPCONF196.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,10 @@ def get_ref_energies(data_path: Path) -> dict[str, float]:
)
ref_energies = {}

for row in df.iterrows():
label = row[1][0]
ref_energies[label] = float(row[1][2]) * KCAL_TO_EV
# Perf opt: Replace `iterrows` with `itertuples` to avoid Series overhead
for row in df.itertuples(index=False, name=None):
label = row[0]
ref_energies[label] = float(row[2]) * KCAL_TO_EV

return ref_energies

Expand Down
7 changes: 4 additions & 3 deletions ml_peg/calcs/conformers/solvMPCONF196/calc_solvMPCONF196.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,10 @@ def get_ref_energies(data_path: Path) -> dict[str, float]:
)
ref_energies = {}

for row in df.iterrows():
label = row[1][0]
e_ref = float(row[1][1]) * units.Hartree
# Perf opt: Replace `iterrows` with `itertuples` to avoid Series overhead
for row in df.itertuples(index=False, name=None):
label = row[0]
e_ref = float(row[1]) * units.Hartree
ref_energies[label] = e_ref

return ref_energies
Expand Down
9 changes: 5 additions & 4 deletions ml_peg/calcs/utils/gscdb138.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,12 @@ def run_gscdb138(
df_refs["Reference"] *= units.Hartree

# Calculate relative energy for each entry.
for _, row in tqdm(df_refs.iterrows(), dataset, total=df_refs.shape[0]):
# Perf opt: Replace `iterrows` with `itertuples` to avoid Series overhead
for row in tqdm(df_refs.itertuples(), dataset, total=df_refs.shape[0]):
atoms_list = []
identifier = row["Reaction"]
reactions = row["Stoichiometry"].split(",") # Parse stoichiometry string.
e_rel_ref = row["Reference"]
identifier = row.Reaction
reactions = row.Stoichiometry.split(",") # Parse stoichiometry string.
e_rel_ref = row.Reference
num_species = len(reactions) // 2 # Each species has coefficient and name.

e_rel_model = 0
Expand Down
Loading