Skip to content

_deform_mesh: bump _mesh_version + clear runner nav caches#191

Open
lmoresi wants to merge 3 commits into
developmentfrom
bugfix/deform-cache-invalidation
Open

_deform_mesh: bump _mesh_version + clear runner nav caches#191
lmoresi wants to merge 3 commits into
developmentfrom
bugfix/deform-cache-invalidation

Conversation

@lmoresi
Copy link
Copy Markdown
Member

@lmoresi lmoresi commented May 15, 2026

Summary

Two cache-invalidation gaps in Mesh._deform_mesh, both triggered by direct _deform_mesh(coords) calls (which bypass the mesh.X.coords NDArray callback that normally performs this hygiene — e.g. every free-surface RK stage in the convection benchmark):

  • _mesh_version not bumped. PR Consolidate and unify cached spatial indexing (KDTree) #182 introduced version-keyed kdtree navigation caches (_BaseMeshVariable._get_kdtree, Mesh._get_domain_kdtree) gated on _mesh_version. The mesh.X.coords callback bumps it; direct _deform_mesh() did not — so navigation kdtrees stayed frozen on the undeformed mesh. PR Invalidate evaluate/DMInterp/topology caches on Mesh._deform_mesh #188 added _topology_version invalidation here but missed _mesh_version.
  • Runner coord-identity nav caches not cleared. A runner's restore_points_to_domain caches a kdtree keyed on id(mesh.X.coords). _deform_mesh replaces self._coords, but CPython reuses freed ids → a fresh array can collide with the old id() and the staleness check false-negatives. Explicitly clearing _restore_kdt / _restore_coords_id defeats the id()-reuse hazard.

Brings _deform_mesh into line with the cache hygiene mesh.adapt() and _legacy_access already perform.

Test plan

  • _mesh_version increments on a direct _deform_mesh() call (was frozen at 0 before the fix)
  • Existing mesh/smoother test suites pass
  • Parallel smoke (np=2) of a deforming-mesh case

Note: this is an independent correctness fix. It does not by itself resolve the separate free-surface convection feedback regression under investigation (verified — both fixes present in a clean build still reproduce the damped regime).

Underworld development team with AI support from Claude Code

Two cache-invalidation gaps in Mesh._deform_mesh, both exposed by
direct _deform_mesh(coords) calls (e.g. every free-surface RK
stage in the convection benchmark), which bypass the
mesh.X.coords NDArray callback that normally performs this
hygiene:

1. _mesh_version was not incremented. PR #182 introduced
   version-keyed kdtree navigation caches
   (_BaseMeshVariable._get_kdtree, Mesh._get_domain_kdtree) that
   gate their rebuild on _mesh_version. The mesh.X.coords callback
   bumps it; direct _deform_mesh() did not. Result: navigation
   kdtrees stay frozen on the undeformed mesh, so spatial lookups
   return pre-deform DOFs after the geometry has moved. PR #188
   added _topology_version invalidation here but missed
   _mesh_version.

2. User-installed coord-identity nav caches were not cleared. A
   runner's restore_points_to_domain typically caches a kdtree
   keyed on id(mesh.X.coords). _deform_mesh replaces self._coords
   with a new object, but CPython reuses freed ids, so a fresh
   coords array can collide with the old id() and the staleness
   check false-negatives. Explicitly clearing _restore_kdt /
   _restore_coords_id defeats the id()-reuse hazard.

Verified: _mesh_version now increments on a direct
_deform_mesh() call (was frozen at 0 before). Matches the cache
hygiene mesh.adapt() and _legacy_access already perform; brings
_deform_mesh into line. Independent correctness fix; does not by
itself resolve the separate free-surface convection feedback
regression under investigation.

Underworld development team with AI support from Claude Code
Copilot AI review requested due to automatic review settings May 15, 2026 22:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR closes two cache invalidation gaps in Mesh._deform_mesh() that occur when _deform_mesh(coords) is called directly (bypassing the mesh.X.coords NDArray callback), ensuring version-gated navigation caches and runner-installed navigation helpers don’t remain stale after mesh deformation.

Changes:

  • Increment self._mesh_version inside _deform_mesh() so version-keyed KDTree navigation caches rebuild on geometry updates.
  • Clear runner/user-installed navigation cache attributes (_restore_kdt, _restore_coords_id) to avoid stale reuse due to CPython id() reuse.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1847 to +1860
# Bump the geometry-version counter so version-keyed
# kdtree navigation caches rebuild against the new DOF
# positions: _BaseMeshVariable._get_kdtree and
# Mesh._get_domain_kdtree both gate their rebuild on
# `_mesh_version`. PR #182 introduced those version-keyed
# caches; the mesh.X.coords callback path bumps
# _mesh_version, but direct _deform_mesh() calls (every
# free-surface RK stage) bypass that callback. Without
# this bump the navigation kdtrees stay frozen on the
# undeformed mesh — back-advected SL samples land at the
# wrong DOFs, corrupting the temperature field. PR #188
# added _topology_version invalidation but missed this.
self._mesh_version += 1
# Also nuke any *user-installed* navigation caches that
lmoresi added 2 commits May 25, 2026 12:16
test_lambdify_caching and test_rbf_false_not_slow asserted one-off
wall-clock timings (time2 <= time1*2, elapsed < 0.01s). These are
inherently flaky on shared CI runners and test_lambdify_caching has
been failing CI on unrelated PRs (0.0107s vs 0.0049s runner jitter).

Replaced with deterministic, timing-free behaviour tests:

- test_lambdify_cache_hit: exercises get_cached_lambdified directly --
  identical request returns the SAME function object and adds no entry
  (true cache hit); a distinct expression is cached separately. This is
  the cache mechanism's actual contract.
- test_rbf_modes_consistent: rbf=True and rbf=False must agree for a
  pure-sympy expression (meaningful, timing-free; replaces the
  "rbf=False not slow" wall-clock check).

Note: writing the cache-hit test surfaced that the high-level
uw.function.evaluate() path does NOT hit the lambdify cache on repeated
identical calls -- _expr_hash(srepr) differs every call, so the cache
grows by one entry per call and never returns a hit. The old loose
timing tolerance was masking this. Not fixed here (evaluate/lambdify is
a performance-critical hot path needing separate benchmarking); see PR
description.

Full module: 19 passed.

Underworld development team with AI support from Claude Code
_expr_hash used sympy.srepr(expr), which embeds the volatile global
dummy_index of any sympy.Dummy. The evaluate() coordinate-substitution
path mints a fresh Dummy per call, so an otherwise-identical expression
hashed differently every call and the lambdify cache never matched --
_lambdify_cache grew one entry per call (1,2,3,4,...) and sympy.lambdify
recompiled every time on a hot path.

Fix: canonicalise Dummy -> name-stable Symbol in _expr_hash before
srepr. This changes only the cache *key*; the real sympy.lambdify()
call still uses the original expr/symbols, so numerics are unchanged.
The cache key separately carries the symbol-name tuple, so name-keying
is safe and deterministic.

Verified: repeated identical evaluate() now holds cache size flat
([1,1,1,...] vs [1,2,3,...] before). test_0720 module 19 passed;
test_0501_integrals 9 passed / 3 pre-existing xfail (unrelated
CellWiseIntegral #172/#174).

test_evaluate_cache_stable_across_calls added as the #194 regression
guard (aggregate cache behaviour + result-consistency, no wall-clock).

Underworld development team with AI support from Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants