Skip to content

Commit b964380

Browse files
committed
torch_training: unify HDF5 cache reuse and trainer runtime policy
Add persisted raw-feature support to HDF5 torch-training datasets, recover descriptors from stored manifests, and consolidate dataset materialization through shared helpers across in-memory, HDF5, and cached paths. This change: - adds a versioned /torch_cache schema with optional feature and derivative payloads while preserving legacy derivative-cache compatibility - defines descriptor manifests for HDF5 build/load recovery and compatibility checks - loads persisted features and derivatives lazily with explicit runtime precedence relative to trainer-owned caches - moves runtime force-sampling and cache policy ownership to the trainer side via passive datasets plus split-local policy wrappers - keeps random force resampling correct with multi-worker loading by disabling persistent training workers when epoch-level resampling is active - expands regression coverage for round trips, dtype/compatibility behavior, collate/materialization parity, trainer equivalence, and docs-backed usage - updates user/developer docs and notebook examples for the unified workflow
1 parent 621c9dc commit b964380

19 files changed

+5123
-926
lines changed
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
Unified HDF5 Torch Cache Schema
2+
===============================
3+
4+
This page documents the versioned on-disk cache schema used by
5+
``HDF5StructureDataset.build_database(..., persist_features=...,``
6+
``persist_force_derivatives=...)``.
7+
8+
The user-facing training and dataset guides describe when to enable these
9+
cache sections and how they interact with ``cache_features=True`` at runtime.
10+
This page focuses on the on-disk schema, metadata contract, and compatibility
11+
rules behind that workflow.
12+
13+
Scope
14+
-----
15+
16+
Schema version 2 introduces a unified ``/torch_cache`` container for optional
17+
persisted payload sections:
18+
19+
- raw unnormalized descriptor features
20+
- sparse local derivative payloads for force-labeled structures
21+
22+
New cache-writing builds use schema version 2 whenever either optional
23+
payload is requested. Legacy derivative-only schema version 1 files stored
24+
under ``/force_derivatives`` remain readable.
25+
26+
Compatibility Contract
27+
----------------------
28+
29+
Persisted cache compatibility is keyed to the descriptor settings that change
30+
the raw geometry-dependent payloads:
31+
32+
- descriptor class
33+
- species order
34+
- radial order and cutoff
35+
- angular order and cutoff
36+
- minimum cutoff
37+
- whether multi-species/typespin weighting is active
38+
39+
Storage dtype is recorded as metadata, but it is not part of the compatibility
40+
signature. A cache may therefore be written in one floating-point dtype and
41+
loaded through another compatible descriptor dtype, with values cast on load.
42+
43+
Schema Version 2 Layout
44+
-----------------------
45+
46+
The root group is ``/torch_cache``.
47+
48+
Root attributes:
49+
50+
- ``schema_version``: integer schema version, currently ``2``
51+
- ``cache_format``: format identifier string,
52+
``"aenet.torch_training.cache.v2"``
53+
- ``descriptor_compat_json``: canonical JSON serialization of the
54+
compatibility-relevant descriptor settings
55+
- ``descriptor_compat_sha256``: SHA-256 hash of that JSON payload
56+
- ``storage_dtype``: floating-point dtype used for stored arrays
57+
- ``contains_features``: whether the ``/torch_cache/features`` section exists
58+
- ``contains_force_derivatives``: whether the
59+
``/torch_cache/force_derivatives`` section exists
60+
61+
Feature Section
62+
---------------
63+
64+
Feature payloads live under ``/torch_cache/features``.
65+
66+
Nodes:
67+
68+
- ``/torch_cache/features/index``
69+
- ``/torch_cache/features/values``
70+
71+
Index columns:
72+
73+
- ``entry_idx``: dataset entry index in ``/entries/structures``
74+
- ``cache_row``: row number used by ``values``
75+
- ``n_atoms``: atom count for the structure
76+
- ``n_features``: raw feature width ``F``
77+
78+
Payload semantics:
79+
80+
- one flattened raw ``(N, F)`` tensor per cached entry in ``values``
81+
- features are stored pre-normalization
82+
- load-time helpers reshape back to ``(N, F)`` and cast to the active
83+
descriptor dtype
84+
85+
Force-Derivative Section
86+
------------------------
87+
88+
Derivative payloads live under ``/torch_cache/force_derivatives``.
89+
90+
Section attributes:
91+
92+
- ``schema_version``: derivative payload schema version, currently ``1``
93+
- ``payload_format``: format identifier string,
94+
``"aenet.torch_training.local_derivatives.v1"``
95+
- ``descriptor_compat_json``
96+
- ``descriptor_compat_sha256``
97+
- ``storage_dtype``
98+
- ``n_radial_features``
99+
- ``n_angular_features``
100+
- ``multi``
101+
- ``contains_features``: currently ``False`` within the derivative subsection
102+
- ``contains_positions``: currently ``False``
103+
104+
Index table:
105+
106+
- ``/torch_cache/force_derivatives/index``
107+
- one row per cached force-labeled structure
108+
- columns:
109+
- ``entry_idx``
110+
- ``cache_row``
111+
- ``n_atoms``
112+
- ``n_radial_edges``
113+
- ``n_angular_triplets``
114+
115+
Radial payload nodes:
116+
117+
- ``/torch_cache/force_derivatives/radial/center_idx``
118+
- ``/torch_cache/force_derivatives/radial/neighbor_idx``
119+
- ``/torch_cache/force_derivatives/radial/dG_drij``
120+
- ``/torch_cache/force_derivatives/radial/neighbor_typespin``
121+
122+
Angular payload nodes:
123+
124+
- ``/torch_cache/force_derivatives/angular/center_idx``
125+
- ``/torch_cache/force_derivatives/angular/neighbor_j_idx``
126+
- ``/torch_cache/force_derivatives/angular/neighbor_k_idx``
127+
- ``/torch_cache/force_derivatives/angular/grads_i``
128+
- ``/torch_cache/force_derivatives/angular/grads_j``
129+
- ``/torch_cache/force_derivatives/angular/grads_k``
130+
- ``/torch_cache/force_derivatives/angular/triplet_typespin``
131+
132+
The logical tensor shapes are unchanged from the original derivative cache
133+
design. The v2 schema only relocates the derivative section under the shared
134+
cache root.
135+
136+
Loading Semantics
137+
-----------------
138+
139+
The persistence layer exposes the cache through explicit dataset helpers:
140+
141+
- ``has_persisted_features()``
142+
- ``get_persisted_feature_cache_info()``
143+
- ``load_persisted_features(idx)``
144+
- ``has_persisted_force_derivatives()``
145+
- ``get_force_derivative_cache_info()``
146+
- ``load_persisted_force_derivatives(idx)``
147+
148+
Runtime sample materialization now uses the persisted cache lazily when the
149+
payload is present and descriptor-compatible:
150+
151+
- energy-view materialization checks the trainer-owned runtime
152+
``cache_features=True`` cache first, then falls back to persisted HDF5
153+
features, then finally recomputes features on demand
154+
- force-view materialization reuses persisted raw features when available
155+
- when both persisted raw features and persisted local derivatives are
156+
available for a force-supervised entry, ``HDF5StructureDataset`` can serve
157+
the force sample without rebuilding graph/triplet payloads
158+
159+
This keeps feature normalization as a runtime training concern and preserves
160+
on-the-fly fallback behavior when a persisted section is absent.
161+
162+
Legacy Version 1 Compatibility
163+
------------------------------
164+
165+
Legacy derivative-only files with a root ``/force_derivatives`` group remain
166+
supported for read access.
167+
168+
Version 1 characteristics:
169+
170+
- derivative-only layout
171+
- ``schema_version = 1``
172+
- no unified ``/torch_cache`` root
173+
- no persisted raw feature section
174+
175+
New builds do not write schema version 1. They standardize on schema version
176+
2 whenever persisted cache payloads are requested.
177+
178+
Related Descriptor Manifest
179+
---------------------------
180+
181+
When ``persist_descriptor=True`` is requested explicitly, or implicitly via
182+
``persist_features=True`` or ``persist_force_derivatives=True``, the HDF5 file
183+
also stores a versioned descriptor manifest under ``/descriptor_manifest``.
184+
185+
That manifest remains distinct from the cache payload schema and exists only
186+
to reconstruct supported descriptor objects safely when a dataset is reopened.

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ Developer Documentation
125125
dev/commandline
126126
dev/docs_examples
127127
dev/analytical_gradients
128+
dev/torch_force_hdf5_cache
128129

129130
API Reference
130131
-------------

0 commit comments

Comments
 (0)