|
| 1 | +Unified HDF5 Torch Cache Schema |
| 2 | +=============================== |
| 3 | + |
| 4 | +This page documents the versioned on-disk cache schema used by |
| 5 | +``HDF5StructureDataset.build_database(..., persist_features=...,`` |
| 6 | +``persist_force_derivatives=...)``. |
| 7 | + |
| 8 | +The user-facing training and dataset guides describe when to enable these |
| 9 | +cache sections and how they interact with ``cache_features=True`` at runtime. |
| 10 | +This page focuses on the on-disk schema, metadata contract, and compatibility |
| 11 | +rules behind that workflow. |
| 12 | + |
| 13 | +Scope |
| 14 | +----- |
| 15 | + |
| 16 | +Schema version 2 introduces a unified ``/torch_cache`` container for optional |
| 17 | +persisted payload sections: |
| 18 | + |
| 19 | +- raw unnormalized descriptor features |
| 20 | +- sparse local derivative payloads for force-labeled structures |
| 21 | + |
| 22 | +New cache-writing builds use schema version 2 whenever either optional |
| 23 | +payload is requested. Legacy derivative-only schema version 1 files stored |
| 24 | +under ``/force_derivatives`` remain readable. |
| 25 | + |
| 26 | +Compatibility Contract |
| 27 | +---------------------- |
| 28 | + |
| 29 | +Persisted cache compatibility is keyed to the descriptor settings that change |
| 30 | +the raw geometry-dependent payloads: |
| 31 | + |
| 32 | +- descriptor class |
| 33 | +- species order |
| 34 | +- radial order and cutoff |
| 35 | +- angular order and cutoff |
| 36 | +- minimum cutoff |
| 37 | +- whether multi-species/typespin weighting is active |
| 38 | + |
| 39 | +Storage dtype is recorded as metadata, but it is not part of the compatibility |
| 40 | +signature. A cache may therefore be written in one floating-point dtype and |
| 41 | +loaded through another compatible descriptor dtype, with values cast on load. |
| 42 | + |
| 43 | +Schema Version 2 Layout |
| 44 | +----------------------- |
| 45 | + |
| 46 | +The root group is ``/torch_cache``. |
| 47 | + |
| 48 | +Root attributes: |
| 49 | + |
| 50 | +- ``schema_version``: integer schema version, currently ``2`` |
| 51 | +- ``cache_format``: format identifier string, |
| 52 | + ``"aenet.torch_training.cache.v2"`` |
| 53 | +- ``descriptor_compat_json``: canonical JSON serialization of the |
| 54 | + compatibility-relevant descriptor settings |
| 55 | +- ``descriptor_compat_sha256``: SHA-256 hash of that JSON payload |
| 56 | +- ``storage_dtype``: floating-point dtype used for stored arrays |
| 57 | +- ``contains_features``: whether the ``/torch_cache/features`` section exists |
| 58 | +- ``contains_force_derivatives``: whether the |
| 59 | + ``/torch_cache/force_derivatives`` section exists |
| 60 | + |
| 61 | +Feature Section |
| 62 | +--------------- |
| 63 | + |
| 64 | +Feature payloads live under ``/torch_cache/features``. |
| 65 | + |
| 66 | +Nodes: |
| 67 | + |
| 68 | +- ``/torch_cache/features/index`` |
| 69 | +- ``/torch_cache/features/values`` |
| 70 | + |
| 71 | +Index columns: |
| 72 | + |
| 73 | +- ``entry_idx``: dataset entry index in ``/entries/structures`` |
| 74 | +- ``cache_row``: row number used by ``values`` |
| 75 | +- ``n_atoms``: atom count for the structure |
| 76 | +- ``n_features``: raw feature width ``F`` |
| 77 | + |
| 78 | +Payload semantics: |
| 79 | + |
| 80 | +- one flattened raw ``(N, F)`` tensor per cached entry in ``values`` |
| 81 | +- features are stored pre-normalization |
| 82 | +- load-time helpers reshape back to ``(N, F)`` and cast to the active |
| 83 | + descriptor dtype |
| 84 | + |
| 85 | +Force-Derivative Section |
| 86 | +------------------------ |
| 87 | + |
| 88 | +Derivative payloads live under ``/torch_cache/force_derivatives``. |
| 89 | + |
| 90 | +Section attributes: |
| 91 | + |
| 92 | +- ``schema_version``: derivative payload schema version, currently ``1`` |
| 93 | +- ``payload_format``: format identifier string, |
| 94 | + ``"aenet.torch_training.local_derivatives.v1"`` |
| 95 | +- ``descriptor_compat_json`` |
| 96 | +- ``descriptor_compat_sha256`` |
| 97 | +- ``storage_dtype`` |
| 98 | +- ``n_radial_features`` |
| 99 | +- ``n_angular_features`` |
| 100 | +- ``multi`` |
| 101 | +- ``contains_features``: currently ``False`` within the derivative subsection |
| 102 | +- ``contains_positions``: currently ``False`` |
| 103 | + |
| 104 | +Index table: |
| 105 | + |
| 106 | +- ``/torch_cache/force_derivatives/index`` |
| 107 | +- one row per cached force-labeled structure |
| 108 | +- columns: |
| 109 | + - ``entry_idx`` |
| 110 | + - ``cache_row`` |
| 111 | + - ``n_atoms`` |
| 112 | + - ``n_radial_edges`` |
| 113 | + - ``n_angular_triplets`` |
| 114 | + |
| 115 | +Radial payload nodes: |
| 116 | + |
| 117 | +- ``/torch_cache/force_derivatives/radial/center_idx`` |
| 118 | +- ``/torch_cache/force_derivatives/radial/neighbor_idx`` |
| 119 | +- ``/torch_cache/force_derivatives/radial/dG_drij`` |
| 120 | +- ``/torch_cache/force_derivatives/radial/neighbor_typespin`` |
| 121 | + |
| 122 | +Angular payload nodes: |
| 123 | + |
| 124 | +- ``/torch_cache/force_derivatives/angular/center_idx`` |
| 125 | +- ``/torch_cache/force_derivatives/angular/neighbor_j_idx`` |
| 126 | +- ``/torch_cache/force_derivatives/angular/neighbor_k_idx`` |
| 127 | +- ``/torch_cache/force_derivatives/angular/grads_i`` |
| 128 | +- ``/torch_cache/force_derivatives/angular/grads_j`` |
| 129 | +- ``/torch_cache/force_derivatives/angular/grads_k`` |
| 130 | +- ``/torch_cache/force_derivatives/angular/triplet_typespin`` |
| 131 | + |
| 132 | +The logical tensor shapes are unchanged from the original derivative cache |
| 133 | +design. The v2 schema only relocates the derivative section under the shared |
| 134 | +cache root. |
| 135 | + |
| 136 | +Loading Semantics |
| 137 | +----------------- |
| 138 | + |
| 139 | +The persistence layer exposes the cache through explicit dataset helpers: |
| 140 | + |
| 141 | +- ``has_persisted_features()`` |
| 142 | +- ``get_persisted_feature_cache_info()`` |
| 143 | +- ``load_persisted_features(idx)`` |
| 144 | +- ``has_persisted_force_derivatives()`` |
| 145 | +- ``get_force_derivative_cache_info()`` |
| 146 | +- ``load_persisted_force_derivatives(idx)`` |
| 147 | + |
| 148 | +Runtime sample materialization now uses the persisted cache lazily when the |
| 149 | +payload is present and descriptor-compatible: |
| 150 | + |
| 151 | +- energy-view materialization checks the trainer-owned runtime |
| 152 | + ``cache_features=True`` cache first, then falls back to persisted HDF5 |
| 153 | + features, then finally recomputes features on demand |
| 154 | +- force-view materialization reuses persisted raw features when available |
| 155 | +- when both persisted raw features and persisted local derivatives are |
| 156 | + available for a force-supervised entry, ``HDF5StructureDataset`` can serve |
| 157 | + the force sample without rebuilding graph/triplet payloads |
| 158 | + |
| 159 | +This keeps feature normalization as a runtime training concern and preserves |
| 160 | +on-the-fly fallback behavior when a persisted section is absent. |
| 161 | + |
| 162 | +Legacy Version 1 Compatibility |
| 163 | +------------------------------ |
| 164 | + |
| 165 | +Legacy derivative-only files with a root ``/force_derivatives`` group remain |
| 166 | +supported for read access. |
| 167 | + |
| 168 | +Version 1 characteristics: |
| 169 | + |
| 170 | +- derivative-only layout |
| 171 | +- ``schema_version = 1`` |
| 172 | +- no unified ``/torch_cache`` root |
| 173 | +- no persisted raw feature section |
| 174 | + |
| 175 | +New builds do not write schema version 1. They standardize on schema version |
| 176 | +2 whenever persisted cache payloads are requested. |
| 177 | + |
| 178 | +Related Descriptor Manifest |
| 179 | +--------------------------- |
| 180 | + |
| 181 | +When ``persist_descriptor=True`` is requested explicitly, or implicitly via |
| 182 | +``persist_features=True`` or ``persist_force_derivatives=True``, the HDF5 file |
| 183 | +also stores a versioned descriptor manifest under ``/descriptor_manifest``. |
| 184 | + |
| 185 | +That manifest remains distinct from the cache payload schema and exists only |
| 186 | +to reconstruct supported descriptor objects safely when a dataset is reopened. |
0 commit comments