Skip to content

Add AlchemicalArchive#687

Merged
atravitz merged 36 commits intoOpenFreeEnergy:mainfrom
ianmkenney:feat/AlchemicalArchive
Feb 6, 2026
Merged

Add AlchemicalArchive#687
atravitz merged 36 commits intoOpenFreeEnergy:mainfrom
ianmkenney:feat/AlchemicalArchive

Conversation

@ianmkenney
Copy link
Copy Markdown
Member

This PR introduced the AlchemicalArchive for serializing an AlchemicalNetwork along with it's transformation results. Closes #323.

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.82%. Comparing base (25c818a) to head (fe1033c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #687      +/-   ##
==========================================
+ Coverage   98.79%   98.82%   +0.02%     
==========================================
  Files          40       41       +1     
  Lines        2498     2555      +57     
==========================================
+ Hits         2468     2525      +57     
  Misses         30       30              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ianmkenney ianmkenney force-pushed the feat/AlchemicalArchive branch from d3f33c9 to fe155bf Compare December 3, 2025 14:55
@ianmkenney ianmkenney force-pushed the feat/AlchemicalArchive branch from fe155bf to 3b00b6f Compare December 3, 2025 15:33
@ianmkenney ianmkenney requested a review from atravitz December 8, 2025 17:10
On real data consisting of 264 ProtocolDAGResults, the dataclass
implementation was not scalable due to the lack of proper
deduplication. Serialized, the archive was 222 MiB (117 MiB when zst
compressed) and took nearly 2 minutes to produce. Using a
GufeTokenizable approach, this was reduced to 5 MiB (1 MiB compressed)
while only taking seconds to produce.
@ianmkenney
Copy link
Copy Markdown
Member Author

Commit 1772aa6 replaces the use of dataclasses with subclassing GufeTokenizable. While the simplicity from dataclasses is appealing, the performance benefits from using the GufeTokenizable subclass leaves little room for debate.

Implementation compression algorithm size (MB) to_json (ms) from_json, string (ms)
dataclass None 222 113000 ~8500
dataclass zstandard 117 -- --
Tokenizable None 5 0.0014 0.0013

If anyone has thoughts on this, please share!

@dotsdl dotsdl marked this pull request as ready for review January 20, 2026 17:22
@dotsdl dotsdl changed the title [WIP] Add AlchemicalArchive Add AlchemicalArchive Jan 20, 2026
@jthorton
Copy link
Copy Markdown
Contributor

Great job @ianmkenney. Is there any substantial benefit to adding compression to the Tokenizable implementation as well? If these objects are intended for long term storage, minimising the footprint at the cost of inspectability might be okay if there is a large difference. Or what about a msgpack option?

@ianmkenney
Copy link
Copy Markdown
Member Author

ianmkenney commented Jan 21, 2026

@jthorton at least for the network I've tested, compressing with zstandard reduced the size to about 1 MB. MessagePack would probably be a good option. From what I see currently implemented, compression needs to be done manually. We would want to add a compress keyword arg in to_msgpack for ease of use.

edit: I think as protocols start producing and collecting more artifacts, compression will be much more valuable and should probably be the default.

@ianmkenney
Copy link
Copy Markdown
Member Author

A quick test using MessagePack, with the new compression kwarg.

from gufe.archival import AlchemicalArchive
from gufe.compression import zst_compress, zst_decompress

from sys import getsizeof

archive = AlchemicalArchive.from_json(file="archive.json")

payload = archive.to_msgpack(compress=False)
print("msgpack, uncompressed (bytes):", getsizeof(payload))

payload = archive.to_msgpack(compress=True)
print("msgpack, compressed (bytes):", getsizeof(payload))

payload = archive.to_json()
print("JSON, uncompressed (bytes):", getsizeof(payload))
print("JSON, compressed (bytes):", getsizeof(zst_compress(payload.encode())))
msgpack, uncompressed (bytes): 2494841
msgpack, compressed (bytes): 762852
JSON, uncompressed (bytes): 5544757
JSON, compressed (bytes): 1032531

Comment thread gufe/archival.py Outdated
@ianmkenney
Copy link
Copy Markdown
Member Author

pre-commit.ci autofix

@ianmkenney ianmkenney force-pushed the feat/AlchemicalArchive branch from d72ca1f to ced3ece Compare February 3, 2026 17:10
Comment thread gufe/tokenization.py
@ianmkenney ianmkenney force-pushed the feat/AlchemicalArchive branch from 2afbd3a to 29a1c6f Compare February 5, 2026 21:09
Comment thread gufe/archival.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 6, 2026

No API break detected ✅

@ianmkenney ianmkenney requested a review from atravitz February 6, 2026 17:41
@atravitz atravitz merged commit c490289 into OpenFreeEnergy:main Feb 6, 2026
14 checks passed
atravitz added a commit that referenced this pull request Mar 2, 2026
* Add untested implementation of AlchemicalArchive

* Address ruff check issues

* Add docstrings to from_json and to_json

* Add tests for AlchemicalArchive

* Add archival module to API autosummary

* Add news entry

* Include fake ProtocolDAGResults in test archive

* Fix error in Archive fixture

* Implement md5sum and deterministic ProtocolDAGResult ordering

* Use lists instead of tuples

* Share tokenization map with AlchemicalNetwork and ProtocolDAGResults

* Use GufeTokenizable approach over dataclass

On real data consisting of 264 ProtocolDAGResults, the dataclass
implementation was not scalable due to the lack of proper
deduplication. Serialized, the archive was 222 MiB (117 MiB when zst
compressed) and took nearly 2 minutes to produce. Using a
GufeTokenizable approach, this was reduced to 5 MiB (1 MiB compressed)
while only taking seconds to produce.

* Update news entry

* Allow zstandard compression of msgpack bytes

* Fix errors in TestArchival

* Test MessagePack roundtrip with and without compression

* Check that transformation keys correspond to network edges

* Remove use of dictionaries for storing transformation results

* Simplify transformation_results validation process

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed typo

* Test ValueError raised on duplicate Transformation

* Test ordering of input transformation_results

* Add docstrings

* Add regression test for deserializing an AlchemicalArchive

* Revert "Add regression test for deserializing an AlchemicalArchive"

This reverts commit 6c2f7ee.

* Add regression test for deserializing an AlchemicalArchive

This reflects the previously reverted commit but changes execution
order of the tests.

* don't mutate the fixture

* add immutability test

* Allow user to skip specifying metadata

* Issue warning only if difference in major or minor versions

* Test conditional issue of warning by semver differences

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Alyssa Travitz <alyssa.travitz@omsf.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define AlchemicalArchive object for use as archival artifact

5 participants