Add AlchemicalArchive#687
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #687 +/- ##
==========================================
+ Coverage 98.79% 98.82% +0.02%
==========================================
Files 40 41 +1
Lines 2498 2555 +57
==========================================
+ Hits 2468 2525 +57
Misses 30 30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d3f33c9 to
fe155bf
Compare
fe155bf to
3b00b6f
Compare
On real data consisting of 264 ProtocolDAGResults, the dataclass implementation was not scalable due to the lack of proper deduplication. Serialized, the archive was 222 MiB (117 MiB when zst compressed) and took nearly 2 minutes to produce. Using a GufeTokenizable approach, this was reduced to 5 MiB (1 MiB compressed) while only taking seconds to produce.
|
Commit 1772aa6 replaces the use of dataclasses with subclassing GufeTokenizable. While the simplicity from dataclasses is appealing, the performance benefits from using the GufeTokenizable subclass leaves little room for debate.
If anyone has thoughts on this, please share! |
|
Great job @ianmkenney. Is there any substantial benefit to adding compression to the Tokenizable implementation as well? If these objects are intended for long term storage, minimising the footprint at the cost of inspectability might be okay if there is a large difference. Or what about a msgpack option? |
|
@jthorton at least for the network I've tested, compressing with zstandard reduced the size to about 1 MB. MessagePack would probably be a good option. From what I see currently implemented, compression needs to be done manually. We would want to add a edit: I think as protocols start producing and collecting more artifacts, compression will be much more valuable and should probably be the default. |
|
A quick test using MessagePack, with the new compression kwarg. from gufe.archival import AlchemicalArchive
from gufe.compression import zst_compress, zst_decompress
from sys import getsizeof
archive = AlchemicalArchive.from_json(file="archive.json")
payload = archive.to_msgpack(compress=False)
print("msgpack, uncompressed (bytes):", getsizeof(payload))
payload = archive.to_msgpack(compress=True)
print("msgpack, compressed (bytes):", getsizeof(payload))
payload = archive.to_json()
print("JSON, uncompressed (bytes):", getsizeof(payload))
print("JSON, compressed (bytes):", getsizeof(zst_compress(payload.encode()))) |
|
pre-commit.ci autofix |
d72ca1f to
ced3ece
Compare
This reverts commit 6c2f7ee.
This reflects the previously reverted commit but changes execution order of the tests.
2afbd3a to
29a1c6f
Compare
|
No API break detected ✅ |
* Add untested implementation of AlchemicalArchive * Address ruff check issues * Add docstrings to from_json and to_json * Add tests for AlchemicalArchive * Add archival module to API autosummary * Add news entry * Include fake ProtocolDAGResults in test archive * Fix error in Archive fixture * Implement md5sum and deterministic ProtocolDAGResult ordering * Use lists instead of tuples * Share tokenization map with AlchemicalNetwork and ProtocolDAGResults * Use GufeTokenizable approach over dataclass On real data consisting of 264 ProtocolDAGResults, the dataclass implementation was not scalable due to the lack of proper deduplication. Serialized, the archive was 222 MiB (117 MiB when zst compressed) and took nearly 2 minutes to produce. Using a GufeTokenizable approach, this was reduced to 5 MiB (1 MiB compressed) while only taking seconds to produce. * Update news entry * Allow zstandard compression of msgpack bytes * Fix errors in TestArchival * Test MessagePack roundtrip with and without compression * Check that transformation keys correspond to network edges * Remove use of dictionaries for storing transformation results * Simplify transformation_results validation process * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed typo * Test ValueError raised on duplicate Transformation * Test ordering of input transformation_results * Add docstrings * Add regression test for deserializing an AlchemicalArchive * Revert "Add regression test for deserializing an AlchemicalArchive" This reverts commit 6c2f7ee. * Add regression test for deserializing an AlchemicalArchive This reflects the previously reverted commit but changes execution order of the tests. * don't mutate the fixture * add immutability test * Allow user to skip specifying metadata * Issue warning only if difference in major or minor versions * Test conditional issue of warning by semver differences --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Alyssa Travitz <alyssa.travitz@omsf.io>
This PR introduced the
AlchemicalArchivefor serializing anAlchemicalNetworkalong with it's transformation results. Closes #323.