-
Notifications
You must be signed in to change notification settings - Fork 38
Add pack and unpack of a more compact serialization format #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
89e1f19 to
fed3034
Compare
|
@lemire thank you for your review, which I did not at all expect for a draft. I hope to have addressed your feedback appropriately, and force pushed this branch. |
|
FYI, I also looked into the Duff's Device idea with branchless code, and it seems to not pay off because of the bounds check. |
|
@nigoroll This PR looks good to me. I consider that changing the definition of the structs is a breaking change... but given that it is well motivated, I think we can go forward. There was a prior request to bundle the size in the struct #63 by @oschonrock. At the time, I dismissed it because I viewed it as a metadata issue. However, it is true that it can allow for a leaner serialized format... which can matter for some applications. ❤️ |
|
@lemire thank you. So I will go ahead with the remaining TODOs to add the other filters. FTR, I also re-ran the zlib test on the packed output (without the dictionary) for the small filter sizes I am interested in. As an overview, the numbers are:
So zlib can not further compress Except for one case, pack + zlib run: |
... in preparation of a more compact serialization format: All other parameters except for the Seed are derived from the size parameter. The drawback is that this format is sensitive to changes of binary_fuse8_allocate(). Due to alignment, this does not need any more space on 64bit. (There were 5 32bit values inbetween two 64bit values) Yet formally, this is a breaking change of the in-core format, which should not be used to store information across versions. See follow up commits for new compact serialization formats.
Rationale:
As mentioned in the previous commit, for binary_fuse filters, we do not
need to save values derived from the size, saving 5 x sizeof(uint32_t).
For both filter implementations, we add a bitmap to indicate non-zero
fingerprint values. This adds 1/{8,16} of the fingerprint array size,
but saves one or two bytes for each zero fingerprint.
The net result is a packed format which can not be compressed further by
zlib for the bundled unit tests.
Note that this format is incompatible with the existing _serialize()
format and, in the case of binary_fuse, sensitive to changes of the
derived parameters in _allocate.
Interface:
We add _pack_bytes() to match _serialization_bytes(). _pack() and
_unpack() match _serialize() and _deserialize().
The existing _{de,}serialize() interfaces take a buffer pointer only and
thus implicitly assume that the buffer will be of sufficient size. For
the new functions, we add a size_t parameter indicating the size of the
buffer and check its bounds in the implementation.
_pack returns the used size or zero for "does not fit", so when called
with a buffer of arbitrary size, the used space or error condition can
be determined without an additional call to _pack_bytes(), avoiding
duplicate work.
Implementation:
We add some XOR_bitf_* macros to address words and individual bits of
bitfields.
The XOR_ser and XOR_deser macros have the otherwise repeated code for
bounds checking and the actual serialization.
Because the implementations for the 8 and 16 bit words are equal except
for the data type, we add macros and create the actual functions by
expanding the macros with the possible data types.
Alternatives considered:
Compared to _{de,}serialize(), the new functions need to copy individual
fingerprint words rather than the whole array at once, which is less
efficient. Therefor, an implementation using Duff's Device with
branchless code was attempted but dismissed because avoiding
out-of-bounds access would require an over-allocated buffer.
To exercise the new code without too much of a change to the existing
unit test, we change the signature of _{un,}serialize_gen() to take an
additional (const) size_t argument, which we ignore for
_{un,}serialize().
We add to the reported metrics absolute and relative size information
for the "in-core" and "wire" format, the latter jointly referencing to
_{un,}serialize() and _{un,}pack().
fed3034 to
1382d41
Compare
|
I think this PR is ready now with the last force-push. Notable changes to before:
|
1382d41 to
5e2aff2
Compare
|
Oh, it is only now that I notice that the packed format is actually larger for |
I am happy to merge you change but it wasn't a glitch in the sense that there is no interop support for big endian. Also : I do not think that big endian is making a come back any time soon. :-) It is dead. |
|
I am merging but note that I toned down the wording in the README. The packed format should definitively not be the default. It is always going to be more computationally expensive, and whether it saves bytes is... it depends. |
|
Merging!!! |
|
Thank you, @lemire , I do fully agree with your changes to the README wording. |
|
Sorry if I am a bit late here, it all happened a bit quickly in the end....
I noticed that the change to the structs (adding Do we want that? It's a breaking change, but then, we have done that already... It could be argued that we should, because otherwise the Also, for the new |
|
@oschonrock Can you prepare a pull request ? It would be a breaking change, but we can make it a 2.0.0 release. |
This implements the idea to more efficiently serialize fuse8 filters by using a bitmask marking non-zero fingerprints.
I am also proposing a changed signature of (a new type of) serialization functions:
void binary_fuse8_serialize(const binary_fuse8_t *filter, char *buffer)size_t binary_fuse8_pack(const binary_fuse8_t *filter, uint8_t *buffer, size_t space)The proposed API takes an additional size argument to enable the serializer to do bounds checking on buffer, returning the used size or 0 for "would overrun". In addition, this also enables opportunistic serialization to a "most likely large enough" buffer without prior size calculation, which is duplicate work.
The purpose of this PR is to ask if such a change would be welcome. If yes, I would work on implementing it for the other filter types. Also, I would be interested in opinions regarding the changed signature, should this be applied for the existing serialization functions also?
Note: There are also some obvious optimizations, for example the loops over the fingerprints array can be partially unrolled and combined with a Duff's Device. I will be happy to implement such improvements if there was any interest.