Add pack and unpack of a more compact serialization format #68

nigoroll · 2025-01-19T15:56:35Z

This implements the idea to more efficiently serialize fuse8 filters by using a bitmask marking non-zero fingerprints.
I am also proposing a changed signature of (a new type of) serialization functions:

existing: void binary_fuse8_serialize(const binary_fuse8_t *filter, char *buffer)
proposed: size_t binary_fuse8_pack(const binary_fuse8_t *filter, uint8_t *buffer, size_t space)

The proposed API takes an additional size argument to enable the serializer to do bounds checking on buffer, returning the used size or 0 for "would overrun". In addition, this also enables opportunistic serialization to a "most likely large enough" buffer without prior size calculation, which is duplicate work.

The purpose of this PR is to ask if such a change would be welcome. If yes, I would work on implementing it for the other filter types. Also, I would be interested in opinions regarding the changed signature, should this be applied for the existing serialization functions also?

Note: There are also some obvious optimizations, for example the loops over the fingerprints array can be partially unrolled and combined with a Duff's Device. I will be happy to implement such improvements if there was any interest.

include/binaryfusefilter.h

nigoroll · 2025-01-20T11:03:27Z

@lemire thank you for your review, which I did not at all expect for a draft. I hope to have addressed your feedback appropriately, and force pushed this branch.

nigoroll · 2025-01-20T14:25:19Z

FYI, I also looked into the Duff's Device idea with branchless code, and it seems to not pay off because of the bounds check.

lemire · 2025-01-20T15:06:11Z

@nigoroll This PR looks good to me.

I consider that changing the definition of the structs is a breaking change... but given that it is well motivated, I think we can go forward. There was a prior request to bundle the size in the struct #63 by @oschonrock. At the time, I dismissed it because I viewed it as a metadata issue. However, it is true that it can allow for a leaner serialized format... which can matter for some applications.

❤️

nigoroll · 2025-01-20T19:33:13Z

@lemire thank you. So I will go ahead with the remaining TODOs to add the other filters.

FTR, I also re-ran the zlib test on the packed output (without the dictionary) for the small filter sizes I am interested in. As an overview, the numbers are:

keys	_serialize	_serialize + zlib	_serialize + dictionary	_pack	_pack + zlib
10	72	43	24	28	31
100	216	184	170	136	141
1000	1432	1220	1216	1185	1190

So zlib can not further compress _pack output.

Except for one case, _pack is better than _serialize with zlib and a dictionary, and for that one case (10), the zlib dictionary is most likely "cheating" because of the Seed almost always being 6df6b22537d23467. In summary, I think it is fair to say that _pack beats _serialize plus zlib. So your advice was gold, @lemire .

pack + zlib run:

== size = 10 
testing binary fuse8 with size 10 and 0 duplicates
 ser 72 pack 28
 31 compress Z_FILTERED
 31 compress Z_HUFFMAN_ONLY
 31 compress Z_RLE
 31 compress Z_FIXED
 31 compressed 28 serialized
 fpp 0.00390 (estimated) 
 bits per entry 24.80
 bits per entry 8.00 (theoretical lower bound)
 efficiency ratio 3.099 

testing xor8
 36 compress Z_FILTERED
 55 compress Z_HUFFMAN_ONLY
 37 compress Z_RLE
 36 compress Z_FIXED
 36 compressed 58 serialized
 fpp 0.00389 (estimated) 
 bits per entry 28.80
 bits per entry 8.01 (theoretical lower bound)
 efficiency ratio 3.597 

======
== size = 100 
testing binary fuse8 with size 100 and 0 duplicates
 ser 216 pack 136
 141 compress Z_FILTERED
 141 compress Z_HUFFMAN_ONLY
 141 compress Z_RLE
 141 compress Z_FIXED
 141 compressed 136 serialized
 fpp 0.00392 (estimated) 
 bits per entry 11.28
 bits per entry 8.00 (theoretical lower bound)
 efficiency ratio 1.411 

testing xor8
 166 compress Z_FILTERED
 174 compress Z_HUFFMAN_ONLY
 166 compress Z_RLE
 170 compress Z_FIXED
 166 compressed 169 serialized
 fpp 0.00393 (estimated) 
 bits per entry 13.28
 bits per entry 7.99 (theoretical lower bound)
 efficiency ratio 1.662 

======
== size = 1000 
testing binary fuse8 with size 1000 and 0 duplicates
 ser 1432 pack 1185
 1190 compress Z_FILTERED
 1190 compress Z_HUFFMAN_ONLY
 1190 compress Z_RLE
 1190 compress Z_FIXED
 1190 compressed 1185 serialized
 fpp 0.00391 (estimated) 
 bits per entry 9.52
 bits per entry 8.00 (theoretical lower bound)
 efficiency ratio 1.190 

testing xor8
 1183 compress Z_FILTERED
 1179 compress Z_HUFFMAN_ONLY
 1179 compress Z_RLE
 1281 compress Z_FIXED
 1179 compressed 1276 serialized
 fpp 0.00387 (estimated) 
 bits per entry 9.43
 bits per entry 8.01 (theoretical lower bound)
 efficiency ratio 1.177

... in preparation of a more compact serialization format: All other parameters except for the Seed are derived from the size parameter. The drawback is that this format is sensitive to changes of binary_fuse8_allocate(). Due to alignment, this does not need any more space on 64bit. (There were 5 32bit values inbetween two 64bit values) Yet formally, this is a breaking change of the in-core format, which should not be used to store information across versions. See follow up commits for new compact serialization formats.

Rationale: As mentioned in the previous commit, for binary_fuse filters, we do not need to save values derived from the size, saving 5 x sizeof(uint32_t). For both filter implementations, we add a bitmap to indicate non-zero fingerprint values. This adds 1/{8,16} of the fingerprint array size, but saves one or two bytes for each zero fingerprint. The net result is a packed format which can not be compressed further by zlib for the bundled unit tests. Note that this format is incompatible with the existing _serialize() format and, in the case of binary_fuse, sensitive to changes of the derived parameters in _allocate. Interface: We add _pack_bytes() to match _serialization_bytes(). _pack() and _unpack() match _serialize() and _deserialize(). The existing _{de,}serialize() interfaces take a buffer pointer only and thus implicitly assume that the buffer will be of sufficient size. For the new functions, we add a size_t parameter indicating the size of the buffer and check its bounds in the implementation. _pack returns the used size or zero for "does not fit", so when called with a buffer of arbitrary size, the used space or error condition can be determined without an additional call to _pack_bytes(), avoiding duplicate work. Implementation: We add some XOR_bitf_* macros to address words and individual bits of bitfields. The XOR_ser and XOR_deser macros have the otherwise repeated code for bounds checking and the actual serialization. Because the implementations for the 8 and 16 bit words are equal except for the data type, we add macros and create the actual functions by expanding the macros with the possible data types. Alternatives considered: Compared to _{de,}serialize(), the new functions need to copy individual fingerprint words rather than the whole array at once, which is less efficient. Therefor, an implementation using Duff's Device with branchless code was attempted but dismissed because avoiding out-of-bounds access would require an over-allocated buffer.

To exercise the new code without too much of a change to the existing unit test, we change the signature of _{un,}serialize_gen() to take an additional (const) size_t argument, which we ignore for _{un,}serialize(). We add to the reported metrics absolute and relative size information for the "in-core" and "wire" format, the latter jointly referencing to _{un,}serialize() and _{un,}pack().

nigoroll · 2025-01-21T10:17:31Z

I think this PR is ready now with the last force-push.

Notable changes to before:

More detailed commit messages
Turned pack-related functions into generator macros to avoid duplicating code just for the differing fingerprint data types
Added xor pack-related functions
Adjusted the unit test to accommodate the new signature and added calls to exercise pack/unpack.
Added additional output to the unit test to inform about absolute and relative sizes of the respective serializer. I propose the terms "in-code" and "wire" with the latter referring to the serialized format. Example:

testing binary fuse16 pack/unpack with size 300000 and 0 duplicates
 fpp 0.00001 (estimated) 
 size in-core 696360 wire 643524
 bits per entry in-core 18.57 wire 17.16
 bits per entry 16.17 (theoretical lower bound)
 efficiency ratio in-core 1.149 wire 1.062

Added documentation with an example
Fixed what I think is a minor glitch in the documentation: The serializers should work on big endian (which might see a bit of a comeback thanks to ARM), it's just a change in endianness which they do not handle.

nigoroll · 2025-01-21T10:27:01Z

Oh, it is only now that I notice that the packed format is actually larger for xor (I should have noticed this earlier). I guess we should remove it?

testing xor8
 fpp 0.00392 (estimated) 
 size in-core 1284 wire 1276
 bits per entry in-core 10.27 wire 10.21
 bits per entry 7.99 (theoretical lower bound)
 efficiency ratio in-core 1.285 wire 1.277
...
testing xor8 pack/unpack
 fpp 0.00394 (estimated) 
 size in-core 1284 wire 1429
 bits per entry in-core 10.27 wire 11.43
 bits per entry 7.99 (theoretical lower bound)
 efficiency ratio in-core 1.286 wire 1.431

lemire · 2025-01-22T02:29:19Z

Fixed what I think is a minor glitch in the documentation: The serializers should work on big endian (which might see a bit of a comeback thanks to ARM), it's just a change in endianness which they do not handle.

I am happy to merge you change but it wasn't a glitch in the sense that there is no interop support for big endian.

Also : I do not think that big endian is making a come back any time soon. :-) It is dead.

lemire · 2025-01-22T02:31:45Z

I am merging but note that I toned down the wording in the README. The packed format should definitively not be the default. It is always going to be more computationally expensive, and whether it saves bytes is... it depends.

160        fuse16: 27.00 18.20   fuse8: 14.00 10.10   xor16: 23.60 18.25   xor8: 12.20 10.25   
320        fuse16: 26.30 17.90   fuse8: 13.40  9.85   xor16: 21.55 17.73   xor8: 10.97  9.72   
640        fuse16: 25.95 17.75   fuse8: 13.10  9.75   xor16: 20.68 20.16   xor8: 10.44  9.44   
1280       fuse16: 22.57 17.48   fuse8: 11.35  9.43   xor16: 20.16 20.74   xor8: 10.13 10.79   
2560       fuse16: 22.49 17.44   fuse8: 11.28  9.41   xor16: 19.93 20.14   xor8:  9.99 10.60   
5120       fuse16: 20.84 17.32   fuse8: 10.44  9.28   xor16: 19.80 20.40   xor8:  9.91 10.71   
10240      fuse16: 20.02 17.26   fuse8: 10.02  9.23   xor16: 19.74 17.25   xor8:  9.88  9.21   
20480      fuse16: 20.01 17.25   fuse8: 10.01  9.22   xor16: 19.71 18.18   xor8:  9.86 10.71   
40960      fuse16: 20.01 17.25   fuse8: 10.00  9.22   xor16: 19.70 20.26   xor8:  9.85 10.69   
81920      fuse16: 19.20 17.20   fuse8:  9.60  9.17   xor16: 19.69 18.14   xor8:  9.84  9.21   
163840     fuse16: 18.80 17.18   fuse8:  9.40  9.14   xor16: 19.68 18.15   xor8:  9.84 10.76   
327680     fuse16: 18.40 17.15   fuse8:  9.20  9.12   xor16: 19.68 20.55   xor8:  9.84 10.81   
655360     fuse16: 18.20 17.14   fuse8:  9.10  9.11   xor16: 19.68 18.79   xor8:  9.84  9.20   
1310720    fuse16: 18.00 17.12   fuse8:  9.00  9.09   xor16: 19.68 18.93   xor8:  9.84 10.36   
2621440    fuse16: 18.00 17.12   fuse8:  9.00  9.09   xor16: 19.68 17.23   xor8:  9.84 10.46   
5242880    fuse16: 18.00 17.12   fuse8:  9.00  9.09   xor16: 19.68 20.37   xor8:  9.84  9.20

lemire · 2025-01-22T02:35:59Z

Merging!!!

nigoroll · 2025-01-22T08:07:35Z

Thank you, @lemire , I do fully agree with your changes to the README wording.
And thank you for writing a space benchmark, this is really helpful to make a more qualified judgement.

oschonrock · 2025-01-23T12:26:06Z

Sorry if I am a bit late here, it all happened a bit quickly in the end....

@nigoroll This PR looks good to me.

I consider that changing the definition of the structs is a breaking change... but given that it is well motivated, I think we can go forward. There was a prior request to bundle the size in the struct #63 by @oschonrock. At the time, I dismissed it because I viewed it as a metadata issue.

I noticed that the change to the structs (adding Size) is part of the new 1.2.0, but this is not reflected in binary_fuse(8|16)_(de)serialize(_header).

Do we want that? It's a breaking change, but then, we have done that already...

It could be argued that we should, because otherwise the ->Size value is is either UB or wrong (ie zero) after deserialization depending on whether the user zero initialized their struct?

Also, for the new .Size properties to be useful for https://github.com/oschonrock/binfuse (which was the intent of #63) they would have to be serialized.

lemire · 2025-01-23T16:09:27Z

@oschonrock Can you prepare a pull request ?

It would be a breaking change, but we can make it a 2.0.0 release.