Skip to content

Conversation

@nigoroll
Copy link
Contributor

This implements the idea to more efficiently serialize fuse8 filters by using a bitmask marking non-zero fingerprints.
I am also proposing a changed signature of (a new type of) serialization functions:

  • existing: void binary_fuse8_serialize(const binary_fuse8_t *filter, char *buffer)
  • proposed: size_t binary_fuse8_pack(const binary_fuse8_t *filter, uint8_t *buffer, size_t space)

The proposed API takes an additional size argument to enable the serializer to do bounds checking on buffer, returning the used size or 0 for "would overrun". In addition, this also enables opportunistic serialization to a "most likely large enough" buffer without prior size calculation, which is duplicate work.

The purpose of this PR is to ask if such a change would be welcome. If yes, I would work on implementing it for the other filter types. Also, I would be interested in opinions regarding the changed signature, should this be applied for the existing serialization functions also?

Note: There are also some obvious optimizations, for example the loops over the fingerprints array can be partially unrolled and combined with a Duff's Device. I will be happy to implement such improvements if there was any interest.

@nigoroll
Copy link
Contributor Author

@lemire thank you for your review, which I did not at all expect for a draft. I hope to have addressed your feedback appropriately, and force pushed this branch.

@nigoroll
Copy link
Contributor Author

FYI, I also looked into the Duff's Device idea with branchless code, and it seems to not pay off because of the bounds check.

@lemire
Copy link
Member

lemire commented Jan 20, 2025

@nigoroll This PR looks good to me.

I consider that changing the definition of the structs is a breaking change... but given that it is well motivated, I think we can go forward. There was a prior request to bundle the size in the struct #63 by @oschonrock. At the time, I dismissed it because I viewed it as a metadata issue. However, it is true that it can allow for a leaner serialized format... which can matter for some applications.

❤️

@nigoroll
Copy link
Contributor Author

@lemire thank you. So I will go ahead with the remaining TODOs to add the other filters.

FTR, I also re-ran the zlib test on the packed output (without the dictionary) for the small filter sizes I am interested in. As an overview, the numbers are:

keys _serialize _serialize + zlib _serialize + dictionary _pack _pack + zlib
10 72 43 24 28 31
100 216 184 170 136 141
1000 1432 1220 1216 1185 1190

So zlib can not further compress _pack output.

Except for one case, _pack is better than _serialize with zlib and a dictionary, and for that one case (10), the zlib dictionary is most likely "cheating" because of the Seed almost always being 6df6b22537d23467. In summary, I think it is fair to say that _pack beats _serialize plus zlib. So your advice was gold, @lemire .

pack + zlib run:

== size = 10 
testing binary fuse8 with size 10 and 0 duplicates
 ser 72 pack 28
 31 compress Z_FILTERED
 31 compress Z_HUFFMAN_ONLY
 31 compress Z_RLE
 31 compress Z_FIXED
 31 compressed 28 serialized
 fpp 0.00390 (estimated) 
 bits per entry 24.80
 bits per entry 8.00 (theoretical lower bound)
 efficiency ratio 3.099 

testing xor8
 36 compress Z_FILTERED
 55 compress Z_HUFFMAN_ONLY
 37 compress Z_RLE
 36 compress Z_FIXED
 36 compressed 58 serialized
 fpp 0.00389 (estimated) 
 bits per entry 28.80
 bits per entry 8.01 (theoretical lower bound)
 efficiency ratio 3.597 

======
== size = 100 
testing binary fuse8 with size 100 and 0 duplicates
 ser 216 pack 136
 141 compress Z_FILTERED
 141 compress Z_HUFFMAN_ONLY
 141 compress Z_RLE
 141 compress Z_FIXED
 141 compressed 136 serialized
 fpp 0.00392 (estimated) 
 bits per entry 11.28
 bits per entry 8.00 (theoretical lower bound)
 efficiency ratio 1.411 

testing xor8
 166 compress Z_FILTERED
 174 compress Z_HUFFMAN_ONLY
 166 compress Z_RLE
 170 compress Z_FIXED
 166 compressed 169 serialized
 fpp 0.00393 (estimated) 
 bits per entry 13.28
 bits per entry 7.99 (theoretical lower bound)
 efficiency ratio 1.662 

======
== size = 1000 
testing binary fuse8 with size 1000 and 0 duplicates
 ser 1432 pack 1185
 1190 compress Z_FILTERED
 1190 compress Z_HUFFMAN_ONLY
 1190 compress Z_RLE
 1190 compress Z_FIXED
 1190 compressed 1185 serialized
 fpp 0.00391 (estimated) 
 bits per entry 9.52
 bits per entry 8.00 (theoretical lower bound)
 efficiency ratio 1.190 

testing xor8
 1183 compress Z_FILTERED
 1179 compress Z_HUFFMAN_ONLY
 1179 compress Z_RLE
 1281 compress Z_FIXED
 1179 compressed 1276 serialized
 fpp 0.00387 (estimated) 
 bits per entry 9.43
 bits per entry 8.01 (theoretical lower bound)
 efficiency ratio 1.177 

... in preparation of a more compact serialization format: All other
parameters except for the Seed are derived from the size parameter.

The drawback is that this format is sensitive to changes of
binary_fuse8_allocate().

Due to alignment, this does not need any more space on 64bit.
(There were 5 32bit values inbetween two 64bit values)

Yet formally, this is a breaking change of the in-core format, which
should not be used to store information across versions. See follow up
commits for new compact serialization formats.
Rationale:

As mentioned in the previous commit, for binary_fuse filters, we do not
need to save values derived from the size, saving 5 x sizeof(uint32_t).

For both filter implementations, we add a bitmap to indicate non-zero
fingerprint values. This adds 1/{8,16} of the fingerprint array size,
but saves one or two bytes for each zero fingerprint.

The net result is a packed format which can not be compressed further by
zlib for the bundled unit tests.

Note that this format is incompatible with the existing _serialize()
format and, in the case of binary_fuse, sensitive to changes of the
derived parameters in _allocate.

Interface:

We add _pack_bytes() to match _serialization_bytes(). _pack() and
_unpack() match _serialize() and _deserialize().

The existing _{de,}serialize() interfaces take a buffer pointer only and
thus implicitly assume that the buffer will be of sufficient size. For
the new functions, we add a size_t parameter indicating the size of the
buffer and check its bounds in the implementation.

_pack returns the used size or zero for "does not fit", so when called
with a buffer of arbitrary size, the used space or error condition can
be determined without an additional call to _pack_bytes(), avoiding
duplicate work.

Implementation:

We add some XOR_bitf_* macros to address words and individual bits of
bitfields.

The XOR_ser and XOR_deser macros have the otherwise repeated code for
bounds checking and the actual serialization.

Because the implementations for the 8 and 16 bit words are equal except
for the data type, we add macros and create the actual functions by
expanding the macros with the possible data types.

Alternatives considered:

Compared to _{de,}serialize(), the new functions need to copy individual
fingerprint words rather than the whole array at once, which is less
efficient. Therefor, an implementation using Duff's Device with
branchless code was attempted but dismissed because avoiding
out-of-bounds access would require an over-allocated buffer.
To exercise the new code without too much of a change to the existing
unit test, we change the signature of _{un,}serialize_gen() to take an
additional (const) size_t argument, which we ignore for
_{un,}serialize().

We add to the reported metrics absolute and relative size information
for the "in-core" and "wire" format, the latter jointly referencing to
_{un,}serialize() and _{un,}pack().
@nigoroll
Copy link
Contributor Author

nigoroll commented Jan 21, 2025

I think this PR is ready now with the last force-push.

Notable changes to before:

  • More detailed commit messages

  • Turned pack-related functions into generator macros to avoid duplicating code just for the differing fingerprint data types

  • Added xor pack-related functions

  • Adjusted the unit test to accommodate the new signature and added calls to exercise pack/unpack.

  • Added additional output to the unit test to inform about absolute and relative sizes of the respective serializer. I propose the terms "in-code" and "wire" with the latter referring to the serialized format. Example:

testing binary fuse16 pack/unpack with size 300000 and 0 duplicates
 fpp 0.00001 (estimated) 
 size in-core 696360 wire 643524
 bits per entry in-core 18.57 wire 17.16
 bits per entry 16.17 (theoretical lower bound)
 efficiency ratio in-core 1.149 wire 1.062
  • Added documentation with an example

  • Fixed what I think is a minor glitch in the documentation: The serializers should work on big endian (which might see a bit of a comeback thanks to ARM), it's just a change in endianness which they do not handle.

@nigoroll nigoroll marked this pull request as ready for review January 21, 2025 10:17
@nigoroll
Copy link
Contributor Author

Oh, it is only now that I notice that the packed format is actually larger for xor (I should have noticed this earlier). I guess we should remove it?

testing xor8
 fpp 0.00392 (estimated) 
 size in-core 1284 wire 1276
 bits per entry in-core 10.27 wire 10.21
 bits per entry 7.99 (theoretical lower bound)
 efficiency ratio in-core 1.285 wire 1.277
...
testing xor8 pack/unpack
 fpp 0.00394 (estimated) 
 size in-core 1284 wire 1429
 bits per entry in-core 10.27 wire 11.43
 bits per entry 7.99 (theoretical lower bound)
 efficiency ratio in-core 1.286 wire 1.431

@nigoroll nigoroll changed the title DRAFT: Add binary_fuse8_{pack,unpack} Add pack and unpack of a more compact serialization format Jan 21, 2025
@lemire
Copy link
Member

lemire commented Jan 22, 2025

Fixed what I think is a minor glitch in the documentation: The serializers should work on big endian (which might see a bit of a comeback thanks to ARM), it's just a change in endianness which they do not handle.

I am happy to merge you change but it wasn't a glitch in the sense that there is no interop support for big endian.

Also : I do not think that big endian is making a come back any time soon. :-) It is dead.

@lemire
Copy link
Member

lemire commented Jan 22, 2025

I am merging but note that I toned down the wording in the README. The packed format should definitively not be the default. It is always going to be more computationally expensive, and whether it saves bytes is... it depends.

160        fuse16: 27.00 18.20   fuse8: 14.00 10.10   xor16: 23.60 18.25   xor8: 12.20 10.25   
320        fuse16: 26.30 17.90   fuse8: 13.40  9.85   xor16: 21.55 17.73   xor8: 10.97  9.72   
640        fuse16: 25.95 17.75   fuse8: 13.10  9.75   xor16: 20.68 20.16   xor8: 10.44  9.44   
1280       fuse16: 22.57 17.48   fuse8: 11.35  9.43   xor16: 20.16 20.74   xor8: 10.13 10.79   
2560       fuse16: 22.49 17.44   fuse8: 11.28  9.41   xor16: 19.93 20.14   xor8:  9.99 10.60   
5120       fuse16: 20.84 17.32   fuse8: 10.44  9.28   xor16: 19.80 20.40   xor8:  9.91 10.71   
10240      fuse16: 20.02 17.26   fuse8: 10.02  9.23   xor16: 19.74 17.25   xor8:  9.88  9.21   
20480      fuse16: 20.01 17.25   fuse8: 10.01  9.22   xor16: 19.71 18.18   xor8:  9.86 10.71   
40960      fuse16: 20.01 17.25   fuse8: 10.00  9.22   xor16: 19.70 20.26   xor8:  9.85 10.69   
81920      fuse16: 19.20 17.20   fuse8:  9.60  9.17   xor16: 19.69 18.14   xor8:  9.84  9.21   
163840     fuse16: 18.80 17.18   fuse8:  9.40  9.14   xor16: 19.68 18.15   xor8:  9.84 10.76   
327680     fuse16: 18.40 17.15   fuse8:  9.20  9.12   xor16: 19.68 20.55   xor8:  9.84 10.81   
655360     fuse16: 18.20 17.14   fuse8:  9.10  9.11   xor16: 19.68 18.79   xor8:  9.84  9.20   
1310720    fuse16: 18.00 17.12   fuse8:  9.00  9.09   xor16: 19.68 18.93   xor8:  9.84 10.36   
2621440    fuse16: 18.00 17.12   fuse8:  9.00  9.09   xor16: 19.68 17.23   xor8:  9.84 10.46   
5242880    fuse16: 18.00 17.12   fuse8:  9.00  9.09   xor16: 19.68 20.37   xor8:  9.84  9.20   

@lemire
Copy link
Member

lemire commented Jan 22, 2025

Merging!!!

@lemire lemire merged commit d3bb4e9 into FastFilter:master Jan 22, 2025
6 checks passed
@nigoroll
Copy link
Contributor Author

Thank you, @lemire , I do fully agree with your changes to the README wording.
And thank you for writing a space benchmark, this is really helpful to make a more qualified judgement.

@nigoroll nigoroll deleted the binary_fuse8_pack branch January 22, 2025 08:07
@oschonrock
Copy link
Contributor

oschonrock commented Jan 23, 2025

Sorry if I am a bit late here, it all happened a bit quickly in the end....

@nigoroll This PR looks good to me.

I consider that changing the definition of the structs is a breaking change... but given that it is well motivated, I think we can go forward. There was a prior request to bundle the size in the struct #63 by @oschonrock. At the time, I dismissed it because I viewed it as a metadata issue.

I noticed that the change to the structs (adding Size) is part of the new 1.2.0, but this is not reflected in binary_fuse(8|16)_(de)serialize(_header).

Do we want that? It's a breaking change, but then, we have done that already...

It could be argued that we should, because otherwise the ->Size value is is either UB or wrong (ie zero) after deserialization depending on whether the user zero initialized their struct?

Also, for the new .Size properties to be useful for https://github.com/oschonrock/binfuse (which was the intent of #63) they would have to be serialized.

@lemire
Copy link
Member

lemire commented Jan 23, 2025

@oschonrock Can you prepare a pull request ?

It would be a breaking change, but we can make it a 2.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants