Skip to content

Compression levels adjustment #166

@tansy

Description

@tansy

I noticed that in 1.9 version compression levels "overlap", I mean some of them are basically the same.
I took the silesia corpus* and that's he result:

Compressor name Compr. size Ratio Filename
memcpy 211957760 100.00 silesia.tar
libdeflate 1.9 -1 73503035 34.68 silesia.tar
libdeflate 1.9 -2 71070103 33.53 silesia.tar
libdeflate 1.9 -3 70170668 33.11 silesia.tar
libdeflate 1.9 -4 69471739 32.78 silesia.tar
libdeflate 1.9 -5 68171764 32.16 silesia.tar
libdeflate 1.9 -6 67510595 31.85 silesia.tar
libdeflate 1.9 -7 67141683 31.68 silesia.tar
libdeflate 1.9 -8 66766242 31.50 silesia.tar
libdeflate 1.9 -9 66716614 31.48 silesia.tar
libdeflate 1.9 -10 64786046 30.57 silesia.tar
libdeflate 1.9 -11 64710756 30.53 silesia.tar
libdeflate 1.9 -12 64687172 30.52 silesia.tar

Compression in levels 8 and 9, 11 and 12 are almost the same - difference in ratio of 0.02% and 0.01% is hardly noticable. Level 10 is not much different than 11 either. Difference between 10-12 is ~0.05% and between 9 and 10 is almost 1%.

First I decided to leave levels 6, 9, and 12 as they are and spread those in between by ratio. Also because level 9 is now the line between lazy/2 and near_optimal algorithms. First I thought even spread would be good but then I realised that "logarithmic", or something like that, would be better as it would resemble existing ones. So I calculated new ratios to be 1/2, 3/4 (and 1) of the gap between 6 and 9 (31.85 - 31.48) as well as 9 and 12 (31.48 - 30.52).

lv x x+1 x+2 y
x+(y-x) 0 1/2 3/4 1
lv 6 7 8 9
ratio 31.85 31.66 31.57 31.48
lv 9 10 11 12
ratio 31.48 31.00 30.76 30.52

That would be "ideal" to look for.

First I took the 9-12 levels range and checked what were the results for v1.8. After some tweek I got to the point where they were almost perfectly matched. Then with levels 6-9 it wasn't that easy but I brought it to acceptable point. Now the results are like this:

Compressor name Compr. size Ratio Filename
memcpy 211957760 100.00 silesia.tar
libdeflate 1.10-1 -1 73503035 34.68 silesia.tar
libdeflate 1.10-1 -2 71070103 33.53 silesia.tar
libdeflate 1.10-1 -3 70170668 33.11 silesia.tar
libdeflate 1.10-1 -4 69471739 32.78 silesia.tar
libdeflate 1.10-1 -5 68171764 32.16 silesia.tar
libdeflate 1.10-1 -6 67510595 31.85 silesia.tar
libdeflate 1.10-1 -7 67155164 31.68 silesia.tar
libdeflate 1.10-1 -8 66850226 31.54 silesia.tar
libdeflate 1.10-1 -9 66716614 31.48 silesia.tar
libdeflate 1.10-1 -10 65724812 31.01 silesia.tar
libdeflate 1.10-1 -11 65030245 30.68 silesia.tar
libdeflate 1.10-1 -12 64685969 30.52 silesia.tar

Deltas calculated for it are as follows:

lv 6 7 8 9
d(6-9) 0 0.46 0.84 1
lv 9 10 11 12
d(9-12) 0 0.49 0.83 1

The results are spread bit more "evenly" and the gap between 9 and 10 is halved.

Similar results are for other data sets.

I think you should consider to adjust these compression levels to that or something similar.

Here are the changes to deflate_compress.c in diff. I will do pull request if that is what you are interested in.


* I chose it as it is diverse, non homogeneous, relatively big corpus that resembles real life date, imo best for general purpose compressor. I tested other corpora available on net and the results were very similar, almost the same.
They include enwik, lukas medical images and my "own", namely app (mozilla - 64bit executables from silesia corpus, google earth 32-bit for windows and firefox for linux), png-dec (bunch of decompressed png images) and html/css/js (bunch of sites styles and scripts; something that imitates html pages).

** To produce results I used lzbench with libdeflate-1.9 support.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions