feat: Adds pre-trained dictionary support#247
Draft
hellobertrand wants to merge 34 commits into
Draft
Conversation
Contributor
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
baecc25 to
a7c6676
Compare
Introduces specific error types for dictionary-related issues, such as missing, mismatched, or oversized dictionaries. These errors are essential for robust dictionary support.
Introduces a comprehensive API for creating, managing, and utilizing pre-trained dictionaries. This includes functions for training, saving to the new `.zxd` file format, loading, and computing unique dictionary IDs. Dictionaries enhance compression ratios for small, similar data by pre-filling the LZ77 sliding window. The dictionary ID is integrated into the ZXC file header, enabling decoders to verify the correct dictionary is provided at decompression time.
Extends `zxc_compress_opts_t` and `zxc_decompress_opts_t` with dictionary parameters. The ZXC file header is updated to include a dictionary ID and a flag, enabling decoders to verify the correct dictionary is used. The LZ77 compression engine is enhanced to seed its hash tables with dictionary content, while the decompressor is adjusted to logically pre-fill its sliding window, facilitating matches against dictionary data for improved compression ratios.
Prepares input buffers for dictionary-aware compression by prefixing data blocks with dictionary content. This allows the core compression logic to operate on the data block while referencing the dictionary. For decompression, a dictionary-prefixed bounce buffer facilitates dictionary lookups during decoding. Includes dictionary ID validation during file header processing to ensure correct dictionary usage.
This suite validates the core dictionary functionalities, including: - Saving and loading dictionaries using the `.zxd` format. - Deterministic generation of dictionary IDs. - Roundtrip compression/decompression with dictionaries across all levels. - Correct error reporting for dictionary mismatches and missing dictionaries. - Compatibility for decompressing non-dictionary-aware streams. This ensures the dictionary feature is robust and correctly integrated.
Introduces a k-gram frequency based algorithm for `zxc_train_dict`, enabling users to generate effective dictionaries from sample data. This commit also fully integrates dictionary handling into the command-line interface (via the new `-D` option), block-level compression/decompression, and the seekable API. This completes the dictionary feature by providing the necessary training capability and ensuring dictionaries are correctly processed across all compression and decompression paths, including robust buffer management for dictionary-aware chunk processing.
This commit adds comprehensive documentation for the pre-trained dictionary feature. It formalizes the `.zxd` dictionary file format, updates the ZXC file header specification to include dictionary details, and provides user-facing guidance on training and using dictionaries in the `README.md`.
Correct buffer allocation for compression contexts to properly account for dictionaries, preventing potential overflows when dictionary content is prefixed to data blocks. Extend the valid match range in the decompressor to include dictionary data, resolving `ZXC_ERROR_BAD_OFFSET` for valid references into the dictionary. A new test case validates robust handling of large dictionaries with smaller blocks.
Refactors the dictionary header to use a 16-bit `zxc_hash16` checksum. This change standardizes the integrity check mechanism for dictionaries, aligning it with the method used for the main ZXC file header. The previous 32-bit CRC32 field is replaced by the 2-byte `CRC16` and a 2-byte reserved field to maintain overall header size.
Refines the `zxc_lz_seed_dict` function to use a sparse seeding strategy for the first half of the dictionary and a dense strategy for the second. This approach prioritizes seeding positions closer to the end of the dictionary, as they are more likely to produce shorter offsets and effective matches, enhancing overall compression efficiency. Also removes redundant comments in `zxc_dict.c` and adds `UNLIKELY` hints to dictionary loading error checks for minor performance guidance.
Introduces `zxc_get_dict_id` to extract the dictionary ID from ZXC compressed buffers and `zxc_dict_get_id` for `.zxd` dictionary files. These functions allow quickly determining the dictionary used without full decompression or validation. The CLI's `zxc -l` (list) command is updated to display the dictionary ID for files compressed with a dictionary, or a dash for files without. This information is also included in the JSON output. This enhances the inspectability of ZXC files.
This new option enables users to train custom dictionaries from multiple input files and save them in the `.zxd` format, completing the dictionary workflow in the command-line interface.
Introduces `fuzz_dict` to test the integrity of dictionary-based compression and decompression. This fuzzer uses the input to construct a dictionary and data, then verifies that compressing and decompressing with the dictionary results in identical output.
The conformance test suite now supports zxc archives that require an external dictionary for decompression. The `test_valid_vector` function is enhanced to automatically detect the dictionary ID from the zxc archive and search for a matching `.zxd` dictionary file in the same directory. If found, the dictionary is loaded and supplied to the decompressor. New valid vector test cases are added to validate dictionary-based compression.
…sion The `zxc_seek_mt_worker` function now allocates a thread-local "bounce buffer" for dictionary-based decompression. This buffer concatenates the dictionary content with the necessary working space for the decompressor, ensuring each worker can decompress chunks independently using the provided dictionary. A new test case `test_dict_seekable_mt_roundtrip` has been added to validate this functionality, covering full-range and sub-range multi-threaded decompression with a dictionary.
Previously, the dictionary test created a dictionary using a custom C helper compiled on the fly. This update replaces that with a direct call to the `zxc --train-dict` CLI command. This change makes the test more realistic by utilizing the public API for dictionary training and removes the external dependency on `cc` for test execution, simplifying the test environment.
Ensures that dictionary memory, allocated as part of the new dictionary support, is properly released in error handling branches within the compression/decompression functions and upon CLI exit. This prevents memory leaks.
Doubles the buffer size to accommodate longer directory and file names, preventing potential path truncation issues during dictionary file discovery.
The `ctx` parameter is present for future dictionary context usage in `zxc_encode_block_num` but is currently unused. This silences a compiler warning.
Introduces a validation step for paths used by the dictionary and compression input files. This prevents issues from malformed paths and improves the CLI's robustness. Invalid dictionary paths now cause an immediate exit, while invalid input files are skipped with a warning.
Ensures that the `dict_input` memory is properly freed when the number of blocks exceeds the maximum allowed, preventing a memory leak in this error path during compression.
On Unix-like systems, switch from `fopen` to `open` and `fdopen` when creating the training dictionary. This enables explicit control over file permissions, ensuring the dictionary file is created with predictable and secure access rights (owner R/W, group R, others R).
Adds LCOV exclusion markers to various error handling and memory allocation failure paths across the library. These paths are often difficult to trigger reliably in unit tests but represent valid error conditions that are correctly handled. Excluding them ensures more accurate code coverage metrics without skewing results for untestable conditions.
Replaces `libc`'s `qsort` with an in-place heapsort implementation for sorting dictionary segments during training. This change removes a dependency on `libc`, making the library more suitable for freestanding environments. It also ensures deterministic dictionary output across different platforms and `libc` versions by providing a fixed sorting algorithm.
This test verifies `zxc_compress_block` and `zxc_decompress_block` functionality with a dictionary, covering all compression levels from 1 to 6. It ensures data integrity and correct behavior for block-level dictionary-based compression and decompression.
Relocates `ZXC_DICT_KGRAM_LEN`, `ZXC_DICT_HT_BITS`, and `ZXC_DICT_HT_SIZE` from `zxc_dict.c` to `zxc_internal.h`. This centralizes shared dictionary-related constants, making them accessible throughout the library and improving code organization.
Adds a new section to `FORMAT.md` that details the `.zxd` dictionary file format. This includes the unique magic word for dictionary files and a comprehensive worked example with a hexdump and byte-level decoding of the dictionary header and content. This provides essential guidance for implementers working with zxc dictionaries.
The default lzbench repository does not include necessary modifications to test zxc's dictionary features. This temporary change allows the benchmark workflow to use a custom fork with dictionary support for ongoing development.
Ensures core dictionary logic is compiled for the Rust wrapper.
caf1cc9 to
36ff5d2
Compare
Refines the dictionary training algorithm to avoid redundant patterns. When a segment is added to the dictionary, its k-grams are marked as covered in the frequency table. This ensures that subsequent dictionary picks prioritize novel patterns, maximizing the dictionary's coverage of the corpus with unique content.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enables pre-trained dictionaries to significantly improve compression ratios for small, similar payloads by prefilling the LZ77 sliding window.
zxc_dict.hAPI for dictionary training, serialization (.zxdformat), loading, and ID generation.-D/--dict) to specify a dictionary for compression/decompression, and a new command (--train-dict) for generating dictionaries from a corpus of samples. Thelistcommand now displays dictionary IDs.DICT_REQUIRED,DICT_MISMATCH, andDICT_TOO_LARGE.