Skip to content

feat: Adds pre-trained dictionary support#247

Draft
hellobertrand wants to merge 34 commits into
mainfrom
feat/dictionary-support
Draft

feat: Adds pre-trained dictionary support#247
hellobertrand wants to merge 34 commits into
mainfrom
feat/dictionary-support

Conversation

@hellobertrand
Copy link
Copy Markdown
Owner

Enables pre-trained dictionaries to significantly improve compression ratios for small, similar payloads by prefilling the LZ77 sliding window.

  • New API: Introduces a dedicated zxc_dict.h API for dictionary training, serialization (.zxd format), loading, and ID generation.
  • File Format & CLI: Updates the ZXC file format to include a dictionary ID in the header. Adds CLI options (-D/--dict) to specify a dictionary for compression/decompression, and a new command (--train-dict) for generating dictionaries from a corpus of samples. The list command now displays dictionary IDs.
  • Core Integration: Seamlessly integrates dictionary handling across all compression and decompression interfaces, including buffer, streaming, and seekable modes, covering both single and multi-threaded operations.
  • Error Handling: Introduces new error codes for dictionary-related issues, such as DICT_REQUIRED, DICT_MISMATCH, and DICT_TOO_LARGE.
  • Infrastructure: Updates build systems, fuzzing targets, and conformance tests to thoroughly validate dictionary functionality.

Comment thread src/cli/main.c Fixed
Comment thread src/cli/main.c Fixed
@hellobertrand hellobertrand changed the title Adds pre-trained dictionary support feat: Adds pre-trained dictionary support May 27, 2026
@github-advanced-security
Copy link
Copy Markdown
Contributor

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 99.33555% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/lib/zxc_dispatch.c 97.59% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@hellobertrand hellobertrand force-pushed the feat/dictionary-support branch 2 times, most recently from baecc25 to a7c6676 Compare May 28, 2026 20:35
@hellobertrand hellobertrand marked this pull request as ready for review May 28, 2026 20:41
Introduces specific error types for dictionary-related issues, such as missing, mismatched, or oversized dictionaries. These errors are essential for robust dictionary support.
Introduces a comprehensive API for creating, managing, and utilizing pre-trained dictionaries. This includes functions for training, saving to the new `.zxd` file format, loading, and computing unique dictionary IDs.

Dictionaries enhance compression ratios for small, similar data by pre-filling the LZ77 sliding window. The dictionary ID is integrated into the ZXC file header, enabling decoders to verify the correct dictionary is provided at decompression time.
Extends `zxc_compress_opts_t` and `zxc_decompress_opts_t` with dictionary parameters. The ZXC file header is updated to include a dictionary ID and a flag, enabling decoders to verify the correct dictionary is used. The LZ77 compression engine is enhanced to seed its hash tables with dictionary content, while the decompressor is adjusted to logically pre-fill its sliding window, facilitating matches against dictionary data for improved compression ratios.
Prepares input buffers for dictionary-aware compression by prefixing data blocks with dictionary content. This allows the core compression logic to operate on the data block while referencing the dictionary.

For decompression, a dictionary-prefixed bounce buffer facilitates dictionary lookups during decoding. Includes dictionary ID validation during file header processing to ensure correct dictionary usage.
This suite validates the core dictionary functionalities, including:
- Saving and loading dictionaries using the `.zxd` format.
- Deterministic generation of dictionary IDs.
- Roundtrip compression/decompression with dictionaries across all levels.
- Correct error reporting for dictionary mismatches and missing dictionaries.
- Compatibility for decompressing non-dictionary-aware streams.

This ensures the dictionary feature is robust and correctly integrated.
Introduces a k-gram frequency based algorithm for `zxc_train_dict`, enabling users to generate effective dictionaries from sample data. This commit also fully integrates dictionary handling into the command-line interface (via the new `-D` option), block-level compression/decompression, and the seekable API.

This completes the dictionary feature by providing the necessary training capability and ensuring dictionaries are correctly processed across all compression and decompression paths, including robust buffer management for dictionary-aware chunk processing.
This commit adds comprehensive documentation for the pre-trained dictionary
feature. It formalizes the `.zxd` dictionary file format, updates the ZXC
file header specification to include dictionary details, and provides
user-facing guidance on training and using dictionaries in the `README.md`.
Correct buffer allocation for compression contexts to properly account for
dictionaries, preventing potential overflows when dictionary content is
prefixed to data blocks. Extend the valid match range in the decompressor
to include dictionary data, resolving `ZXC_ERROR_BAD_OFFSET` for valid
references into the dictionary. A new test case validates robust handling
of large dictionaries with smaller blocks.
Refactors the dictionary header to use a 16-bit `zxc_hash16` checksum. This change standardizes the integrity check mechanism for dictionaries, aligning it with the method used for the main ZXC file header. The previous 32-bit CRC32 field is replaced by the 2-byte `CRC16` and a 2-byte reserved field to maintain overall header size.
Refines the `zxc_lz_seed_dict` function to use a sparse seeding strategy
for the first half of the dictionary and a dense strategy for the second.
This approach prioritizes seeding positions closer to the end of the
dictionary, as they are more likely to produce shorter offsets and
effective matches, enhancing overall compression efficiency.

Also removes redundant comments in `zxc_dict.c` and adds `UNLIKELY` hints
to dictionary loading error checks for minor performance guidance.
Introduces `zxc_get_dict_id` to extract the dictionary ID from ZXC compressed buffers and `zxc_dict_get_id` for `.zxd` dictionary files. These functions allow quickly determining the dictionary used without full decompression or validation.

The CLI's `zxc -l` (list) command is updated to display the dictionary ID for files compressed with a dictionary, or a dash for files without. This information is also included in the JSON output. This enhances the inspectability of ZXC files.
This new option enables users to train custom dictionaries from multiple input files and save them in the `.zxd` format, completing the dictionary workflow in the command-line interface.
Introduces `fuzz_dict` to test the integrity of dictionary-based
compression and decompression. This fuzzer uses the input to construct
a dictionary and data, then verifies that compressing and decompressing
with the dictionary results in identical output.
The conformance test suite now supports zxc archives that require an external dictionary for decompression. The `test_valid_vector` function is enhanced to automatically detect the dictionary ID from the zxc archive and search for a matching `.zxd` dictionary file in the same directory. If found, the dictionary is loaded and supplied to the decompressor.

New valid vector test cases are added to validate dictionary-based compression.
…sion

The `zxc_seek_mt_worker` function now allocates a thread-local "bounce buffer" for dictionary-based decompression. This buffer concatenates the dictionary content with the necessary working space for the decompressor, ensuring each worker can decompress chunks independently using the provided dictionary.

A new test case `test_dict_seekable_mt_roundtrip` has been added to validate this functionality, covering full-range and sub-range multi-threaded decompression with a dictionary.
Previously, the dictionary test created a dictionary using a custom C helper compiled on the fly. This update replaces that with a direct call to the `zxc --train-dict` CLI command.

This change makes the test more realistic by utilizing the public API for dictionary training and removes the external dependency on `cc` for test execution, simplifying the test environment.
Ensures that dictionary memory, allocated as part of the new dictionary support, is properly released in error handling branches within the compression/decompression functions and upon CLI exit. This prevents memory leaks.
Doubles the buffer size to accommodate longer directory and file names, preventing potential path truncation issues during dictionary file discovery.
The `ctx` parameter is present for future dictionary context usage in `zxc_encode_block_num` but is currently unused. This silences a compiler warning.
Introduces a validation step for paths used by the dictionary and compression input files. This prevents issues from malformed paths and improves the CLI's robustness. Invalid dictionary paths now cause an immediate exit, while invalid input files are skipped with a warning.
Ensures that the `dict_input` memory is properly freed when the number of blocks exceeds the maximum allowed, preventing a memory leak in this error path during compression.
On Unix-like systems, switch from `fopen` to `open` and `fdopen` when creating the training dictionary. This enables explicit control over file permissions, ensuring the dictionary file is created with predictable and secure access rights (owner R/W, group R, others R).
Adds LCOV exclusion markers to various error handling and memory allocation failure paths across the library. These paths are often difficult to trigger reliably in unit tests but represent valid error conditions that are correctly handled. Excluding them ensures more accurate code coverage metrics without skewing results for untestable conditions.
Replaces `libc`'s `qsort` with an in-place heapsort implementation for sorting dictionary segments during training. This change removes a dependency on `libc`, making the library more suitable for freestanding environments. It also ensures deterministic dictionary output across different platforms and `libc` versions by providing a fixed sorting algorithm.
This test verifies `zxc_compress_block` and `zxc_decompress_block`
functionality with a dictionary, covering all compression levels from 1 to 6.
It ensures data integrity and correct behavior for block-level
dictionary-based compression and decompression.
Relocates `ZXC_DICT_KGRAM_LEN`, `ZXC_DICT_HT_BITS`, and `ZXC_DICT_HT_SIZE`
from `zxc_dict.c` to `zxc_internal.h`. This centralizes shared
dictionary-related constants, making them accessible throughout the library
and improving code organization.
Adds a new section to `FORMAT.md` that details the `.zxd` dictionary
file format. This includes the unique magic word for dictionary files
and a comprehensive worked example with a hexdump and byte-level
decoding of the dictionary header and content. This provides essential
guidance for implementers working with zxc dictionaries.
The default lzbench repository does not include necessary modifications to test zxc's
dictionary features. This temporary change allows the benchmark workflow to use
a custom fork with dictionary support for ongoing development.
Ensures core dictionary logic is compiled for the Rust wrapper.
@hellobertrand hellobertrand force-pushed the feat/dictionary-support branch from caf1cc9 to 36ff5d2 Compare May 29, 2026 07:04
Refines the dictionary training algorithm to avoid redundant patterns.

When a segment is added to the dictionary, its k-grams are marked as covered in the frequency table. This ensures that subsequent dictionary picks prioritize novel patterns, maximizing the dictionary's coverage of the corpus with unique content.
@hellobertrand hellobertrand marked this pull request as draft May 29, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants