feat: Adds pre-trained dictionary support by hellobertrand · Pull Request #247 · hellobertrand/zxc

hellobertrand · 2026-05-27T15:59:55Z

Enables pre-trained dictionaries to significantly improve compression ratios for small, similar payloads by prefilling the LZ77 sliding window.

New API: Introduces a dedicated zxc_dict.h API for dictionary training, serialization (.zxd format), loading, and ID generation.
File Format & CLI: Updates the ZXC file format to include a dictionary ID in the header. Adds CLI options (-D/--dict) to specify a dictionary for compression/decompression, and a new command (--train-dict) for generating dictionaries from a corpus of samples. The list command now displays dictionary IDs.
Core Integration: Seamlessly integrates dictionary handling across all compression and decompression interfaces, including buffer, streaming, and seekable modes, covering both single and multi-threaded operations.
Error Handling: Introduces new error codes for dictionary-related issues, such as DICT_REQUIRED, DICT_MISMATCH, and DICT_TOO_LARGE.
Infrastructure: Updates build systems, fuzzing targets, and conformance tests to thoroughly validate dictionary functionality.

github-advanced-security · 2026-05-27T16:06:08Z

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

codecov · 2026-05-27T16:19:04Z

Codecov Report

❌ Patch coverage is 99.33555% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/lib/zxc_dispatch.c	97.59%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Introduces specific error types for dictionary-related issues, such as missing, mismatched, or oversized dictionaries. These errors are essential for robust dictionary support.

Introduces a comprehensive API for creating, managing, and utilizing pre-trained dictionaries. This includes functions for training, saving to the new `.zxd` file format, loading, and computing unique dictionary IDs. Dictionaries enhance compression ratios for small, similar data by pre-filling the LZ77 sliding window. The dictionary ID is integrated into the ZXC file header, enabling decoders to verify the correct dictionary is provided at decompression time.

Extends `zxc_compress_opts_t` and `zxc_decompress_opts_t` with dictionary parameters. The ZXC file header is updated to include a dictionary ID and a flag, enabling decoders to verify the correct dictionary is used. The LZ77 compression engine is enhanced to seed its hash tables with dictionary content, while the decompressor is adjusted to logically pre-fill its sliding window, facilitating matches against dictionary data for improved compression ratios.

Prepares input buffers for dictionary-aware compression by prefixing data blocks with dictionary content. This allows the core compression logic to operate on the data block while referencing the dictionary. For decompression, a dictionary-prefixed bounce buffer facilitates dictionary lookups during decoding. Includes dictionary ID validation during file header processing to ensure correct dictionary usage.

This suite validates the core dictionary functionalities, including: - Saving and loading dictionaries using the `.zxd` format. - Deterministic generation of dictionary IDs. - Roundtrip compression/decompression with dictionaries across all levels. - Correct error reporting for dictionary mismatches and missing dictionaries. - Compatibility for decompressing non-dictionary-aware streams. This ensures the dictionary feature is robust and correctly integrated.

Introduces a k-gram frequency based algorithm for `zxc_train_dict`, enabling users to generate effective dictionaries from sample data. This commit also fully integrates dictionary handling into the command-line interface (via the new `-D` option), block-level compression/decompression, and the seekable API. This completes the dictionary feature by providing the necessary training capability and ensuring dictionaries are correctly processed across all compression and decompression paths, including robust buffer management for dictionary-aware chunk processing.

This commit adds comprehensive documentation for the pre-trained dictionary feature. It formalizes the `.zxd` dictionary file format, updates the ZXC file header specification to include dictionary details, and provides user-facing guidance on training and using dictionaries in the `README.md`.

Correct buffer allocation for compression contexts to properly account for dictionaries, preventing potential overflows when dictionary content is prefixed to data blocks. Extend the valid match range in the decompressor to include dictionary data, resolving `ZXC_ERROR_BAD_OFFSET` for valid references into the dictionary. A new test case validates robust handling of large dictionaries with smaller blocks.

Refactors the dictionary header to use a 16-bit `zxc_hash16` checksum. This change standardizes the integrity check mechanism for dictionaries, aligning it with the method used for the main ZXC file header. The previous 32-bit CRC32 field is replaced by the 2-byte `CRC16` and a 2-byte reserved field to maintain overall header size.

Refines the `zxc_lz_seed_dict` function to use a sparse seeding strategy for the first half of the dictionary and a dense strategy for the second. This approach prioritizes seeding positions closer to the end of the dictionary, as they are more likely to produce shorter offsets and effective matches, enhancing overall compression efficiency. Also removes redundant comments in `zxc_dict.c` and adds `UNLIKELY` hints to dictionary loading error checks for minor performance guidance.

Introduces `zxc_get_dict_id` to extract the dictionary ID from ZXC compressed buffers and `zxc_dict_get_id` for `.zxd` dictionary files. These functions allow quickly determining the dictionary used without full decompression or validation. The CLI's `zxc -l` (list) command is updated to display the dictionary ID for files compressed with a dictionary, or a dash for files without. This information is also included in the JSON output. This enhances the inspectability of ZXC files.

This new option enables users to train custom dictionaries from multiple input files and save them in the `.zxd` format, completing the dictionary workflow in the command-line interface.

Introduces `fuzz_dict` to test the integrity of dictionary-based compression and decompression. This fuzzer uses the input to construct a dictionary and data, then verifies that compressing and decompressing with the dictionary results in identical output.

The conformance test suite now supports zxc archives that require an external dictionary for decompression. The `test_valid_vector` function is enhanced to automatically detect the dictionary ID from the zxc archive and search for a matching `.zxd` dictionary file in the same directory. If found, the dictionary is loaded and supplied to the decompressor. New valid vector test cases are added to validate dictionary-based compression.

…sion The `zxc_seek_mt_worker` function now allocates a thread-local "bounce buffer" for dictionary-based decompression. This buffer concatenates the dictionary content with the necessary working space for the decompressor, ensuring each worker can decompress chunks independently using the provided dictionary. A new test case `test_dict_seekable_mt_roundtrip` has been added to validate this functionality, covering full-range and sub-range multi-threaded decompression with a dictionary.

Previously, the dictionary test created a dictionary using a custom C helper compiled on the fly. This update replaces that with a direct call to the `zxc --train-dict` CLI command. This change makes the test more realistic by utilizing the public API for dictionary training and removes the external dependency on `cc` for test execution, simplifying the test environment.

Ensures that dictionary memory, allocated as part of the new dictionary support, is properly released in error handling branches within the compression/decompression functions and upon CLI exit. This prevents memory leaks.

Doubles the buffer size to accommodate longer directory and file names, preventing potential path truncation issues during dictionary file discovery.

The `ctx` parameter is present for future dictionary context usage in `zxc_encode_block_num` but is currently unused. This silences a compiler warning.

Introduces a validation step for paths used by the dictionary and compression input files. This prevents issues from malformed paths and improves the CLI's robustness. Invalid dictionary paths now cause an immediate exit, while invalid input files are skipped with a warning.

Ensures that the `dict_input` memory is properly freed when the number of blocks exceeds the maximum allowed, preventing a memory leak in this error path during compression.

On Unix-like systems, switch from `fopen` to `open` and `fdopen` when creating the training dictionary. This enables explicit control over file permissions, ensuring the dictionary file is created with predictable and secure access rights (owner R/W, group R, others R).

Adds LCOV exclusion markers to various error handling and memory allocation failure paths across the library. These paths are often difficult to trigger reliably in unit tests but represent valid error conditions that are correctly handled. Excluding them ensures more accurate code coverage metrics without skewing results for untestable conditions.

Replaces `libc`'s `qsort` with an in-place heapsort implementation for sorting dictionary segments during training. This change removes a dependency on `libc`, making the library more suitable for freestanding environments. It also ensures deterministic dictionary output across different platforms and `libc` versions by providing a fixed sorting algorithm.

This test verifies `zxc_compress_block` and `zxc_decompress_block` functionality with a dictionary, covering all compression levels from 1 to 6. It ensures data integrity and correct behavior for block-level dictionary-based compression and decompression.

Relocates `ZXC_DICT_KGRAM_LEN`, `ZXC_DICT_HT_BITS`, and `ZXC_DICT_HT_SIZE` from `zxc_dict.c` to `zxc_internal.h`. This centralizes shared dictionary-related constants, making them accessible throughout the library and improving code organization.

Adds a new section to `FORMAT.md` that details the `.zxd` dictionary file format. This includes the unique magic word for dictionary files and a comprehensive worked example with a hexdump and byte-level decoding of the dictionary header and content. This provides essential guidance for implementers working with zxc dictionaries.

The default lzbench repository does not include necessary modifications to test zxc's dictionary features. This temporary change allows the benchmark workflow to use a custom fork with dictionary support for ongoing development.

Ensures core dictionary logic is compiled for the Rust wrapper.

Refines the dictionary training algorithm to avoid redundant patterns. When a segment is added to the dictionary, its k-grams are marked as covered in the frequency table. This ensures that subsequent dictionary picks prioritize novel patterns, maximizing the dictionary's coverage of the corpus with unique content.

github-advanced-security AI found potential problems May 27, 2026

View reviewed changes

Comment thread src/cli/main.c Fixed

Comment thread src/cli/main.c Fixed

hellobertrand changed the title ~~Adds pre-trained dictionary support~~ feat: Adds pre-trained dictionary support May 27, 2026

hellobertrand force-pushed the feat/dictionary-support branch 2 times, most recently from baecc25 to a7c6676 Compare May 28, 2026 20:35

hellobertrand marked this pull request as ready for review May 28, 2026 20:41

hellobertrand added 23 commits May 29, 2026 09:04

feat: Add dictionary error codes

cde378e

Introduces specific error types for dictionary-related issues, such as missing, mismatched, or oversized dictionaries. These errors are essential for robust dictionary support.

feat: Introduce --train-dict CLI command

61e684b

This new option enables users to train custom dictionaries from multiple input files and save them in the `.zxd` format, completing the dictionary workflow in the command-line interface.

feat: Integrate dictionary component into build

b61972f

refactor: Make dirent pointer const-correct in dictionary lookup

f7bf236

fix: Free dictionary allocations on all exit paths

f765392

Ensures that dictionary memory, allocated as part of the new dictionary support, is properly released in error handling branches within the compression/decompression functions and upon CLI exit. This prevents memory leaks.

fix: Enlarge path buffers for dictionary lookup

34b5a0d

Doubles the buffer size to accommodate longer directory and file names, preventing potential path truncation issues during dictionary file discovery.

fix: Silence unused 'ctx' parameter warning

acf3e7e

The `ctx` parameter is present for future dictionary context usage in `zxc_encode_block_num` but is currently unused. This silences a compiler warning.

feat: Include dictionary test

bea18a4

fix: Expand dictionary memory freeing to all CLI and lib error paths

1ee87b6

hellobertrand added 10 commits May 29, 2026 09:04

fix: Free dictionary input buffer on block count overflow

d55efd2

Ensures that the `dict_input` memory is properly freed when the number of blocks exceeds the maximum allowed, preventing a memory leak in this error path during compression.

feat: Include zxc_dict.c in zxc-sys build

36ff5d2

Ensures core dictionary logic is compiled for the Rust wrapper.

hellobertrand force-pushed the feat/dictionary-support branch from caf1cc9 to 36ff5d2 Compare May 29, 2026 07:04

hellobertrand marked this pull request as draft May 29, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Adds pre-trained dictionary support#247

feat: Adds pre-trained dictionary support#247
hellobertrand wants to merge 34 commits into
mainfrom
feat/dictionary-support

hellobertrand commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hellobertrand commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI commented May 27, 2026

What Enabling Code Scanning Means:

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 27, 2026 •

edited

Loading