-
Notifications
You must be signed in to change notification settings - Fork 3
[Model] MinimumExternalMacroDataCompression #441
Description
Important
Build as optimization, not decision. Value = Min<usize>.
Objective: Minimize compression cost |D| + |C| + (h−1) × (pointer count).
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.
Motivation
MINIMUM EXTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR22, "EXTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string using a separate dictionary string and a compressed string with pointers. This problem formalizes the macro model of data compression introduced by Storer and Szymanski, which generalizes many practical compression schemes including LZ-family algorithms.
Associated reduction rules:
- R116: VERTEX COVER -> MINIMUM EXTERNAL MACRO DATA COMPRESSION (GJ reference reduction)
Definition
Name: MinimumExternalMacroDataCompression
Canonical name: External Macro Data Compression (also: External Pointer Macro Compression, EPM Compression)
Reference: Garey & Johnson, Computers and Intractability, A4 SR22
Mathematical definition:
INSTANCE: Alphabet Σ, string s ∈ Σ*, pointer cost h ∈ Z⁺.
OBJECTIVE: Find strings D (dictionary string) and C (compressed string) in (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})*, where the symbols pᵢ are "pointers," such that s can be obtained from C by repeatedly replacing pointers with their corresponding substrings of D, minimizing the total cost:
cost(D, C) = |D| + |C| + (h − 1) × (number of pointer occurrences in D and C)
The corresponding decision problem (GJ SR22) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.
Variables
- Count: 2 × |s| variables — |s| slots for the dictionary string D and |s| slots for the compressed string C. Unused trailing slots are filled with a special "empty" marker.
- D-slot domain: {empty} ∪ Σ — each dictionary position is either an alphabet symbol or unused. Domain size =
alphabet_size + 1. - C-slot domain: {empty} ∪ Σ ∪ {ptr(start, len) : 0 ≤ start < |s|, 1 ≤ len ≤ |s| − start} — each compressed-string position is an alphabet symbol, a pointer into D (specified by start position and length), or unused. Domain size =
alphabet_size + 1 + |s| × (|s| + 1) / 2. - Meaning: A configuration encodes a candidate (D, C) pair. D (the non-empty prefix of D-slots) is a dictionary of reusable substrings. C (the non-empty prefix of C-slots) contains literal symbols and pointers into D. The pair is valid if replacing all pointers in C with their referenced D-substrings yields s. The cost is |D| + |C| + (h − 1) × (number of pointer symbols in C).
Note: This encoding restricts D to be pointer-free (pure alphabet string). The GJ definition allows pointers in D, but the pointer-free variant is NP-complete (Storer & Szymanski, 1982) and sufficient for all known reductions.
Schema (data type)
Type name: MinimumExternalMacroDataCompression
Variants: none (no graph or weight type parameter)
Value type: Min<usize>
| Field | Type | Description |
|---|---|---|
alphabet_size |
usize |
Size of the alphabet Σ (symbols indexed 0..alphabet_size) |
string |
Vec<usize> |
The source string s to be compressed, as a sequence of symbol indices |
pointer_cost |
usize |
The pointer cost h (each pointer occurrence contributes h − 1 extra beyond the position it occupies) |
Getters (for overhead expressions):
string_length()→string.len()alphabet_size()→alphabet_sizepointer_cost()→pointer_cost
Complexity
- Optimization complexity: NP-hard (Storer, 1977; Storer and Szymanski, 1978; transformation from VERTEX COVER). The decision version (is minimum cost ≤ B?) is NP-complete. Remains NP-hard even for fixed h ≥ 2, for alphabet size ≥ 3, and for variants where D contains no pointers.
- Best known exact algorithm: Brute-force over all possible (D, C) pairs: for each candidate dictionary string D of length up to |s|, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Upper bound: O(alphabet_size^string_length × (alphabet_size + string_length²)^string_length).
- Approximation: Practical heuristic algorithms (LZ77, LZSS, LZ78) achieve good compression ratios in linear or near-linear time but do not guarantee optimality. LZSS (Lempel-Ziv-Storer-Szymanski) is a direct practical algorithm derived from this theoretical framework. The related Smallest Grammar Problem is APX-hard (Charikar et al., 2005).
- References:
- [Storer, 1977] J. A. Storer, "NP-completeness results concerning data compression", Tech. Report 234, Princeton University.
- [Storer and Szymanski, 1978] J. A. Storer and T. G. Szymanski, "The macro model for data compression", Proc. 10th STOC, pp. 30-39.
- [Storer and Szymanski, 1982] J. A. Storer and T. G. Szymanski, "Data compression via textual substitution", JACM 29(4), pp. 928-951.
Specialization
- This is related to: INTERNAL MACRO DATA COMPRESSION (P171) -- the variant where D and C are merged into a single self-referencing string
- Generalization of: Many practical compression schemes (LZ77, LZSS) are restricted forms of external macro compression
Extra Remark
Full book text (GJ SR22, decision version):
INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Are there strings D (dictionary string) and C (compressed string) in (Sigma union {p_i: 1 <= i <= |s|})*, where the symbols p_i are "pointers," such that
|D| + |C| + (h-1) * (number of occurrences of pointers in D and C) <= B
and such that there is a way of identifying pointers with substrings of D so that S can be obtained from C by repeatedly replacing pointers in C by their corresponding substrings in D?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. Many variants, including those in which D can contain no pointers and/or no pointers can refer to overlapping strings, are also NP-complete. If the alphabet size is fixed at 3 or greater, and the pointer cost is ceiling(h * log|s|), the problem is also NP-complete. For further variants, including the case of "original pointers," see references.
How to solve
- It can be solved by (existing) bruteforce. Enumerate all possible dictionary strings D up to length |s|, and for each D, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Track the minimum total cost across all valid (D, C) pairs.
- It can be solved by reducing to integer programming.
- Other: Practical heuristic compression algorithms (LZSS, LZ77) provide approximate solutions in linear time, though they do not guarantee optimality.
Example Instance
Alphabet Σ = {a, b, c, d, e, f} (alphabet_size = 6)
String s = [a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f] (length 18)
Pointer cost h = 2
Optimal compression (cost = 12):
- Dictionary D = "abcdef" (length 6)
- Compressed string C = "p₁ p₁ p₁" where p₁ → D[0..6] = "abcdef"
- Decode: "abcdef" × 3 = s ✓
- Cost = |D| + |C| + (h−1) × ptrs = 6 + 3 + 1 × 3 = 12
Suboptimal scheme (cost = 16):
- D = "abcdefabcdef" (length 12), C = "p₁ p₂" where p₁ → D[0..12], p₂ → D[0..6]
- Cost = 12 + 2 + 1 × 2 = 16
No compression: cost = |s| = 18
Optimal value: Min(12)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status