-
Notifications
You must be signed in to change notification settings - Fork 3
[Model] MinimumInternalMacroDataCompression #442
Description
Important
Build as optimization, not decision. Value = Min<usize>.
Objective: Minimize compression cost |C| + (h−1) × (pointer count).
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.
Motivation
MINIMUM INTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR23, "INTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string into a single self-referencing string with embedded pointers. Unlike the external variant (#441), there is no separate dictionary — the compressed string C serves as both the dictionary and the output, with pointers referencing substrings within C itself.
Associated reduction rules:
- R117: VERTEX COVER → MINIMUM INTERNAL MACRO DATA COMPRESSION (GJ reference reduction)
Definition
Name: MinimumInternalMacroDataCompression
Canonical name: Internal Macro Data Compression (also: Internal Pointer Macro Compression, Self-Referencing Macro Compression)
Reference: Garey & Johnson, Computers and Intractability, A4 SR23
Mathematical definition:
INSTANCE: Alphabet Σ (with |Σ| = alphabet_size), string s ∈ Σ* (with |s| = string_len), pointer cost h ∈ Z⁺ (h ≥ 1).
OBJECTIVE: Minimize the compression cost
cost(C) = |C| + (h − 1) × (number of pointer occurrences in C)
over all strings C ∈ (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})* such that there is a way of identifying pointers with substrings of C so that s can be obtained from C by using C as both compressed string and dictionary string.
A pointer p at position i in C references a substring C[j..j+l] (where j < i in the compressed string and l ≥ 1). During decoding, each pointer is replaced by its referenced substring. The decoding order is left-to-right; a pointer may only reference positions that have already been fully decoded.
The uncompressed string C = s is always a valid compression with cost |s| (zero pointers). The optimal compression minimizes cost by replacing repeated substrings with pointers.
The corresponding decision problem (GJ SR23) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.
Variables
- Count: string_len (= |s|). The configuration vector has one variable per position of the compressed string C, which has at most |s| characters (since the uncompressed C = s has length |s|).
- Per-variable domain: alphabet_size + string_len + 1. Each position can be:
- A literal symbol: values 0..alphabet_size − 1 (symbol indices in Σ)
- An end-of-string marker: value alphabet_size (positions after this are padding, ignored)
- A pointer: values alphabet_size + 1..alphabet_size + string_len, where value v means "copy from C[v − alphabet_size − 1] using greedy longest match"
- Meaning: The configuration encodes a compressed string C of variable length (up to |s|). Active positions are those before the first end-of-string marker. The evaluate function decodes C by resolving all pointer references within C itself (left-to-right, greedy longest match), checks whether the decoded result equals s, and computes the cost. Invalid configurations (decoding fails, circular references, or decoded string ≠ s) return
Min(usize::MAX).
dims() returns [alphabet_size + string_len + 1; string_len].
Schema (data type)
Type name: MinimumInternalMacroDataCompression
Variants: none (no graph or weight type parameter)
| Field | Type | Description | Getter |
|---|---|---|---|
alphabet_size |
usize |
Size of alphabet Σ (symbols indexed 0..alphabet_size) | alphabet_size() |
string |
Vec<usize> |
The source string s to be compressed, as symbol indices | string_len() → self.string.len() |
pointer_cost |
usize |
The pointer cost h (each pointer adds h − 1 extra to the cost) | pointer_cost() |
Problem type: Optimization (minimization)
Value type: Min<usize>
Complexity
- Decision complexity: NP-complete (Storer, 1977; Storer and Szymanski, 1978; transformation from VERTEX COVER). Remains NP-complete even for fixed h ≥ 2.
- Best known exact algorithm: Brute-force over all possible compressed strings C of length up to |s|, where each position is one of alphabet_size + |s| + 1 tokens. Worst-case time complexity:
(alphabet_size + string_len + 1)^string_len. - Approximation: Practical algorithms like LZ77 (sliding window) and LZ78 (internal referencing) approximate this problem in O(|s|) or O(|s| log |s|) time but do not guarantee optimal compression.
- Relationship to grammar compression: Internal macro compression is closely related to the smallest grammar problem (finding the smallest context-free grammar generating exactly the string s), which is also NP-hard (Charikar et al., 2005).
- References:
- [Storer, 1977] J. A. Storer, "NP-completeness results concerning data compression", Tech. Report 234, Princeton University.
- [Storer and Szymanski, 1978] J. A. Storer and T. G. Szymanski, "The macro model for data compression", Proc. 10th STOC, pp. 30-39.
- [Storer and Szymanski, 1982] J. A. Storer and T. G. Szymanski, "Data compression via textual substitution", JACM 29(4), pp. 928-951.
- [Charikar et al., 2005] M. Charikar et al., "The smallest grammar problem", IEEE Trans. Inf. Theory 51(7), pp. 2554-2576.
Specialization
- This is related to: MINIMUM EXTERNAL MACRO DATA COMPRESSION ([Model] MinimumExternalMacroDataCompression #441) — the variant with a separate dictionary string
- This is a special case of: General macro compression (both external and internal variants are special cases of the unified macro model)
Extra Remark
Full book text:
INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Is there a single string C in (Sigma union {p_i: 1 <= i <= |s|})* such that
|C| + (h-1) * (number of occurences of pointers in C) <= B
and such that there is a way of identifying pointers with substrings of C so that s can be obtained from C by using C as both compressed string and dictionary string in the manner indicated in the previous problem?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. For other NP-complete variants (as in the previous problem), see references.
Note: The original GJ formulation is a decision problem with bound B. This issue reformulates it as an optimization problem (minimize cost). The decision version is recovered by checking whether the optimal cost ≤ B.
How to solve
- It can be solved by (existing) bruteforce. The search space is (alphabet_size + string_len + 1)^string_len, which is too large for practical brute-force enumeration even on small instances.
- It can be solved by reducing to integer programming. The ILP formulation uses a flow-on-DAG partition model: binary variables
lit[i](position i is a literal in C),ptr[i][l][k](a segment starting at position i with length l copies from previously decoded position k), and flow conservation ensures positions 0..|s| are partitioned into segments. The objective minimizes total literal count + h × pointer count. This is analogous to the ILP formulation used for the external variant (MinimumExternalMacroDataCompression), but without dictionary variables. - Other: Practical compression algorithms (LZ77 with sliding window, LZ78) provide heuristic solutions in near-linear time.
Example Instance
Alphabet Σ = {a, b, c} (alphabet_size = 3)
String s = "abcabcabc" (string_len = 9)
Pointer cost h = 2
Token encoding:
- Domain per variable = 3 + 9 + 1 = 13
- Values 0..2 = literals {a, b, c}
- Value 3 = end-of-string marker
- Values 4..12 = pointer to C[0], C[1], ..., C[8]
Optimal compression:
- Config = [0, 1, 2, 4, 4, 3, 3, 3, 3]
- Positions 0–2: literals a, b, c
- Positions 3–4: pointer to C[0] (greedy match: "abc" at each)
- Position 5: end-of-string marker (positions 6–8 are padding)
- Active C = [a, b, c, ptr(0), ptr(0)], length = 5
Decoding verification:
- C = [a, b, c, ptr(0), ptr(0)]
- ptr(0) at position 3: greedy match from C[0] = "abc" → decoded so far: "abcabc"
- ptr(0) at position 4: greedy match from C[0] = "abc" → decoded: "abcabcabc"
- Result: "abcabcabc" = s ✓
Cost: |C| + (h−1) × pointers = 5 + (2−1) × 2 = 7
Suboptimal feasible solutions (confirming optimality):
- C = [a, b, c, a, b, c, ptr(0)], length 7, pointers 1 → cost = 7 + 1 = 8
- C = "abcabcabc" (uncompressed), length 9, pointers 0 → cost = 9
Expected outcome: Min(7)
Reduction Rule Crossref
- R117: MinimumVertexCover → MinimumInternalMacroDataCompression (GJ reference reduction)
- Direct ILP: MinimumInternalMacroDataCompression → ILP (flow-on-DAG partition model, implemented together with this model)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status