Skip to content

[Model] MinimumExternalMacroDataCompression #441

@isPANN

Description

@isPANN

Important

Build as optimization, not decision. Value = Min<usize>.
Objective: Minimize compression cost |D| + |C| + (h−1) × (pointer count).
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.

Motivation

MINIMUM EXTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR22, "EXTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string using a separate dictionary string and a compressed string with pointers. This problem formalizes the macro model of data compression introduced by Storer and Szymanski, which generalizes many practical compression schemes including LZ-family algorithms.

Associated reduction rules:

  • R116: VERTEX COVER -> MINIMUM EXTERNAL MACRO DATA COMPRESSION (GJ reference reduction)

Definition

Name: MinimumExternalMacroDataCompression
Canonical name: External Macro Data Compression (also: External Pointer Macro Compression, EPM Compression)
Reference: Garey & Johnson, Computers and Intractability, A4 SR22

Mathematical definition:

INSTANCE: Alphabet Σ, string s ∈ Σ*, pointer cost h ∈ Z⁺.
OBJECTIVE: Find strings D (dictionary string) and C (compressed string) in (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})*, where the symbols pᵢ are "pointers," such that s can be obtained from C by repeatedly replacing pointers with their corresponding substrings of D, minimizing the total cost:

cost(D, C) = |D| + |C| + (h − 1) × (number of pointer occurrences in D and C)

The corresponding decision problem (GJ SR22) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.

Variables

  • Count: 2 × |s| variables — |s| slots for the dictionary string D and |s| slots for the compressed string C. Unused trailing slots are filled with a special "empty" marker.
  • D-slot domain: {empty} ∪ Σ — each dictionary position is either an alphabet symbol or unused. Domain size = alphabet_size + 1.
  • C-slot domain: {empty} ∪ Σ ∪ {ptr(start, len) : 0 ≤ start < |s|, 1 ≤ len ≤ |s| − start} — each compressed-string position is an alphabet symbol, a pointer into D (specified by start position and length), or unused. Domain size = alphabet_size + 1 + |s| × (|s| + 1) / 2.
  • Meaning: A configuration encodes a candidate (D, C) pair. D (the non-empty prefix of D-slots) is a dictionary of reusable substrings. C (the non-empty prefix of C-slots) contains literal symbols and pointers into D. The pair is valid if replacing all pointers in C with their referenced D-substrings yields s. The cost is |D| + |C| + (h − 1) × (number of pointer symbols in C).

Note: This encoding restricts D to be pointer-free (pure alphabet string). The GJ definition allows pointers in D, but the pointer-free variant is NP-complete (Storer & Szymanski, 1982) and sufficient for all known reductions.

Schema (data type)

Type name: MinimumExternalMacroDataCompression
Variants: none (no graph or weight type parameter)
Value type: Min<usize>

Field Type Description
alphabet_size usize Size of the alphabet Σ (symbols indexed 0..alphabet_size)
string Vec<usize> The source string s to be compressed, as a sequence of symbol indices
pointer_cost usize The pointer cost h (each pointer occurrence contributes h − 1 extra beyond the position it occupies)

Getters (for overhead expressions):

  • string_length()string.len()
  • alphabet_size()alphabet_size
  • pointer_cost()pointer_cost

Complexity

  • Optimization complexity: NP-hard (Storer, 1977; Storer and Szymanski, 1978; transformation from VERTEX COVER). The decision version (is minimum cost ≤ B?) is NP-complete. Remains NP-hard even for fixed h ≥ 2, for alphabet size ≥ 3, and for variants where D contains no pointers.
  • Best known exact algorithm: Brute-force over all possible (D, C) pairs: for each candidate dictionary string D of length up to |s|, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Upper bound: O(alphabet_size^string_length × (alphabet_size + string_length²)^string_length).
  • Approximation: Practical heuristic algorithms (LZ77, LZSS, LZ78) achieve good compression ratios in linear or near-linear time but do not guarantee optimality. LZSS (Lempel-Ziv-Storer-Szymanski) is a direct practical algorithm derived from this theoretical framework. The related Smallest Grammar Problem is APX-hard (Charikar et al., 2005).
  • References:
    • [Storer, 1977] J. A. Storer, "NP-completeness results concerning data compression", Tech. Report 234, Princeton University.
    • [Storer and Szymanski, 1978] J. A. Storer and T. G. Szymanski, "The macro model for data compression", Proc. 10th STOC, pp. 30-39.
    • [Storer and Szymanski, 1982] J. A. Storer and T. G. Szymanski, "Data compression via textual substitution", JACM 29(4), pp. 928-951.

Specialization

  • This is related to: INTERNAL MACRO DATA COMPRESSION (P171) -- the variant where D and C are merged into a single self-referencing string
  • Generalization of: Many practical compression schemes (LZ77, LZSS) are restricted forms of external macro compression

Extra Remark

Full book text (GJ SR22, decision version):

INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Are there strings D (dictionary string) and C (compressed string) in (Sigma union {p_i: 1 <= i <= |s|})*, where the symbols p_i are "pointers," such that
|D| + |C| + (h-1) * (number of occurrences of pointers in D and C) <= B
and such that there is a way of identifying pointers with substrings of D so that S can be obtained from C by repeatedly replacing pointers in C by their corresponding substrings in D?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. Many variants, including those in which D can contain no pointers and/or no pointers can refer to overlapping strings, are also NP-complete. If the alphabet size is fixed at 3 or greater, and the pointer cost is ceiling(h * log|s|), the problem is also NP-complete. For further variants, including the case of "original pointers," see references.

How to solve

  • It can be solved by (existing) bruteforce. Enumerate all possible dictionary strings D up to length |s|, and for each D, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Track the minimum total cost across all valid (D, C) pairs.
  • It can be solved by reducing to integer programming.
  • Other: Practical heuristic compression algorithms (LZSS, LZ77) provide approximate solutions in linear time, though they do not guarantee optimality.

Example Instance

Alphabet Σ = {a, b, c, d, e, f} (alphabet_size = 6)
String s = [a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f] (length 18)
Pointer cost h = 2

Optimal compression (cost = 12):

  • Dictionary D = "abcdef" (length 6)
  • Compressed string C = "p₁ p₁ p₁" where p₁ → D[0..6] = "abcdef"
  • Decode: "abcdef" × 3 = s ✓
  • Cost = |D| + |C| + (h−1) × ptrs = 6 + 3 + 1 × 3 = 12

Suboptimal scheme (cost = 16):

  • D = "abcdefabcdef" (length 12), C = "p₁ p₂" where p₁ → D[0..12], p₂ → D[0..6]
  • Cost = 12 + 2 + 1 × 2 = 16

No compression: cost = |s| = 18

Optimal value: Min(12)

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions