Skip to content

[Model] MinimumInternalMacroDataCompression #442

@isPANN

Description

@isPANN

Important

Build as optimization, not decision. Value = Min<usize>.
Objective: Minimize compression cost |C| + (h−1) × (pointer count).
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.

Motivation

MINIMUM INTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR23, "INTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string into a single self-referencing string with embedded pointers. Unlike the external variant (#441), there is no separate dictionary — the compressed string C serves as both the dictionary and the output, with pointers referencing substrings within C itself.

Associated reduction rules:

  • R117: VERTEX COVER → MINIMUM INTERNAL MACRO DATA COMPRESSION (GJ reference reduction)

Definition

Name: MinimumInternalMacroDataCompression
Canonical name: Internal Macro Data Compression (also: Internal Pointer Macro Compression, Self-Referencing Macro Compression)
Reference: Garey & Johnson, Computers and Intractability, A4 SR23

Mathematical definition:

INSTANCE: Alphabet Σ (with |Σ| = alphabet_size), string s ∈ Σ* (with |s| = string_len), pointer cost h ∈ Z⁺ (h ≥ 1).

OBJECTIVE: Minimize the compression cost

cost(C) = |C| + (h − 1) × (number of pointer occurrences in C)

over all strings C ∈ (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})* such that there is a way of identifying pointers with substrings of C so that s can be obtained from C by using C as both compressed string and dictionary string.

A pointer p at position i in C references a substring C[j..j+l] (where j < i in the compressed string and l ≥ 1). During decoding, each pointer is replaced by its referenced substring. The decoding order is left-to-right; a pointer may only reference positions that have already been fully decoded.

The uncompressed string C = s is always a valid compression with cost |s| (zero pointers). The optimal compression minimizes cost by replacing repeated substrings with pointers.

The corresponding decision problem (GJ SR23) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.

Variables

  • Count: string_len (= |s|). The configuration vector has one variable per position of the compressed string C, which has at most |s| characters (since the uncompressed C = s has length |s|).
  • Per-variable domain: alphabet_size + string_len + 1. Each position can be:
    • A literal symbol: values 0..alphabet_size − 1 (symbol indices in Σ)
    • An end-of-string marker: value alphabet_size (positions after this are padding, ignored)
    • A pointer: values alphabet_size + 1..alphabet_size + string_len, where value v means "copy from C[v − alphabet_size − 1] using greedy longest match"
  • Meaning: The configuration encodes a compressed string C of variable length (up to |s|). Active positions are those before the first end-of-string marker. The evaluate function decodes C by resolving all pointer references within C itself (left-to-right, greedy longest match), checks whether the decoded result equals s, and computes the cost. Invalid configurations (decoding fails, circular references, or decoded string ≠ s) return Min(usize::MAX).

dims() returns [alphabet_size + string_len + 1; string_len].

Schema (data type)

Type name: MinimumInternalMacroDataCompression
Variants: none (no graph or weight type parameter)

Field Type Description Getter
alphabet_size usize Size of alphabet Σ (symbols indexed 0..alphabet_size) alphabet_size()
string Vec<usize> The source string s to be compressed, as symbol indices string_len()self.string.len()
pointer_cost usize The pointer cost h (each pointer adds h − 1 extra to the cost) pointer_cost()

Problem type: Optimization (minimization)
Value type: Min<usize>

Complexity

  • Decision complexity: NP-complete (Storer, 1977; Storer and Szymanski, 1978; transformation from VERTEX COVER). Remains NP-complete even for fixed h ≥ 2.
  • Best known exact algorithm: Brute-force over all possible compressed strings C of length up to |s|, where each position is one of alphabet_size + |s| + 1 tokens. Worst-case time complexity: (alphabet_size + string_len + 1)^string_len.
  • Approximation: Practical algorithms like LZ77 (sliding window) and LZ78 (internal referencing) approximate this problem in O(|s|) or O(|s| log |s|) time but do not guarantee optimal compression.
  • Relationship to grammar compression: Internal macro compression is closely related to the smallest grammar problem (finding the smallest context-free grammar generating exactly the string s), which is also NP-hard (Charikar et al., 2005).
  • References:
    • [Storer, 1977] J. A. Storer, "NP-completeness results concerning data compression", Tech. Report 234, Princeton University.
    • [Storer and Szymanski, 1978] J. A. Storer and T. G. Szymanski, "The macro model for data compression", Proc. 10th STOC, pp. 30-39.
    • [Storer and Szymanski, 1982] J. A. Storer and T. G. Szymanski, "Data compression via textual substitution", JACM 29(4), pp. 928-951.
    • [Charikar et al., 2005] M. Charikar et al., "The smallest grammar problem", IEEE Trans. Inf. Theory 51(7), pp. 2554-2576.

Specialization

  • This is related to: MINIMUM EXTERNAL MACRO DATA COMPRESSION ([Model] MinimumExternalMacroDataCompression #441) — the variant with a separate dictionary string
  • This is a special case of: General macro compression (both external and internal variants are special cases of the unified macro model)

Extra Remark

Full book text:

INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Is there a single string C in (Sigma union {p_i: 1 <= i <= |s|})* such that
|C| + (h-1) * (number of occurences of pointers in C) <= B
and such that there is a way of identifying pointers with substrings of C so that s can be obtained from C by using C as both compressed string and dictionary string in the manner indicated in the previous problem?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. For other NP-complete variants (as in the previous problem), see references.

Note: The original GJ formulation is a decision problem with bound B. This issue reformulates it as an optimization problem (minimize cost). The decision version is recovered by checking whether the optimal cost ≤ B.

How to solve

  • It can be solved by (existing) bruteforce. The search space is (alphabet_size + string_len + 1)^string_len, which is too large for practical brute-force enumeration even on small instances.
  • It can be solved by reducing to integer programming. The ILP formulation uses a flow-on-DAG partition model: binary variables lit[i] (position i is a literal in C), ptr[i][l][k] (a segment starting at position i with length l copies from previously decoded position k), and flow conservation ensures positions 0..|s| are partitioned into segments. The objective minimizes total literal count + h × pointer count. This is analogous to the ILP formulation used for the external variant (MinimumExternalMacroDataCompression), but without dictionary variables.
  • Other: Practical compression algorithms (LZ77 with sliding window, LZ78) provide heuristic solutions in near-linear time.

Example Instance

Alphabet Σ = {a, b, c} (alphabet_size = 3)
String s = "abcabcabc" (string_len = 9)
Pointer cost h = 2

Token encoding:

  • Domain per variable = 3 + 9 + 1 = 13
  • Values 0..2 = literals {a, b, c}
  • Value 3 = end-of-string marker
  • Values 4..12 = pointer to C[0], C[1], ..., C[8]

Optimal compression:

  • Config = [0, 1, 2, 4, 4, 3, 3, 3, 3]
    • Positions 0–2: literals a, b, c
    • Positions 3–4: pointer to C[0] (greedy match: "abc" at each)
    • Position 5: end-of-string marker (positions 6–8 are padding)
  • Active C = [a, b, c, ptr(0), ptr(0)], length = 5

Decoding verification:

  • C = [a, b, c, ptr(0), ptr(0)]
  • ptr(0) at position 3: greedy match from C[0] = "abc" → decoded so far: "abcabc"
  • ptr(0) at position 4: greedy match from C[0] = "abc" → decoded: "abcabcabc"
  • Result: "abcabcabc" = s ✓

Cost: |C| + (h−1) × pointers = 5 + (2−1) × 2 = 7

Suboptimal feasible solutions (confirming optimality):

  1. C = [a, b, c, a, b, c, ptr(0)], length 7, pointers 1 → cost = 7 + 1 = 8
  2. C = "abcabcabc" (uncompressed), length 9, pointers 0 → cost = 9

Expected outcome: Min(7)

Reduction Rule Crossref

  • R117: MinimumVertexCover → MinimumInternalMacroDataCompression (GJ reference reduction)
  • Direct ILP: MinimumInternalMacroDataCompression → ILP (flow-on-DAG partition model, implemented together with this model)

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions