[Model] MinimumExternalMacroDataCompression

> [!IMPORTANT]
> **Build as optimization, not decision.** `Value = Min<usize>`.
> **Objective:** Minimize compression cost |D| + |C| + (h−1) × (pointer count).
> Do not add a `bound`/`threshold` field — let the solver find the optimum directly. See #765.

## Motivation

MINIMUM EXTERNAL MACRO DATA COMPRESSION (derived from Garey & Johnson A4 SR22, "EXTERNAL MACRO DATA COMPRESSION"). A classical NP-hard problem in data compression theory, where the goal is to find the minimum-cost compression of a string using a separate dictionary string and a compressed string with pointers. This problem formalizes the macro model of data compression introduced by Storer and Szymanski, which generalizes many practical compression schemes including LZ-family algorithms.



**Associated reduction rules:**
- **R116:** VERTEX COVER -> MINIMUM EXTERNAL MACRO DATA COMPRESSION (GJ reference reduction)

## Definition

**Name:** `MinimumExternalMacroDataCompression`
**Canonical name:**  External Macro Data Compression (also: External Pointer Macro Compression, EPM Compression)
**Reference:** Garey & Johnson, *Computers and Intractability*, A4 SR22

**Mathematical definition:**

INSTANCE: Alphabet Σ, string s ∈ Σ*, pointer cost h ∈ Z⁺.
OBJECTIVE: Find strings D (dictionary string) and C (compressed string) in (Σ ∪ {pᵢ : 1 ≤ i ≤ |s|})*, where the symbols pᵢ are "pointers," such that s can be obtained from C by repeatedly replacing pointers with their corresponding substrings of D, minimizing the total cost:

cost(D, C) = |D| + |C| + (h − 1) × (number of pointer occurrences in D and C)

The corresponding decision problem (GJ SR22) asks whether the minimum cost is at most B for a given bound B ∈ Z⁺.

## Variables



- **Count:** 2 × |s| variables — |s| slots for the dictionary string D and |s| slots for the compressed string C. Unused trailing slots are filled with a special "empty" marker.
- **D-slot domain:** {empty} ∪ Σ — each dictionary position is either an alphabet symbol or unused. Domain size = `alphabet_size + 1`.
- **C-slot domain:** {empty} ∪ Σ ∪ {ptr(start, len) : 0 ≤ start < |s|, 1 ≤ len ≤ |s| − start} — each compressed-string position is an alphabet symbol, a pointer into D (specified by start position and length), or unused. Domain size = `alphabet_size + 1 + |s| × (|s| + 1) / 2`.
- **Meaning:** A configuration encodes a candidate (D, C) pair. D (the non-empty prefix of D-slots) is a dictionary of reusable substrings. C (the non-empty prefix of C-slots) contains literal symbols and pointers into D. The pair is valid if replacing all pointers in C with their referenced D-substrings yields s. The cost is |D| + |C| + (h − 1) × (number of pointer symbols in C).

**Note:** This encoding restricts D to be pointer-free (pure alphabet string). The GJ definition allows pointers in D, but the pointer-free variant is NP-complete (Storer & Szymanski, 1982) and sufficient for all known reductions.

## Schema (data type)



**Type name:** `MinimumExternalMacroDataCompression`
**Variants:** none (no graph or weight type parameter)
**Value type:** `Min<usize>`

| Field | Type | Description |
|-------|------|-------------|
| `alphabet_size` | `usize` | Size of the alphabet Σ (symbols indexed 0..alphabet_size) |
| `string` | `Vec<usize>` | The source string s to be compressed, as a sequence of symbol indices |
| `pointer_cost` | `usize` | The pointer cost h (each pointer occurrence contributes h − 1 extra beyond the position it occupies) |

**Getters (for overhead expressions):**
- `string_length()` → `string.len()`
- `alphabet_size()` → `alphabet_size`
- `pointer_cost()` → `pointer_cost`

## Complexity



- **Optimization complexity:** NP-hard (Storer, 1977; Storer and Szymanski, 1978; transformation from VERTEX COVER). The decision version (is minimum cost ≤ B?) is NP-complete. Remains NP-hard even for fixed h ≥ 2, for alphabet size ≥ 3, and for variants where D contains no pointers.
- **Best known exact algorithm:** Brute-force over all possible (D, C) pairs: for each candidate dictionary string D of length up to |s|, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Upper bound: O(alphabet_size^string_length × (alphabet_size + string_length²)^string_length).
- **Approximation:** Practical heuristic algorithms (LZ77, LZSS, LZ78) achieve good compression ratios in linear or near-linear time but do not guarantee optimality. LZSS (Lempel-Ziv-Storer-Szymanski) is a direct practical algorithm derived from this theoretical framework. The related Smallest Grammar Problem is APX-hard (Charikar et al., 2005).
- **References:**
  - [Storer, 1977] J. A. Storer, "NP-completeness results concerning data compression", Tech. Report 234, Princeton University.
  - [Storer and Szymanski, 1978] J. A. Storer and T. G. Szymanski, "The macro model for data compression", Proc. 10th STOC, pp. 30-39.
  - [Storer and Szymanski, 1982] J. A. Storer and T. G. Szymanski, "Data compression via textual substitution", JACM 29(4), pp. 928-951.

## Specialization



- **This is related to:** INTERNAL MACRO DATA COMPRESSION (P171) -- the variant where D and C are merged into a single self-referencing string
- **Generalization of:** Many practical compression schemes (LZ77, LZSS) are restricted forms of external macro compression

## Extra Remark

**Full book text (GJ SR22, decision version):**

INSTANCE: Alphabet Sigma, string s in Sigma*, pointer cost h in Z+, and a bound B in Z+.
QUESTION: Are there strings D (dictionary string) and C (compressed string) in (Sigma union {p_i: 1 <= i <= |s|})*, where the symbols p_i are "pointers," such that
|D| + |C| + (h-1) * (number of occurrences of pointers in D and C) <= B
and such that there is a way of identifying pointers with substrings of D so that S can be obtained from C by repeatedly replacing pointers in C by their corresponding substrings in D?
Reference: [Storer, 1977], [Storer and Szymanski, 1978]. Transformation from VERTEX COVER.
Comment: Remains NP-complete even if h is any fixed integer 2 or greater. Many variants, including those in which D can contain no pointers and/or no pointers can refer to overlapping strings, are also NP-complete. If the alphabet size is fixed at 3 or greater, and the pointer cost is ceiling(h * log|s|), the problem is also NP-complete. For further variants, including the case of "original pointers," see references.

## How to solve

- [x] It can be solved by (existing) bruteforce. Enumerate all possible dictionary strings D up to length |s|, and for each D, enumerate all compressed strings C using alphabet symbols and pointers into D, checking whether C decodes to s. Track the minimum total cost across all valid (D, C) pairs.
- [ ] It can be solved by reducing to integer programming.
- [x] Other: Practical heuristic compression algorithms (LZSS, LZ77) provide approximate solutions in linear time, though they do not guarantee optimality.

## Example Instance



Alphabet Σ = {a, b, c, d, e, f} (alphabet_size = 6)
String s = [a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f] (length 18)
Pointer cost h = 2

**Optimal compression (cost = 12):**
- Dictionary D = "abcdef" (length 6)
- Compressed string C = "p₁ p₁ p₁" where p₁ → D[0..6] = "abcdef"
- Decode: "abcdef" × 3 = s ✓
- Cost = |D| + |C| + (h−1) × ptrs = 6 + 3 + 1 × 3 = 12

**Suboptimal scheme (cost = 16):**
- D = "abcdefabcdef" (length 12), C = "p₁ p₂" where p₁ → D[0..12], p₂ → D[0..6]
- Cost = 12 + 2 + 1 × 2 = 16

**No compression:** cost = |s| = 18

**Optimal value:** Min(12)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] MinimumExternalMacroDataCompression #441

Motivation

Definition

Variables

Schema (data type)

Complexity

Specialization

Extra Remark

How to solve

Example Instance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Type	Description
`alphabet_size`	`usize`	Size of the alphabet Σ (symbols indexed 0..alphabet_size)
`string`	`Vec<usize>`	The source string s to be compressed, as a sequence of symbol indices
`pointer_cost`	`usize`	The pointer cost h (each pointer occurrence contributes h − 1 extra beyond the position it occupies)

[Model] MinimumExternalMacroDataCompression #441

Description

Motivation

Definition

Variables

Schema (data type)

Complexity

Specialization

Extra Remark

How to solve

Example Instance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions