-
Notifications
You must be signed in to change notification settings - Fork 3
[Model] MinimumDecisionTree #932
Description
Motivation
MINIMUM DECISION TREE (MS15) from Garey & Johnson, A1.2. Given a set of objects distinguished by binary tests, find a decision tree that identifies each object with minimum total external path length (the weighted sum of test depths for identifying each object). NP-hard even when each test has at most 3 objects with value 1 (Hyafil and Rivest 1976). The problem captures optimal test sequencing for identification — applications in medical diagnosis, fault isolation, species classification, and information-theoretic search.
Definition
Name: MinimumDecisionTree
Reference: Garey & Johnson, Computers and Intractability, A1.2 MS15; Hyafil and Rivest 1976
Mathematical definition:
INSTANCE: Set S of objects, collection T = {T_1, T_2, ..., T_m} of binary tests, where each test T_j is a function from S into {0,1}, such that for every pair of objects s, s' ∈ S there exists a test T_j ∈ T with T_j(s) ≠ T_j(s').
OBJECTIVE: Find a decision tree using tests from T that identifies each member of S and minimizes the total external path length.
A decision tree is a binary tree where each internal node is labeled by a test T_j. The left child corresponds to outcome 0 and the right child to outcome 1. Each leaf is labeled by an object. The external path length of a leaf is the number of edges from the root to that leaf. The total external path length is the sum over all leaves.
Variables
- Count: The tree structure is the decision variable. Equivalently, one can encode the tree as a sequence of test selections at each internal node.
- Per-variable domain: Each internal node chooses from {T_1, ..., T_m}.
- Meaning: The assignment of tests to internal nodes of a binary tree. The tree must correctly distinguish all objects, and the objective is to minimize the total external path length.
Schema (data type)
Type name: MinimumDecisionTree
Variants: none (unique input structure)
| Field | Type | Description |
|---|---|---|
test_matrix |
Vec<Vec<bool>> |
Binary matrix: test_matrix[j][i] = true iff object i passes test j |
num_objects |
usize |
Number of objects |
num_tests |
usize |
Number of tests |
Complexity
- Best known exact algorithm: O*(num_tests^num_objects) by enumerating all possible decision trees. NP-hard even when each test has at most 3 objects with value 1 (Hyafil and Rivest 1976; transformation from EXACT COVER BY 3-SETS).
- Complexity string:
"num_tests^num_objects"
Extra Remark
Full book text:
INSTANCE: Set S of "objects," collection T = {T_1, T_2, ..., T_m} of "binary tests," where each test T_j is a function from S into {0,1}, such that for every pair of objects s, s' ∈ S there exists a test T_j ∈ T with T_j(s) ≠ T_j(s'), positive integer K.
QUESTION: Is there a decision tree for T that "identifies" each member of S and has total external path length of at most K?
Reference: [Hyafil and Rivest, 1976]. Transformation from EXACT COVER BY 3-SETS.
Comment: NP-complete even if each test T_j has at most 3 objects s ∈ S for which T_j(s) = 1.
Note: The original G&J formulation is a decision problem with threshold K. We use the equivalent optimization formulation: minimize total external path length.
How to solve
- It can be solved by (existing) bruteforce. (Enumerate all binary trees with test labels at internal nodes; prune those that don't identify all objects; return minimum total path length.)
- It can be solved by reducing to integer programming.
- Other:
Reduction Rule Crossref
(Planned: ExactCoverBy3Sets → MinimumDecisionTree, following Hyafil & Rivest 1976)
Example Instance
Input:
Objects: S = {o0, o1, o2, o3} (4 objects).
Tests:
o0 o1 o2 o3
T0: 1 1 0 0
T1: 1 0 0 0
T2: 0 1 0 1
Each pair of objects is distinguished by at least one test:
(o0,o1): T1,T2; (o0,o2): T0,T1; (o0,o3): T0,T1,T2; (o1,o2): T0,T2; (o1,o3): T0; (o2,o3): T2.
Optimal decision tree (total path length = 8):
T0
/ \
(0) (1)
T2 T1
/ \ / \
o2 o3 o1 o0
- Root: test T0. Left (T0=0): objects {o2, o3}. Right (T0=1): objects {o0, o1}.
- Left child: test T2. T2(o2)=0 → leaf o2 (depth 2). T2(o3)=1 → leaf o3 (depth 2).
- Right child: test T1. T1(o1)=0 → leaf o1 (depth 2). T1(o0)=1 → leaf o0 (depth 2).
- Total path length = 2+2+2+2 = 8. ✓
Suboptimal decision tree (total path length = 9):
T1
/ \
(0) (1)
T0 o0
/ \
T2 o1
/ \
o2 o3
- Root: test T1. T1(o0)=1 → leaf o0 (depth 1). T1=0: objects {o1, o2, o3}.
- T0 on {o1,o2,o3}: T0(o1)=1 → leaf o1 (depth 2). T0=0: objects {o2, o3}.
- T2 on {o2,o3}: T2(o2)=0 → leaf o2 (depth 3). T2(o3)=1 → leaf o3 (depth 3).
- Total path length = 1+2+3+3 = 9.
Why the balanced tree wins: Starting with T1 (which isolates only o0) gives depth 1 for o0 but pushes three objects deeper. The balanced split by T0 (2-2) yields uniform depth 2 for all objects, achieving the minimum total path length.
Expected Outcome
Optimal value: Min(8) — the minimum total external path length is 8. There are 6 valid decision trees with 2 distinct path lengths: 4 trees with PL=8 and 2 trees with PL=9.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status