diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 6894062..43ec69d 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -52,7 +52,7 @@ jobs:
working-directory: solution/backend/rust
run: cargo build --release
- - name: Run unit tests (15 tests)
+ - name: Run unit tests
working-directory: solution/backend/rust
run: cargo test -- --nocapture
diff --git a/solution/CHANGELOG.md b/solution/CHANGELOG.md
index f1c109b..951befa 100644
--- a/solution/CHANGELOG.md
+++ b/solution/CHANGELOG.md
@@ -73,7 +73,7 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- Traversal optimization: snake/zig-zag tile order for MatMul data reuse.
- **Test suite**
- - 15 Rust unit tests in `src/main.rs`:
+ - 18 Rust unit tests in `src/main.rs`:
- Example 1 (baseline pointwise chain): strategies A, B, C
- Example 2 (larger tensors, 256x256): strategies A and B
- Example 3 (diamond graph): spilling baseline and selective retention
@@ -82,6 +82,9 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- Edge cases: single tiny op, OOM detection, serialization round-trip,
ephemeral tensor boundary correctness, cyclic DAG rejection
- All 5 released benchmarks: full pipeline validity check
+ - New tests (3): `test_fused_matmul_pointwise_splitk`,
+ `test_fused_matmul_pointwise_splitk_boundary_pw_input`,
+ `test_mixed_k_two_matmuls`
- E2E script (`solution/scripts/test-e2e.sh`):
- Track A build + 5 benchmark validation
- Track B import verification + 5 benchmark validation (baseline mode)
@@ -105,14 +108,17 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
topology (Rust binary + Python agent), user journeys, C4 workspace,
error catalog (Rust error handling), security model.
- `solution/docs/decisions/` — ADR-001 (Rust + Python language selection),
- ADR-002 (baseline-first development), ADR-003 (greedy fusion over DP).
+ ADR-002 (baseline-first development), ADR-003 (greedy fusion over DP),
+ ADR-004 (k-dimension search in granularity optimization),
+ ADR-005 (closed-form latency evaluation),
+ ADR-006 (mixed-K fusion).
- **Benchmark results** (Track A — Rust)
- | Benchmark | Ops | Latency |
- |-----------|-----|---------|
- | mlsys-2026-1 | 5 | 27,443 |
- | mlsys-2026-5 | 19 | 27,856 |
- | mlsys-2026-9 | 32 | 110,100 |
- | mlsys-2026-13 | 63 | 191,693 |
- | mlsys-2026-17 | 103 | 23,650 |
+ | Benchmark | Ops | Latency | Subgraphs |
+ |-----------|-----|---------|-----------|
+ | mlsys-2026-1 | 5 | 262,822 | 4 |
+ | mlsys-2026-5 | 19 | 909,261 | 13 |
+ | mlsys-2026-9 | 32 | 12,415,140 | 24 |
+ | mlsys-2026-13 | 63 | 4,707,779 | 25 |
+ | mlsys-2026-17 | 103 | 814,572 | 81 |
diff --git a/solution/README.md b/solution/README.md
index 7a59d8f..c044029 100644
--- a/solution/README.md
+++ b/solution/README.md
@@ -84,7 +84,7 @@ Solution: 3 subgraphs, total latency = 8234.56
Solution written to /tmp/out.json
```
-To run the full unit test suite (15 tests, including all 5 released benchmarks):
+To run the full unit test suite (18 tests, including all 5 released benchmarks):
```bash
cargo test
@@ -232,7 +232,7 @@ solution/
│ └── rust/
│ ├── Cargo.toml
│ └── src/
-│ ├── main.rs # Entry point + 15 unit tests
+│ ├── main.rs # Entry point + 18 unit tests
│ ├── models.rs
│ ├── parser.rs
│ ├── dag.rs
@@ -274,7 +274,7 @@ solution/
## Testing
-### Track A — Rust Unit Tests (15 tests)
+### Track A — Rust Unit Tests (18 tests)
```bash
cd solution/backend/rust
@@ -336,18 +336,18 @@ Validation checks per output file:
All 5 released benchmarks produce valid solutions within the memory constraint.
Reported latencies are from Track A (Rust) on the local machine.
-| Benchmark | Ops | Tensors | Fast Mem | Bandwidth | Track A Latency |
-|-----------|-----|---------|----------|-----------|-----------------|
-| mlsys-2026-1 | 5 | 9 | 60,000 | 20 | 27,443 |
-| mlsys-2026-5 | 19 | 29 | 30,000 | 15 | 27,856 |
-| mlsys-2026-9 | 32 | 49 | 250,000 | 25 | 110,100 |
-| mlsys-2026-13 | 63 | 100 | 600,000 | 50 | 191,693 |
-| mlsys-2026-17 | 103 | 160 | 500,000 | 100 | 23,650 |
-
-All benchmarks complete in under 1 second. The optimizer fuses adjacent
-chains, applies Split-K for memory-constrained MatMuls, searches tile
-granularities to balance compute/memory costs, and uses snake traversal
-for MatMul data reuse.
+| Benchmark | Ops | Tensors | Fast Mem | Bandwidth | Track A Latency | Subgraphs |
+|-----------|-----|---------|----------|-----------|-----------------|-----------|
+| mlsys-2026-1 | 5 | 9 | 60,000 | 20 | 262,822 | 4 |
+| mlsys-2026-5 | 19 | 29 | 30,000 | 15 | 909,261 | 13 |
+| mlsys-2026-9 | 32 | 49 | 250,000 | 25 | 12,415,140 | 24 |
+| mlsys-2026-13 | 63 | 100 | 600,000 | 50 | 4,707,779 | 25 |
+| mlsys-2026-17 | 103 | 160 | 500,000 | 100 | 814,572 | 81 |
+
+The optimizer runs in under 1 second per benchmark on a standard dev machine. It uses cost-based
+fusion with epsilon tolerance to merge adjacent chains, applies Split-K for
+memory-constrained MatMuls, searches tile granularities to balance
+compute/memory costs, and uses snake traversal for MatMul data reuse.
---
diff --git a/solution/checkpoints/stage-4-validation.md b/solution/checkpoints/stage-4-validation.md
index 658ca83..bc56045 100644
--- a/solution/checkpoints/stage-4-validation.md
+++ b/solution/checkpoints/stage-4-validation.md
@@ -6,7 +6,7 @@
- Remaining: ~60 minutes (Stage 5)
## Deliverables
-- [x] Track A (Rust) unit tests - 15 tests passing
+- [x] Track A (Rust) unit tests - 18 tests passing
- [x] Track B (Python) unit tests - 29 tests passing
- [x] E2E happy path script - 13/13 tests passing
- [x] Both tracks validated against all 5 benchmarks
@@ -43,6 +43,7 @@
|-------|------|-------|--------|
| Track A (Rust) - Latency model examples | `src/main.rs` | 9 | PASS |
| Track A (Rust) - Edge cases + benchmarks | `src/main.rs` | 6 | PASS |
+| Track A (Rust) - Mixed-K + split-K + boundary PW | `src/main.rs` | 3 | PASS |
| Track B (Python) - Example 1 (Pointwise chain) | `tests/test_evaluator.py` | 4 | PASS |
| Track B (Python) - Example 2 (Larger tensors) | `tests/test_evaluator.py` | 2 | PASS |
| Track B (Python) - Example 3 (Diamond graph) | `tests/test_evaluator.py` | 5 | PASS |
@@ -50,9 +51,9 @@
| Track B (Python) - Example 5 (Split-K) | `tests/test_evaluator.py` | 1 | PASS |
| Track B (Python) - Edge cases | `tests/test_evaluator.py` | 11 | PASS |
| Track B (Python) - Benchmark integration | `tests/test_evaluator.py` | 5 | PASS |
-| **Total** | | **44** | **PASS** |
+| **Total** | | **47** | **PASS** |
-### Rust Test Details (15 tests)
+### Rust Test Details (18 tests)
| Test | Validates |
|------|-----------|
@@ -71,6 +72,9 @@
| test_edge_fusion_ephemeral_correctness | Tensor 3 ephemeral in fused [0,1] |
| test_edge_cyclic_dag_rejected | Cyclic input returns Err("cycle") |
| test_benchmark_solutions_validity | All 5 benchmarks: full op coverage, valid JSON |
+| test_fused_matmul_pointwise_splitk | Fused MatMul+Pointwise with split-K granularity |
+| test_fused_matmul_pointwise_splitk_boundary_pw_input | Boundary Pointwise input tensor memory accounting |
+| test_mixed_k_two_matmuls | Two MatMuls with different K_full in same subgraph |
### E2E Tests
@@ -100,7 +104,7 @@
| Track A Scheduler | Rust (Cargo, edition 2021) | Compiled and passing |
| Track B Agent | Python 3.12 + google-genai | Imports OK, baseline mode works |
| Track B Evaluator | Pure Python (no external deps) | 29 tests passing |
-| Test Runner (Rust) | `cargo test` | 15/15 passing |
+| Test Runner (Rust) | `cargo test` | 18/18 passing |
| Test Runner (Python) | pytest 9.0.2 via uv venv | 29/29 passing |
### Benchmark Latency Summary
@@ -145,7 +149,7 @@ No bugs were found requiring fixes. All tests passed on first run after the `inc
## Ready for Next Stage?
- [x] All deliverables complete
- [x] Judge validation passed (5.00/5)
-- [x] 44 unit tests passing (15 Rust + 29 Python)
+- [x] 47 unit tests passing (18 Rust + 29 Python)
- [x] 13/13 E2E tests passing
- [x] Both tracks validated against all 5 benchmark problems
diff --git a/solution/docs/architecture/data-flow.md b/solution/docs/architecture/data-flow.md
index dddde78..05cfa68 100644
--- a/solution/docs/architecture/data-flow.md
+++ b/solution/docs/architecture/data-flow.md
@@ -63,7 +63,7 @@ sequenceDiagram
S-->>G: ScheduleState (with split-K)
G->>G: For each subgraph, generate (w, h, k) candidates
- Note over G: k candidates: K_cap down to 1 in powers of 2
K_cap = min(K_full across MatMuls in subgraph)
+ Note over G: k candidates: K_max down to 1 by repeated halving
K_max = max(K_full across MatMuls in subgraph)
G->>M: Check OOM for each candidate
G->>L: Calculate total latency for each valid (w, h, k)
L-->>G: candidate latencies (sum of per-step roofline)
@@ -135,7 +135,7 @@ flowchart LR
S2[2. Greedy Fusion
Merge adjacent ops]
S3[3. Retention pass 1
Keep tensors resident]
S4[4. Split-K
Reduce k for OOM relief]
- S5[5. Granularity Search
Optimize w,h,k per subgraph
k: K_cap...1 in powers of 2]
+ S5[5. Granularity Search
Optimize w,h,k per subgraph
k: K_max...1 by halving]
S6[6. Retention pass 2
Re-evaluate after granularity changes]
S7[7. Emergency OOM Fix
Reduce granularity for any remaining OOM]
S8[8. Final Latency
Recalculate all subgraph latencies]
diff --git a/solution/docs/architecture/database-schema.md b/solution/docs/architecture/database-schema.md
index b6c533d..f2554b0 100644
--- a/solution/docs/architecture/database-schema.md
+++ b/solution/docs/architecture/database-schema.md
@@ -122,7 +122,7 @@ fn optimize_*(subgraphs: &mut Vec, problem: &Problem, dag: &DagInfo
| 1 | 9 | 5 | 512x512 (262,144) | 60,000 | 20 |
| 5 | 29 | 19 | 1024x1024 (1,048,576) | 30,000 | 15 |
| 9 | 49 | 32 | 4096x4096 (16,777,216) | 250,000 | 25 |
-| 13 | 96 | 63 | 4096x4096 (16,777,216) | 600,000 | 50 |
+| 13 | 100 | 63 | 4096x4096 (16,777,216) | 600,000 | 50 |
| 17 | 160 | 103 | 2048x2048 (4,194,304) | 500,000 | 100 |
Note: Tensor sizes can be much larger than fast memory capacity, requiring tiling (spatial granularity < tensor dimensions).
diff --git a/solution/docs/architecture/security-model.md b/solution/docs/architecture/security-model.md
index bf01968..2b262de 100644
--- a/solution/docs/architecture/security-model.md
+++ b/solution/docs/architecture/security-model.md
@@ -1,6 +1,20 @@
# Security Model
-This is a single-user CLI tool that processes local JSON files. There is no network communication, no authentication, no multi-tenancy, and no user-supplied code execution.
+This is a single-user CLI tool that processes local JSON files. Track A (Rust
+binary) has no network communication, no authentication, no multi-tenancy, and
+no user-supplied code execution.
+
+Track B (Python agent) makes HTTPS calls to the Gemini API
+(`generativelanguage.googleapis.com`) when a `GOOGLE_API_KEY` environment
+variable is set. No problem data or solution JSON is logged or persisted by the
+API client beyond the scope of a single agent run. When `GOOGLE_API_KEY` is set
+to `dummy` or omitted, the agent runs in local-only mode with zero network
+traffic. When `GOOGLE_API_KEY` is valid, Track B transmits the full problem
+JSON and current solution JSON to the Gemini API as part of the prompt. No
+claims are made about Gemini's data retention policies — this is controlled
+by the contest organizers' environment. On API error paths, the agent may
+print a preview of the Gemini response (first 500 chars) to stderr for
+debugging.
## Threat Model
@@ -9,7 +23,7 @@ This is a single-user CLI tool that processes local JSON files. There is no netw
| Injection attacks (SQL, command) | No | No database, no shell commands |
| Malicious input JSON | Low risk | JSON parser handles untrusted input safely; no `eval()` |
| Denial of service | Not applicable | Single-user local tool |
-| Data exfiltration | Not applicable | No network, no secrets |
+| Data exfiltration | Track B only | HTTPS to Gemini API; no credentials stored; `GOOGLE_API_KEY` read from environment only |
| Path traversal | Low risk | CLI accepts file paths; use `std::fs::canonicalize()` for safety |
## Input Validation
@@ -21,4 +35,9 @@ This is a single-user CLI tool that processes local JSON files. There is no netw
## No Secrets
-This project contains no API keys, passwords, tokens, or credentials. The `.env` file pattern is not applicable.
+Track A contains no API keys, passwords, tokens, or credentials. The `.env` file
+pattern is not applicable for the Rust binary.
+
+Track B reads `GOOGLE_API_KEY` from the environment at runtime. This key is never
+written to disk, never logged, and never embedded in source code. The agent falls
+back to local-only mode if the key is absent or set to `dummy`.
diff --git a/solution/docs/architecture/system-design.md b/solution/docs/architecture/system-design.md
index e6c43ed..587866e 100644
--- a/solution/docs/architecture/system-design.md
+++ b/solution/docs/architecture/system-design.md
@@ -6,7 +6,7 @@ This is a **computational optimization tool**, not a web service. It is a single
## Scale Estimates
-- Input size: 2 ops / 3 tensors (trivial) to 96 ops / 160 tensors (benchmark 17)
+- Input size: 2 ops / 3 tensors (trivial) to 103 ops / 160 tensors (benchmark 17)
- Runtime target: < 2 seconds per benchmark on a standard developer machine
- No concurrency, no network, no database
- Memory: All data fits easily in RAM (< 1 MB input, < 10 MB working state)
@@ -470,28 +470,17 @@ working_set = sum(slice_size for each boundary input and output tensor that must
### Retained Tensors from Previous Subgraphs
-When a previous subgraph retains a tensor, that tensor occupies fast memory at its **full size** (not a slice), because it was computed across all spatial tiles and remains fully materialized.
+When a previous subgraph retains a tensor, that tensor occupies fast memory at its **full size** (not a slice), because it was fully materialized by the prior subgraph and remains resident across subgraph boundaries.
-Wait -- actually, retained tensors are computed slice-by-slice but the full tensor accumulates. Let me reconsider.
-
-Actually, from Example 3C: Tensor1 (128x128 = 16384) is retained. The working set of subgraph 1 must include this full tensor. The subgraph 1 has Tensor1 as input (already resident), processes Op1 and Op2 producing Tensor3. Working set = Tensor1 (16384, resident) + Tensor2 (ephemeral, 0) + Tensor3 output (16384) = 32768 <= 50000. This works.
-
-But wait -- if the subgraph uses a granularity smaller than the tensor, only a slice of the retained tensor is needed per step. The retained tensor is at full size in fast memory though (it was fully computed by the prior subgraph at its granularity).
-
-Actually, the problem says retained tensors stay in fast memory at full size. The working set calculation must include:
-- The **full size** of all currently retained tensors
+The working set calculation must include:
+- The **full size** of all currently retained tensors from previous subgraphs
- Plus the **slice sizes** of all boundary inputs/outputs needed for the current execution step
-Correction: from Example 5B, the accumulator Tensor4 (128x128 = 16384) and Tensor0 (128x128 = 16384) are resident, plus Tensor1 strip (128x32 = 4096) and Tensor2 strip (32x128 = 4096). Working set = 16384 + 16384 + 4096 + 4096 = 40960. That matches.
-
-But Tensor0 is a full input that gets loaded in step 1 and reused. It's NOT a retained tensor from a previous subgraph -- it's loaded in this subgraph. Tensor4 is the accumulator (output). So the working set includes:
-- Full-size inputs that are resident (loaded once, reused): full tensor size
-- Streamed input strips: slice size
-- Output/accumulator: slice size (w * h)
+From Example 3C: Tensor1 (128x128 = 16384) is retained. The working set of subgraph 1 includes Tensor1 at full size (16384, already resident), Tensor2 as ephemeral (0), and Tensor3 output slice (16384). Working set = 32768 <= 50000. This confirms that retained tensors count at full size.
-This is more nuanced. The working set depends on which step we're computing and the traversal order. The **maximum** working set across all steps must fit.
+From Example 5B: the accumulator Tensor4 (128x128 = 16384) and Tensor0 (128x128 = 16384) are resident, plus Tensor1 strip (128x32 = 4096) and Tensor2 strip (32x128 = 4096). Working set = 16384 + 16384 + 4096 + 4096 = 40960. Note that Tensor0 here is a full input loaded and reused within this subgraph (not a cross-subgraph retained tensor), and Tensor4 is the accumulator output.
-For the OOM check, we need the **worst-case step** (typically the first step, where the most data is loaded fresh).
+For the OOM check, the **worst-case step** (typically the first step, where the most data is loaded fresh) must fit within fast memory capacity.
---
@@ -583,8 +572,8 @@ total_latency = sum(subgraph_latency for each subgraph)
| 1 | 5 | 9 | 512x512 | 60,000 | 20 | Linear chain (MatMul + Pointwise) |
| 5 | 19 | 29 | 128-1024 mixed | 30,000 | 15 | 3x attention heads + aggregation |
| 9 | 32 | 49 | 1024-4096 mixed | 250,000 | 25 | 8x repeating MatMul+PW blocks |
-| 13 | 63 | 96 | 128-4096 mixed | 600,000 | 50 | 16x parallel MatMul heads + PW aggregation |
-| 17 | 96 | 160 | 128-2048 mixed | 500,000 | 100 | 8x attention + 8x MLP blocks + residual |
+| 13 | 63 | 100 | 128-4096 mixed | 600,000 | 50 | 16x parallel MatMul heads + PW aggregation |
+| 17 | 103 | 160 | 128-2048 mixed | 500,000 | 100 | 8x attention + 8x MLP blocks + residual |
---
diff --git a/solution/docs/architecture/workspace.dsl b/solution/docs/architecture/workspace.dsl
index d1b54ff..ed4fa3c 100644
--- a/solution/docs/architecture/workspace.dsl
+++ b/solution/docs/architecture/workspace.dsl
@@ -57,7 +57,7 @@ workspace "MLSys DAG Scheduler" "Computational graph scheduler for memory-constr
description "Finds optimal k for MatMul subgraphs under memory pressure"
}
granularityComponent = component "Granularity Search" "Spatial tiling" {
- description "Searches (w, h) candidates to minimize per-subgraph latency"
+ description "Searches (w, h, k) candidates to minimize per-subgraph latency"
}
traversalComponent = component "Traversal Order" "Tile ordering" {
description "Snake/zig-zag traversal to reduce input strip reloads"
diff --git a/solution/docs/decisions/ADR-003-greedy-fusion.md b/solution/docs/decisions/ADR-003-greedy-fusion.md
index 5ec2cb7..d7c35b9 100644
--- a/solution/docs/decisions/ADR-003-greedy-fusion.md
+++ b/solution/docs/decisions/ADR-003-greedy-fusion.md
@@ -6,7 +6,7 @@ Accepted
## Context
Operation grouping (fusion) is the highest-impact optimization in the scheduler. Grouping adjacent ops into a single subgraph makes intermediate tensors ephemeral (zero memory, zero transfer cost), which can dramatically reduce latency (2x improvement shown in Example 1B).
-The grouping problem is combinatorial: for N ops, the number of possible partitions into subgraphs is the Bell number B(N), which grows super-exponentially. For benchmark 17 (96 ops), exhaustive search is infeasible.
+The grouping problem is combinatorial: for N ops, the number of possible partitions into subgraphs is the Bell number B(N), which grows super-exponentially. For benchmark 17 (103 ops), exhaustive search is infeasible.
### Alternatives Considered
@@ -16,7 +16,7 @@ The grouping problem is combinatorial: for N ops, the number of possible partiti
3. **Beam search**: Maintain top-K candidate partitions, expand greedily. Better than pure greedy but O(K * N^2) with uncertain quality guarantees.
-4. **ILP/constraint solver**: Formulate as integer program. Optimal but requires a solver dependency and may be slow for N=96.
+4. **ILP/constraint solver**: Formulate as integer program. Optimal but requires a solver dependency and may be slow for N=103.
## Decision
Use **greedy bottom-up fusion** with the following rules:
@@ -41,8 +41,8 @@ A merge is only valid if the resulting subgraph is a **connected directed subDAG
## Consequences
### Positive
-- **Simple to implement**: ~150 lines of Python, well within the time budget
-- **Fast to execute**: O(N^2) worst case, sub-second for N=96
+- **Simple to implement**: implemented in Rust (Track A) and Python (Track B), well within the time budget
+- **Fast to execute**: O(N^2) worst case, sub-second for N=103
- **Deterministic**: Same input always produces the same output
- **Incremental**: Each merge is independently validated -- no risk of cascading failures
- **Good enough**: For the linear and repeating block structures in the benchmarks, greedy fusion captures most of the benefit (chains fuse naturally)
@@ -59,3 +59,22 @@ A merge is only valid if the resulting subgraph is a **connected directed subDAG
### Neutral
- Greedy fusion is the standard approach in production ML compilers (XLA, TVM, Triton) for operation grouping
+
+---
+
+## Enhancement: Cost-Based Fusion with Epsilon Tolerance
+
+Since the initial ADR was written, the merge criterion was strengthened (Issue #16).
+The pure feasibility check (merge if working set fits) was replaced with a
+**cost-based merge criterion**: merge subgraphs A and B only when
+`latency(A+B, best_gran_fused) < latency(A, best_gran_A) + latency(B, best_gran_B)`.
+
+An epsilon tolerance is applied: a merge is accepted only when the fused latency
+is strictly lower than the split latency by more than the tolerance (i.e.,
+`lat_fused < lat_split - eps`). Borderline cases within epsilon are rejected,
+ensuring merges provide a meaningful improvement.
+
+This prevents fusions where forcing a shared granularity on the merged subgraph
+degrades latency more than the DRAM savings from making intermediate tensors
+ephemeral. The decision to merge is now based on measured benefit rather than
+assumed benefit.
diff --git a/solution/docs/decisions/ADR-004-k-dimension-search.md b/solution/docs/decisions/ADR-004-k-dimension-search.md
index 0ddecc9..493d58a 100644
--- a/solution/docs/decisions/ADR-004-k-dimension-search.md
+++ b/solution/docs/decisions/ADR-004-k-dimension-search.md
@@ -1,7 +1,12 @@
# ADR-004: Full k-Dimension Search in Granularity Optimization
## Status
-Accepted
+Accepted (partially superseded by ADR-006)
+
+**Note:** ADR-006 (Mixed-K Fusion) superseded the `min(K_full)` cap used in the
+initial granularity search. The current implementation uses `K_max = max(K_full)`
+across all MatMuls in the subgraph as the upper bound for k candidates. See ADR-006
+for the mixed-K execution model that makes this correct.
## Context
The initial granularity search only varied (w, h) spatially and used k=1 (or the k value inherited from the split-K stage). This produced pathologically bad schedules for MatMul-heavy subgraphs where k=1 creates K_full k-steps per spatial tile, each loading tiny input strips.
@@ -14,7 +19,7 @@ Benchmark analysis showed:
The root cause was that the search evaluated candidates but k=1 minimizes the per-step working set (smallest slices). The search was not properly accounting for the multiplicative k-step count and the repeated strip reloading that comes with smaller k values.
## Decision
-Search k from min(K_full across all MatMuls in the subgraph) down to 1 in powers of 2, jointly with (w, h) spatial candidates. The full search space becomes:
+Search k from max(K_full across all MatMuls in the subgraph) down to 1 by repeated integer halving, jointly with (w, h) spatial candidates. The full search space becomes:
```
candidates = w_values x h_values x k_values
@@ -23,14 +28,14 @@ candidates = w_values x h_values x k_values
Where:
- `w_values`: powers of 2 up to output width (as before)
- `h_values`: powers of 2 up to output height (as before)
-- `k_values`: K_cap, K_cap/2, K_cap/4, ..., 1 where K_cap = min(K_full for each MatMul in the subgraph)
+- `k_values`: K_max, floor(K_max/2), floor(K_max/4), ..., 1 (repeated integer halving from K_max) where K_max = max(K_full for each MatMul in the subgraph)
For each (w, h, k) candidate:
1. Check the working set fits in fast memory (OOM constraint)
2. Compute total subgraph latency as the sum of per-step roofline costs across all tiles and k-steps
3. Select the (w, h, k) triple that minimizes total subgraph latency
-Using min(K_full) across MatMuls ensures k never exceeds any op's reduction dimension. Note: k candidates are powers of 2, so `ceil(K_full / k)` correctly handles cases where k does not evenly divide K_full — the last k-step simply processes the remainder.
+Using max(K_full) across MatMuls as the upper bound (K_max) drives the subgraph for the full extent of the largest reduction dimension. MatMuls with smaller K_full values become inactive once they finish their k-steps, contributing zero compute and memory on subsequent steps. See ADR-006 for the mixed-K execution model. Note: k candidates are generated by repeated integer halving (not necessarily powers of 2 unless K_max is). `ceil(K_full / k)` correctly handles cases where k does not evenly divide K_full — the last k-step simply processes the remainder.
## Consequences