melroyanthony · melroyanthony · Mar 16, 2026 · Mar 15, 2026 · Mar 15, 2026 · Mar 16, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -52,7 +52,7 @@ jobs:
         working-directory: solution/backend/rust
         run: cargo build --release
 
-      - name: Run unit tests (15 tests)
+      - name: Run unit tests
         working-directory: solution/backend/rust
         run: cargo test -- --nocapture
 

diff --git a/solution/CHANGELOG.md b/solution/CHANGELOG.md
@@ -73,7 +73,7 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
   - Traversal optimization: snake/zig-zag tile order for MatMul data reuse.
 
 - **Test suite**
-  - 15 Rust unit tests in `src/main.rs`:
+  - 18 Rust unit tests in `src/main.rs`:
     - Example 1 (baseline pointwise chain): strategies A, B, C
     - Example 2 (larger tensors, 256x256): strategies A and B
     - Example 3 (diamond graph): spilling baseline and selective retention
@@ -82,6 +82,9 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
     - Edge cases: single tiny op, OOM detection, serialization round-trip,
       ephemeral tensor boundary correctness, cyclic DAG rejection
     - All 5 released benchmarks: full pipeline validity check
+    - New tests (3): `test_fused_matmul_pointwise_splitk`,
+      `test_fused_matmul_pointwise_splitk_boundary_pw_input`,
+      `test_mixed_k_two_matmuls`
   - E2E script (`solution/scripts/test-e2e.sh`):
     - Track A build + 5 benchmark validation
     - Track B import verification + 5 benchmark validation (baseline mode)
@@ -105,14 +108,17 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
     topology (Rust binary + Python agent), user journeys, C4 workspace,
     error catalog (Rust error handling), security model.
   - `solution/docs/decisions/` — ADR-001 (Rust + Python language selection),
-    ADR-002 (baseline-first development), ADR-003 (greedy fusion over DP).
+    ADR-002 (baseline-first development), ADR-003 (greedy fusion over DP),
+    ADR-004 (k-dimension search in granularity optimization),
+    ADR-005 (closed-form latency evaluation),
+    ADR-006 (mixed-K fusion).
 
 - **Benchmark results** (Track A — Rust)
 
-  | Benchmark | Ops | Latency |
-  |-----------|-----|---------|
-  | mlsys-2026-1  | 5   | 27,443    |
-  | mlsys-2026-5  | 19  | 27,856    |
-  | mlsys-2026-9  | 32  | 110,100   |
-  | mlsys-2026-13 | 63  | 191,693   |
-  | mlsys-2026-17 | 103 | 23,650    |
+  | Benchmark | Ops | Latency | Subgraphs |
+  |-----------|-----|---------|-----------|
+  | mlsys-2026-1  | 5   | 262,822    | 4  |
+  | mlsys-2026-5  | 19  | 909,261    | 13 |
+  | mlsys-2026-9  | 32  | 12,415,140 | 24 |
+  | mlsys-2026-13 | 63  | 4,707,779  | 25 |
+  | mlsys-2026-17 | 103 | 814,572    | 81 |
diff --git a/solution/README.md b/solution/README.md
@@ -84,7 +84,7 @@ Solution: 3 subgraphs, total latency = 8234.56
 Solution written to /tmp/out.json
 ```
 
-To run the full unit test suite (15 tests, including all 5 released benchmarks):
+To run the full unit test suite (18 tests, including all 5 released benchmarks):
 
 ```bash
 cargo test
@@ -232,7 +232,7 @@ solution/
 │   └── rust/
 │       ├── Cargo.toml
 │       └── src/
-│           ├── main.rs              # Entry point + 15 unit tests
+│           ├── main.rs              # Entry point + 18 unit tests
 │           ├── models.rs
 │           ├── parser.rs
 │           ├── dag.rs
@@ -274,7 +274,7 @@ solution/
 
 ## Testing
 
-### Track A — Rust Unit Tests (15 tests)
+### Track A — Rust Unit Tests (18 tests)
 
 ```bash
 cd solution/backend/rust
@@ -336,18 +336,18 @@ Validation checks per output file:
 All 5 released benchmarks produce valid solutions within the memory constraint.
 Reported latencies are from Track A (Rust) on the local machine.
 
-| Benchmark | Ops | Tensors | Fast Mem | Bandwidth | Track A Latency |
-|-----------|-----|---------|----------|-----------|-----------------|
-| mlsys-2026-1  | 5   | 9   | 60,000   | 20  | 27,443    |
-| mlsys-2026-5  | 19  | 29  | 30,000   | 15  | 27,856    |
-| mlsys-2026-9  | 32  | 49  | 250,000  | 25  | 110,100   |
-| mlsys-2026-13 | 63  | 100 | 600,000  | 50  | 191,693   |
-| mlsys-2026-17 | 103 | 160 | 500,000  | 100 | 23,650    |
-
-All benchmarks complete in under 1 second. The optimizer fuses adjacent
-chains, applies Split-K for memory-constrained MatMuls, searches tile
-granularities to balance compute/memory costs, and uses snake traversal
-for MatMul data reuse.
+| Benchmark | Ops | Tensors | Fast Mem | Bandwidth | Track A Latency | Subgraphs |
+|-----------|-----|---------|----------|-----------|-----------------|-----------|
+| mlsys-2026-1  | 5   | 9   | 60,000   | 20  | 262,822    | 4  |
+| mlsys-2026-5  | 19  | 29  | 30,000   | 15  | 909,261    | 13 |
+| mlsys-2026-9  | 32  | 49  | 250,000  | 25  | 12,415,140 | 24 |
+| mlsys-2026-13 | 63  | 100 | 600,000  | 50  | 4,707,779  | 25 |
+| mlsys-2026-17 | 103 | 160 | 500,000  | 100 | 814,572    | 81 |
+
+The optimizer runs in under 1 second per benchmark on a standard dev machine. It uses cost-based
+fusion with epsilon tolerance to merge adjacent chains, applies Split-K for
+memory-constrained MatMuls, searches tile granularities to balance
+compute/memory costs, and uses snake traversal for MatMul data reuse.
 
 ---
 

diff --git a/solution/checkpoints/stage-4-validation.md b/solution/checkpoints/stage-4-validation.md
@@ -6,7 +6,7 @@
 - Remaining: ~60 minutes (Stage 5)
 
 ## Deliverables
-- [x] Track A (Rust) unit tests - 15 tests passing
+- [x] Track A (Rust) unit tests - 18 tests passing
 - [x] Track B (Python) unit tests - 29 tests passing
 - [x] E2E happy path script - 13/13 tests passing
 - [x] Both tracks validated against all 5 benchmarks
@@ -43,16 +43,17 @@
 |-------|------|-------|--------|
 | Track A (Rust) - Latency model examples | `src/main.rs` | 9 | PASS |
 | Track A (Rust) - Edge cases + benchmarks | `src/main.rs` | 6 | PASS |
+| Track A (Rust) - Mixed-K + split-K + boundary PW | `src/main.rs` | 3 | PASS |
 | Track B (Python) - Example 1 (Pointwise chain) | `tests/test_evaluator.py` | 4 | PASS |
 | Track B (Python) - Example 2 (Larger tensors) | `tests/test_evaluator.py` | 2 | PASS |
 | Track B (Python) - Example 3 (Diamond graph) | `tests/test_evaluator.py` | 5 | PASS |
 | Track B (Python) - Example 4 (MatMul tiling) | `tests/test_evaluator.py` | 1 | PASS |
 | Track B (Python) - Example 5 (Split-K) | `tests/test_evaluator.py` | 1 | PASS |
 | Track B (Python) - Edge cases | `tests/test_evaluator.py` | 11 | PASS |
 | Track B (Python) - Benchmark integration | `tests/test_evaluator.py` | 5 | PASS |
-| **Total** | | **44** | **PASS** |
+| **Total** | | **47** | **PASS** |
 
-### Rust Test Details (15 tests)
+### Rust Test Details (18 tests)
 
 | Test | Validates |
 |------|-----------|
@@ -71,6 +72,9 @@
 | test_edge_fusion_ephemeral_correctness | Tensor 3 ephemeral in fused [0,1] |
 | test_edge_cyclic_dag_rejected | Cyclic input returns Err("cycle") |
 | test_benchmark_solutions_validity | All 5 benchmarks: full op coverage, valid JSON |
+| test_fused_matmul_pointwise_splitk | Fused MatMul+Pointwise with split-K granularity |
+| test_fused_matmul_pointwise_splitk_boundary_pw_input | Boundary Pointwise input tensor memory accounting |
+| test_mixed_k_two_matmuls | Two MatMuls with different K_full in same subgraph |
 
 ### E2E Tests
 
@@ -100,7 +104,7 @@
 | Track A Scheduler | Rust (Cargo, edition 2021) | Compiled and passing |
 | Track B Agent | Python 3.12 + google-genai | Imports OK, baseline mode works |
 | Track B Evaluator | Pure Python (no external deps) | 29 tests passing |
-| Test Runner (Rust) | `cargo test` | 15/15 passing |
+| Test Runner (Rust) | `cargo test` | 18/18 passing |
 | Test Runner (Python) | pytest 9.0.2 via uv venv | 29/29 passing |
 
 ### Benchmark Latency Summary
@@ -145,7 +149,7 @@ No bugs were found requiring fixes. All tests passed on first run after the `inc
 ## Ready for Next Stage?
 - [x] All deliverables complete
 - [x] Judge validation passed (5.00/5)
-- [x] 44 unit tests passing (15 Rust + 29 Python)
+- [x] 47 unit tests passing (18 Rust + 29 Python)
 - [x] 13/13 E2E tests passing
 - [x] Both tracks validated against all 5 benchmark problems
 

diff --git a/solution/docs/architecture/data-flow.md b/solution/docs/architecture/data-flow.md
@@ -63,7 +63,7 @@ sequenceDiagram
     S-->>G: ScheduleState (with split-K)
 
     G->>G: For each subgraph, generate (w, h, k) candidates
-    Note over G: k candidates: K_cap down to 1 in powers of 2<br/>K_cap = min(K_full across MatMuls in subgraph)
+    Note over G: k candidates: K_max down to 1 by repeated halving<br/>K_max = max(K_full across MatMuls in subgraph)
     G->>M: Check OOM for each candidate
     G->>L: Calculate total latency for each valid (w, h, k)
     L-->>G: candidate latencies (sum of per-step roofline)
@@ -135,7 +135,7 @@ flowchart LR
     S2[2. Greedy Fusion<br>Merge adjacent ops]
     S3[3. Retention pass 1<br>Keep tensors resident]
     S4[4. Split-K<br>Reduce k for OOM relief]
-    S5[5. Granularity Search<br>Optimize w,h,k per subgraph<br>k: K_cap...1 in powers of 2]
+    S5[5. Granularity Search<br>Optimize w,h,k per subgraph<br>k: K_max...1 by halving]
     S6[6. Retention pass 2<br>Re-evaluate after granularity changes]
     S7[7. Emergency OOM Fix<br>Reduce granularity for any remaining OOM]
     S8[8. Final Latency<br>Recalculate all subgraph latencies]

diff --git a/solution/docs/architecture/database-schema.md b/solution/docs/architecture/database-schema.md
@@ -122,7 +122,7 @@ fn optimize_*(subgraphs: &mut Vec<SubgraphDef>, problem: &Problem, dag: &DagInfo
 | 1 | 9 | 5 | 512x512 (262,144) | 60,000 | 20 |
 | 5 | 29 | 19 | 1024x1024 (1,048,576) | 30,000 | 15 |
 | 9 | 49 | 32 | 4096x4096 (16,777,216) | 250,000 | 25 |
-| 13 | 96 | 63 | 4096x4096 (16,777,216) | 600,000 | 50 |
+| 13 | 100 | 63 | 4096x4096 (16,777,216) | 600,000 | 50 |
 | 17 | 160 | 103 | 2048x2048 (4,194,304) | 500,000 | 100 |
 
 Note: Tensor sizes can be much larger than fast memory capacity, requiring tiling (spatial granularity < tensor dimensions).
diff --git a/solution/docs/architecture/security-model.md b/solution/docs/architecture/security-model.md
@@ -1,6 +1,20 @@
 # Security Model
 
-This is a single-user CLI tool that processes local JSON files. There is no network communication, no authentication, no multi-tenancy, and no user-supplied code execution.
+This is a single-user CLI tool that processes local JSON files. Track A (Rust
+binary) has no network communication, no authentication, no multi-tenancy, and
+no user-supplied code execution.
+
+Track B (Python agent) makes HTTPS calls to the Gemini API
+(`generativelanguage.googleapis.com`) when a `GOOGLE_API_KEY` environment
+variable is set. No problem data or solution JSON is logged or persisted by the
+API client beyond the scope of a single agent run. When `GOOGLE_API_KEY` is set
+to `dummy` or omitted, the agent runs in local-only mode with zero network
+traffic. When `GOOGLE_API_KEY` is valid, Track B transmits the full problem
+JSON and current solution JSON to the Gemini API as part of the prompt. No
+claims are made about Gemini's data retention policies — this is controlled
+by the contest organizers' environment. On API error paths, the agent may
+print a preview of the Gemini response (first 500 chars) to stderr for
+debugging.
 
 ## Threat Model
 
@@ -9,7 +23,7 @@ This is a single-user CLI tool that processes local JSON files. There is no netw
 | Injection attacks (SQL, command) | No | No database, no shell commands |
 | Malicious input JSON | Low risk | JSON parser handles untrusted input safely; no `eval()` |
 | Denial of service | Not applicable | Single-user local tool |
-| Data exfiltration | Not applicable | No network, no secrets |
+| Data exfiltration | Track B only | HTTPS to Gemini API; no credentials stored; `GOOGLE_API_KEY` read from environment only |
 | Path traversal | Low risk | CLI accepts file paths; use `std::fs::canonicalize()` for safety |
 
 ## Input Validation
@@ -21,4 +35,9 @@ This is a single-user CLI tool that processes local JSON files. There is no netw
 
 ## No Secrets
 
-This project contains no API keys, passwords, tokens, or credentials. The `.env` file pattern is not applicable.
+Track A contains no API keys, passwords, tokens, or credentials. The `.env` file
+pattern is not applicable for the Rust binary.
+
+Track B reads `GOOGLE_API_KEY` from the environment at runtime. This key is never
+written to disk, never logged, and never embedded in source code. The agent falls
+back to local-only mode if the key is absent or set to `dummy`.
diff --git a/solution/docs/architecture/system-design.md b/solution/docs/architecture/system-design.md
@@ -6,7 +6,7 @@ This is a **computational optimization tool**, not a web service. It is a single
 
 ## Scale Estimates
 
-- Input size: 2 ops / 3 tensors (trivial) to 96 ops / 160 tensors (benchmark 17)
+- Input size: 2 ops / 3 tensors (trivial) to 103 ops / 160 tensors (benchmark 17)
 - Runtime target: < 2 seconds per benchmark on a standard developer machine
 - No concurrency, no network, no database
 - Memory: All data fits easily in RAM (< 1 MB input, < 10 MB working state)
@@ -470,28 +470,17 @@ working_set = sum(slice_size for each boundary input and output tensor that must
 
 ### Retained Tensors from Previous Subgraphs
 
-When a previous subgraph retains a tensor, that tensor occupies fast memory at its **full size** (not a slice), because it was computed across all spatial tiles and remains fully materialized.
+When a previous subgraph retains a tensor, that tensor occupies fast memory at its **full size** (not a slice), because it was fully materialized by the prior subgraph and remains resident across subgraph boundaries.
 
-Wait -- actually, retained tensors are computed slice-by-slice but the full tensor accumulates. Let me reconsider.
-
-Actually, from Example 3C: Tensor1 (128x128 = 16384) is retained. The working set of subgraph 1 must include this full tensor. The subgraph 1 has Tensor1 as input (already resident), processes Op1 and Op2 producing Tensor3. Working set = Tensor1 (16384, resident) + Tensor2 (ephemeral, 0) + Tensor3 output (16384) = 32768 <= 50000. This works.
-
-But wait -- if the subgraph uses a granularity smaller than the tensor, only a slice of the retained tensor is needed per step. The retained tensor is at full size in fast memory though (it was fully computed by the prior subgraph at its granularity).
-
-Actually, the problem says retained tensors stay in fast memory at full size. The working set calculation must include:
-- The **full size** of all currently retained tensors
+The working set calculation must include:
+- The **full size** of all currently retained tensors from previous subgraphs
 - Plus the **slice sizes** of all boundary inputs/outputs needed for the current execution step
 
-Correction: from Example 5B, the accumulator Tensor4 (128x128 = 16384) and Tensor0 (128x128 = 16384) are resident, plus Tensor1 strip (128x32 = 4096) and Tensor2 strip (32x128 = 4096). Working set = 16384 + 16384 + 4096 + 4096 = 40960. That matches.
-
-But Tensor0 is a full input that gets loaded in step 1 and reused. It's NOT a retained tensor from a previous subgraph -- it's loaded in this subgraph. Tensor4 is the accumulator (output). So the working set includes:
-- Full-size inputs that are resident (loaded once, reused): full tensor size
-- Streamed input strips: slice size
-- Output/accumulator: slice size (w * h)
+From Example 3C: Tensor1 (128x128 = 16384) is retained. The working set of subgraph 1 includes Tensor1 at full size (16384, already resident), Tensor2 as ephemeral (0), and Tensor3 output slice (16384). Working set = 32768 <= 50000. This confirms that retained tensors count at full size.
 
-This is more nuanced. The working set depends on which step we're computing and the traversal order. The **maximum** working set across all steps must fit.
+From Example 5B: the accumulator Tensor4 (128x128 = 16384) and Tensor0 (128x128 = 16384) are resident, plus Tensor1 strip (128x32 = 4096) and Tensor2 strip (32x128 = 4096). Working set = 16384 + 16384 + 4096 + 4096 = 40960. Note that Tensor0 here is a full input loaded and reused within this subgraph (not a cross-subgraph retained tensor), and Tensor4 is the accumulator output.
 
-For the OOM check, we need the **worst-case step** (typically the first step, where the most data is loaded fresh).
+For the OOM check, the **worst-case step** (typically the first step, where the most data is loaded fresh) must fit within fast memory capacity.
 
 ---
 
@@ -583,8 +572,8 @@ total_latency = sum(subgraph_latency for each subgraph)
 | 1 | 5 | 9 | 512x512 | 60,000 | 20 | Linear chain (MatMul + Pointwise) |
 | 5 | 19 | 29 | 128-1024 mixed | 30,000 | 15 | 3x attention heads + aggregation |
 | 9 | 32 | 49 | 1024-4096 mixed | 250,000 | 25 | 8x repeating MatMul+PW blocks |
-| 13 | 63 | 96 | 128-4096 mixed | 600,000 | 50 | 16x parallel MatMul heads + PW aggregation |
-| 17 | 96 | 160 | 128-2048 mixed | 500,000 | 100 | 8x attention + 8x MLP blocks + residual |
+| 13 | 63 | 100 | 128-4096 mixed | 600,000 | 50 | 16x parallel MatMul heads + PW aggregation |
+| 17 | 103 | 160 | 128-2048 mixed | 500,000 | 100 | 8x attention + 8x MLP blocks + residual |
 
 ---
 

diff --git a/solution/docs/architecture/workspace.dsl b/solution/docs/architecture/workspace.dsl
@@ -57,7 +57,7 @@ workspace "MLSys DAG Scheduler" "Computational graph scheduler for memory-constr
                     description "Finds optimal k for MatMul subgraphs under memory pressure"
                 }
                 granularityComponent = component "Granularity Search" "Spatial tiling" {
-                    description "Searches (w, h) candidates to minimize per-subgraph latency"
+                    description "Searches (w, h, k) candidates to minimize per-subgraph latency"
                 }
                 traversalComponent = component "Traversal Order" "Tile ordering" {
                     description "Snake/zig-zag traversal to reduce input strip reloads"