feat: allow mixed-K fusion (remove K_full consistency constraint)#23
feat: allow mixed-K fusion (remove K_full consistency constraint)#23melroyanthony merged 6 commits intomainfrom
Conversation
Architecture: - system-design.md: mixed-K execution model with per-op active steps, updated compute/memory formulas, K_max replaces K_cap - ADR-006: documents why K_full consistency was overly restrictive, the correct per-op k-step model, and 30% expected improvement Requirements: - F-05: allows mixed-K fusion with num_k_steps = ceil(max(K_full)/k) - F-06: K_max for k search range - New acceptance criteria for mixed-K support - Risk register: mixed-K latency model correctness Refs #22
Rust + Python changes: - compute_num_k_steps: max(K_full) instead of min(K_full) - find_k_full/find_split_k: max(K_full) for search range - Removed K_full consistency check from fusion (both tracks) - Evaluator: k <= max(K_full) instead of K_full consistency - Latency model: mixed-K phase-based computation for split-K (each MatMul active for ceil(its_K_full/k) steps only) - Uniform-K case uses existing formula (no regression) 16/16 tests pass. Benchmark latencies unchanged because cost-based fusion correctly determines that mixed-K fused latency is not better than split for these benchmarks' tensor dimensions. Closes #22
There was a problem hiding this comment.
Pull request overview
This PR removes the K_full consistency constraint that prevented fusing MatMul ops with different reduction dimensions into the same subgraph. The change spans both the Rust backend and Python evaluator, updating the latency model, fusion logic, granularity search, validation, and documentation.
Changes:
- Replaced
min(K_full)withmax(K_full)across all k-step computations (Rust latency, memory, granularity; Python evaluator), and removed the K_full consistency check from fusion and validation - Added a mixed-K phased latency computation path in both Rust (
latency.rs) and Python (evaluator.py) that computes per-phase costs when MatMuls have different K_full values - Added ADR-006 and updated system-design docs and MVP scope to document the mixed-K execution model
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| solution/backend/rust/src/latency.rs | Changed min → max for K_full; added mixed-K phased latency path |
| solution/backend/rust/src/evaluate.rs | Removed K_full consistency validation; k <= max(K_full) check |
| solution/backend/rust/src/optimizer/fusion.rs | Removed K_full consistency check from merge eligibility |
| solution/backend/rust/src/optimizer/granularity.rs | Changed find_k_full to use max() instead of min() |
| solution/backend/rust/src/memory.rs | Changed find_split_k to search from max K_full downward |
| solution/agent/evaluator.py | Changed min → max for K_full; added mixed-K phased latency path; updated validation |
| solution/agent/scheduler.py | Removed K_full consistency check from fusion |
| solution/docs/decisions/ADR-006-mixed-k-fusion.md | New ADR documenting the mixed-K fusion decision |
| solution/docs/architecture/system-design.md | Updated system design docs for mixed-K model |
| solution/requirements/mvp-scope.md | Updated requirements and acceptance criteria for mixed-K |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Rust + Python: - Per-op k_strip: track actual tensor-dimension-based k_strip per MatMul instead of compute-ratio proxy (correct for mixed input dimensions) - O(1) per phase: replaced step loop with first/interior/last arithmetic matching documented O(distinct_K) complexity - Simulation path: skip k_strip loads for inactive MatMuls (finished their reduction) using tensor-to-active-steps mapping Refs #22
There was a problem hiding this comment.
Pull request overview
This PR removes the K_full consistency constraint that prevented fusing MatMul ops with different reduction dimensions into the same subgraph. The change touches both the Rust optimizer and the Python evaluator to support "mixed-K" execution where each MatMul independently runs ceil(K_full_op / k) k-steps, with the subgraph running for ceil(max(K_full) / k) total steps. A phased latency computation handles the varying active-op sets efficiently in O(distinct_K) per candidate.
Changes:
- Changed
min(K_full)tomax(K_full)for num_k_steps, k search range, and validation across Rust and Python - Added mixed-K phased latency computation in both
latency.rsandevaluator.pywith O(1)-per-phase closed-form arithmetic, while retaining the uniform-K fast path - Removed K_full consistency checks from fusion (Rust + Python) and evaluator validation, and added ADR-006 documenting the decision
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| solution/backend/rust/src/latency.rs | Changed num_k_steps from min→max K_full; added mixed-K phased latency with per-op k_strip tracking |
| solution/backend/rust/src/evaluate.rs | Relaxed validation to check k <= max(K_full) instead of K_full equality |
| solution/backend/rust/src/optimizer/fusion.rs | Removed K_full consistency check from merge eligibility |
| solution/backend/rust/src/optimizer/granularity.rs | Changed find_k_full from min to max |
| solution/backend/rust/src/memory.rs | Changed split-K search bound from min to max K_full |
| solution/agent/evaluator.py | Python evaluator: min→max K_full, mixed-K phased latency, simulation path active-op filtering |
| solution/agent/scheduler.py | Removed K_full consistency check from Python fusion |
| solution/docs/decisions/ADR-006-mixed-k-fusion.md | New ADR documenting mixed-K fusion decision |
| solution/docs/architecture/system-design.md | Updated design docs with mixed-K execution model |
| solution/requirements/mvp-scope.md | Updated requirements and acceptance criteria for mixed-K |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Rust + Python: compute uses min(k, K_full_op) / K_full_op to prevent base_cost scaling above 1.0 when k > K_full_op in mixed-K subgraphs - Applies to matmul_compute_per_step, mixed-K active_compute, and simulation path active_matmul_compute - Removed dead total_k_steps variable from Python mixed-K path Refs #22
There was a problem hiding this comment.
Pull request overview
This PR removes the K_full consistency constraint from the fusion stage, allowing MatMul ops with different reduction dimensions (K_full values) to coexist in the same subgraph. The subgraph now runs for ceil(max(K_full) / k) total k-steps, with each MatMul active only for its own ceil(K_full_op / k) steps. The change is applied consistently across the Rust optimizer, Python evaluator, and documentation.
Changes:
- Replaced
min(K_full)withmax(K_full)fornum_k_stepscomputation, granularity search bounds, and split-K search in both Rust and Python - Removed K_full consistency checks from fusion (Rust + Python) and evaluator validation; validation now only checks
k <= max(K_full) - Added mixed-K phase-based latency computation with O(distinct_K) closed-form evaluation per candidate, with a fast path for uniform-K subgraphs
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| solution/backend/rust/src/latency.rs | Core mixed-K latency model: phase-based computation, min(k, K_full_op) clamping, max-based num_k_steps |
| solution/backend/rust/src/evaluate.rs | Removed K_full consistency validation; now only checks k <= max(K_full) |
| solution/backend/rust/src/optimizer/fusion.rs | Removed K_full consistency check from fusion eligibility |
| solution/backend/rust/src/optimizer/granularity.rs | Changed find_k_full from min to max |
| solution/backend/rust/src/memory.rs | Changed split-K search bound from min to max K_full |
| solution/agent/evaluator.py | Python mirror: mixed-K latency, phase-based evaluation, validation relaxation |
| solution/agent/scheduler.py | Removed K_full consistency check from Python fusion |
| solution/docs/decisions/ADR-006-mixed-k-fusion.md | ADR documenting the mixed-K decision and execution model |
| solution/docs/architecture/system-design.md | Updated latency model docs for mixed-K semantics |
| solution/requirements/mvp-scope.md | Updated acceptance criteria and feature descriptions for mixed-K |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Simulation path k_strip loads now use per-tensor k_eff = min(k, K_full_op) via k_strip_tensor_k_eff mapping. Prevents loading more elements than the tensor's actual reduction dimension when k > K_full_op. Refs #22
There was a problem hiding this comment.
Pull request overview
This PR removes the K_full consistency constraint from the fusion stage, allowing MatMul ops with different reduction dimensions (K_full values) to be fused into the same subgraph. The change implements a mixed-K execution model where each MatMul runs for ceil(K_full_op / k) steps and the subgraph runs for ceil(max(K_full) / k) total steps, with inactive MatMuls contributing zero compute and memory on later steps.
Changes:
- Changed
num_k_stepsfromceil(min(K_full) / k)toceil(max(K_full) / k)and removed K_full consistency checks from fusion (Rust + Python) - Added phase-based closed-form latency computation for mixed-K subgraphs in both Rust and Python, with a uniform-K fast path for regression safety
- Added ADR-006 and updated system design documentation to describe the mixed-K execution model
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| solution/backend/rust/src/latency.rs | Changed min → max for K_full, added min(k, K_full_op) clamping to compute, and implemented phase-based mixed-K latency path |
| solution/backend/rust/src/optimizer/fusion.rs | Removed K_full consistency check from merge eligibility |
| solution/backend/rust/src/optimizer/granularity.rs | Changed find_k_full from min to max |
| solution/backend/rust/src/memory.rs | Changed find_split_k to use max K_full as upper search bound |
| solution/backend/rust/src/evaluate.rs | Relaxed validation: k <= max(K_full) instead of K_full equality |
| solution/agent/scheduler.py | Removed K_full consistency check from Python fusion |
| solution/agent/evaluator.py | Changed to max K_full, added mixed-K phase-based latency path with k_eff clamping, updated validation |
| solution/docs/decisions/ADR-006-mixed-k-fusion.md | New ADR documenting the mixed-K fusion decision |
| solution/docs/architecture/system-design.md | Updated latency model, granularity search, and performance docs for mixed-K |
| solution/requirements/mvp-scope.md | Updated acceptance criteria and feature descriptions for mixed-K |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…-K test - Rust build_memory_plan: k_strip sizes use min(k, K_full_op) via k_eff - Python comments: updated to show k_eff instead of raw k - Added test_mixed_k_two_matmuls: verifies phase-based latency with K_full=64 and K_full=128 at various k values, checks compute_num_k_steps 17 tests passing. Refs #22
There was a problem hiding this comment.
Pull request overview
This PR removes the K_full consistency constraint that prevented fusing MatMuls with different reduction dimensions into the same subgraph. The problem statement allows per-op k-steps, so this constraint was overly restrictive. The cost-based fusion criterion remains as the gatekeeper for merge decisions.
Changes:
- Changed
min(K_full)tomax(K_full)across all components (granularity search, split-K search, num_k_steps, evaluator validation) and removed K_full equality checks from fusion (Rust + Python) - Added mixed-K phased latency computation in both Rust and Python: when MatMuls have different K_full values, steps are grouped into phases by active op set, with O(1) computation per phase
- Added
test_mixed_k_two_matmulsunit test and comprehensive documentation (ADR-006, system-design.md updates, mvp-scope.md updates)
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| solution/backend/rust/src/latency.rs | Core change: min→max for num_k_steps, k_eff = min(k, K_full_op) clamping in build_memory_plan and compute, mixed-K phased latency path |
| solution/backend/rust/src/optimizer/fusion.rs | Removed K_full consistency check from merge eligibility |
| solution/backend/rust/src/optimizer/granularity.rs | min→max for K_full search bound |
| solution/backend/rust/src/memory.rs | min→max for split-K search upper bound |
| solution/backend/rust/src/evaluate.rs | Replaced K_full consistency validation with k <= max(K_full) check |
| solution/backend/rust/src/main.rs | Added test_mixed_k_two_matmuls test with K_full=64/128 |
| solution/agent/evaluator.py | Python mirror of all Rust changes: max K_full, mixed-K phased latency, k_eff clamping, per-op k_strip tracking |
| solution/agent/scheduler.py | Removed K_full consistency check from Python fusion |
| solution/docs/decisions/ADR-006-mixed-k-fusion.md | ADR documenting the decision and execution model |
| solution/docs/architecture/system-design.md | Updated latency model docs for mixed-K semantics |
| solution/requirements/mvp-scope.md | Updated acceptance criteria and feature descriptions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Summary
Removes the K_full consistency constraint that prevented fusing MatMuls with different reduction dimensions. The problem statement allows per-op k-steps — each MatMul does ceil(its_K_full / k) steps independently.
Closes #22
Changes
compute_num_k_steps: max(K_full) instead of min(K_full)Results
Benchmark latencies are unchanged — the cost-based fusion correctly determines that mixed-K fused latency is not beneficial for these benchmarks' tensor dimensions. The mixed-K model is now correct and ready for future benchmarks where it would help.
All 16 unit tests pass. All 5 benchmarks complete in < 1s.
Test plan