Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
working-directory: solution/backend/rust
run: cargo build --release

- name: Run unit tests (15 tests)
- name: Run unit tests
working-directory: solution/backend/rust
run: cargo test -- --nocapture

Expand Down
24 changes: 15 additions & 9 deletions solution/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- Traversal optimization: snake/zig-zag tile order for MatMul data reuse.

- **Test suite**
- 15 Rust unit tests in `src/main.rs`:
- 18 Rust unit tests in `src/main.rs`:
- Example 1 (baseline pointwise chain): strategies A, B, C
- Example 2 (larger tensors, 256x256): strategies A and B
- Example 3 (diamond graph): spilling baseline and selective retention
Expand All @@ -82,6 +82,9 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- Edge cases: single tiny op, OOM detection, serialization round-trip,
ephemeral tensor boundary correctness, cyclic DAG rejection
- All 5 released benchmarks: full pipeline validity check
- New tests (3): `test_fused_matmul_pointwise_splitk`,
`test_fused_matmul_pointwise_splitk_boundary_pw_input`,
`test_mixed_k_two_matmuls`
- E2E script (`solution/scripts/test-e2e.sh`):
- Track A build + 5 benchmark validation
- Track B import verification + 5 benchmark validation (baseline mode)
Expand All @@ -105,14 +108,17 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
topology (Rust binary + Python agent), user journeys, C4 workspace,
error catalog (Rust error handling), security model.
- `solution/docs/decisions/` — ADR-001 (Rust + Python language selection),
ADR-002 (baseline-first development), ADR-003 (greedy fusion over DP).
ADR-002 (baseline-first development), ADR-003 (greedy fusion over DP),
ADR-004 (k-dimension search in granularity optimization),
ADR-005 (closed-form latency evaluation),
ADR-006 (mixed-K fusion).

- **Benchmark results** (Track A — Rust)

| Benchmark | Ops | Latency |
|-----------|-----|---------|
| mlsys-2026-1 | 5 | 27,443 |
| mlsys-2026-5 | 19 | 27,856 |
| mlsys-2026-9 | 32 | 110,100 |
| mlsys-2026-13 | 63 | 191,693 |
| mlsys-2026-17 | 103 | 23,650 |
| Benchmark | Ops | Latency | Subgraphs |
|-----------|-----|---------|-----------|
| mlsys-2026-1 | 5 | 262,822 | 4 |
| mlsys-2026-5 | 19 | 909,261 | 13 |
| mlsys-2026-9 | 32 | 12,415,140 | 24 |
| mlsys-2026-13 | 63 | 4,707,779 | 25 |
| mlsys-2026-17 | 103 | 814,572 | 81 |
30 changes: 15 additions & 15 deletions solution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ Solution: 3 subgraphs, total latency = 8234.56
Solution written to /tmp/out.json
```

To run the full unit test suite (15 tests, including all 5 released benchmarks):
To run the full unit test suite (18 tests, including all 5 released benchmarks):

```bash
cargo test
Expand Down Expand Up @@ -232,7 +232,7 @@ solution/
│ └── rust/
│ ├── Cargo.toml
│ └── src/
│ ├── main.rs # Entry point + 15 unit tests
│ ├── main.rs # Entry point + 18 unit tests
│ ├── models.rs
│ ├── parser.rs
│ ├── dag.rs
Expand Down Expand Up @@ -274,7 +274,7 @@ solution/

## Testing

### Track A — Rust Unit Tests (15 tests)
### Track A — Rust Unit Tests (18 tests)

```bash
cd solution/backend/rust
Expand Down Expand Up @@ -336,18 +336,18 @@ Validation checks per output file:
All 5 released benchmarks produce valid solutions within the memory constraint.
Reported latencies are from Track A (Rust) on the local machine.

| Benchmark | Ops | Tensors | Fast Mem | Bandwidth | Track A Latency |
|-----------|-----|---------|----------|-----------|-----------------|
| mlsys-2026-1 | 5 | 9 | 60,000 | 20 | 27,443 |
| mlsys-2026-5 | 19 | 29 | 30,000 | 15 | 27,856 |
| mlsys-2026-9 | 32 | 49 | 250,000 | 25 | 110,100 |
| mlsys-2026-13 | 63 | 100 | 600,000 | 50 | 191,693 |
| mlsys-2026-17 | 103 | 160 | 500,000 | 100 | 23,650 |

All benchmarks complete in under 1 second. The optimizer fuses adjacent
chains, applies Split-K for memory-constrained MatMuls, searches tile
granularities to balance compute/memory costs, and uses snake traversal
for MatMul data reuse.
| Benchmark | Ops | Tensors | Fast Mem | Bandwidth | Track A Latency | Subgraphs |
|-----------|-----|---------|----------|-----------|-----------------|-----------|
| mlsys-2026-1 | 5 | 9 | 60,000 | 20 | 262,822 | 4 |
| mlsys-2026-5 | 19 | 29 | 30,000 | 15 | 909,261 | 13 |
| mlsys-2026-9 | 32 | 49 | 250,000 | 25 | 12,415,140 | 24 |
| mlsys-2026-13 | 63 | 100 | 600,000 | 50 | 4,707,779 | 25 |
| mlsys-2026-17 | 103 | 160 | 500,000 | 100 | 814,572 | 81 |

The optimizer runs in under 1 second per benchmark on a standard dev machine. It uses cost-based
fusion with epsilon tolerance to merge adjacent chains, applies Split-K for
memory-constrained MatMuls, searches tile granularities to balance
compute/memory costs, and uses snake traversal for MatMul data reuse.

---

Expand Down
14 changes: 9 additions & 5 deletions solution/checkpoints/stage-4-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
- Remaining: ~60 minutes (Stage 5)

## Deliverables
- [x] Track A (Rust) unit tests - 15 tests passing
- [x] Track A (Rust) unit tests - 18 tests passing
- [x] Track B (Python) unit tests - 29 tests passing
- [x] E2E happy path script - 13/13 tests passing
- [x] Both tracks validated against all 5 benchmarks
Expand Down Expand Up @@ -43,16 +43,17 @@
|-------|------|-------|--------|
| Track A (Rust) - Latency model examples | `src/main.rs` | 9 | PASS |
| Track A (Rust) - Edge cases + benchmarks | `src/main.rs` | 6 | PASS |
| Track A (Rust) - Mixed-K + split-K + boundary PW | `src/main.rs` | 3 | PASS |
| Track B (Python) - Example 1 (Pointwise chain) | `tests/test_evaluator.py` | 4 | PASS |
| Track B (Python) - Example 2 (Larger tensors) | `tests/test_evaluator.py` | 2 | PASS |
| Track B (Python) - Example 3 (Diamond graph) | `tests/test_evaluator.py` | 5 | PASS |
| Track B (Python) - Example 4 (MatMul tiling) | `tests/test_evaluator.py` | 1 | PASS |
| Track B (Python) - Example 5 (Split-K) | `tests/test_evaluator.py` | 1 | PASS |
| Track B (Python) - Edge cases | `tests/test_evaluator.py` | 11 | PASS |
| Track B (Python) - Benchmark integration | `tests/test_evaluator.py` | 5 | PASS |
| **Total** | | **44** | **PASS** |
| **Total** | | **47** | **PASS** |

### Rust Test Details (15 tests)
### Rust Test Details (18 tests)

| Test | Validates |
|------|-----------|
Expand All @@ -71,6 +72,9 @@
| test_edge_fusion_ephemeral_correctness | Tensor 3 ephemeral in fused [0,1] |
| test_edge_cyclic_dag_rejected | Cyclic input returns Err("cycle") |
| test_benchmark_solutions_validity | All 5 benchmarks: full op coverage, valid JSON |
| test_fused_matmul_pointwise_splitk | Fused MatMul+Pointwise with split-K granularity |
| test_fused_matmul_pointwise_splitk_boundary_pw_input | Boundary Pointwise input tensor memory accounting |
| test_mixed_k_two_matmuls | Two MatMuls with different K_full in same subgraph |

### E2E Tests

Expand Down Expand Up @@ -100,7 +104,7 @@
| Track A Scheduler | Rust (Cargo, edition 2021) | Compiled and passing |
| Track B Agent | Python 3.12 + google-genai | Imports OK, baseline mode works |
| Track B Evaluator | Pure Python (no external deps) | 29 tests passing |
| Test Runner (Rust) | `cargo test` | 15/15 passing |
| Test Runner (Rust) | `cargo test` | 18/18 passing |
| Test Runner (Python) | pytest 9.0.2 via uv venv | 29/29 passing |

### Benchmark Latency Summary
Expand Down Expand Up @@ -145,7 +149,7 @@ No bugs were found requiring fixes. All tests passed on first run after the `inc
## Ready for Next Stage?
- [x] All deliverables complete
- [x] Judge validation passed (5.00/5)
- [x] 44 unit tests passing (15 Rust + 29 Python)
- [x] 47 unit tests passing (18 Rust + 29 Python)
- [x] 13/13 E2E tests passing
- [x] Both tracks validated against all 5 benchmark problems

Expand Down
4 changes: 2 additions & 2 deletions solution/docs/architecture/data-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ sequenceDiagram
S-->>G: ScheduleState (with split-K)

G->>G: For each subgraph, generate (w, h, k) candidates
Note over G: k candidates: K_cap down to 1 in powers of 2<br/>K_cap = min(K_full across MatMuls in subgraph)
Note over G: k candidates: K_max down to 1 by repeated halving<br/>K_max = max(K_full across MatMuls in subgraph)
G->>M: Check OOM for each candidate
G->>L: Calculate total latency for each valid (w, h, k)
L-->>G: candidate latencies (sum of per-step roofline)
Expand Down Expand Up @@ -135,7 +135,7 @@ flowchart LR
S2[2. Greedy Fusion<br>Merge adjacent ops]
S3[3. Retention pass 1<br>Keep tensors resident]
S4[4. Split-K<br>Reduce k for OOM relief]
S5[5. Granularity Search<br>Optimize w,h,k per subgraph<br>k: K_cap...1 in powers of 2]
S5[5. Granularity Search<br>Optimize w,h,k per subgraph<br>k: K_max...1 by halving]
S6[6. Retention pass 2<br>Re-evaluate after granularity changes]
S7[7. Emergency OOM Fix<br>Reduce granularity for any remaining OOM]
S8[8. Final Latency<br>Recalculate all subgraph latencies]
Expand Down
2 changes: 1 addition & 1 deletion solution/docs/architecture/database-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ fn optimize_*(subgraphs: &mut Vec<SubgraphDef>, problem: &Problem, dag: &DagInfo
| 1 | 9 | 5 | 512x512 (262,144) | 60,000 | 20 |
| 5 | 29 | 19 | 1024x1024 (1,048,576) | 30,000 | 15 |
| 9 | 49 | 32 | 4096x4096 (16,777,216) | 250,000 | 25 |
| 13 | 96 | 63 | 4096x4096 (16,777,216) | 600,000 | 50 |
| 13 | 100 | 63 | 4096x4096 (16,777,216) | 600,000 | 50 |
| 17 | 160 | 103 | 2048x2048 (4,194,304) | 500,000 | 100 |

Note: Tensor sizes can be much larger than fast memory capacity, requiring tiling (spatial granularity < tensor dimensions).
25 changes: 22 additions & 3 deletions solution/docs/architecture/security-model.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@
# Security Model

This is a single-user CLI tool that processes local JSON files. There is no network communication, no authentication, no multi-tenancy, and no user-supplied code execution.
This is a single-user CLI tool that processes local JSON files. Track A (Rust
binary) has no network communication, no authentication, no multi-tenancy, and
no user-supplied code execution.

Track B (Python agent) makes HTTPS calls to the Gemini API
(`generativelanguage.googleapis.com`) when a `GOOGLE_API_KEY` environment
variable is set. No problem data or solution JSON is logged or persisted by the
API client beyond the scope of a single agent run. When `GOOGLE_API_KEY` is set
to `dummy` or omitted, the agent runs in local-only mode with zero network
traffic. When `GOOGLE_API_KEY` is valid, Track B transmits the full problem
JSON and current solution JSON to the Gemini API as part of the prompt. No
claims are made about Gemini's data retention policies — this is controlled
by the contest organizers' environment. On API error paths, the agent may
print a preview of the Gemini response (first 500 chars) to stderr for
debugging.

## Threat Model

Expand All @@ -9,7 +23,7 @@ This is a single-user CLI tool that processes local JSON files. There is no netw
| Injection attacks (SQL, command) | No | No database, no shell commands |
| Malicious input JSON | Low risk | JSON parser handles untrusted input safely; no `eval()` |
| Denial of service | Not applicable | Single-user local tool |
| Data exfiltration | Not applicable | No network, no secrets |
| Data exfiltration | Track B only | HTTPS to Gemini API; no credentials stored; `GOOGLE_API_KEY` read from environment only |
| Path traversal | Low risk | CLI accepts file paths; use `std::fs::canonicalize()` for safety |

## Input Validation
Expand All @@ -21,4 +35,9 @@ This is a single-user CLI tool that processes local JSON files. There is no netw

## No Secrets

This project contains no API keys, passwords, tokens, or credentials. The `.env` file pattern is not applicable.
Track A contains no API keys, passwords, tokens, or credentials. The `.env` file
pattern is not applicable for the Rust binary.

Track B reads `GOOGLE_API_KEY` from the environment at runtime. This key is never
written to disk, never logged, and never embedded in source code. The agent falls
back to local-only mode if the key is absent or set to `dummy`.
29 changes: 9 additions & 20 deletions solution/docs/architecture/system-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This is a **computational optimization tool**, not a web service. It is a single

## Scale Estimates

- Input size: 2 ops / 3 tensors (trivial) to 96 ops / 160 tensors (benchmark 17)
- Input size: 2 ops / 3 tensors (trivial) to 103 ops / 160 tensors (benchmark 17)
- Runtime target: < 2 seconds per benchmark on a standard developer machine
- No concurrency, no network, no database
- Memory: All data fits easily in RAM (< 1 MB input, < 10 MB working state)
Expand Down Expand Up @@ -470,28 +470,17 @@ working_set = sum(slice_size for each boundary input and output tensor that must

### Retained Tensors from Previous Subgraphs

When a previous subgraph retains a tensor, that tensor occupies fast memory at its **full size** (not a slice), because it was computed across all spatial tiles and remains fully materialized.
When a previous subgraph retains a tensor, that tensor occupies fast memory at its **full size** (not a slice), because it was fully materialized by the prior subgraph and remains resident across subgraph boundaries.

Wait -- actually, retained tensors are computed slice-by-slice but the full tensor accumulates. Let me reconsider.

Actually, from Example 3C: Tensor1 (128x128 = 16384) is retained. The working set of subgraph 1 must include this full tensor. The subgraph 1 has Tensor1 as input (already resident), processes Op1 and Op2 producing Tensor3. Working set = Tensor1 (16384, resident) + Tensor2 (ephemeral, 0) + Tensor3 output (16384) = 32768 <= 50000. This works.

But wait -- if the subgraph uses a granularity smaller than the tensor, only a slice of the retained tensor is needed per step. The retained tensor is at full size in fast memory though (it was fully computed by the prior subgraph at its granularity).

Actually, the problem says retained tensors stay in fast memory at full size. The working set calculation must include:
- The **full size** of all currently retained tensors
The working set calculation must include:
- The **full size** of all currently retained tensors from previous subgraphs
- Plus the **slice sizes** of all boundary inputs/outputs needed for the current execution step

Correction: from Example 5B, the accumulator Tensor4 (128x128 = 16384) and Tensor0 (128x128 = 16384) are resident, plus Tensor1 strip (128x32 = 4096) and Tensor2 strip (32x128 = 4096). Working set = 16384 + 16384 + 4096 + 4096 = 40960. That matches.

But Tensor0 is a full input that gets loaded in step 1 and reused. It's NOT a retained tensor from a previous subgraph -- it's loaded in this subgraph. Tensor4 is the accumulator (output). So the working set includes:
- Full-size inputs that are resident (loaded once, reused): full tensor size
- Streamed input strips: slice size
- Output/accumulator: slice size (w * h)
From Example 3C: Tensor1 (128x128 = 16384) is retained. The working set of subgraph 1 includes Tensor1 at full size (16384, already resident), Tensor2 as ephemeral (0), and Tensor3 output slice (16384). Working set = 32768 <= 50000. This confirms that retained tensors count at full size.

This is more nuanced. The working set depends on which step we're computing and the traversal order. The **maximum** working set across all steps must fit.
From Example 5B: the accumulator Tensor4 (128x128 = 16384) and Tensor0 (128x128 = 16384) are resident, plus Tensor1 strip (128x32 = 4096) and Tensor2 strip (32x128 = 4096). Working set = 16384 + 16384 + 4096 + 4096 = 40960. Note that Tensor0 here is a full input loaded and reused within this subgraph (not a cross-subgraph retained tensor), and Tensor4 is the accumulator output.

For the OOM check, we need the **worst-case step** (typically the first step, where the most data is loaded fresh).
For the OOM check, the **worst-case step** (typically the first step, where the most data is loaded fresh) must fit within fast memory capacity.

---

Expand Down Expand Up @@ -583,8 +572,8 @@ total_latency = sum(subgraph_latency for each subgraph)
| 1 | 5 | 9 | 512x512 | 60,000 | 20 | Linear chain (MatMul + Pointwise) |
| 5 | 19 | 29 | 128-1024 mixed | 30,000 | 15 | 3x attention heads + aggregation |
| 9 | 32 | 49 | 1024-4096 mixed | 250,000 | 25 | 8x repeating MatMul+PW blocks |
| 13 | 63 | 96 | 128-4096 mixed | 600,000 | 50 | 16x parallel MatMul heads + PW aggregation |
| 17 | 96 | 160 | 128-2048 mixed | 500,000 | 100 | 8x attention + 8x MLP blocks + residual |
| 13 | 63 | 100 | 128-4096 mixed | 600,000 | 50 | 16x parallel MatMul heads + PW aggregation |
| 17 | 103 | 160 | 128-2048 mixed | 500,000 | 100 | 8x attention + 8x MLP blocks + residual |

---

Expand Down
2 changes: 1 addition & 1 deletion solution/docs/architecture/workspace.dsl
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ workspace "MLSys DAG Scheduler" "Computational graph scheduler for memory-constr
description "Finds optimal k for MatMul subgraphs under memory pressure"
}
granularityComponent = component "Granularity Search" "Spatial tiling" {
description "Searches (w, h) candidates to minimize per-subgraph latency"
description "Searches (w, h, k) candidates to minimize per-subgraph latency"
}
traversalComponent = component "Traversal Order" "Tile ordering" {
description "Snake/zig-zag traversal to reduce input strip reloads"
Expand Down
Loading
Loading