LLM360 · mvillmow · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/docs/k2-aic-exhaustive-study/README.md b/docs/k2-aic-exhaustive-study/README.md
@@ -0,0 +1,9 @@
+# K2 Failure Boundary Study
+
+This epic-scoped slice turns the K2 exhaustive AIConfigurator no-result region into a checked-in derived study artifact.
+
+- Narrative: [sections/138-failure-boundary.md](sections/138-failure-boundary.md)
+- Tables: [data/138_failure_boundary_overview.tsv](data/138_failure_boundary_overview.tsv), [data/138_failure_boundary_profile_summary.tsv](data/138_failure_boundary_profile_summary.tsv), [data/138_failure_boundary_context_node_summary.tsv](data/138_failure_boundary_context_node_summary.tsv)
+- Figures: [figures/138_failure_boundary_mix_heatmap.png](figures/138_failure_boundary_mix_heatmap.png), [figures/138_disagg_unsupported_ratio_heatmap.png](figures/138_disagg_unsupported_ratio_heatmap.png), [figures/138_failure_unsupported_ratio_pointcloud.png](figures/138_failure_unsupported_ratio_pointcloud.png)
+
+The raw source catalogs remain outside GitHub. This subtree only contains safe derived tables, figures, and prose. Reproducibility scripts are intentionally excluded here and are tracked separately in RL360 issue `#182`.
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_cells.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_cells.tsv
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_context_node_summary.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_context_node_summary.tsv
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_context_summary.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_context_summary.tsv
@@ -0,0 +1,18 @@
+profile_context_budget	context_budget_label	total_cells	supported_both_cells	unsupported_disagg_cells	failed_both_cells	supported_both_ratio	unsupported_disagg_ratio	failed_both_ratio
+1024	1K	255	240	15	0	0.9412	0.0588	0.0000
+1536	1.5K	255	240	15	0	0.9412	0.0588	0.0000
+2048	2K	255	240	15	0	0.9412	0.0588	0.0000
+2560	2.5K	255	240	15	0	0.9412	0.0588	0.0000
+4096	4K	255	240	15	0	0.9412	0.0588	0.0000
+4500	4500	255	240	15	0	0.9412	0.0588	0.0000
+8192	8K	510	480	30	0	0.9412	0.0588	0.0000
+16384	16K	255	240	15	0	0.9412	0.0588	0.0000
+32768	32K	255	240	15	0	0.9412	0.0588	0.0000
+40960	40K	255	240	15	0	0.9412	0.0588	0.0000
+65536	64K	255	240	15	0	0.9412	0.0588	0.0000
+73728	72K	255	240	15	0	0.9412	0.0588	0.0000
+131072	128K	510	480	30	0	0.9412	0.0588	0.0000
+131584	128.5K	255	240	15	0	0.9412	0.0588	0.0000
+132608	129.5K	255	240	15	0	0.9412	0.0588	0.0000
+138240	135K	255	240	15	0	0.9412	0.0588	0.0000
+262144	256K	510	240	15	255	0.4706	0.0294	0.5000
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_node_summary.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_node_summary.tsv
@@ -0,0 +1,18 @@
+node_count	total_cells	supported_both_cells	unsupported_disagg_cells	failed_both_cells	supported_both_ratio	unsupported_disagg_ratio	failed_both_ratio
+1	300	0	285	15	0.0000	0.9500	0.0500
+2	300	285	0	15	0.9500	0.0000	0.0500
+4	300	285	0	15	0.9500	0.0000	0.0500
+8	300	285	0	15	0.9500	0.0000	0.0500
+16	300	285	0	15	0.9500	0.0000	0.0500
+32	300	285	0	15	0.9500	0.0000	0.0500
+64	300	285	0	15	0.9500	0.0000	0.0500
+128	300	285	0	15	0.9500	0.0000	0.0500
+144	300	285	0	15	0.9500	0.0000	0.0500
+160	300	285	0	15	0.9500	0.0000	0.0500
+176	300	285	0	15	0.9500	0.0000	0.0500
+192	300	285	0	15	0.9500	0.0000	0.0500
+208	300	285	0	15	0.9500	0.0000	0.0500
+224	300	285	0	15	0.9500	0.0000	0.0500
+240	300	285	0	15	0.9500	0.0000	0.0500
+256	300	285	0	15	0.9500	0.0000	0.0500
+512	300	285	0	15	0.9500	0.0000	0.0500
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_overview.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_overview.tsv
@@ -0,0 +1,10 @@
+metric	value	notes
+source_catalog_count	2	Merged safe derived catalogs from the additive sweep and intermediate-node backfill.
+unique_topology_rows	10200	Each topology row is one workload-profile, node-count, and batch cell for agg or disagg.
+unique_boundary_cells	5100	Each boundary cell collapses agg and disagg into one workload-profile, node-count, and batch point.
+supported_both_cells	4560	Both agg and disagg emitted candidates.
+unsupported_disagg_cells	285	Agg succeeded but disagg emitted no CSV on single-node runs.
+failed_both_cells	255	Both topologies land in the non-wideep SGLang TP>1 and DP>1 guard path.
+supported_both_ratio	0.8941	Share of merged boundary cells that are fully supported.
+unsupported_disagg_ratio	0.0559	Share of merged boundary cells in the single-node disagg-only unsupported region.
+failed_both_ratio	0.0500	Share of merged boundary cells in the hard failure region.
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_profile_summary.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_profile_summary.tsv
@@ -0,0 +1,21 @@
+workload_profile	profile_family	profile_tier	summary_label	profile_isl	profile_osl	profile_context_budget	context_budget_label	total_cells	supported_both_cells	unsupported_disagg_cells	failed_both_cells	supported_both_ratio	unsupported_disagg_ratio	failed_both_ratio
+balanced_1k	balanced_pow2	balanced_pow2	Balanced 1K	512	512	1024	1K	255	240	15	0	0.9412	0.0588	0.0000
+observed_low_0p5k_1k	issue_preserved	observed_practical	0.5K / 1K	512	1024	1536	1.5K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_2k	balanced_pow2	balanced_pow2	Balanced 2K	1024	1024	2048	2K	255	240	15	0	0.9412	0.0588	0.0000
+observed_mean_1p5k_1k	issue_preserved	observed_practical	1.5K / 1K	1536	1024	2560	2.5K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_4k	balanced_pow2	balanced_pow2	Balanced 4K	2048	2048	4096	4K	255	240	15	0	0.9412	0.0588	0.0000
+chat_4k_500	issue_preserved	legacy_chat	Chat 4K / 500	4000	500	4500	4500	255	240	15	0	0.9412	0.0588	0.0000
+balanced_8k	balanced_pow2	balanced_pow2	Balanced 8K	4096	4096	8192	8K	255	240	15	0	0.9412	0.0588	0.0000
+observed_high_7k_1k	issue_preserved	observed_practical	7K / 1K	7168	1024	8192	8K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_16k	balanced_pow2	balanced_pow2	Balanced 16K	8192	8192	16384	16K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_32k	balanced_pow2	balanced_pow2	Balanced 32K	16384	16384	32768	32K	255	240	15	0	0.9412	0.0588	0.0000
+practical_32k_8k	issue_preserved	practical_rl	32K / 8K	32768	8192	40960	40K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_64k	balanced_pow2	balanced_pow2	Balanced 64K	32768	32768	65536	64K	255	240	15	0	0.9412	0.0588	0.0000
+practical_64k_8k	issue_preserved	practical_rl	64K / 8K	65536	8192	73728	72K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_128k	balanced_pow2	balanced_pow2	Balanced 128K	65536	65536	131072	128K	255	240	15	0	0.9412	0.0588	0.0000
+stress_128k	issue_preserved	stress_rl	128K profile	114688	16384	131072	128K	255	240	15	0	0.9412	0.0588	0.0000
+observed_low_0p5k_128k	issue_preserved	observed_stress	0.5K / 128K	512	131072	131584	128.5K	255	240	15	0	0.9412	0.0588	0.0000
+observed_mean_1p5k_128k	issue_preserved	observed_stress	1.5K / 128K	1536	131072	132608	129.5K	255	240	15	0	0.9412	0.0588	0.0000
+observed_high_7k_128k	issue_preserved	observed_stress	7K / 128K	7168	131072	138240	135K	255	240	15	0	0.9412	0.0588	0.0000
+balanced_256k	balanced_pow2	balanced_pow2	Balanced 256K	131072	131072	262144	256K	255	240	15	0	0.9412	0.0588	0.0000
+stress_256k	issue_preserved	stress_rl	256K profile	229376	32768	262144	256K	255	0	0	255	0.0000	0.0000	1.0000
diff --git a/docs/k2-aic-exhaustive-study/data/138_failure_boundary_topology_summary.tsv b/docs/k2-aic-exhaustive-study/data/138_failure_boundary_topology_summary.tsv
@@ -0,0 +1,3 @@
+topology	total_points	candidate_points	no_results_points	candidate_ratio	no_results_ratio	supported_points	unsupported_single_node_disagg_points	failed_nonwideep_tp_dp_guard_points
+agg	5100	4845	255	0.9500	0.0500	4845	0	255
+disagg	5100	4560	540	0.8941	0.1059	4560	285	255
diff --git a/docs/k2-aic-exhaustive-study/figures/138_disagg_unsupported_ratio_heatmap.png b/docs/k2-aic-exhaustive-study/figures/138_disagg_unsupported_ratio_heatmap.png
diff --git a/docs/k2-aic-exhaustive-study/figures/138_failure_boundary_mix_heatmap.png b/docs/k2-aic-exhaustive-study/figures/138_failure_boundary_mix_heatmap.png
diff --git a/docs/k2-aic-exhaustive-study/figures/138_failure_unsupported_ratio_pointcloud.png b/docs/k2-aic-exhaustive-study/figures/138_failure_unsupported_ratio_pointcloud.png
diff --git a/docs/k2-aic-exhaustive-study/sections/138-failure-boundary.md b/docs/k2-aic-exhaustive-study/sections/138-failure-boundary.md
@@ -0,0 +1,70 @@
+# K2 Failure Boundary And No-Result Analysis
+
+## Scope
+
+This section analyzes the safe derived K2 exhaustive AIConfigurator sweep for `h200_sxm + sglang 0.5.9` and asks a narrow question: where do no-result regions represent a stable unsupported topology boundary, and where do they represent a harder backend failure that should block planning outright?
+
+The checked-in derived dataset merges two local catalogs:
+
+- the additive batch-band sweep over node counts `1, 2, 4, 8, 16, 32, 64, 128, 256, 512`
+- the intermediate-node backfill over node counts `144, 160, 176, 192, 208, 224, 240`
+
+That merge yields `10,200` topology rows and `5,100` agg/disagg boundary cells in [../data/138_failure_boundary_cells.tsv](../data/138_failure_boundary_cells.tsv).
+
+## Boundary Taxonomy
+
+The no-result region is not uniform. The derived tables normalize the sweep into three boundary classes:
+
+| Boundary class | Cells | Ratio | Meaning |
+| --- | ---: | ---: | --- |
+| `supported_both` | 4560 | 89.4% | Both agg and disagg emit candidates. |
+| `unsupported_disagg` | 285 | 5.6% | Agg emits a candidate, but single-node disagg emits no result CSV. |
+| `failed_both` | 255 | 5.0% | Both topologies collapse into the non-wideep SGLang `TP>1 && DP>1` guard path. |
+
+Those counts are summarized in [../data/138_failure_boundary_overview.tsv](../data/138_failure_boundary_overview.tsv) and visualized in [../figures/138_failure_boundary_mix_heatmap.png](../figures/138_failure_boundary_mix_heatmap.png).
+
+## What The Boundary Actually Says
+
+Two qualitatively different regions appear in the sweep:
+
+1. The disaggregated topology is unsupported at exactly one node for every non-`stress_256k` profile.
+2. The `stress_256k` profile fails across every sampled node count and every sampled load cell.
+
+The distinction matters because it changes the operational recommendation:
+
+- The single-node disagg region should be treated as an unsupported planning boundary, not as evidence that the entire workload shape is invalid.
+- The `stress_256k` region should be treated as a hard failure boundary until the backend guard or estimator coverage changes.
+
+The topology-specific unsupported view in [../figures/138_disagg_unsupported_ratio_heatmap.png](../figures/138_disagg_unsupported_ratio_heatmap.png) makes the first point explicit: unsupported disagg is concentrated at `1` node and disappears immediately at `2` nodes.
+
+The ratio view in [../figures/138_failure_unsupported_ratio_pointcloud.png](../figures/138_failure_unsupported_ratio_pointcloud.png) shows that almost every node/context bucket is fully supported, with only three nonzero ratio states:
+
+- `unsupported=1.00, failure=0.00` for the single-node non-`256K` unsupported strip
+- `unsupported=0.80, failure=0.20` for the single-node `256K` mixed bucket
+- `unsupported=0.00, failure=0.20` for every multi-node `256K` bucket
+
+## Profile-Level Findings
+
+Every profile except `stress_256k` follows the same pattern in [../data/138_failure_boundary_profile_summary.tsv](../data/138_failure_boundary_profile_summary.tsv):
+
+- `240 / 255` cells are fully supported
+- `15 / 255` cells are single-node disagg no-results
+- `0 / 255` cells are hard failures
+
+`stress_256k` is the lone outlier:
+
+- `0 / 255` cells are supported
+- `0 / 255` cells land in the disagg-only unsupported category
+- `255 / 255` cells fail
+
+This means the apparent `256K` boundary is not a generic "all 256K contexts are broken" result. `balanced_256k` and the other `~256K` observed-context profiles remain launchable above one node. The failure boundary is much narrower: it is tied to the preserved `stress_256k` shape with `isl=229376` and `osl=32768`.
+
+## Implications For Measured Follow-Up
+
+This boundary study suggests three concrete follow-ups for measured runtime work:
+
+- Exclude single-node disagg from measured comparisons unless the goal is explicitly to validate the unsupported region.
+- Sample directly around the `128K -> 256K` stress transition, because `stress_128k` is stable while `stress_256k` fails everywhere.
+- Keep the `stress_256k` failure mode separate from ordinary unsupported planning gaps so that future estimator or backend fixes can be evaluated against a clean baseline.
+
+In short, the safe derived sweep does not show a broad collapse of K2 AIConfigurator coverage. It shows one narrow topology boundary and one preserved hard-failure lane. Treating those as separate planning outputs is the key result of epic `#138`.