Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/k2-aic-exhaustive-study/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# K2 Failure Boundary Study

This epic-scoped slice turns the K2 exhaustive AIConfigurator no-result region into a checked-in derived study artifact.

- Narrative: [sections/138-failure-boundary.md](sections/138-failure-boundary.md)
- Tables: [data/138_failure_boundary_overview.tsv](data/138_failure_boundary_overview.tsv), [data/138_failure_boundary_profile_summary.tsv](data/138_failure_boundary_profile_summary.tsv), [data/138_failure_boundary_context_node_summary.tsv](data/138_failure_boundary_context_node_summary.tsv)
- Figures: [figures/138_failure_boundary_mix_heatmap.png](figures/138_failure_boundary_mix_heatmap.png), [figures/138_disagg_unsupported_ratio_heatmap.png](figures/138_disagg_unsupported_ratio_heatmap.png), [figures/138_failure_unsupported_ratio_pointcloud.png](figures/138_failure_unsupported_ratio_pointcloud.png)

The raw source catalogs remain outside GitHub. This subtree only contains safe derived tables, figures, and prose. Reproducibility scripts are intentionally excluded here and are tracked separately in RL360 issue `#182`.
5,101 changes: 5,101 additions & 0 deletions docs/k2-aic-exhaustive-study/data/138_failure_boundary_cells.tsv

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
profile_context_budget context_budget_label total_cells supported_both_cells unsupported_disagg_cells failed_both_cells supported_both_ratio unsupported_disagg_ratio failed_both_ratio
1024 1K 255 240 15 0 0.9412 0.0588 0.0000
1536 1.5K 255 240 15 0 0.9412 0.0588 0.0000
2048 2K 255 240 15 0 0.9412 0.0588 0.0000
2560 2.5K 255 240 15 0 0.9412 0.0588 0.0000
4096 4K 255 240 15 0 0.9412 0.0588 0.0000
4500 4500 255 240 15 0 0.9412 0.0588 0.0000
8192 8K 510 480 30 0 0.9412 0.0588 0.0000
16384 16K 255 240 15 0 0.9412 0.0588 0.0000
32768 32K 255 240 15 0 0.9412 0.0588 0.0000
40960 40K 255 240 15 0 0.9412 0.0588 0.0000
65536 64K 255 240 15 0 0.9412 0.0588 0.0000
73728 72K 255 240 15 0 0.9412 0.0588 0.0000
131072 128K 510 480 30 0 0.9412 0.0588 0.0000
131584 128.5K 255 240 15 0 0.9412 0.0588 0.0000
132608 129.5K 255 240 15 0 0.9412 0.0588 0.0000
138240 135K 255 240 15 0 0.9412 0.0588 0.0000
262144 256K 510 240 15 255 0.4706 0.0294 0.5000
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
node_count total_cells supported_both_cells unsupported_disagg_cells failed_both_cells supported_both_ratio unsupported_disagg_ratio failed_both_ratio
1 300 0 285 15 0.0000 0.9500 0.0500
2 300 285 0 15 0.9500 0.0000 0.0500
4 300 285 0 15 0.9500 0.0000 0.0500
8 300 285 0 15 0.9500 0.0000 0.0500
16 300 285 0 15 0.9500 0.0000 0.0500
32 300 285 0 15 0.9500 0.0000 0.0500
64 300 285 0 15 0.9500 0.0000 0.0500
128 300 285 0 15 0.9500 0.0000 0.0500
144 300 285 0 15 0.9500 0.0000 0.0500
160 300 285 0 15 0.9500 0.0000 0.0500
176 300 285 0 15 0.9500 0.0000 0.0500
192 300 285 0 15 0.9500 0.0000 0.0500
208 300 285 0 15 0.9500 0.0000 0.0500
224 300 285 0 15 0.9500 0.0000 0.0500
240 300 285 0 15 0.9500 0.0000 0.0500
256 300 285 0 15 0.9500 0.0000 0.0500
512 300 285 0 15 0.9500 0.0000 0.0500
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
metric value notes
source_catalog_count 2 Merged safe derived catalogs from the additive sweep and intermediate-node backfill.
unique_topology_rows 10200 Each topology row is one workload-profile, node-count, and batch cell for agg or disagg.
unique_boundary_cells 5100 Each boundary cell collapses agg and disagg into one workload-profile, node-count, and batch point.
supported_both_cells 4560 Both agg and disagg emitted candidates.
unsupported_disagg_cells 285 Agg succeeded but disagg emitted no CSV on single-node runs.
failed_both_cells 255 Both topologies land in the non-wideep SGLang TP>1 and DP>1 guard path.
supported_both_ratio 0.8941 Share of merged boundary cells that are fully supported.
unsupported_disagg_ratio 0.0559 Share of merged boundary cells in the single-node disagg-only unsupported region.
failed_both_ratio 0.0500 Share of merged boundary cells in the hard failure region.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
workload_profile profile_family profile_tier summary_label profile_isl profile_osl profile_context_budget context_budget_label total_cells supported_both_cells unsupported_disagg_cells failed_both_cells supported_both_ratio unsupported_disagg_ratio failed_both_ratio
balanced_1k balanced_pow2 balanced_pow2 Balanced 1K 512 512 1024 1K 255 240 15 0 0.9412 0.0588 0.0000
observed_low_0p5k_1k issue_preserved observed_practical 0.5K / 1K 512 1024 1536 1.5K 255 240 15 0 0.9412 0.0588 0.0000
balanced_2k balanced_pow2 balanced_pow2 Balanced 2K 1024 1024 2048 2K 255 240 15 0 0.9412 0.0588 0.0000
observed_mean_1p5k_1k issue_preserved observed_practical 1.5K / 1K 1536 1024 2560 2.5K 255 240 15 0 0.9412 0.0588 0.0000
balanced_4k balanced_pow2 balanced_pow2 Balanced 4K 2048 2048 4096 4K 255 240 15 0 0.9412 0.0588 0.0000
chat_4k_500 issue_preserved legacy_chat Chat 4K / 500 4000 500 4500 4500 255 240 15 0 0.9412 0.0588 0.0000
balanced_8k balanced_pow2 balanced_pow2 Balanced 8K 4096 4096 8192 8K 255 240 15 0 0.9412 0.0588 0.0000
observed_high_7k_1k issue_preserved observed_practical 7K / 1K 7168 1024 8192 8K 255 240 15 0 0.9412 0.0588 0.0000
balanced_16k balanced_pow2 balanced_pow2 Balanced 16K 8192 8192 16384 16K 255 240 15 0 0.9412 0.0588 0.0000
balanced_32k balanced_pow2 balanced_pow2 Balanced 32K 16384 16384 32768 32K 255 240 15 0 0.9412 0.0588 0.0000
practical_32k_8k issue_preserved practical_rl 32K / 8K 32768 8192 40960 40K 255 240 15 0 0.9412 0.0588 0.0000
balanced_64k balanced_pow2 balanced_pow2 Balanced 64K 32768 32768 65536 64K 255 240 15 0 0.9412 0.0588 0.0000
practical_64k_8k issue_preserved practical_rl 64K / 8K 65536 8192 73728 72K 255 240 15 0 0.9412 0.0588 0.0000
balanced_128k balanced_pow2 balanced_pow2 Balanced 128K 65536 65536 131072 128K 255 240 15 0 0.9412 0.0588 0.0000
stress_128k issue_preserved stress_rl 128K profile 114688 16384 131072 128K 255 240 15 0 0.9412 0.0588 0.0000
observed_low_0p5k_128k issue_preserved observed_stress 0.5K / 128K 512 131072 131584 128.5K 255 240 15 0 0.9412 0.0588 0.0000
observed_mean_1p5k_128k issue_preserved observed_stress 1.5K / 128K 1536 131072 132608 129.5K 255 240 15 0 0.9412 0.0588 0.0000
observed_high_7k_128k issue_preserved observed_stress 7K / 128K 7168 131072 138240 135K 255 240 15 0 0.9412 0.0588 0.0000
balanced_256k balanced_pow2 balanced_pow2 Balanced 256K 131072 131072 262144 256K 255 240 15 0 0.9412 0.0588 0.0000
stress_256k issue_preserved stress_rl 256K profile 229376 32768 262144 256K 255 0 0 255 0.0000 0.0000 1.0000
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
topology total_points candidate_points no_results_points candidate_ratio no_results_ratio supported_points unsupported_single_node_disagg_points failed_nonwideep_tp_dp_guard_points
agg 5100 4845 255 0.9500 0.0500 4845 0 255
disagg 5100 4560 540 0.8941 0.1059 4560 285 255
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
70 changes: 70 additions & 0 deletions docs/k2-aic-exhaustive-study/sections/138-failure-boundary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# K2 Failure Boundary And No-Result Analysis

## Scope

This section analyzes the safe derived K2 exhaustive AIConfigurator sweep for `h200_sxm + sglang 0.5.9` and asks a narrow question: where do no-result regions represent a stable unsupported topology boundary, and where do they represent a harder backend failure that should block planning outright?

The checked-in derived dataset merges two local catalogs:

- the additive batch-band sweep over node counts `1, 2, 4, 8, 16, 32, 64, 128, 256, 512`
- the intermediate-node backfill over node counts `144, 160, 176, 192, 208, 224, 240`

That merge yields `10,200` topology rows and `5,100` agg/disagg boundary cells in [../data/138_failure_boundary_cells.tsv](../data/138_failure_boundary_cells.tsv).

## Boundary Taxonomy

The no-result region is not uniform. The derived tables normalize the sweep into three boundary classes:

| Boundary class | Cells | Ratio | Meaning |
| --- | ---: | ---: | --- |
| `supported_both` | 4560 | 89.4% | Both agg and disagg emit candidates. |
| `unsupported_disagg` | 285 | 5.6% | Agg emits a candidate, but single-node disagg emits no result CSV. |
| `failed_both` | 255 | 5.0% | Both topologies collapse into the non-wideep SGLang `TP>1 && DP>1` guard path. |

Those counts are summarized in [../data/138_failure_boundary_overview.tsv](../data/138_failure_boundary_overview.tsv) and visualized in [../figures/138_failure_boundary_mix_heatmap.png](../figures/138_failure_boundary_mix_heatmap.png).

## What The Boundary Actually Says

Two qualitatively different regions appear in the sweep:

1. The disaggregated topology is unsupported at exactly one node for every non-`stress_256k` profile.
2. The `stress_256k` profile fails across every sampled node count and every sampled load cell.

The distinction matters because it changes the operational recommendation:

- The single-node disagg region should be treated as an unsupported planning boundary, not as evidence that the entire workload shape is invalid.
- The `stress_256k` region should be treated as a hard failure boundary until the backend guard or estimator coverage changes.

The topology-specific unsupported view in [../figures/138_disagg_unsupported_ratio_heatmap.png](../figures/138_disagg_unsupported_ratio_heatmap.png) makes the first point explicit: unsupported disagg is concentrated at `1` node and disappears immediately at `2` nodes.

The ratio view in [../figures/138_failure_unsupported_ratio_pointcloud.png](../figures/138_failure_unsupported_ratio_pointcloud.png) shows that almost every node/context bucket is fully supported, with only three nonzero ratio states:

- `unsupported=1.00, failure=0.00` for the single-node non-`256K` unsupported strip
- `unsupported=0.80, failure=0.20` for the single-node `256K` mixed bucket
- `unsupported=0.00, failure=0.20` for every multi-node `256K` bucket

## Profile-Level Findings

Every profile except `stress_256k` follows the same pattern in [../data/138_failure_boundary_profile_summary.tsv](../data/138_failure_boundary_profile_summary.tsv):

- `240 / 255` cells are fully supported
- `15 / 255` cells are single-node disagg no-results
- `0 / 255` cells are hard failures

`stress_256k` is the lone outlier:

- `0 / 255` cells are supported
- `0 / 255` cells land in the disagg-only unsupported category
- `255 / 255` cells fail

This means the apparent `256K` boundary is not a generic "all 256K contexts are broken" result. `balanced_256k` and the other `~256K` observed-context profiles remain launchable above one node. The failure boundary is much narrower: it is tied to the preserved `stress_256k` shape with `isl=229376` and `osl=32768`.

## Implications For Measured Follow-Up

This boundary study suggests three concrete follow-ups for measured runtime work:

- Exclude single-node disagg from measured comparisons unless the goal is explicitly to validate the unsupported region.
- Sample directly around the `128K -> 256K` stress transition, because `stress_128k` is stable while `stress_256k` fails everywhere.
- Keep the `stress_256k` failure mode separate from ordinary unsupported planning gaps so that future estimator or backend fixes can be evaluated against a clean baseline.

In short, the safe derived sweep does not show a broad collapse of K2 AIConfigurator coverage. It shows one narrow topology boundary and one preserved hard-failure lane. Treating those as separate planning outputs is the key result of epic `#138`.
Loading