Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions docs/k2-aic-exhaustive-study/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Epic 122: RL360 Global Scaling

This study package is the self-contained documentation artifact for RL360 epic `#122`.
It stays inside `docs/k2-aic-exhaustive-study` and does not depend on `scripts/analysis/*`.

## Scope

- data and tables: [sections/122-data-and-tables.md](sections/122-data-and-tables.md)
- figures: [sections/122-figures.md](sections/122-figures.md)
- narrative: [sections/122-narrative.md](sections/122-narrative.md)

## Provenance

The artifacts are derived from the local AIConfigurator catalogs captured on `2026-04-23`:

- `/mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered`
- `/mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256`

The first catalog covers `practical_32k_8k`, `practical_64k_8k`, and `stress_128k_16k`.
The retry catalog records the all-scale `stress_256k_32k` no-results boundary.

## High-Level Result

The agg topology is the global default for this epic.
Disagg has a narrow mid-scale win region, but it wins cluster throughput in only `4` of `27` supported head-to-head comparisons and the `256K` stress profile has no viable agg or disagg candidate at any tested scale.
81 changes: 81 additions & 0 deletions docs/k2-aic-exhaustive-study/data/122_global_scaling_points.tsv

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions docs/k2-aic-exhaustive-study/data/122_scaling_summary.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
workload_profile workload_tier topology baseline_node_count baseline_cluster_tokens_s peak_node_count peak_cluster_tokens_s peak_vs_baseline_ratio final_node_count final_cluster_tokens_s final_vs_baseline_ratio final_vs_peak_ratio peak_selected_gpu_fraction final_selected_gpu_fraction scale_vs_agg_1n_at_peak
practical_32k_8k practical_rl agg 1 732.727 512 375156.224 512.000 512 375156.224 512.000 1.000 1.000 1.000 512.000
practical_32k_8k practical_rl disagg 2 1190.789 16 15480.259 13.000 512 15480.259 13.000 1.000 0.875 0.984 21.127
practical_64k_8k practical_rl agg 1 507.713 512 259949.056 512.000 512 259949.056 512.000 1.000 1.000 1.000 512.000
practical_64k_8k practical_rl disagg 2 504.955 16 5042.995 9.987 512 5042.995 9.987 1.000 0.688 0.988 9.933
stress_128k_16k stress_rl agg 1 349.452 512 178919.424 512.000 512 178919.424 512.000 1.000 1.000 1.000 512.000
stress_128k_16k stress_rl disagg 2 422.052 16 5094.769 12.071 512 5064.622 12.000 0.994 0.875 0.990 14.579
11 changes: 11 additions & 0 deletions docs/k2-aic-exhaustive-study/data/122_stress256_boundary.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
workload_profile workload_tier context_budget node_count total_gpus agg_status disagg_status agg_selected_total_gpus disagg_selected_total_gpus agg_manifest_path disagg_manifest_path
stress_256k_32k stress_rl 256K 1 8 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/1n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/1n/disagg_manifest.json
stress_256k_32k stress_rl 256K 2 16 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/2n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/2n/disagg_manifest.json
stress_256k_32k stress_rl 256K 4 32 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/4n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/4n/disagg_manifest.json
stress_256k_32k stress_rl 256K 8 64 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/8n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/8n/disagg_manifest.json
stress_256k_32k stress_rl 256K 16 128 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/16n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/16n/disagg_manifest.json
stress_256k_32k stress_rl 256K 32 256 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/32n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/32n/disagg_manifest.json
stress_256k_32k stress_rl 256K 64 512 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/64n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/64n/disagg_manifest.json
stress_256k_32k stress_rl 256K 128 1024 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/128n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/128n/disagg_manifest.json
stress_256k_32k stress_rl 256K 256 2048 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/256n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/256n/disagg_manifest.json
stress_256k_32k stress_rl 256K 512 4096 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/512n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/512n/disagg_manifest.json
28 changes: 28 additions & 0 deletions docs/k2-aic-exhaustive-study/data/122_topology_head_to_head.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
workload_profile workload_tier context_budget node_count total_gpus agg_cluster_tokens_s disagg_cluster_tokens_s disagg_to_agg_cluster_ratio agg_tokens_s_user disagg_tokens_s_user disagg_to_agg_user_ratio agg_tokens_s_gpu_allocated disagg_tokens_s_gpu_allocated disagg_to_agg_allocated_gpu_ratio disagg_tokens_s_gpu_selected disagg_to_agg_selected_gpu_ratio disagg_selected_gpu_fraction winner_cluster_tokens winner_user_throughput winner_allocated_gpu_efficiency winner_selected_gpu_efficiency disagg_script_shape
practical_32k_8k practical_rl 40K 2 16 1465.454 1190.789 0.813 31.302 26.884 0.859 91.591 74.424 0.813 74.424 0.813 1.000 agg agg agg agg cross_node
practical_32k_8k practical_rl 40K 4 32 2930.908 3572.367 1.219 31.302 26.884 0.859 91.591 111.636 1.219 111.636 1.219 1.000 disagg agg disagg disagg cross_node
practical_32k_8k practical_rl 40K 8 64 5861.816 8335.524 1.422 31.302 26.884 0.859 91.591 130.243 1.422 130.243 1.422 1.000 disagg agg disagg disagg cross_node
practical_32k_8k practical_rl 40K 16 128 11723.632 15480.259 1.320 31.302 26.884 0.859 91.591 120.940 1.320 138.217 1.509 0.875 disagg agg disagg disagg cross_node
practical_32k_8k practical_rl 40K 32 256 23447.264 15480.259 0.660 31.302 26.884 0.859 91.591 60.470 0.660 69.108 0.755 0.875 agg agg agg agg cross_node
practical_32k_8k practical_rl 40K 64 512 46894.528 15480.259 0.330 31.302 26.884 0.859 91.591 30.235 0.330 34.554 0.377 0.875 agg agg agg agg cross_node
practical_32k_8k practical_rl 40K 128 1024 93789.056 15480.259 0.165 31.302 26.884 0.859 91.591 15.117 0.165 15.357 0.168 0.984 agg agg agg agg cross_node
practical_32k_8k practical_rl 40K 256 2048 187578.112 15480.259 0.083 31.302 26.884 0.859 91.591 7.559 0.083 7.679 0.084 0.984 agg agg agg agg cross_node
practical_32k_8k practical_rl 40K 512 4096 375156.224 15480.259 0.041 31.302 26.884 0.859 91.591 3.779 0.041 3.839 0.042 0.984 agg agg agg agg cross_node
practical_64k_8k practical_rl 72K 2 16 1015.426 504.955 0.497 88.240 61.356 0.695 63.464 31.560 0.497 31.560 0.497 1.000 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 4 32 2030.852 1514.865 0.746 88.240 61.356 0.695 63.464 47.340 0.746 47.340 0.746 1.000 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 8 64 4061.704 3534.684 0.870 88.240 61.356 0.695 63.464 55.229 0.870 55.229 0.870 1.000 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 16 128 8123.408 5042.995 0.621 88.240 61.356 0.695 63.464 39.398 0.621 57.307 0.903 0.688 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 32 256 16246.816 5042.995 0.310 88.240 61.356 0.695 63.464 19.699 0.310 28.653 0.451 0.688 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 64 512 32493.632 5042.995 0.155 88.240 61.356 0.695 63.464 9.850 0.155 11.461 0.181 0.859 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 128 1024 64987.264 5042.995 0.078 88.240 61.356 0.695 63.464 4.925 0.078 5.210 0.082 0.945 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 256 2048 129974.528 5042.995 0.039 88.240 61.356 0.695 63.464 2.462 0.039 2.492 0.039 0.988 agg agg agg agg node_local
practical_64k_8k practical_rl 72K 512 4096 259949.056 5042.995 0.019 88.240 61.356 0.695 63.464 1.231 0.019 1.246 0.020 0.988 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 2 16 698.904 422.052 0.604 179.730 90.246 0.502 43.681 26.378 0.604 26.378 0.604 1.000 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 4 32 1397.808 1266.156 0.906 179.730 90.246 0.502 43.681 39.567 0.906 39.567 0.906 1.000 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 8 64 2795.616 2954.363 1.057 179.730 90.246 0.502 43.681 46.162 1.057 46.162 1.057 1.000 disagg agg disagg disagg node_local
stress_128k_16k stress_rl 128K 16 128 5591.232 5094.769 0.911 179.730 105.023 0.584 43.681 39.803 0.911 45.489 1.041 0.875 agg agg agg disagg node_local
stress_128k_16k stress_rl 128K 32 256 11182.464 5094.769 0.456 179.730 105.023 0.584 43.681 19.901 0.456 22.745 0.521 0.875 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 64 512 22364.928 5094.769 0.228 179.730 105.023 0.584 43.681 9.951 0.228 11.372 0.260 0.875 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 128 1024 44729.856 5094.769 0.114 179.730 105.023 0.584 43.681 4.975 0.114 5.054 0.116 0.984 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 256 2048 89459.712 5064.622 0.057 179.730 90.246 0.502 43.681 2.473 0.057 2.563 0.059 0.965 agg agg agg agg node_local
stress_128k_16k stress_rl 128K 512 4096 178919.424 5064.622 0.028 179.730 90.246 0.502 43.681 1.236 0.028 1.249 0.029 0.990 agg agg agg agg node_local
7 changes: 7 additions & 0 deletions docs/k2-aic-exhaustive-study/data/122_topology_win_rates.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
scope_type scope comparisons disagg_cluster_wins agg_cluster_wins disagg_cluster_win_rate disagg_allocated_gpu_efficiency_wins disagg_allocated_gpu_efficiency_win_rate disagg_selected_gpu_efficiency_wins disagg_selected_gpu_efficiency_win_rate geometric_mean_cluster_ratio median_cluster_ratio max_cluster_ratio node_count_at_max_ratio min_cluster_ratio node_count_at_min_ratio median_disagg_selected_gpu_fraction
workload_profile practical_32k_8k 9 3 6 0.333 3 0.333 3 0.333 0.394 0.660 1.422 8 0.041 512 0.984
workload_profile practical_64k_8k 9 0 9 0.000 0 0.000 0 0.000 0.202 0.310 0.870 8 0.019 512 0.988
workload_profile stress_128k_16k 9 1 8 0.111 1 0.111 2 0.222 0.278 0.456 1.057 8 0.028 512 0.984
workload_tier practical_rl 18 3 15 0.167 3 0.167 3 0.167 0.282 0.413 1.422 8 0.019 512 0.984
workload_tier stress_rl 9 1 8 0.111 1 0.111 2 0.222 0.278 0.456 1.057 8 0.028 512 0.984
overall all_rl_profiles 27 4 23 0.148 4 0.148 5 0.185 0.281 0.456 1.422 8 0.019 512 0.984
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions docs/k2-aic-exhaustive-study/sections/122-data-and-tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Epic 122 Data And Tables

This section packages the derived tables for the RL360 global-scaling sweep.
All TSV artifacts are epic-scoped and live under `data/122_*`.

## Source Artifacts

- `data/122_global_scaling_points.tsv`: one row per `workload_profile x node_count x topology`, including cluster throughput, per-user throughput, allocated-GPU efficiency, selected-GPU efficiency, and topology shape.
- `data/122_topology_head_to_head.tsv`: paired agg-vs-disagg comparisons for each scale where both topologies produced candidate manifests.
- `data/122_topology_win_rates.tsv`: win-rate and ratio rollups by workload profile, workload tier, and overall.
- `data/122_scaling_summary.tsv`: baseline, peak, and final scale summaries for each topology.
- `data/122_stress256_boundary.tsv`: the 256K stress boundary, where both topologies returned `no_results` at every tested scale.

## Topology Win Rates

| Scope | Comparisons | Disagg cluster wins | Disagg cluster win rate | Disagg selected-GPU efficiency win rate | Geometric mean cluster ratio |
| --- | ---: | ---: | ---: | ---: | ---: |
| `practical_32k_8k` | 9 | 3 | 33.3% | 33.3% | 0.394 |
| `practical_64k_8k` | 9 | 0 | 0.0% | 0.0% | 0.202 |
| `stress_128k_16k` | 9 | 1 | 11.1% | 22.2% | 0.278 |
| `overall` | 27 | 4 | 14.8% | 18.5% | 0.281 |

## Peak Scaling Summary

| Workload profile | Topology | Baseline node | Peak node | Peak cluster tokens/s | Peak vs baseline | Final vs peak |
| --- | --- | ---: | ---: | ---: | ---: | ---: |
| `practical_32k_8k` | `agg` | 1 | 512 | 375156.224 | 512.000x | 1.000x |
| `practical_32k_8k` | `disagg` | 2 | 16 | 15480.259 | 13.000x | 1.000x |
| `practical_64k_8k` | `agg` | 1 | 512 | 259949.056 | 512.000x | 1.000x |
| `practical_64k_8k` | `disagg` | 2 | 16 | 5042.995 | 9.987x | 1.000x |
| `stress_128k_16k` | `agg` | 1 | 512 | 178919.424 | 512.000x | 1.000x |
| `stress_128k_16k` | `disagg` | 2 | 16 | 5094.769 | 12.071x | 0.994x |

## 256K Boundary

`stress_256k_32k` is a clean all-scale boundary in this epic dataset.
`data/122_stress256_boundary.tsv` records `agg_status=no_results` and `disagg_status=no_results` for every tested scale from `1n` through `512n`.
34 changes: 34 additions & 0 deletions docs/k2-aic-exhaustive-study/sections/122-figures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Epic 122 Figures

This section collects the figure artifacts for the RL360 global-scaling epic.
All image outputs are epic-scoped and live under `figures/122_*`.

## Figure Inventory

### Cluster Capacity Scaling

![Epic 122 cluster capacity scaling](../figures/122_cluster_capacity_scaling.png)

The agg path scales linearly because the selected `8`-GPU serving replica is repeated across nodes.
The disagg path grows through `16n`, then flattens.

### Disagg To Agg Ratio Heatmap

![Epic 122 disagg to agg ratio heatmap](../figures/122_disagg_to_agg_ratio_heatmap.png)

The heatmap highlights the narrow region where disagg wins on total cluster throughput:

- `practical_32k_8k`: `4n`, `8n`, and `16n`
- `stress_128k_16k`: `8n`
- `practical_64k_8k`: no disagg cluster-throughput wins

### Disagg Efficiency Point Cloud

![Epic 122 disagg efficiency point cloud](../figures/122_disagg_efficiency_pointcloud.png)

This point-cloud view is the required non-heatmap/non-line figure for the epic.
It shows how disagg win margin relates to selected-GPU fraction:

- points above `1.0` are disagg cluster-throughput wins
- circle markers are disagg wins; `X` markers are agg wins
- the `16n` points show where disagg can stay efficient while already leaving GPUs idle
34 changes: 34 additions & 0 deletions docs/k2-aic-exhaustive-study/sections/122-narrative.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Epic 122 Narrative

This narrative summarizes the RL360 global-scaling findings from the local AIConfigurator catalogs dated `2026-04-23`.

## Findings

1. The agg path is globally stable.
The selected agg layout is a repeated `8`-GPU serving replica, so cluster throughput scales linearly from `1n` through `512n` for every supported RL profile in this epic.

2. Disagg wins only in a narrow mid-scale window.
Across the `27` supported agg-vs-disagg comparisons, disagg wins cluster throughput `4` times (`14.8%`): `practical_32k_8k` at `4n`, `8n`, and `16n`, plus `stress_128k_16k` at `8n`.

3. The strongest disagg win is still localized.
The best head-to-head result is `practical_32k_8k` at `8n`, where disagg reaches a `1.422x` cluster-throughput ratio over agg.
The stress-tier win is much smaller: `stress_128k_16k` reaches only `1.057x` at `8n`.

4. Practical `64K/8K` never flips to disagg.
`practical_64k_8k` has zero disagg cluster-throughput wins; its best ratio is `0.870x` at `8n`, and the geometric-mean ratio is `0.202x`.

5. Disagg plateaus early.
For every supported RL profile, disagg peaks by `16n` and then flattens, while agg keeps scaling linearly.
The disagg peak-to-baseline ratios are `13.000x` for `practical_32k_8k`, `9.987x` for `practical_64k_8k`, and `12.071x` for `stress_128k_16k`.

6. Selected-GPU efficiency is better than cluster wins suggest, but not enough to change the default.
Disagg wins selected-GPU efficiency `5` times (`18.5%`), including `stress_128k_16k` at `16n`, where it loses on cluster throughput (`0.911x`) but still beats agg on active-GPU efficiency (`1.041x`) because it leaves `12.5%` of the allocation idle.

7. The `256K` stress tier is outside the current supported envelope.
`stress_256k_32k` returns `no_results` for both agg and disagg at every tested scale from `1n` through `512n`.

## Recommendation

Use agg as the default topology for global rollout planning in this epic.
Treat disagg as a targeted option for `practical_32k_8k` at `4n` to `16n`, and as a one-point stress exception at `stress_128k_16k` `8n`.
Do not plan around `stress_256k_32k` until the missing estimator and cache-coverage path is fixed upstream.
Loading