diff --git a/docs/k2-aic-exhaustive-study/README.md b/docs/k2-aic-exhaustive-study/README.md new file mode 100644 index 000000000000..46db133a2481 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/README.md @@ -0,0 +1,25 @@ +# Epic 122: RL360 Global Scaling + +This study package is the self-contained documentation artifact for RL360 epic `#122`. +It stays inside `docs/k2-aic-exhaustive-study` and does not depend on `scripts/analysis/*`. + +## Scope + +- data and tables: [sections/122-data-and-tables.md](sections/122-data-and-tables.md) +- figures: [sections/122-figures.md](sections/122-figures.md) +- narrative: [sections/122-narrative.md](sections/122-narrative.md) + +## Provenance + +The artifacts are derived from the local AIConfigurator catalogs captured on `2026-04-23`: + +- `/mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered` +- `/mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256` + +The first catalog covers `practical_32k_8k`, `practical_64k_8k`, and `stress_128k_16k`. +The retry catalog records the all-scale `stress_256k_32k` no-results boundary. + +## High-Level Result + +The agg topology is the global default for this epic. +Disagg has a narrow mid-scale win region, but it wins cluster throughput in only `4` of `27` supported head-to-head comparisons and the `256K` stress profile has no viable agg or disagg candidate at any tested scale. diff --git a/docs/k2-aic-exhaustive-study/data/122_global_scaling_points.tsv b/docs/k2-aic-exhaustive-study/data/122_global_scaling_points.tsv new file mode 100644 index 000000000000..b13f87915b9f --- /dev/null +++ b/docs/k2-aic-exhaustive-study/data/122_global_scaling_points.tsv @@ -0,0 +1,81 @@ +workload_profile workload_tier context_budget node_count topology status total_gpus selected_total_gpus selected_gpu_fraction cluster_tokens_s tokens_s_user tokens_s_gpu_selected tokens_s_gpu_allocated request_rate concurrency scale_vs_topology_baseline scale_vs_agg_1n parallelism prefill_parallelism decode_parallelism script_shape source_run +practical_32k_8k practical_rl 40K 1 agg candidate 8 8 1.000 732.727 31.302 91.591 91.591 0.089 24.000 1.000 1.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 2 agg candidate 16 16 1.000 1465.454 31.302 91.591 91.591 0.178 48.000 2.000 2.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 4 agg candidate 32 32 1.000 2930.908 31.302 91.591 91.591 0.356 96.000 4.000 4.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 8 agg candidate 64 64 1.000 5861.816 31.302 91.591 91.591 0.712 192.000 8.000 8.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 16 agg candidate 128 128 1.000 11723.632 31.302 91.591 91.591 1.424 384.000 16.000 16.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 32 agg candidate 256 256 1.000 23447.264 31.302 91.591 91.591 2.848 768.000 32.000 32.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 64 agg candidate 512 512 1.000 46894.528 31.302 91.591 91.591 5.696 1536.000 64.000 64.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 128 agg candidate 1024 1024 1.000 93789.056 31.302 91.591 91.591 11.392 3072.000 128.000 128.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 256 agg candidate 2048 2048 1.000 187578.112 31.302 91.591 91.591 22.784 6144.000 256.000 256.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 512 agg candidate 4096 4096 1.000 375156.224 31.302 91.591 91.591 45.568 12288.000 512.000 512.000 tp1pp1dp8etp1ep8 cluster_spanning 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 1 disagg no_results 8 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 2 disagg candidate 16 16 1.000 1190.789 26.884 74.424 74.424 0.145 48.000 1.000 1.625 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 4 disagg candidate 32 32 1.000 3572.367 26.884 111.636 111.636 0.436 144.000 3.000 4.875 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 8 disagg candidate 64 64 1.000 8335.524 26.884 130.243 130.243 1.018 336.000 7.000 11.376 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 16 disagg candidate 128 112 0.875 15480.259 26.884 138.217 120.940 1.890 624.000 13.000 21.127 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 32 disagg candidate 256 224 0.875 15480.259 26.884 69.108 60.470 1.890 624.000 13.000 21.127 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 64 disagg candidate 512 448 0.875 15480.259 26.884 34.554 30.235 1.890 624.000 13.000 21.127 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 128 disagg candidate 1024 1008 0.984 15480.259 26.884 15.357 15.117 1.890 624.000 13.000 21.127 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 256 disagg candidate 2048 2016 0.984 15480.259 26.884 7.679 7.559 1.890 624.000 13.000 21.127 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_32k_8k practical_rl 40K 512 disagg candidate 4096 4032 0.984 15480.259 26.884 3.839 3.779 1.890 624.000 13.000 21.127 prefill tp1pp1dp8etp1ep8 | decode tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 tp1pp1dp8etp1ep8 cross_node 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 1 agg candidate 8 8 1.000 507.713 88.240 63.464 63.464 0.062 6.000 1.000 1.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 2 agg candidate 16 16 1.000 1015.426 88.240 63.464 63.464 0.124 12.000 2.000 2.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 4 agg candidate 32 32 1.000 2030.852 88.240 63.464 63.464 0.248 24.000 4.000 4.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 8 agg candidate 64 64 1.000 4061.704 88.240 63.464 63.464 0.496 48.000 8.000 8.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 16 agg candidate 128 128 1.000 8123.408 88.240 63.464 63.464 0.992 96.000 16.000 16.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 32 agg candidate 256 256 1.000 16246.816 88.240 63.464 63.464 1.984 192.000 32.000 32.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 64 agg candidate 512 512 1.000 32493.632 88.240 63.464 63.464 3.968 384.000 64.000 64.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 128 agg candidate 1024 1024 1.000 64987.264 88.240 63.464 63.464 7.936 768.000 128.000 128.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 256 agg candidate 2048 2048 1.000 129974.528 88.240 63.464 63.464 15.872 1536.000 256.000 256.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 512 agg candidate 4096 4096 1.000 259949.056 88.240 63.464 63.464 31.744 3072.000 512.000 512.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 1 disagg no_results 8 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 2 disagg candidate 16 16 1.000 504.955 61.356 31.560 31.560 0.062 9.000 1.000 0.995 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 4 disagg candidate 32 32 1.000 1514.865 61.356 47.340 47.340 0.185 27.000 3.000 2.984 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 8 disagg candidate 64 64 1.000 3534.684 61.356 55.229 55.229 0.431 63.000 7.000 6.962 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 16 disagg candidate 128 88 0.688 5042.995 61.356 57.307 39.398 0.616 90.000 9.987 9.933 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 32 disagg candidate 256 176 0.688 5042.995 61.356 28.653 19.699 0.616 90.000 9.987 9.933 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 64 disagg candidate 512 440 0.859 5042.995 61.356 11.461 9.850 0.616 90.000 9.987 9.933 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 128 disagg candidate 1024 968 0.945 5042.995 61.356 5.210 4.925 0.616 90.000 9.987 9.933 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 256 disagg candidate 2048 2024 0.988 5042.995 61.356 2.492 2.462 0.616 90.000 9.987 9.933 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +practical_64k_8k practical_rl 72K 512 disagg candidate 4096 4048 0.988 5042.995 61.356 1.246 1.231 0.616 90.000 9.987 9.933 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 1 agg candidate 8 8 1.000 349.452 179.730 43.681 43.681 0.021 2.000 1.000 1.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 2 agg candidate 16 16 1.000 698.904 179.730 43.681 43.681 0.042 4.000 2.000 2.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 4 agg candidate 32 32 1.000 1397.808 179.730 43.681 43.681 0.084 8.000 4.000 4.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 8 agg candidate 64 64 1.000 2795.616 179.730 43.681 43.681 0.168 16.000 8.000 8.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 16 agg candidate 128 128 1.000 5591.232 179.730 43.681 43.681 0.336 32.000 16.000 16.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 32 agg candidate 256 256 1.000 11182.464 179.730 43.681 43.681 0.672 64.000 32.000 32.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 64 agg candidate 512 512 1.000 22364.928 179.730 43.681 43.681 1.344 128.000 64.000 64.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 128 agg candidate 1024 1024 1.000 44729.856 179.730 43.681 43.681 2.688 256.000 128.000 128.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 256 agg candidate 2048 2048 1.000 89459.712 179.730 43.681 43.681 5.376 512.000 256.000 256.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 512 agg candidate 4096 4096 1.000 178919.424 179.730 43.681 43.681 10.752 1024.000 512.000 512.000 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 1 disagg no_results 8 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 2 disagg candidate 16 16 1.000 422.052 90.246 26.378 26.378 0.026 5.000 1.000 1.208 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 4 disagg candidate 32 32 1.000 1266.156 90.246 39.567 39.567 0.077 15.000 3.000 3.623 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 8 disagg candidate 64 64 1.000 2954.363 90.246 46.162 46.162 0.180 35.000 7.000 8.454 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 16 disagg candidate 128 112 0.875 5094.769 105.023 45.489 39.803 0.311 52.000 12.071 14.579 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 32 disagg candidate 256 224 0.875 5094.769 105.023 22.745 19.901 0.311 52.000 12.071 14.579 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 64 disagg candidate 512 448 0.875 5094.769 105.023 11.372 9.951 0.311 52.000 12.071 14.579 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 128 disagg candidate 1024 1008 0.984 5094.769 105.023 5.054 4.975 0.311 52.000 12.071 14.579 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 256 disagg candidate 2048 1976 0.965 5064.622 90.246 2.563 2.473 0.309 60.000 12.000 14.493 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_128k_16k stress_rl 128K 512 disagg candidate 4096 4056 0.990 5064.622 90.246 1.249 1.236 0.309 60.000 12.000 14.493 prefill tp8pp1dp1etp1ep8 | decode tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 tp8pp1dp1etp1ep8 node_local 20260423T221500Z_rl_pow2_512_unfiltered +stress_256k_32k stress_rl 256K 1 agg no_results 8 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 2 agg no_results 16 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 4 agg no_results 32 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 8 agg no_results 64 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 16 agg no_results 128 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 32 agg no_results 256 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 64 agg no_results 512 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 128 agg no_results 1024 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 256 agg no_results 2048 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 512 agg no_results 4096 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 1 disagg no_results 8 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 2 disagg no_results 16 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 4 disagg no_results 32 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 8 disagg no_results 64 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 16 disagg no_results 128 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 32 disagg no_results 256 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 64 disagg no_results 512 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 128 disagg no_results 1024 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 256 disagg no_results 2048 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 +stress_256k_32k stress_rl 256K 512 disagg no_results 4096 0 0.000 no_results 20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256 diff --git a/docs/k2-aic-exhaustive-study/data/122_scaling_summary.tsv b/docs/k2-aic-exhaustive-study/data/122_scaling_summary.tsv new file mode 100644 index 000000000000..f4e34430d14c --- /dev/null +++ b/docs/k2-aic-exhaustive-study/data/122_scaling_summary.tsv @@ -0,0 +1,7 @@ +workload_profile workload_tier topology baseline_node_count baseline_cluster_tokens_s peak_node_count peak_cluster_tokens_s peak_vs_baseline_ratio final_node_count final_cluster_tokens_s final_vs_baseline_ratio final_vs_peak_ratio peak_selected_gpu_fraction final_selected_gpu_fraction scale_vs_agg_1n_at_peak +practical_32k_8k practical_rl agg 1 732.727 512 375156.224 512.000 512 375156.224 512.000 1.000 1.000 1.000 512.000 +practical_32k_8k practical_rl disagg 2 1190.789 16 15480.259 13.000 512 15480.259 13.000 1.000 0.875 0.984 21.127 +practical_64k_8k practical_rl agg 1 507.713 512 259949.056 512.000 512 259949.056 512.000 1.000 1.000 1.000 512.000 +practical_64k_8k practical_rl disagg 2 504.955 16 5042.995 9.987 512 5042.995 9.987 1.000 0.688 0.988 9.933 +stress_128k_16k stress_rl agg 1 349.452 512 178919.424 512.000 512 178919.424 512.000 1.000 1.000 1.000 512.000 +stress_128k_16k stress_rl disagg 2 422.052 16 5094.769 12.071 512 5064.622 12.000 0.994 0.875 0.990 14.579 diff --git a/docs/k2-aic-exhaustive-study/data/122_stress256_boundary.tsv b/docs/k2-aic-exhaustive-study/data/122_stress256_boundary.tsv new file mode 100644 index 000000000000..8a7bec2d9e49 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/data/122_stress256_boundary.tsv @@ -0,0 +1,11 @@ +workload_profile workload_tier context_budget node_count total_gpus agg_status disagg_status agg_selected_total_gpus disagg_selected_total_gpus agg_manifest_path disagg_manifest_path +stress_256k_32k stress_rl 256K 1 8 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/1n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/1n/disagg_manifest.json +stress_256k_32k stress_rl 256K 2 16 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/2n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/2n/disagg_manifest.json +stress_256k_32k stress_rl 256K 4 32 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/4n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/4n/disagg_manifest.json +stress_256k_32k stress_rl 256K 8 64 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/8n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/8n/disagg_manifest.json +stress_256k_32k stress_rl 256K 16 128 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/16n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/16n/disagg_manifest.json +stress_256k_32k stress_rl 256K 32 256 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/32n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/32n/disagg_manifest.json +stress_256k_32k stress_rl 256K 64 512 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/64n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/64n/disagg_manifest.json +stress_256k_32k stress_rl 256K 128 1024 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/128n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/128n/disagg_manifest.json +stress_256k_32k stress_rl 256K 256 2048 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/256n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/256n/disagg_manifest.json +stress_256k_32k stress_rl 256K 512 4096 no_results no_results 0 0 /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/512n/agg_manifest.json /mnt/weka/shrd/k2m/micah.villmow/aiconfigurator/qwen35_397b_a17b/20260423T221500Z_rl_pow2_512_unfiltered_retry_stress256/stress_256k_32k/512n/disagg_manifest.json diff --git a/docs/k2-aic-exhaustive-study/data/122_topology_head_to_head.tsv b/docs/k2-aic-exhaustive-study/data/122_topology_head_to_head.tsv new file mode 100644 index 000000000000..4b6b95b3b935 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/data/122_topology_head_to_head.tsv @@ -0,0 +1,28 @@ +workload_profile workload_tier context_budget node_count total_gpus agg_cluster_tokens_s disagg_cluster_tokens_s disagg_to_agg_cluster_ratio agg_tokens_s_user disagg_tokens_s_user disagg_to_agg_user_ratio agg_tokens_s_gpu_allocated disagg_tokens_s_gpu_allocated disagg_to_agg_allocated_gpu_ratio disagg_tokens_s_gpu_selected disagg_to_agg_selected_gpu_ratio disagg_selected_gpu_fraction winner_cluster_tokens winner_user_throughput winner_allocated_gpu_efficiency winner_selected_gpu_efficiency disagg_script_shape +practical_32k_8k practical_rl 40K 2 16 1465.454 1190.789 0.813 31.302 26.884 0.859 91.591 74.424 0.813 74.424 0.813 1.000 agg agg agg agg cross_node +practical_32k_8k practical_rl 40K 4 32 2930.908 3572.367 1.219 31.302 26.884 0.859 91.591 111.636 1.219 111.636 1.219 1.000 disagg agg disagg disagg cross_node +practical_32k_8k practical_rl 40K 8 64 5861.816 8335.524 1.422 31.302 26.884 0.859 91.591 130.243 1.422 130.243 1.422 1.000 disagg agg disagg disagg cross_node +practical_32k_8k practical_rl 40K 16 128 11723.632 15480.259 1.320 31.302 26.884 0.859 91.591 120.940 1.320 138.217 1.509 0.875 disagg agg disagg disagg cross_node +practical_32k_8k practical_rl 40K 32 256 23447.264 15480.259 0.660 31.302 26.884 0.859 91.591 60.470 0.660 69.108 0.755 0.875 agg agg agg agg cross_node +practical_32k_8k practical_rl 40K 64 512 46894.528 15480.259 0.330 31.302 26.884 0.859 91.591 30.235 0.330 34.554 0.377 0.875 agg agg agg agg cross_node +practical_32k_8k practical_rl 40K 128 1024 93789.056 15480.259 0.165 31.302 26.884 0.859 91.591 15.117 0.165 15.357 0.168 0.984 agg agg agg agg cross_node +practical_32k_8k practical_rl 40K 256 2048 187578.112 15480.259 0.083 31.302 26.884 0.859 91.591 7.559 0.083 7.679 0.084 0.984 agg agg agg agg cross_node +practical_32k_8k practical_rl 40K 512 4096 375156.224 15480.259 0.041 31.302 26.884 0.859 91.591 3.779 0.041 3.839 0.042 0.984 agg agg agg agg cross_node +practical_64k_8k practical_rl 72K 2 16 1015.426 504.955 0.497 88.240 61.356 0.695 63.464 31.560 0.497 31.560 0.497 1.000 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 4 32 2030.852 1514.865 0.746 88.240 61.356 0.695 63.464 47.340 0.746 47.340 0.746 1.000 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 8 64 4061.704 3534.684 0.870 88.240 61.356 0.695 63.464 55.229 0.870 55.229 0.870 1.000 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 16 128 8123.408 5042.995 0.621 88.240 61.356 0.695 63.464 39.398 0.621 57.307 0.903 0.688 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 32 256 16246.816 5042.995 0.310 88.240 61.356 0.695 63.464 19.699 0.310 28.653 0.451 0.688 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 64 512 32493.632 5042.995 0.155 88.240 61.356 0.695 63.464 9.850 0.155 11.461 0.181 0.859 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 128 1024 64987.264 5042.995 0.078 88.240 61.356 0.695 63.464 4.925 0.078 5.210 0.082 0.945 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 256 2048 129974.528 5042.995 0.039 88.240 61.356 0.695 63.464 2.462 0.039 2.492 0.039 0.988 agg agg agg agg node_local +practical_64k_8k practical_rl 72K 512 4096 259949.056 5042.995 0.019 88.240 61.356 0.695 63.464 1.231 0.019 1.246 0.020 0.988 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 2 16 698.904 422.052 0.604 179.730 90.246 0.502 43.681 26.378 0.604 26.378 0.604 1.000 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 4 32 1397.808 1266.156 0.906 179.730 90.246 0.502 43.681 39.567 0.906 39.567 0.906 1.000 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 8 64 2795.616 2954.363 1.057 179.730 90.246 0.502 43.681 46.162 1.057 46.162 1.057 1.000 disagg agg disagg disagg node_local +stress_128k_16k stress_rl 128K 16 128 5591.232 5094.769 0.911 179.730 105.023 0.584 43.681 39.803 0.911 45.489 1.041 0.875 agg agg agg disagg node_local +stress_128k_16k stress_rl 128K 32 256 11182.464 5094.769 0.456 179.730 105.023 0.584 43.681 19.901 0.456 22.745 0.521 0.875 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 64 512 22364.928 5094.769 0.228 179.730 105.023 0.584 43.681 9.951 0.228 11.372 0.260 0.875 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 128 1024 44729.856 5094.769 0.114 179.730 105.023 0.584 43.681 4.975 0.114 5.054 0.116 0.984 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 256 2048 89459.712 5064.622 0.057 179.730 90.246 0.502 43.681 2.473 0.057 2.563 0.059 0.965 agg agg agg agg node_local +stress_128k_16k stress_rl 128K 512 4096 178919.424 5064.622 0.028 179.730 90.246 0.502 43.681 1.236 0.028 1.249 0.029 0.990 agg agg agg agg node_local diff --git a/docs/k2-aic-exhaustive-study/data/122_topology_win_rates.tsv b/docs/k2-aic-exhaustive-study/data/122_topology_win_rates.tsv new file mode 100644 index 000000000000..9704d231da47 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/data/122_topology_win_rates.tsv @@ -0,0 +1,7 @@ +scope_type scope comparisons disagg_cluster_wins agg_cluster_wins disagg_cluster_win_rate disagg_allocated_gpu_efficiency_wins disagg_allocated_gpu_efficiency_win_rate disagg_selected_gpu_efficiency_wins disagg_selected_gpu_efficiency_win_rate geometric_mean_cluster_ratio median_cluster_ratio max_cluster_ratio node_count_at_max_ratio min_cluster_ratio node_count_at_min_ratio median_disagg_selected_gpu_fraction +workload_profile practical_32k_8k 9 3 6 0.333 3 0.333 3 0.333 0.394 0.660 1.422 8 0.041 512 0.984 +workload_profile practical_64k_8k 9 0 9 0.000 0 0.000 0 0.000 0.202 0.310 0.870 8 0.019 512 0.988 +workload_profile stress_128k_16k 9 1 8 0.111 1 0.111 2 0.222 0.278 0.456 1.057 8 0.028 512 0.984 +workload_tier practical_rl 18 3 15 0.167 3 0.167 3 0.167 0.282 0.413 1.422 8 0.019 512 0.984 +workload_tier stress_rl 9 1 8 0.111 1 0.111 2 0.222 0.278 0.456 1.057 8 0.028 512 0.984 +overall all_rl_profiles 27 4 23 0.148 4 0.148 5 0.185 0.281 0.456 1.422 8 0.019 512 0.984 diff --git a/docs/k2-aic-exhaustive-study/figures/122_cluster_capacity_scaling.png b/docs/k2-aic-exhaustive-study/figures/122_cluster_capacity_scaling.png new file mode 100644 index 000000000000..a88a01f447aa Binary files /dev/null and b/docs/k2-aic-exhaustive-study/figures/122_cluster_capacity_scaling.png differ diff --git a/docs/k2-aic-exhaustive-study/figures/122_disagg_efficiency_pointcloud.png b/docs/k2-aic-exhaustive-study/figures/122_disagg_efficiency_pointcloud.png new file mode 100644 index 000000000000..cc43f00e3519 Binary files /dev/null and b/docs/k2-aic-exhaustive-study/figures/122_disagg_efficiency_pointcloud.png differ diff --git a/docs/k2-aic-exhaustive-study/figures/122_disagg_to_agg_ratio_heatmap.png b/docs/k2-aic-exhaustive-study/figures/122_disagg_to_agg_ratio_heatmap.png new file mode 100644 index 000000000000..349bd34c3829 Binary files /dev/null and b/docs/k2-aic-exhaustive-study/figures/122_disagg_to_agg_ratio_heatmap.png differ diff --git a/docs/k2-aic-exhaustive-study/sections/122-data-and-tables.md b/docs/k2-aic-exhaustive-study/sections/122-data-and-tables.md new file mode 100644 index 000000000000..2e87ba19dd89 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/sections/122-data-and-tables.md @@ -0,0 +1,37 @@ +# Epic 122 Data And Tables + +This section packages the derived tables for the RL360 global-scaling sweep. +All TSV artifacts are epic-scoped and live under `data/122_*`. + +## Source Artifacts + +- `data/122_global_scaling_points.tsv`: one row per `workload_profile x node_count x topology`, including cluster throughput, per-user throughput, allocated-GPU efficiency, selected-GPU efficiency, and topology shape. +- `data/122_topology_head_to_head.tsv`: paired agg-vs-disagg comparisons for each scale where both topologies produced candidate manifests. +- `data/122_topology_win_rates.tsv`: win-rate and ratio rollups by workload profile, workload tier, and overall. +- `data/122_scaling_summary.tsv`: baseline, peak, and final scale summaries for each topology. +- `data/122_stress256_boundary.tsv`: the 256K stress boundary, where both topologies returned `no_results` at every tested scale. + +## Topology Win Rates + +| Scope | Comparisons | Disagg cluster wins | Disagg cluster win rate | Disagg selected-GPU efficiency win rate | Geometric mean cluster ratio | +| --- | ---: | ---: | ---: | ---: | ---: | +| `practical_32k_8k` | 9 | 3 | 33.3% | 33.3% | 0.394 | +| `practical_64k_8k` | 9 | 0 | 0.0% | 0.0% | 0.202 | +| `stress_128k_16k` | 9 | 1 | 11.1% | 22.2% | 0.278 | +| `overall` | 27 | 4 | 14.8% | 18.5% | 0.281 | + +## Peak Scaling Summary + +| Workload profile | Topology | Baseline node | Peak node | Peak cluster tokens/s | Peak vs baseline | Final vs peak | +| --- | --- | ---: | ---: | ---: | ---: | ---: | +| `practical_32k_8k` | `agg` | 1 | 512 | 375156.224 | 512.000x | 1.000x | +| `practical_32k_8k` | `disagg` | 2 | 16 | 15480.259 | 13.000x | 1.000x | +| `practical_64k_8k` | `agg` | 1 | 512 | 259949.056 | 512.000x | 1.000x | +| `practical_64k_8k` | `disagg` | 2 | 16 | 5042.995 | 9.987x | 1.000x | +| `stress_128k_16k` | `agg` | 1 | 512 | 178919.424 | 512.000x | 1.000x | +| `stress_128k_16k` | `disagg` | 2 | 16 | 5094.769 | 12.071x | 0.994x | + +## 256K Boundary + +`stress_256k_32k` is a clean all-scale boundary in this epic dataset. +`data/122_stress256_boundary.tsv` records `agg_status=no_results` and `disagg_status=no_results` for every tested scale from `1n` through `512n`. diff --git a/docs/k2-aic-exhaustive-study/sections/122-figures.md b/docs/k2-aic-exhaustive-study/sections/122-figures.md new file mode 100644 index 000000000000..a8eba6854e06 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/sections/122-figures.md @@ -0,0 +1,34 @@ +# Epic 122 Figures + +This section collects the figure artifacts for the RL360 global-scaling epic. +All image outputs are epic-scoped and live under `figures/122_*`. + +## Figure Inventory + +### Cluster Capacity Scaling + +![Epic 122 cluster capacity scaling](../figures/122_cluster_capacity_scaling.png) + +The agg path scales linearly because the selected `8`-GPU serving replica is repeated across nodes. +The disagg path grows through `16n`, then flattens. + +### Disagg To Agg Ratio Heatmap + +![Epic 122 disagg to agg ratio heatmap](../figures/122_disagg_to_agg_ratio_heatmap.png) + +The heatmap highlights the narrow region where disagg wins on total cluster throughput: + +- `practical_32k_8k`: `4n`, `8n`, and `16n` +- `stress_128k_16k`: `8n` +- `practical_64k_8k`: no disagg cluster-throughput wins + +### Disagg Efficiency Point Cloud + +![Epic 122 disagg efficiency point cloud](../figures/122_disagg_efficiency_pointcloud.png) + +This point-cloud view is the required non-heatmap/non-line figure for the epic. +It shows how disagg win margin relates to selected-GPU fraction: + +- points above `1.0` are disagg cluster-throughput wins +- circle markers are disagg wins; `X` markers are agg wins +- the `16n` points show where disagg can stay efficient while already leaving GPUs idle diff --git a/docs/k2-aic-exhaustive-study/sections/122-narrative.md b/docs/k2-aic-exhaustive-study/sections/122-narrative.md new file mode 100644 index 000000000000..ec3b32e80f22 --- /dev/null +++ b/docs/k2-aic-exhaustive-study/sections/122-narrative.md @@ -0,0 +1,34 @@ +# Epic 122 Narrative + +This narrative summarizes the RL360 global-scaling findings from the local AIConfigurator catalogs dated `2026-04-23`. + +## Findings + +1. The agg path is globally stable. + The selected agg layout is a repeated `8`-GPU serving replica, so cluster throughput scales linearly from `1n` through `512n` for every supported RL profile in this epic. + +2. Disagg wins only in a narrow mid-scale window. + Across the `27` supported agg-vs-disagg comparisons, disagg wins cluster throughput `4` times (`14.8%`): `practical_32k_8k` at `4n`, `8n`, and `16n`, plus `stress_128k_16k` at `8n`. + +3. The strongest disagg win is still localized. + The best head-to-head result is `practical_32k_8k` at `8n`, where disagg reaches a `1.422x` cluster-throughput ratio over agg. + The stress-tier win is much smaller: `stress_128k_16k` reaches only `1.057x` at `8n`. + +4. Practical `64K/8K` never flips to disagg. + `practical_64k_8k` has zero disagg cluster-throughput wins; its best ratio is `0.870x` at `8n`, and the geometric-mean ratio is `0.202x`. + +5. Disagg plateaus early. + For every supported RL profile, disagg peaks by `16n` and then flattens, while agg keeps scaling linearly. + The disagg peak-to-baseline ratios are `13.000x` for `practical_32k_8k`, `9.987x` for `practical_64k_8k`, and `12.071x` for `stress_128k_16k`. + +6. Selected-GPU efficiency is better than cluster wins suggest, but not enough to change the default. + Disagg wins selected-GPU efficiency `5` times (`18.5%`), including `stress_128k_16k` at `16n`, where it loses on cluster throughput (`0.911x`) but still beats agg on active-GPU efficiency (`1.041x`) because it leaves `12.5%` of the allocation idle. + +7. The `256K` stress tier is outside the current supported envelope. + `stress_256k_32k` returns `no_results` for both agg and disagg at every tested scale from `1n` through `512n`. + +## Recommendation + +Use agg as the default topology for global rollout planning in this epic. +Treat disagg as a targeted option for `practical_32k_8k` at `4n` to `16n`, and as a one-point stress exception at `stress_128k_16k` `8n`. +Do not plan around `stress_256k_32k` until the missing estimator and cache-coverage path is fixed upstream.