dstackai
diff --git a/‎docs/concepts/gateways.md‎
Lines changed: 6 additions & 4 deletions b/‎docs/concepts/gateways.md‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎docs/concepts/gateways/index.html‎
Lines changed: 30 additions & 24 deletions b/‎docs/concepts/gateways/index.html‎
Lines changed: 30 additions & 24 deletions
diff --git a/‎docs/concepts/services.md‎
Lines changed: 80 additions & 1 deletion b/‎docs/concepts/services.md‎
Lines changed: 80 additions & 1 deletion
diff --git a/‎docs/concepts/services/index.html‎
Lines changed: 79 additions & 1 deletion b/‎docs/concepts/services/index.html‎
Lines changed: 79 additions & 1 deletion
@@ -5,7 +5,7 @@ description: Managing ingress traffic and endpoints for services
 
 # Gateways
 
-Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#).
+Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain.
 
 <!-- > If you're using [dstack Sky](https://sky.dstack.ai),
 > the gateway is already set up for you. -->
@@ -67,6 +67,10 @@ You can create gateways with the `aws`, `azure`, `gcp`, or `kubernetes` backends
 
 ### Router
 
+> In previous releases, `dstack` allowed configuring `router` the gateway, which was required for PD disaggregation. Since 0.20.17, the `router` configuration has moved to [services](services.md#pd-disaggregation), and the gateway no longer needs to configure router.
+
+<!-- ### Router
+
 By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
 
 #### SGLang
@@ -107,9 +111,7 @@ If you configure the `sglang` router, [services](../concepts/services.md) can ru
     * `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue. 
     * `power_of_two` &mdash; Samples two workers and picks the lighter one.                                               
     * `random` &mdash; Uniform random selection.                                                                    
-    * `round_robin` &mdash; Cycles through workers in order.                                                             
-
-
+    * `round_robin` &mdash; Cycles through workers in order.                                                              -->
 
 ### Certificate
 
 
@@ -4527,7 +4527,7 @@
 
 
 <h1 id="gateways">Gateways<a class="headerlink" href="#gateways" title="Permanent link">&para;</a></h1>
-<p>Gateways manage ingress traffic for running <a href="../services/">services</a>, handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the <a href="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a>.</p>
+<p>Gateways manage ingress traffic for running <a href="../services/">services</a>, handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain.</p>
 <!-- > If you're using [dstack Sky](https://sky.dstack.ai),
 > the gateway is already set up for you. -->
 
@@ -4577,10 +4577,19 @@ <h3 id="backend">Backend<a class="headerlink" href="#backend" title="Permanent l
 For self-hosted Kubernetes, you must provide a load balancer by yourself.</p>
 </details>
 <h3 id="router">Router<a class="headerlink" href="#router" title="Permanent link">&para;</a></h3>
-<p>By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the <code>router</code> property. Currently, the only supported external router is <code>sglang</code>.</p>
-<h4 id="sglang">SGLang<a class="headerlink" href="#sglang" title="Permanent link">&para;</a></h4>
-<p>The <code>sglang</code> router delegates routing logic to the <a href="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a>.</p>
-<p>To enable it, set <code>type</code> field under <code>router</code> to <code>sglang</code>:</p>
+<blockquote>
+<p>In previous releases, <code>dstack</code> allowed configuring <code>router</code> the gateway, which was required for PD disaggregation. Since 0.20.17, the <code>router</code> configuration has moved to <a href="../services/#pd-disaggregation">services</a>, and the gateway no longer needs to configure router.</p>
+</blockquote>
+<!-- ### Router
+
+By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
+
+#### SGLang
+
+The `sglang` router delegates routing logic to the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#).
+
+To enable it, set `type` field under `router` to `sglang`:
+
 <div editor-title="gateway.dstack.yml">
 
 <div class="highlight"><pre><span></span><code><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">gateway</span>
@@ -4598,25 +4607,22 @@ <h4 id="sglang">SGLang<a class="headerlink" href="#sglang" title="Permanent link
 
 </div>
 
-<p>If you configure the <code>sglang</code> router, <a href="../services/">services</a> can run either <a href="../../../examples/inference/sglang/">standard SGLang workers</a> or <a href="../../../examples/inference/sglang/#pd-disaggregation">Prefill-Decode workers</a> (aka PD disaggregation).</p>
-<div class="admonition note">
-<p class="admonition-title">PD disaggregation</p>
-<p>To run services with PD disaggregation see <a href="https://dstack.ai/examples/inference/sglang/#pd-disaggregation">SGLang PD disaggregation</a>.</p>
-</div>
-<div class="admonition note">
-<p class="admonition-title">Deprecation</p>
-<p>Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.</p>
-</div>
-<details class="info">
-<summary>Policy</summary>
-<p>The <code>policy</code> property allows you to configure the routing policy:</p>
-<ul>
-<li><code>cache_aware</code> &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue. </li>
-<li><code>power_of_two</code> &mdash; Samples two workers and picks the lighter one.                                               </li>
-<li><code>random</code> &mdash; Uniform random selection.                                                                    </li>
-<li><code>round_robin</code> &mdash; Cycles through workers in order.                                                             </li>
-</ul>
-</details>
+If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
+
+!!! note "PD disaggregation"
+    To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
+
+!!! note "Deprecation"
+    Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.
+
+??? info "Policy"
+    The `policy` property allows you to configure the routing policy:
+
+    * `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue. 
+    * `power_of_two` &mdash; Samples two workers and picks the lighter one.                                               
+    * `random` &mdash; Uniform random selection.                                                                    
+    * `round_robin` &mdash; Cycles through workers in order.                                                              -->
+
 <h3 id="certificate">Certificate<a class="headerlink" href="#certificate" title="Permanent link">&para;</a></h3>
 <p>By default, when you run a service with a gateway, <code>dstack</code> provisions an SSL certificate via Let's Encrypt for the configured domain. This automatically enables HTTPS for the service endpoint.</p>
 <p>If you disable <a href="#public-ip">public IP</a> (e.g. to make the gateway private) or if you simply don't need HTTPS, you can set <code>certificate</code> to <code>null</code>. </p>
 
@@ -233,7 +233,86 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 
 ### PD disaggregation
 
-You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
+Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use it, configure three replica groups: one for a router (for example, [SGLang Model Gateway](https://docs.sglang.io/advanced_features/sgl_model_gateway.html)), one for prefill workers, and one for decode workers.
+
+> Currently, Prefill-Decode disaggregation is supported only for SGLang.
+
+Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
+
+<div editor-title="examples/inference/sglang/pd.dstack.yml">
+
+```yaml
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+  - HF_TOKEN
+  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+  - count: 1
+    # For now replica group with router must have count: 1
+    commands:
+      - pip install sglang_router
+      - |
+        python -m sglang_router.launch_router \
+          --host 0.0.0.0 \
+          --port 8000 \
+          --pd-disaggregation \
+          --prefill-policy cache_aware
+    router:
+      type: sglang
+    resources:
+      cpu: 4
+
+  - count: 1..4
+    scaling:
+      metric: rps
+      target: 3
+    commands:
+      - |
+        python -m sglang.launch_server \
+          --model-path $MODEL_ID \
+          --disaggregation-mode prefill \
+          --disaggregation-transfer-backend nixl \
+          --host 0.0.0.0 \
+          --port 8000 \
+          --disaggregation-bootstrap-port 8998
+    resources:
+      gpu: H200
+
+  - count: 1..8
+    scaling:
+      metric: rps
+      target: 2
+    commands:
+      - |
+        python -m sglang.launch_server \
+          --model-path $MODEL_ID \
+          --disaggregation-mode decode \
+          --disaggregation-transfer-backend nixl \
+          --host 0.0.0.0 \
+          --port 8000
+    resources:
+      gpu: H200
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+# Custom probe is required for PD disaggregation.
+probes:
+  - type: http
+    url: /health
+    interval: 15s
+```
+
+</div>
+
+!!! info "Cluster"
+    PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
+
+    While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
 
 ### Authorization
 
 
@@ -5007,7 +5007,85 @@ <h3 id="replicas-and-scaling">Replicas and scaling<a class="headerlink" href="#r
 </blockquote>
 </details>
 <h3 id="pd-disaggregation">PD disaggregation<a class="headerlink" href="#pd-disaggregation" title="Permanent link">&para;</a></h3>
-<p>You can run SGLang with <a href="https://docs.sglang.io/advanced_features/pd_disaggregation.html">Prefill-Decode disaggregation</a>. See the <a href="../../../examples/inference/sglang/#pd-disaggregation">corresponding example</a>.</p>
+<p>Since 0.20.17, <code>dstack</code> supports serving a model using PD disaggregation. To use it, configure three replica groups: one for a router (for example, <a href="https://docs.sglang.io/advanced_features/sgl_model_gateway.html">SGLang Model Gateway</a>), one for prefill workers, and one for decode workers.</p>
+<blockquote>
+<p>Currently, Prefill-Decode disaggregation is supported only for SGLang.</p>
+</blockquote>
+<p>Below is an example for running <code>zai-org/GLM-4.5-Air-FP8</code>:</p>
+<div editor-title="examples/inference/sglang/pd.dstack.yml">
+
+<div class="highlight"><pre><span></span><code><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">service</span>
+<span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">prefill-decode</span>
+<span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">lmsysorg/sglang:latest</span>
+
+<span class="nt">env</span><span class="p">:</span>
+<span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">HF_TOKEN</span>
+<span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">MODEL_ID=zai-org/GLM-4.5-Air-FP8</span>
+
+<span class="nt">replicas</span><span class="p">:</span>
+<span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1</span>
+<span class="w">    </span><span class="c1"># For now replica group with router must have count: 1</span>
+<span class="w">    </span><span class="nt">commands</span><span class="p">:</span>
+<span class="w">      </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">pip install sglang_router</span>
+<span class="w">      </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">|</span>
+<span class="w">        </span><span class="no">python -m sglang_router.launch_router \</span>
+<span class="w">          </span><span class="no">--host 0.0.0.0 \</span>
+<span class="w">          </span><span class="no">--port 8000 \</span>
+<span class="w">          </span><span class="no">--pd-disaggregation \</span>
+<span class="w">          </span><span class="no">--prefill-policy cache_aware</span>
+<span class="w">    </span><span class="nt">router</span><span class="p">:</span>
+<span class="w">      </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">sglang</span>
+<span class="w">    </span><span class="nt">resources</span><span class="p">:</span>
+<span class="w">      </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">4</span>
+
+<span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1..4</span>
+<span class="w">    </span><span class="nt">scaling</span><span class="p">:</span>
+<span class="w">      </span><span class="nt">metric</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">rps</span>
+<span class="w">      </span><span class="nt">target</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">3</span>
+<span class="w">    </span><span class="nt">commands</span><span class="p">:</span>
+<span class="w">      </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">|</span>
+<span class="w">        </span><span class="no">python -m sglang.launch_server \</span>
+<span class="w">          </span><span class="no">--model-path $MODEL_ID \</span>
+<span class="w">          </span><span class="no">--disaggregation-mode prefill \</span>
+<span class="w">          </span><span class="no">--disaggregation-transfer-backend nixl \</span>
+<span class="w">          </span><span class="no">--host 0.0.0.0 \</span>
+<span class="w">          </span><span class="no">--port 8000 \</span>
+<span class="w">          </span><span class="no">--disaggregation-bootstrap-port 8998</span>
+<span class="w">    </span><span class="nt">resources</span><span class="p">:</span>
+<span class="w">      </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">H200</span>
+
+<span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1..8</span>
+<span class="w">    </span><span class="nt">scaling</span><span class="p">:</span>
+<span class="w">      </span><span class="nt">metric</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">rps</span>
+<span class="w">      </span><span class="nt">target</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
+<span class="w">    </span><span class="nt">commands</span><span class="p">:</span>
+<span class="w">      </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">|</span>
+<span class="w">        </span><span class="no">python -m sglang.launch_server \</span>
+<span class="w">          </span><span class="no">--model-path $MODEL_ID \</span>
+<span class="w">          </span><span class="no">--disaggregation-mode decode \</span>
+<span class="w">          </span><span class="no">--disaggregation-transfer-backend nixl \</span>
+<span class="w">          </span><span class="no">--host 0.0.0.0 \</span>
+<span class="w">          </span><span class="no">--port 8000</span>
+<span class="w">    </span><span class="nt">resources</span><span class="p">:</span>
+<span class="w">      </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">H200</span>
+
+<span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">8000</span>
+<span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">zai-org/GLM-4.5-Air-FP8</span>
+
+<span class="c1"># Custom probe is required for PD disaggregation.</span>
+<span class="nt">probes</span><span class="p">:</span>
+<span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">http</span>
+<span class="w">    </span><span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/health</span>
+<span class="w">    </span><span class="nt">interval</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">15s</span>
+</code></pre></div>
+
+</div>
+
+<div class="admonition info">
+<p class="admonition-title">Cluster</p>
+<p>PD disaggregation requires the service to run in a fleet with <code>placement</code> set to <code>cluster</code>, because the replicas require an interconnect between instances.</p>
+<p>While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.</p>
+</div>
 <h3 id="authorization">Authorization<a class="headerlink" href="#authorization" title="Permanent link">&para;</a></h3>
 <p>By default, the service enables authorization, meaning the service endpoint requires a <code>dstack</code> user token.
 This can be disabled by setting <code>auth</code> to <code>false</code>.</p>