You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/concepts/gateways.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ description: Managing ingress traffic and endpoints for services
5
5
6
6
# Gateways
7
7
8
-
Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#).
8
+
Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain.
9
9
10
10
<!-- > If you're using [dstack Sky](https://sky.dstack.ai),
11
11
> the gateway is already set up for you. -->
@@ -67,6 +67,10 @@ You can create gateways with the `aws`, `azure`, `gcp`, or `kubernetes` backends
67
67
68
68
### Router
69
69
70
+
> In previous releases, `dstack` allowed configuring `router` the gateway, which was required for PD disaggregation. Since 0.20.17, the `router` configuration has moved to [services](services.md#pd-disaggregation), and the gateway no longer needs to configure router.
71
+
72
+
<!-- ### Router
73
+
70
74
By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
71
75
72
76
#### SGLang
@@ -107,9 +111,7 @@ If you configure the `sglang` router, [services](../concepts/services.md) can ru
107
111
* `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue.
108
112
* `power_of_two` — Samples two workers and picks the lighter one.
109
113
* `random` — Uniform random selection.
110
-
* `round_robin` — Cycles through workers in order.
111
-
112
-
114
+
* `round_robin` — Cycles through workers in order. -->
<p>Gateways manage ingress traffic for running <ahref="../services/">services</a>, handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the <ahref="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a>.</p>
4530
+
<p>Gateways manage ingress traffic for running <ahref="../services/">services</a>, handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain.</p>
4531
4531
<!-- > If you're using [dstack Sky](https://sky.dstack.ai),
4532
4532
> the gateway is already set up for you. -->
4533
4533
@@ -4577,10 +4577,19 @@ <h3 id="backend">Backend<a class="headerlink" href="#backend" title="Permanent l
4577
4577
For self-hosted Kubernetes, you must provide a load balancer by yourself.</p>
<p>By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the <code>router</code> property. Currently, the only supported external router is <code>sglang</code>.</p>
<p>The <code>sglang</code> router delegates routing logic to the <ahref="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a>.</p>
4583
-
<p>To enable it, set <code>type</code> field under <code>router</code> to <code>sglang</code>:</p>
4580
+
<blockquote>
4581
+
<p>In previous releases, <code>dstack</code> allowed configuring <code>router</code> the gateway, which was required for PD disaggregation. Since 0.20.17, the <code>router</code> configuration has moved to <ahref="../services/#pd-disaggregation">services</a>, and the gateway no longer needs to configure router.</p>
4582
+
</blockquote>
4583
+
<!-- ### Router
4584
+
4585
+
By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
4586
+
4587
+
#### SGLang
4588
+
4589
+
The `sglang` router delegates routing logic to the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#).
4590
+
4591
+
To enable it, set `type` field under `router` to `sglang`:
@@ -4598,25 +4607,22 @@ <h4 id="sglang">SGLang<a class="headerlink" href="#sglang" title="Permanent link
4598
4607
4599
4608
</div>
4600
4609
4601
-
<p>If you configure the <code>sglang</code> router, <ahref="../services/">services</a> can run either <ahref="../../../examples/inference/sglang/">standard SGLang workers</a> or <ahref="../../../examples/inference/sglang/#pd-disaggregation">Prefill-Decode workers</a> (aka PD disaggregation).</p>
4602
-
<divclass="admonition note">
4603
-
<pclass="admonition-title">PD disaggregation</p>
4604
-
<p>To run services with PD disaggregation see <ahref="https://dstack.ai/examples/inference/sglang/#pd-disaggregation">SGLang PD disaggregation</a>.</p>
4605
-
</div>
4606
-
<divclass="admonition note">
4607
-
<pclass="admonition-title">Deprecation</p>
4608
-
<p>Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.</p>
4609
-
</div>
4610
-
<detailsclass="info">
4611
-
<summary>Policy</summary>
4612
-
<p>The <code>policy</code> property allows you to configure the routing policy:</p>
4613
-
<ul>
4614
-
<li><code>cache_aware</code> — Default policy; combines cache locality with load balancing, falling back to shortest queue. </li>
4615
-
<li><code>power_of_two</code> — Samples two workers and picks the lighter one. </li>
4616
-
<li><code>random</code> — Uniform random selection. </li>
4617
-
<li><code>round_robin</code> — Cycles through workers in order. </li>
4618
-
</ul>
4619
-
</details>
4610
+
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
4611
+
4612
+
!!! note "PD disaggregation"
4613
+
To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
4614
+
4615
+
!!! note "Deprecation"
4616
+
Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.
4617
+
4618
+
??? info "Policy"
4619
+
The `policy` property allows you to configure the routing policy:
4620
+
4621
+
* `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue.
4622
+
* `power_of_two` — Samples two workers and picks the lighter one.
4623
+
* `random` — Uniform random selection.
4624
+
* `round_robin` — Cycles through workers in order. -->
<p>By default, when you run a service with a gateway, <code>dstack</code> provisions an SSL certificate via Let's Encrypt for the configured domain. This automatically enables HTTPS for the service endpoint.</p>
4622
4628
<p>If you disable <ahref="#public-ip">public IP</a> (e.g. to make the gateway private) or if you simply don't need HTTPS, you can set <code>certificate</code> to <code>null</code>. </p>
Copy file name to clipboardExpand all lines: docs/concepts/services.md
+80-1Lines changed: 80 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -233,7 +233,86 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
233
233
234
234
### PD disaggregation
235
235
236
-
You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
236
+
Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use it, configure three replica groups: one for a router (for example, [SGLang Model Gateway](https://docs.sglang.io/advanced_features/sgl_model_gateway.html)), one for prefill workers, and one for decode workers.
237
+
238
+
> Currently, Prefill-Decode disaggregation is supported only for SGLang.
239
+
240
+
Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
# For now replica group with router must have count: 1
256
+
commands:
257
+
- pip install sglang_router
258
+
- |
259
+
python -m sglang_router.launch_router \
260
+
--host 0.0.0.0 \
261
+
--port 8000 \
262
+
--pd-disaggregation \
263
+
--prefill-policy cache_aware
264
+
router:
265
+
type: sglang
266
+
resources:
267
+
cpu: 4
268
+
269
+
- count: 1..4
270
+
scaling:
271
+
metric: rps
272
+
target: 3
273
+
commands:
274
+
- |
275
+
python -m sglang.launch_server \
276
+
--model-path $MODEL_ID \
277
+
--disaggregation-mode prefill \
278
+
--disaggregation-transfer-backend nixl \
279
+
--host 0.0.0.0 \
280
+
--port 8000 \
281
+
--disaggregation-bootstrap-port 8998
282
+
resources:
283
+
gpu: H200
284
+
285
+
- count: 1..8
286
+
scaling:
287
+
metric: rps
288
+
target: 2
289
+
commands:
290
+
- |
291
+
python -m sglang.launch_server \
292
+
--model-path $MODEL_ID \
293
+
--disaggregation-mode decode \
294
+
--disaggregation-transfer-backend nixl \
295
+
--host 0.0.0.0 \
296
+
--port 8000
297
+
resources:
298
+
gpu: H200
299
+
300
+
port: 8000
301
+
model: zai-org/GLM-4.5-Air-FP8
302
+
303
+
# Custom probe is required for PD disaggregation.
304
+
probes:
305
+
- type: http
306
+
url: /health
307
+
interval: 15s
308
+
```
309
+
310
+
</div>
311
+
312
+
!!! info "Cluster"
313
+
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
314
+
315
+
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
<p>You can run SGLang with <ahref="https://docs.sglang.io/advanced_features/pd_disaggregation.html">Prefill-Decode disaggregation</a>. See the <ahref="../../../examples/inference/sglang/#pd-disaggregation">corresponding example</a>.</p>
5010
+
<p>Since 0.20.17, <code>dstack</code> supports serving a model using PD disaggregation. To use it, configure three replica groups: one for a router (for example, <ahref="https://docs.sglang.io/advanced_features/sgl_model_gateway.html">SGLang Model Gateway</a>), one for prefill workers, and one for decode workers.</p>
5011
+
<blockquote>
5012
+
<p>Currently, Prefill-Decode disaggregation is supported only for SGLang.</p>
5013
+
</blockquote>
5014
+
<p>Below is an example for running <code>zai-org/GLM-4.5-Air-FP8</code>:</p>
<p>PD disaggregation requires the service to run in a fleet with <code>placement</code> set to <code>cluster</code>, because the replicas require an interconnect between instances.</p>
5087
+
<p>While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.</p>
0 commit comments