Skip to content

Commit b0eabd5

Browse files
Deploying to gh-pages from @ dstackai/dstack@7bc818e 🚀
1 parent b8af5e1 commit b0eabd5

12 files changed

Lines changed: 444 additions & 118 deletions

File tree

docs/concepts/gateways.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Managing ingress traffic and endpoints for services
55

66
# Gateways
77

8-
Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#).
8+
Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain.
99

1010
<!-- > If you're using [dstack Sky](https://sky.dstack.ai),
1111
> the gateway is already set up for you. -->
@@ -67,6 +67,10 @@ You can create gateways with the `aws`, `azure`, `gcp`, or `kubernetes` backends
6767

6868
### Router
6969

70+
> In previous releases, `dstack` allowed configuring `router` the gateway, which was required for PD disaggregation. Since 0.20.17, the `router` configuration has moved to [services](services.md#pd-disaggregation), and the gateway no longer needs to configure router.
71+
72+
<!-- ### Router
73+
7074
By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
7175

7276
#### SGLang
@@ -107,9 +111,7 @@ If you configure the `sglang` router, [services](../concepts/services.md) can ru
107111
* `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue.
108112
* `power_of_two` &mdash; Samples two workers and picks the lighter one.
109113
* `random` &mdash; Uniform random selection.
110-
* `round_robin` &mdash; Cycles through workers in order.
111-
112-
114+
* `round_robin` &mdash; Cycles through workers in order. -->
113115

114116
### Certificate
115117

docs/concepts/gateways/index.html

Lines changed: 30 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4527,7 +4527,7 @@
45274527

45284528

45294529
<h1 id="gateways">Gateways<a class="headerlink" href="#gateways" title="Permanent link">&para;</a></h1>
4530-
<p>Gateways manage ingress traffic for running <a href="../services/">services</a>, handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the <a href="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a>.</p>
4530+
<p>Gateways manage ingress traffic for running <a href="../services/">services</a>, handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain.</p>
45314531
<!-- > If you're using [dstack Sky](https://sky.dstack.ai),
45324532
> the gateway is already set up for you. -->
45334533

@@ -4577,10 +4577,19 @@ <h3 id="backend">Backend<a class="headerlink" href="#backend" title="Permanent l
45774577
For self-hosted Kubernetes, you must provide a load balancer by yourself.</p>
45784578
</details>
45794579
<h3 id="router">Router<a class="headerlink" href="#router" title="Permanent link">&para;</a></h3>
4580-
<p>By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the <code>router</code> property. Currently, the only supported external router is <code>sglang</code>.</p>
4581-
<h4 id="sglang">SGLang<a class="headerlink" href="#sglang" title="Permanent link">&para;</a></h4>
4582-
<p>The <code>sglang</code> router delegates routing logic to the <a href="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a>.</p>
4583-
<p>To enable it, set <code>type</code> field under <code>router</code> to <code>sglang</code>:</p>
4580+
<blockquote>
4581+
<p>In previous releases, <code>dstack</code> allowed configuring <code>router</code> the gateway, which was required for PD disaggregation. Since 0.20.17, the <code>router</code> configuration has moved to <a href="../services/#pd-disaggregation">services</a>, and the gateway no longer needs to configure router.</p>
4582+
</blockquote>
4583+
<!-- ### Router
4584+
4585+
By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
4586+
4587+
#### SGLang
4588+
4589+
The `sglang` router delegates routing logic to the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#).
4590+
4591+
To enable it, set `type` field under `router` to `sglang`:
4592+
45844593
<div editor-title="gateway.dstack.yml">
45854594
45864595
<div class="highlight"><pre><span></span><code><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">gateway</span>
@@ -4598,25 +4607,22 @@ <h4 id="sglang">SGLang<a class="headerlink" href="#sglang" title="Permanent link
45984607
45994608
</div>
46004609
4601-
<p>If you configure the <code>sglang</code> router, <a href="../services/">services</a> can run either <a href="../../../examples/inference/sglang/">standard SGLang workers</a> or <a href="../../../examples/inference/sglang/#pd-disaggregation">Prefill-Decode workers</a> (aka PD disaggregation).</p>
4602-
<div class="admonition note">
4603-
<p class="admonition-title">PD disaggregation</p>
4604-
<p>To run services with PD disaggregation see <a href="https://dstack.ai/examples/inference/sglang/#pd-disaggregation">SGLang PD disaggregation</a>.</p>
4605-
</div>
4606-
<div class="admonition note">
4607-
<p class="admonition-title">Deprecation</p>
4608-
<p>Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.</p>
4609-
</div>
4610-
<details class="info">
4611-
<summary>Policy</summary>
4612-
<p>The <code>policy</code> property allows you to configure the routing policy:</p>
4613-
<ul>
4614-
<li><code>cache_aware</code> &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue. </li>
4615-
<li><code>power_of_two</code> &mdash; Samples two workers and picks the lighter one. </li>
4616-
<li><code>random</code> &mdash; Uniform random selection. </li>
4617-
<li><code>round_robin</code> &mdash; Cycles through workers in order. </li>
4618-
</ul>
4619-
</details>
4610+
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
4611+
4612+
!!! note "PD disaggregation"
4613+
To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
4614+
4615+
!!! note "Deprecation"
4616+
Configuring the SGLang router in a gateway is deprecated and will be disallowed in a future release.
4617+
4618+
??? info "Policy"
4619+
The `policy` property allows you to configure the routing policy:
4620+
4621+
* `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue.
4622+
* `power_of_two` &mdash; Samples two workers and picks the lighter one.
4623+
* `random` &mdash; Uniform random selection.
4624+
* `round_robin` &mdash; Cycles through workers in order. -->
4625+
46204626
<h3 id="certificate">Certificate<a class="headerlink" href="#certificate" title="Permanent link">&para;</a></h3>
46214627
<p>By default, when you run a service with a gateway, <code>dstack</code> provisions an SSL certificate via Let's Encrypt for the configured domain. This automatically enables HTTPS for the service endpoint.</p>
46224628
<p>If you disable <a href="#public-ip">public IP</a> (e.g. to make the gateway private) or if you simply don't need HTTPS, you can set <code>certificate</code> to <code>null</code>. </p>

docs/concepts/services.md

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,86 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
233233

234234
### PD disaggregation
235235

236-
You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
236+
Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use it, configure three replica groups: one for a router (for example, [SGLang Model Gateway](https://docs.sglang.io/advanced_features/sgl_model_gateway.html)), one for prefill workers, and one for decode workers.
237+
238+
> Currently, Prefill-Decode disaggregation is supported only for SGLang.
239+
240+
Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
241+
242+
<div editor-title="examples/inference/sglang/pd.dstack.yml">
243+
244+
```yaml
245+
type: service
246+
name: prefill-decode
247+
image: lmsysorg/sglang:latest
248+
249+
env:
250+
- HF_TOKEN
251+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
252+
253+
replicas:
254+
- count: 1
255+
# For now replica group with router must have count: 1
256+
commands:
257+
- pip install sglang_router
258+
- |
259+
python -m sglang_router.launch_router \
260+
--host 0.0.0.0 \
261+
--port 8000 \
262+
--pd-disaggregation \
263+
--prefill-policy cache_aware
264+
router:
265+
type: sglang
266+
resources:
267+
cpu: 4
268+
269+
- count: 1..4
270+
scaling:
271+
metric: rps
272+
target: 3
273+
commands:
274+
- |
275+
python -m sglang.launch_server \
276+
--model-path $MODEL_ID \
277+
--disaggregation-mode prefill \
278+
--disaggregation-transfer-backend nixl \
279+
--host 0.0.0.0 \
280+
--port 8000 \
281+
--disaggregation-bootstrap-port 8998
282+
resources:
283+
gpu: H200
284+
285+
- count: 1..8
286+
scaling:
287+
metric: rps
288+
target: 2
289+
commands:
290+
- |
291+
python -m sglang.launch_server \
292+
--model-path $MODEL_ID \
293+
--disaggregation-mode decode \
294+
--disaggregation-transfer-backend nixl \
295+
--host 0.0.0.0 \
296+
--port 8000
297+
resources:
298+
gpu: H200
299+
300+
port: 8000
301+
model: zai-org/GLM-4.5-Air-FP8
302+
303+
# Custom probe is required for PD disaggregation.
304+
probes:
305+
- type: http
306+
url: /health
307+
interval: 15s
308+
```
309+
310+
</div>
311+
312+
!!! info "Cluster"
313+
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
314+
315+
While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
237316

238317
### Authorization
239318

docs/concepts/services/index.html

Lines changed: 79 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5007,7 +5007,85 @@ <h3 id="replicas-and-scaling">Replicas and scaling<a class="headerlink" href="#r
50075007
</blockquote>
50085008
</details>
50095009
<h3 id="pd-disaggregation">PD disaggregation<a class="headerlink" href="#pd-disaggregation" title="Permanent link">&para;</a></h3>
5010-
<p>You can run SGLang with <a href="https://docs.sglang.io/advanced_features/pd_disaggregation.html">Prefill-Decode disaggregation</a>. See the <a href="../../../examples/inference/sglang/#pd-disaggregation">corresponding example</a>.</p>
5010+
<p>Since 0.20.17, <code>dstack</code> supports serving a model using PD disaggregation. To use it, configure three replica groups: one for a router (for example, <a href="https://docs.sglang.io/advanced_features/sgl_model_gateway.html">SGLang Model Gateway</a>), one for prefill workers, and one for decode workers.</p>
5011+
<blockquote>
5012+
<p>Currently, Prefill-Decode disaggregation is supported only for SGLang.</p>
5013+
</blockquote>
5014+
<p>Below is an example for running <code>zai-org/GLM-4.5-Air-FP8</code>:</p>
5015+
<div editor-title="examples/inference/sglang/pd.dstack.yml">
5016+
5017+
<div class="highlight"><pre><span></span><code><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">service</span>
5018+
<span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">prefill-decode</span>
5019+
<span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">lmsysorg/sglang:latest</span>
5020+
5021+
<span class="nt">env</span><span class="p">:</span>
5022+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">HF_TOKEN</span>
5023+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">MODEL_ID=zai-org/GLM-4.5-Air-FP8</span>
5024+
5025+
<span class="nt">replicas</span><span class="p">:</span>
5026+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1</span>
5027+
<span class="w"> </span><span class="c1"># For now replica group with router must have count: 1</span>
5028+
<span class="w"> </span><span class="nt">commands</span><span class="p">:</span>
5029+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">pip install sglang_router</span>
5030+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">|</span>
5031+
<span class="w"> </span><span class="no">python -m sglang_router.launch_router \</span>
5032+
<span class="w"> </span><span class="no">--host 0.0.0.0 \</span>
5033+
<span class="w"> </span><span class="no">--port 8000 \</span>
5034+
<span class="w"> </span><span class="no">--pd-disaggregation \</span>
5035+
<span class="w"> </span><span class="no">--prefill-policy cache_aware</span>
5036+
<span class="w"> </span><span class="nt">router</span><span class="p">:</span>
5037+
<span class="w"> </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">sglang</span>
5038+
<span class="w"> </span><span class="nt">resources</span><span class="p">:</span>
5039+
<span class="w"> </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">4</span>
5040+
5041+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1..4</span>
5042+
<span class="w"> </span><span class="nt">scaling</span><span class="p">:</span>
5043+
<span class="w"> </span><span class="nt">metric</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">rps</span>
5044+
<span class="w"> </span><span class="nt">target</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">3</span>
5045+
<span class="w"> </span><span class="nt">commands</span><span class="p">:</span>
5046+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">|</span>
5047+
<span class="w"> </span><span class="no">python -m sglang.launch_server \</span>
5048+
<span class="w"> </span><span class="no">--model-path $MODEL_ID \</span>
5049+
<span class="w"> </span><span class="no">--disaggregation-mode prefill \</span>
5050+
<span class="w"> </span><span class="no">--disaggregation-transfer-backend nixl \</span>
5051+
<span class="w"> </span><span class="no">--host 0.0.0.0 \</span>
5052+
<span class="w"> </span><span class="no">--port 8000 \</span>
5053+
<span class="w"> </span><span class="no">--disaggregation-bootstrap-port 8998</span>
5054+
<span class="w"> </span><span class="nt">resources</span><span class="p">:</span>
5055+
<span class="w"> </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">H200</span>
5056+
5057+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">count</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1..8</span>
5058+
<span class="w"> </span><span class="nt">scaling</span><span class="p">:</span>
5059+
<span class="w"> </span><span class="nt">metric</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">rps</span>
5060+
<span class="w"> </span><span class="nt">target</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
5061+
<span class="w"> </span><span class="nt">commands</span><span class="p">:</span>
5062+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">|</span>
5063+
<span class="w"> </span><span class="no">python -m sglang.launch_server \</span>
5064+
<span class="w"> </span><span class="no">--model-path $MODEL_ID \</span>
5065+
<span class="w"> </span><span class="no">--disaggregation-mode decode \</span>
5066+
<span class="w"> </span><span class="no">--disaggregation-transfer-backend nixl \</span>
5067+
<span class="w"> </span><span class="no">--host 0.0.0.0 \</span>
5068+
<span class="w"> </span><span class="no">--port 8000</span>
5069+
<span class="w"> </span><span class="nt">resources</span><span class="p">:</span>
5070+
<span class="w"> </span><span class="nt">gpu</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">H200</span>
5071+
5072+
<span class="nt">port</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">8000</span>
5073+
<span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">zai-org/GLM-4.5-Air-FP8</span>
5074+
5075+
<span class="c1"># Custom probe is required for PD disaggregation.</span>
5076+
<span class="nt">probes</span><span class="p">:</span>
5077+
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">http</span>
5078+
<span class="w"> </span><span class="nt">url</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/health</span>
5079+
<span class="w"> </span><span class="nt">interval</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">15s</span>
5080+
</code></pre></div>
5081+
5082+
</div>
5083+
5084+
<div class="admonition info">
5085+
<p class="admonition-title">Cluster</p>
5086+
<p>PD disaggregation requires the service to run in a fleet with <code>placement</code> set to <code>cluster</code>, because the replicas require an interconnect between instances.</p>
5087+
<p>While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.</p>
5088+
</div>
50115089
<h3 id="authorization">Authorization<a class="headerlink" href="#authorization" title="Permanent link">&para;</a></h3>
50125090
<p>By default, the service enables authorization, meaning the service endpoint requires a <code>dstack</code> user token.
50135091
This can be disabled by setting <code>auth</code> to <code>false</code>.</p>

0 commit comments

Comments
 (0)