Skip to content

Latest commit

 

History

History
63 lines (41 loc) · 2.14 KB

File metadata and controls

63 lines (41 loc) · 2.14 KB

Quick Start

RoleBasedGroup (RBG) is a custom resource that models a group of roles (each role represents a workload type and set of pods) and the relationships between them. It is intended to manage multi-role applications that may require coordinated scheduling, lifecycle management, rolling updates, and optional gang-scheduling (PodGroup) support.

Conceptual View

Key Feature

PD Colocation

When a request comes into an LLM inference engine, the system will first take the user input to generate the first token (prefill), then generate outputs token-by-token autoregressively (decode). A request usually consists of one prefill step, and multiple decoding steps until termination.

Single Node

When the model is small enough that a single Kubernetes Node can load all model files, you can deploy the LLM inference service on a single node.

Examples

Multi Nodes

When the model is too large for a single Node to load all files, use multi-node distributed inference.

Examples

PD Disaggregated

Colocating the two phases and batch the computation of prefill and decoding across all users and requests not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. Disaggregating the prefill and decoding computation improves the performance of large language models(LLMs) serving.

Examples

Deploying PD-disagg inference service with RBG.