Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 78 additions & 80 deletions site-src/guides/epp-configuration/config-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ The Inference Gateway (IGW) can be configured via a YAML file.
At this time the YAML file based configuration allows for:

1. The set of the lifecycle hooks (plugins) that are used by the IGW.
2. The configuration of the saturation detector
3. A set of feature gates that are used to enable experimental features.
2. The set of scheduling profiles that define how requests are scheduled to pods.
3. The configuration of the saturation detector
4. A set of feature gates that are used to enable experimental features.

The YAML file can either be specified as a path to a file or in-line as a parameter.

Expand All @@ -32,79 +33,17 @@ featureGates:

The first two lines of the configuration are constant and must appear as is.

The plugins section defines the set of plugins that will be instantiated and their parameters. This section is described in more detail in the section [Configuring Plugins via text](#configuring-plugins-via-text)
The plugins section defines the set of plugins that will be instantiated and their parameters. This section is described in more detail in the section [Plugin Configuration](#plugin-configuration).

The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
requests to pods. This section is described in more detail in the section [Configuring Plugins via YAML](#configuring-plugins-via-yaml)
requests to pods. This section is described in more detail in the section [Scheduling Profiles](#scheduling-profiles).

The saturationDetector section configures the saturation detector, which is used to determine if special
action needs to eb taken due to the system being overloaded or saturated. This section is described in more detail in the section [Saturation Detector configuration](#saturation-detector-configuration)
action needs to eb taken due to the system being overloaded or saturated. This section is described in more detail in the section [Saturation Detector Configuration](#saturation-detector-configuration)

The featureGates sections allows the enablement of experimental features of the IGW. This section is
described in more detail in the section [Feature Gates](#feature-gates)

## Configuring Plugins via YAML

The set of plugins that are used by the IGW is determined by how it is configured. The IGW is
primarily configured via a configuration file.

The configuration defines the set of plugins to be instantiated along with their parameters.
Each plugin can also be given a name, enabling the same plugin type to be instantiated multiple
times, if needed (such as when configuring multiple scheduling profiles).

Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling
a request. If one is not defined, a default one names `default` will be added and will reference all of
the instantiated plugins.

The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
will be used for a particular request. A Profile Handler must be specified, unless the configuration only
contains one profile, in which case the `SingleProfileHandler` will be used.

In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.

The plugins section defines the set of plugins that will be instantiated and their parameters.
Each entry in this section has the following form:

```yaml
- name: aName
type: a-type
parameters:
parm1: val1
parm2: val2
```

The fields in a plugin entry are:

- *name* which is optional, provides a name by which the plugin instance can be referenced. If this
field is omitted, the plugin's type will be used as its name.
- *type* specifies the type of the plugin to be instantiated.
- *parameters* which is optional, defines the set of parameters used to configure the plugin in question.
The actual set of parameters varies from plugin to plugin.

The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple
serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry
in this section has the following form:

```yaml
- name: aName
plugins:
- pluginRef: plugin1
- pluginRef: plugin2
weight: 50
```

The fields in a schedulingProfile entry are:

- *name* specifies the scheduling profile's name.
- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request.
Each entry in the schedulingProfile's plugins section has the following fields:
- *pluginRef* is a reference to the name of the plugin instance to be used
- *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one
will be used.

A complete configuration might look like this:
```yaml
apiVersion: inference.networking.x-k8s.io/v1alpha1
Expand Down Expand Up @@ -208,21 +147,53 @@ schedulingProfiles:
plugins:
- pluginRef: prefix-cache-scorer
weight: 50
-pluginRef: max-score-picker
- pluginRef: max-score-picker
```

### Plugin Configuration
## Plugin Configuration

This section describes how to setup the various plugins that are available with the IGW.
The set of plugins that are used by the IGW is determined by how it is configured. The IGW is
primarily configured via a configuration file.

#### **SingleProfileHandler**
The configuration defines the set of plugins to be instantiated along with their parameters.
Each plugin can also be given a name, enabling the same plugin type to be instantiated multiple
times, if needed (such as when configuring multiple scheduling profiles).

The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
will be used for a particular request. A Profile Handler must be specified, unless the configuration only
contains one profile, in which case the `SingleProfileHandler` will be used.

In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.

The plugins section defines the set of plugins that will be instantiated and their parameters.
Each entry in this section has the following form:

```yaml
- name: aName
type: a-type
parameters:
parm1: val1
parm2: val2
```

The fields in a plugin entry are:

- *name* which is optional, provides a name by which the plugin instance can be referenced. If this
field is omitted, the plugin's type will be used as its name.
- *type* specifies the type of the plugin to be instantiated.
- *parameters* which is optional, defines the set of parameters used to configure the plugin in question.
The actual set of parameters varies from plugin to plugin.

### **SingleProfileHandler**

Selects a single profile which is always the primary profile.

- *Type*: single-profile-handler
- *Parameters*: none

#### **PrefixCacheScorer**
### **PrefixCacheScorer**

Scores pods based on the amount of the prompt is believed to be in the pod's KvCache.

Expand All @@ -235,15 +206,15 @@ Scores pods based on the amount of the prompt is believed to be in the pod's KvC
- `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries
per server (pod). If not specified defaults to `31250`

#### **LoRAAffinityScorer**
### **LoRAAffinityScorer**

Scores pods based on whether the requested LoRA adapter is already loaded in the pod's HBM, or if
the pod is ready to load the LoRA on demand.

- *Type*: lora-affinity-scorer
- *Parameters*: none

#### **MaxScorePicker**
### **MaxScorePicker**

Picks the pod with the maximum score from the list of candidates. This is the default picker plugin
if not specified.
Expand All @@ -253,7 +224,7 @@ if not specified.
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates, based on
the scores of those endpoints. If not specified defaults to `1`.

#### **RandomPicker**
### **RandomPicker**

Picks a random pod from the list of candidates.

Expand All @@ -262,7 +233,7 @@ Picks a random pod from the list of candidates.
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not
specified defaults to `1`.

#### **WeightedRandomPicker**
### **WeightedRandomPicker**

Picks pod(s) from the list of candidates based on weighted random sampling using A-Res algorithm.

Expand All @@ -271,14 +242,14 @@ Picks pod(s) from the list of candidates based on weighted random sampling using
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not
specified defaults to `1`.

#### **KvCacheScorer**
### **KvCacheScorer**

Scores the candidate pods based on their KV cache utilization.

- *Type*: kv-cache-utilization-scorer
- *Parameters*: none

#### **QueueScorer**
### **QueueScorer**

Scores list of candidate pods based on the pod's waiting queue size. The lower the
waiting queue size the pod has, the higher the score it will get (since it's more
Expand All @@ -288,7 +259,7 @@ available to serve new request).
- *Parameters*: none


#### **LoraAffinityScorer**
### **LoraAffinityScorer**

Scores list of candidate pods based on the LoRA adapters loaded on the pod.
Pods with the adapter already loaded or able to be actively loaded will be
Expand All @@ -297,7 +268,34 @@ scored higher (since it's more available to serve new request).
- *Type*: lora-affinity-scorer
- *Parameters*: none

## Saturation Detector configuration
## Scheduling Profiles

The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
requests to pods. If one is not defined, a default one names `default` will be added and will reference all of
the instantiated plugins.

The number of scheduling profiles one defines, depends on the use case. For simple
serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry
in this section has the following form:

```yaml
- name: aName
plugins:
- pluginRef: plugin1
- pluginRef: plugin2
weight: 50
```

The fields in a schedulingProfile entry are:

- *name* specifies the scheduling profile's name.
- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request.
Each entry in the schedulingProfile's plugins section has the following fields:
- *pluginRef* is a reference to the name of the plugin instance to be used
- *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one
will be used.

## Saturation Detector Configuration

The Saturation Detector is used to determine if the the cluster is overloaded, i.e. saturated. When
the cluster is saturated special actions will be taken depending what has been enabled. At this time, sheddable requests will be dropped.
Expand Down