diff --git a/site-src/guides/epp-configuration/config-text.md b/site-src/guides/epp-configuration/config-text.md index 2526091020..a49612166c 100644 --- a/site-src/guides/epp-configuration/config-text.md +++ b/site-src/guides/epp-configuration/config-text.md @@ -5,8 +5,9 @@ The Inference Gateway (IGW) can be configured via a YAML file. At this time the YAML file based configuration allows for: 1. The set of the lifecycle hooks (plugins) that are used by the IGW. -2. The configuration of the saturation detector -3. A set of feature gates that are used to enable experimental features. +2. The set of scheduling profiles that define how requests are scheduled to pods. +3. The configuration of the saturation detector +4. A set of feature gates that are used to enable experimental features. The YAML file can either be specified as a path to a file or in-line as a parameter. @@ -32,79 +33,17 @@ featureGates: The first two lines of the configuration are constant and must appear as is. -The plugins section defines the set of plugins that will be instantiated and their parameters. This section is described in more detail in the section [Configuring Plugins via text](#configuring-plugins-via-text) +The plugins section defines the set of plugins that will be instantiated and their parameters. This section is described in more detail in the section [Plugin Configuration](#plugin-configuration). The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling -requests to pods. This section is described in more detail in the section [Configuring Plugins via YAML](#configuring-plugins-via-yaml) +requests to pods. This section is described in more detail in the section [Scheduling Profiles](#scheduling-profiles). The saturationDetector section configures the saturation detector, which is used to determine if special -action needs to eb taken due to the system being overloaded or saturated. This section is described in more detail in the section [Saturation Detector configuration](#saturation-detector-configuration) +action needs to eb taken due to the system being overloaded or saturated. This section is described in more detail in the section [Saturation Detector Configuration](#saturation-detector-configuration) The featureGates sections allows the enablement of experimental features of the IGW. This section is described in more detail in the section [Feature Gates](#feature-gates) -## Configuring Plugins via YAML - -The set of plugins that are used by the IGW is determined by how it is configured. The IGW is -primarily configured via a configuration file. - -The configuration defines the set of plugins to be instantiated along with their parameters. -Each plugin can also be given a name, enabling the same plugin type to be instantiated multiple -times, if needed (such as when configuring multiple scheduling profiles). - -Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling -a request. If one is not defined, a default one names `default` will be added and will reference all of -the instantiated plugins. - -The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles -will be used for a particular request. A Profile Handler must be specified, unless the configuration only -contains one profile, in which case the `SingleProfileHandler` will be used. - -In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which -the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an -instance of `MaxScorePicker` will be added to the SchedulingProfile in question. - -The plugins section defines the set of plugins that will be instantiated and their parameters. -Each entry in this section has the following form: - -```yaml -- name: aName - type: a-type - parameters: - parm1: val1 - parm2: val2 -``` - -The fields in a plugin entry are: - -- *name* which is optional, provides a name by which the plugin instance can be referenced. If this -field is omitted, the plugin's type will be used as its name. -- *type* specifies the type of the plugin to be instantiated. -- *parameters* which is optional, defines the set of parameters used to configure the plugin in question. -The actual set of parameters varies from plugin to plugin. - -The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling -requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple -serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry -in this section has the following form: - -```yaml -- name: aName - plugins: - - pluginRef: plugin1 - - pluginRef: plugin2 - weight: 50 -``` - -The fields in a schedulingProfile entry are: - -- *name* specifies the scheduling profile's name. -- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request. -Each entry in the schedulingProfile's plugins section has the following fields: - - *pluginRef* is a reference to the name of the plugin instance to be used - - *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one - will be used. - A complete configuration might look like this: ```yaml apiVersion: inference.networking.x-k8s.io/v1alpha1 @@ -208,21 +147,53 @@ schedulingProfiles: plugins: - pluginRef: prefix-cache-scorer weight: 50 - -pluginRef: max-score-picker + - pluginRef: max-score-picker ``` -### Plugin Configuration +## Plugin Configuration -This section describes how to setup the various plugins that are available with the IGW. +The set of plugins that are used by the IGW is determined by how it is configured. The IGW is +primarily configured via a configuration file. -#### **SingleProfileHandler** +The configuration defines the set of plugins to be instantiated along with their parameters. +Each plugin can also be given a name, enabling the same plugin type to be instantiated multiple +times, if needed (such as when configuring multiple scheduling profiles). + +The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles +will be used for a particular request. A Profile Handler must be specified, unless the configuration only +contains one profile, in which case the `SingleProfileHandler` will be used. + +In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which +the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an +instance of `MaxScorePicker` will be added to the SchedulingProfile in question. + +The plugins section defines the set of plugins that will be instantiated and their parameters. +Each entry in this section has the following form: + +```yaml +- name: aName + type: a-type + parameters: + parm1: val1 + parm2: val2 +``` + +The fields in a plugin entry are: + +- *name* which is optional, provides a name by which the plugin instance can be referenced. If this +field is omitted, the plugin's type will be used as its name. +- *type* specifies the type of the plugin to be instantiated. +- *parameters* which is optional, defines the set of parameters used to configure the plugin in question. +The actual set of parameters varies from plugin to plugin. + +### **SingleProfileHandler** Selects a single profile which is always the primary profile. - *Type*: single-profile-handler - *Parameters*: none -#### **PrefixCacheScorer** +### **PrefixCacheScorer** Scores pods based on the amount of the prompt is believed to be in the pod's KvCache. @@ -235,7 +206,7 @@ Scores pods based on the amount of the prompt is believed to be in the pod's KvC - `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries per server (pod). If not specified defaults to `31250` -#### **LoRAAffinityScorer** +### **LoRAAffinityScorer** Scores pods based on whether the requested LoRA adapter is already loaded in the pod's HBM, or if the pod is ready to load the LoRA on demand. @@ -243,7 +214,7 @@ the pod is ready to load the LoRA on demand. - *Type*: lora-affinity-scorer - *Parameters*: none -#### **MaxScorePicker** +### **MaxScorePicker** Picks the pod with the maximum score from the list of candidates. This is the default picker plugin if not specified. @@ -253,7 +224,7 @@ if not specified. - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates, based on the scores of those endpoints. If not specified defaults to `1`. -#### **RandomPicker** +### **RandomPicker** Picks a random pod from the list of candidates. @@ -262,7 +233,7 @@ Picks a random pod from the list of candidates. - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not specified defaults to `1`. -#### **WeightedRandomPicker** +### **WeightedRandomPicker** Picks pod(s) from the list of candidates based on weighted random sampling using A-Res algorithm. @@ -271,14 +242,14 @@ Picks pod(s) from the list of candidates based on weighted random sampling using - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not specified defaults to `1`. -#### **KvCacheScorer** +### **KvCacheScorer** Scores the candidate pods based on their KV cache utilization. - *Type*: kv-cache-utilization-scorer - *Parameters*: none -#### **QueueScorer** +### **QueueScorer** Scores list of candidate pods based on the pod's waiting queue size. The lower the waiting queue size the pod has, the higher the score it will get (since it's more @@ -288,7 +259,7 @@ available to serve new request). - *Parameters*: none -#### **LoraAffinityScorer** +### **LoraAffinityScorer** Scores list of candidate pods based on the LoRA adapters loaded on the pod. Pods with the adapter already loaded or able to be actively loaded will be @@ -297,7 +268,34 @@ scored higher (since it's more available to serve new request). - *Type*: lora-affinity-scorer - *Parameters*: none -## Saturation Detector configuration +## Scheduling Profiles + +The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling +requests to pods. If one is not defined, a default one names `default` will be added and will reference all of +the instantiated plugins. + +The number of scheduling profiles one defines, depends on the use case. For simple +serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry +in this section has the following form: + +```yaml +- name: aName + plugins: + - pluginRef: plugin1 + - pluginRef: plugin2 + weight: 50 +``` + +The fields in a schedulingProfile entry are: + +- *name* specifies the scheduling profile's name. +- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request. +Each entry in the schedulingProfile's plugins section has the following fields: + - *pluginRef* is a reference to the name of the plugin instance to be used + - *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one + will be used. + +## Saturation Detector Configuration The Saturation Detector is used to determine if the the cluster is overloaded, i.e. saturated. When the cluster is saturated special actions will be taken depending what has been enabled. At this time, sheddable requests will be dropped.