Model Group Prefix Aware Routing

Currently, the Model Groups are load balanced by a Kubernetes service which round robins requests to the replicas of the model deployment. We can build a more intelligent routing pattern that takes advantage of prefix caching to send similarly prefixed prompts to the same replicas. This helps decrease latency for predictable workloads and increase throughout.

## Implementation details

We have a couple choices:

### Route requests based on sender

This is a simple implementation and can be done natively in Kubernetes without the need for custom components. This bets that similar workloads come from the same clients and so routing on that basis would give you most of the benefit of unpacking the http body. 

We can route on:
* Sender IP
* API key

#### Strengths

* Routing stays high level and never needs to unpack the data sent in the requests.
* Simpler architecture - no new components needed, resilience and battle tested as can be done at the Kubernetes level which has been tested by many other high grade production distributed systems.

#### Weakness

* No knowledge of downstream systems usage, sending a similar request to the same pod could be disadvantageous if it already it at capacity and other replicas have space. We can write algorithms to have configurable thresholds when we can assume a replica will be at full occupancy but they will be models and will not be fully reliable. 
* If the assumption that most similar requests come from the same clients is not true then is no better (could be worse) than round robin.

### Route base on a history of seen requests

The route could copy and async save the prompt sent to each replica into some sort of in memory datastore. Then once hitting some saturation start to match incoming prompts to this in memory cache to make routing decisions

#### Strengths

* Actually make decisions off individual request and is more likely to have a cache hit when it reaches each replica.

#### Weakness

* You are essentially rebuilding a slimline version of the prefixing in the inference engine. These two can easily get out of sync and invalidate the routers choices. It's hard to also know when it should evict previously seen prompts without linking it too closely to the inference engine.
* A more complicated component to unpack prompts and build from prefix trie.
* Requires unwrapping body of request to make decisions.
* Still no idea of the occupancy of each replica, in low traffic regime can make very poor decisions to overload a single instance. 

### Route base on the state of each inference engine

This would be to replicate the state of the KV Cache in each inference engine at the router level. This is what [dynamo](https://github.com/ai-dynamo/dynamo) do and requires quite a few additional components.

#### Strengths

* Intelligently routes and has been proven to improve throughput of large scale systems.

#### Weakness

* The component set required for this feature introduces a lot of surface area for bugs.
* This is replicating work done elsewhere - it would be better to integrate dynamo as a model group option rather than single engine deployment groups 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Group Prefix Aware Routing #4

Implementation details

Route requests based on sender

Strengths

Weakness

Route base on a history of seen requests

Strengths

Weakness

Route base on the state of each inference engine

Strengths

Weakness

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model Group Prefix Aware Routing #4

Description

Implementation details

Route requests based on sender

Strengths

Weakness

Route base on a history of seen requests

Strengths

Weakness

Route base on the state of each inference engine

Strengths

Weakness

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions