Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Note in the above example, the error is specific to an response status code equa
Without this scaling special sauce (APIM using retries with exponential backoff), once the initial rate limit is hit, say due to many concurrent users sending too many prompts, then a '429' error return code (server busy) response code is sent back. As additional subsequent prompts/completions are being sent, then the issue can be compounded quickly as more 429 errors are returned, and the error rates increase further and further.
It is with the retries with exponential backoff where you are then able to scale many thousands of concurrent users with very low error responses, providing scalability to the AOAI service. I will include some metrics in the near future on scaling 5K concurrent users with low latency and less than 0.02% error rate.

In addition to using restries with exponential backoff, Azure APIM also supports content based routing. Content based routing is where the message routing endpoint is determined by the **content** of the message at runtime. You can leverage this to send AOAI prompts to multiple AOAI accounts, including both PTUs and TPMs, for meeting further scaling requirements.
In addition to using retries with exponential backoff, Azure APIM also supports content based routing. Content based routing is where the message routing endpoint is determined by the **content** of the message at runtime. You can leverage this to send AOAI prompts to multiple AOAI accounts, including both PTUs and TPMs, for meeting further scaling requirements.
For example, if your model API request states a specific version, say gpt-35-turbo-16k, you can then route this request to your GPT 3.5 Turbo (16K) PTUs deployment. We won't get into too much details here, but there are additional repo examples in the references section at the end of this repo.

In the [infra](./infra/) directory you will find a sample Bicep template to deploy Azure APIM and an API that applies this exponential retry logic and optional failover between different Azure OpenAI deployments. You will need to have an Azure subscription and two Azure OpenAI LLM deployments. Once deployed, you will need to give the APIM's system assigned managed identity the role of Cognitive Services OpenAI User on the Azure OpenAI accounts it's connected to, and add any required networking configurations.
Expand Down