cld2labs/llama-3.1-8b-instruct#98
Conversation
…ell EI Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
|
Model deployment works. Testing inference is showing a Gateway Timeout error. vLLM pod is fine, but ingress-nginx-controller is giving an upstream timeout: |
| @@ -0,0 +1,73 @@ | |||
|
|
|||
| # Deployed with EI Version-1.3.1 | |||
There was a problem hiding this comment.
we can delete this file since the instructions will all be in deployment.md
|
inference is functional after increasing ingress and APISIX timeout to 300s |
| kubectl get ingress -A | grep <model-name> | ||
| ``` | ||
|
|
||
| Then annotate each ingress: |
There was a problem hiding this comment.
| Then annotate each ingress: | |
| Then annotate **EACH** ingress: |
Let's emphasize EACH since 2 is created.
|
|
||
| **Cause:** | ||
|
|
||
| CPU-based model inference (`vllm-cpu`) generates tokens at ~0.3-0.4 tokens/s. Responses requiring more than ~24 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX. |
There was a problem hiding this comment.
The performance is different for every Xeon SKU and will change over time. Let's just keep the note generic by only mentioning the root cause is the upstream timeout exceeds 60 seconds.
| **Notes:** | ||
|
|
||
| - The nginx ingress annotation takes effect immediately; no pod restart required. | ||
| - For GPU-based deployments this timeout is rarely needed as throughput is significantly higher (30-50 tokens/s vs 0.3-0.4 tokens/s on CPU). |
There was a problem hiding this comment.
Same here, let's remove mentions of performance numbers as it will vary from SKU, config, and over time
Summary
third_party/Dell/model-deployment/llama-3.1-8b-instruct/