Skip to content

cld2labs/llama-3.1-8b-instruct#98

Open
arpannookala-12 wants to merge 8 commits into
opea-project:mainfrom
cld2labs:cld2labs/llama-3.1-8b-instruct
Open

cld2labs/llama-3.1-8b-instruct#98
arpannookala-12 wants to merge 8 commits into
opea-project:mainfrom
cld2labs:cld2labs/llama-3.1-8b-instruct

Conversation

@arpannookala-12
Copy link
Copy Markdown
Contributor

Summary

  • Adds model card for llama-3.1-8b-instruct (Meta) under third_party/Dell/model-deployment/llama-3.1-8b-instruct/
  • Adds Helm-based deployment guide for deploying llama-3.1-8b-instruct via vLLM on Gaudi and CPU (Xeon) with Keycloak OIDC and APISIX ingress

…ell EI

Signed-off-by: arpannookala-12 <ganesh.arpan.nookala@cloud2labs.com>
@alexsin368 alexsin368 self-requested a review April 29, 2026 04:18
@alexsin368
Copy link
Copy Markdown
Collaborator

Model deployment works. Testing inference is showing a Gateway Timeout error.

vLLM pod is fine, but ingress-nginx-controller is giving an upstream timeout:

2026/05/18 23:31:52 [error] 265195#265195: *10094624 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.17.23.1, server: api.example.com, request: "POST /Llama-3.1-8B-Instruct-vllmcpu/v1/completions HTTP/2.0", upstream: "http://10.233.104.80:9080/Llama-3.1-8B-Instruct-vllmcpu/v1/completions", host: "api.example.com"
172.17.23.1 - - [18/May/2026:23:31:52 +0000] "POST /Llama-3.1-8B-Instruct-vllmcpu/v1/completions HTTP/2.0" 504 160 "-" "curl/7.81.0" 1233 60.001 [auth-apisix-auth-apisix-gateway-80] [] 10.233.104.80:9080 0 60.000 504 b90b75191948d3bf2aff518dc7b72510

@@ -0,0 +1,73 @@

# Deployed with EI Version-1.3.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can delete this file since the instructions will all be in deployment.md

@alexsin368
Copy link
Copy Markdown
Collaborator

inference is functional after increasing ingress and APISIX timeout to 300s

kubectl get ingress -A | grep <model-name>
```

Then annotate each ingress:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Then annotate each ingress:
Then annotate **EACH** ingress:

Let's emphasize EACH since 2 is created.


**Cause:**

CPU-based model inference (`vllm-cpu`) generates tokens at ~0.3-0.4 tokens/s. Responses requiring more than ~24 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is different for every Xeon SKU and will change over time. Let's just keep the note generic by only mentioning the root cause is the upstream timeout exceeds 60 seconds.

**Notes:**

- The nginx ingress annotation takes effect immediately; no pod restart required.
- For GPU-based deployments this timeout is rarely needed as throughput is significantly higher (30-50 tokens/s vs 0.3-0.4 tokens/s on CPU).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's remove mentions of performance numbers as it will vary from SKU, config, and over time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants