-
Notifications
You must be signed in to change notification settings - Fork 27
cld2labs/llama-3.1-8b-instruct #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d68922a
9e3c97d
806461e
7bcf2bc
8eaa078
3410c1d
1d52bc5
971441a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| ## Step 1: Prerequisites to Deploy Llama-3.1-8B-Instruct Model on Xeon with Keycloak | ||
|
|
||
| Ensure the Enterprise Inference stack with Keycloak is already deployed before proceeding. | ||
|
|
||
| Edit `core/scripts/generate-token.sh` and set your values before sourcing it: | ||
|
|
||
| | Variable | Description | | ||
| | ------------------------- | ------------------------------------------------------------------------ | | ||
| | `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` | | ||
| | `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username | | ||
| | `KEYCLOAK_PASSWORD` | Keycloak admin password | | ||
| | `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment | | ||
|
|
||
| Then run: | ||
|
|
||
| ```bash | ||
| export HUGGING_FACE_HUB_TOKEN="your_token_here" | ||
|
|
||
| cd ~/Enterprise-Inference | ||
| source core/scripts/generate-token.sh | ||
| ``` | ||
|
|
||
| This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`. | ||
|
|
||
| ## Step 2: Deploy Llama-3.1-8B-Instruct Model | ||
|
|
||
| ```bash | ||
| helm install vllm-llama-8b ./core/helm-charts/vllm \ | ||
| --values ./core/helm-charts/vllm/xeon-values.yaml \ | ||
| --set LLM_MODEL_ID="meta-llama/Llama-3.1-8B-Instruct" \ | ||
| --set global.HUGGINGFACEHUB_API_TOKEN="$HUGGING_FACE_HUB_TOKEN" \ | ||
| --set ingress.enabled=true \ | ||
| --set ingress.secretname="${BASE_URL}" \ | ||
| --set ingress.host="${BASE_URL}" \ | ||
| --set oidc.client_id="$KEYCLOAK_CLIENT_ID" \ | ||
| --set oidc.client_secret="$KEYCLOAK_CLIENT_SECRET" \ | ||
| --set apisix.enabled=true \ | ||
| --set tensor_parallel_size="1" \ | ||
| --set pipeline_parallel_size="1" | ||
| ``` | ||
|
|
||
| ## Step 3: Verify the Deployment | ||
|
|
||
| ```bash | ||
| kubectl get pods | ||
| kubectl get apisixroutes | ||
| ``` | ||
|
|
||
| Expected Output: | ||
|
|
||
| ``` | ||
| NAME READY STATUS RESTARTS | ||
| keycloak-0 1/1 Running 0 | ||
| keycloak-postgresql-0 1/1 Running 0 | ||
| vllm-llama-8b-<hash>-<hash> 1/1 Running 0 | ||
| ``` | ||
|
|
||
| > Note: The pod name suffix `<hash>-<hash>` is auto-generated by Kubernetes and will differ on each deployment. Ensure all pods show `1/1 Running`. | ||
|
|
||
| ``` | ||
| NAME HOSTS | ||
| vllm-llama-8b-apisixroute api.example.com | ||
| ``` | ||
|
|
||
| ## Step 4: Test the Deployed Model | ||
|
|
||
| ```bash | ||
| curl -k https://${BASE_URL}/Llama-3.1-8B-Instruct-vllmcpu/v1/completions \ | ||
| -X POST \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer $TOKEN" \ | ||
| -d '{ | ||
| "model": "meta-llama/Llama-3.1-8B-Instruct", | ||
| "prompt": "What is Deep Learning?", | ||
| "max_tokens": 25, | ||
| "temperature": 0 | ||
| }' | ||
| ``` | ||
|
|
||
| If successful, the model will return a completion response. | ||
|
|
||
| ## To undeploy the model | ||
|
|
||
| ```bash | ||
| helm uninstall vllm-llama-8b | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| | Parameter | Description | | ||
| | ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | | ||
| | `--set LLM_MODEL_ID="meta-llama/Llama-3.1-8B-Instruct"` | Defines the target model from **Hugging Face** to deploy. | | ||
| | `--set global.HUGGINGFACEHUB_API_TOKEN="..."` | Authenticates access to gated or private Hugging Face models. Replace with your own secure token. | | ||
| | `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. | | ||
| | `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). | | ||
| | `--set ingress.secretname="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. | | ||
| | `--set oidc.client_id="..."` | Keycloak OIDC client ID used for token-based authentication. | | ||
| | `--set oidc.client_secret="..."` | Keycloak OIDC client secret corresponding to the client ID. | | ||
| | `--set apisix.enabled=true` | Enables **APISIX** as the API gateway for routing and authentication. | | ||
| | `--set tensor_parallel_size="1"` | Number of tensor parallel workers. Set to the number of available Gaudi cards per node. | | ||
| | `--set pipeline_parallel_size="1"` | Number of pipeline parallel stages. Typically `1` for single-node deployments. | |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| # Llama-3.1-8B-Instruct | ||
|
|
||
| This model uses Llama-3.1-8B-Instruct, a 8 billion-parameter instruction-tuned model from Meta Platforms, Inc. (Meta AI). It belongs to the Llama 3.1 model family and is optimized for multilingual dialogue, code tasks, and general instruction-following across a large context window. | ||
|
|
||
| For full details including model specifications, licensing, intended use, safety guidance, and example prompts, please visit the official Hugging Face page: **Official Hugging Face Page** | ||
|
|
||
| https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct | ||
|
|
||
| This model provides inference services only; weights are hosted by Hugging Face under Meta’s license. | ||
|
|
||
| Ensure compliance with the Llama 2 Community License Agreement before using this model. | ||
|
|
||
| ### Model Attribution | ||
|
|
||
| **Developer:** Meta Platforms, Inc. (Meta AI) | ||
|
|
||
| **purpose:** Instruction-following model for dialogue, code generation/completion, multilingual tasks | ||
|
|
||
| **Sizes/Variants:** 8 B parameters (instruction tuned); the Llama 3.1 family also includes 70 B and 405 B parameter variants | ||
|
|
||
| **Modalities:** Text input → Text (including code) output | ||
|
|
||
| **Parameter Size:** ~8 billion | ||
|
|
||
| **Max Context:** Up to ~128 k tokens (for the 3.1 family) | ||
|
|
||
| **License:** Llama 3.1 Community License (custom commercial license) | ||
|
|
||
| **Minimum required CPU Cores:** 157 | ||
|
|
||
| **Minimum required PCIe Cards:** 1 | ||
|
|
||
| ### Usage Notice | ||
|
|
||
| **By using this model, you agree that:** | ||
|
|
||
| - Inputs and outputs are processed through Llama-3.1-8B-Instruct under Meta’s Community License. | ||
| - You will comply with Meta’s licensing terms, including restrictions on redistribution, commercial scale-use thresholds, attribution (“Built with Llama”), and acceptable use policy. | ||
| - All generated content (text or code) must be reviewed for accuracy, compliance, and safety before deployment. | ||
| - The model should not be used for generating malicious content, disallowed content, or automating decisions in high-risk or regulated systems without appropriate safeguards. | ||
|
|
||
| ### Intended Applications | ||
|
|
||
| - Instruction-following chatbots and assistants (multilingual) | ||
| - Code generation, completion, refactoring tasks (Python, Java, JavaScript, etc.) | ||
| - Multilingual support (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) and potentially others with fine-tuning. | ||
| - Large-context tasks: summarization of long documents, dialog over long history, RAG (retrieve-and-generate) over extended context. | ||
| - Research, prototyping, and commercial workflows (subject to license terms). | ||
|
|
||
| ### Limitations | ||
|
|
||
| - Although capable, the 8 B size still has trade-offs: accuracy and depth of reasoning may lag behind much larger models. | ||
| - As with all large language models, risk of hallucinations (incorrect statements), biases, or unsafe outputs remains. | ||
| - The custom license restricts certain uses (e.g., if your product has > 700 million monthly active users you may require a special license) as described in Meta’s license terms. | ||
| - The model does not guarantee tool-use, vision/multimodal input (unless you fine-tune or wrap appropriately) – it is primarily text → text. | ||
| - Running it efficiently still requires significant hardware/resources for full context and best performance | ||
|
|
||
| ### References | ||
|
|
||
| “Introducing Llama 3.1: Our most capable models to date”. https://ai.meta.com/blog/meta-llama-3-1 | ||
|
|
||
| Hugging Face Model Card: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct | ||
|
|
||
| Meta Llama GitHub Repository & License Details. https://github.com/meta-llama/llama3 |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,79 @@ | ||||||
| # Troubleshooting Guide | ||||||
|
|
||||||
| This section provides common issues observed when running inference against models deployed via Helm commands on Intel® AI for Enterprise Inference, along with step-by-step resolutions. | ||||||
|
|
||||||
| **Issues:** | ||||||
| 1. [Gateway Timeout (504) on Inference Requests](#1-gateway-timeout-504-on-inference-requests) | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ### 1. Gateway Timeout (504) on Inference Requests | ||||||
|
|
||||||
| **Context:** Model deployed via Helm commands. Inference request sent through the ingress stack (ingress-nginx -> APISIX -> vLLM service). | ||||||
|
|
||||||
| **Error:** Inference requests return `504 Gateway Timeout` after 60 seconds: | ||||||
|
|
||||||
| ``` | ||||||
| "POST /<model-name>/v1/completions HTTP/2.0" 504 | ||||||
| upstream timed out (110: Operation timed out) ... 60.001 | ||||||
| ``` | ||||||
|
|
||||||
| **Cause:** | ||||||
|
|
||||||
| CPU-based model inference (`vllm-cpu`) generates tokens at ~0.3-0.4 tokens/s. Responses requiring more than ~24 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX. | ||||||
|
|
||||||
| **Fix:** | ||||||
|
|
||||||
| **Step 1 - Increase the nginx ingress timeout** | ||||||
|
|
||||||
| Apply to both the `default` and `auth-apisix` namespaces. To find ingress names: | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl get ingress -A | grep <model-name> | ||||||
| ``` | ||||||
|
|
||||||
| Then annotate each ingress: | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Let's emphasize EACH since 2 is created. |
||||||
|
|
||||||
| ```bash | ||||||
| kubectl annotate ingress <ingress-name> -n <namespace> \ | ||||||
| nginx.ingress.kubernetes.io/proxy-read-timeout="300" \ | ||||||
| nginx.ingress.kubernetes.io/proxy-send-timeout="300" \ | ||||||
| nginx.ingress.kubernetes.io/proxy-connect-timeout="60" \ | ||||||
| --overwrite | ||||||
| ``` | ||||||
|
|
||||||
| **Step 2 - Increase the APISIX route timeout** | ||||||
|
|
||||||
| To find the route name: | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl get apisixroute -n auth-apisix | grep <model-name> | ||||||
| ``` | ||||||
|
|
||||||
| Edit the route: | ||||||
|
|
||||||
| ```bash | ||||||
| kubectl edit apisixroute <route-name> -n auth-apisix | ||||||
| ``` | ||||||
|
|
||||||
| Update the timeout section under the route: | ||||||
|
|
||||||
| ```yaml | ||||||
| spec: | ||||||
| http: | ||||||
| - name: <route-name> | ||||||
| timeout: | ||||||
| connect: 60s | ||||||
| send: 300s | ||||||
| read: 300s | ||||||
| ``` | ||||||
|
|
||||||
| **Verification:** | ||||||
|
|
||||||
| Re-run the inference request and confirm a `200 OK` response is returned within the new timeout window. | ||||||
|
|
||||||
| **Notes:** | ||||||
|
|
||||||
| - The nginx ingress annotation takes effect immediately; no pod restart required. | ||||||
| - For GPU-based deployments this timeout is rarely needed as throughput is significantly higher (30-50 tokens/s vs 0.3-0.4 tokens/s on CPU). | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here, let's remove mentions of performance numbers as it will vary from SKU, config, and over time |
||||||
| - If requests still time out after increasing both timeouts, reduce `max_tokens` in the request payload to limit response length. | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance is different for every Xeon SKU and will change over time. Let's just keep the note generic by only mentioning the root cause is the upstream timeout exceeds 60 seconds.