Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions third_party/Dell/model-deployment/Qwen3-8b/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
## Step 1: Prerequisites to Deploy Qwen3-8b Model on Xeon with Keycloak

Ensure the Enterprise Inference stack with Keycloak is already deployed before proceeding.

Edit `core/scripts/generate-token.sh` and set your values before sourcing it:

| Variable | Description |
| ------------------------- | ------------------------------------------------------------------------ |
| `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` |
| `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username |
| `KEYCLOAK_PASSWORD` | Keycloak admin password |
| `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment |

Then run:

```bash
export HUGGING_FACE_HUB_TOKEN="your_token_here"

cd ~/Enterprise-Inference
source core/scripts/generate-token.sh
```

This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`.

## Step 2: Deploy Qwen3-8b Model

```bash
helm install qwen3-8b-cpu ./core/helm-charts/vllm \
--values ./core/helm-charts/vllm/xeon-values.yaml \
--set LLM_MODEL_ID="Qwen/Qwen3-8B" \
--set global.HUGGINGFACEHUB_API_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
--set ingress.enabled=true \
--set ingress.secretname="${BASE_URL}" \
--set ingress.host="${BASE_URL}" \
--set oidc.client_id="$KEYCLOAK_CLIENT_ID" \
--set oidc.client_secret="$KEYCLOAK_CLIENT_SECRET" \
--set apisix.enabled=true \
--set tensor_parallel_size="1" \
--set pipeline_parallel_size="1"
```

## Step 3: Verify the Deployment

```bash
kubectl get pods
kubectl get apisixroutes
```

Expected Output:

```
NAME READY STATUS RESTARTS
keycloak-0 1/1 Running 0
keycloak-postgresql-0 1/1 Running 0
qwen3-8b-cpu-vllm-<hash>-<hash> 1/1 Running 0
```

> Note: The pod name suffix `<hash>-<hash>` is auto-generated by Kubernetes and will differ on each deployment. Ensure all pods show `1/1 Running`.

```
NAME HOSTS
qwen3-8b-cpu-vllm-apisixroute api.example.com
```

## Step 4: Test the Deployed Model

```bash
curl -k https://${BASE_URL}/Qwen3-8B-vllmcpu/v1/completions \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"model": "Qwen/Qwen3-8B",
"prompt": "What is Deep Learning?",
"max_tokens": 25,
"temperature": 0
}'
```

If successful, the model will return a completion response.

## To undeploy the model

```bash
helm uninstall qwen3-8b-cpu
```

## Parameters

| Parameter | Description |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| `--set LLM_MODEL_ID="Qwen/Qwen3-8B"` | Defines the target model from **Hugging Face** to deploy. |
| `--set global.HUGGINGFACEHUB_API_TOKEN="..."` | Authenticates access to gated or private Hugging Face models. Replace with your own secure token. |
| `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. |
| `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). |
| `--set ingress.secretname="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. |
| `--set oidc.client_id="..."` | Keycloak OIDC client ID used for token-based authentication. |
| `--set oidc.client_secret="..."` | Keycloak OIDC client secret corresponding to the client ID. |
| `--set apisix.enabled=true` | Enables **APISIX** as the API gateway for routing and authentication. |
| `--set tensor_parallel_size="1"` | Number of tensor parallel workers. Set to the number of available CPUs/GPUs per node. |
| `--set pipeline_parallel_size="1"` | Number of pipeline parallel stages. Typically `1` for single-node deployments. |
62 changes: 62 additions & 0 deletions third_party/Dell/model-deployment/Qwen3-8b/model-card.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Qwen3-8B

This model uses Qwen3-8B, a large-scale open-weight language model developed by the Qwen Team at Alibaba Cloud. It is designed for high-quality natural language understanding, reasoning, instruction following, and code intelligence across a broad range of enterprise and research workloads. Qwen3-8B represents the next-generation evolution of the Qwen model family, with improved reasoning depth, instruction alignment, and multilingual capabilities.

For full details including model specifications, licensing, intended use, safety guidance, and example prompts, please visit the official Hugging Face page: **Official Hugging Face Page**

https://huggingface.co/Qwen/Qwen3-8B

This model provides inference services only; weights are hosted by Hugging Face under the Qwen License.

Ensure compliance with the Qwen License terms before using this model.

### Model Attribution

**Developer:** Alibaba Cloud / Qwen Team

**purpose:** General-purpose instruction-tuned reasoning and language model

**Sizes/Variants:** 8B parameters

**Modalities:** Text → Natural Language + Code

**Parameter Size:** 8 Billion

**Max Context:** Up to ~128K tokens (depending on backend integration)

**License:** Qwen License (commercial use permitted with conditions)

### Usage Notice

**By using this model, you agree that:**

- Inputs and outputs are processed by the Qwen3-8B model under the Qwen License.
- You are responsible for validating outputs before production use.
- This model should not be used for generating malicious, deceptive, or unsafe content.
- Commercial usage must comply with all Qwen license obligations and regional legal requirements..

### Intended Applications

- Enterprise chatbots and virtual assistants
- Retrieval-Augmented Generation (RAG) systems
- Agentic AI workflows and task automation
- Code generation, debugging, and refactoring
- API reasoning and architecture guidance
- Multilingual document analysis and summarization
- Knowledge base and search augmentation systems

### Limitations

- Higher compute and memory requirements than sub-3B models
- May hallucinate in open-ended or low-context prompts
- Not suitable for unsupervised safety-critical decision systems
- Long-context performance depends on serving backend configuration
- Requires responsible deployment with output validation

### References

Qwen Project Official Repository - https://github.com/QwenLM

Hugging Face Model Page — https://huggingface.co/Qwen/Qwen3-8B

Qwen License - https://github.com/QwenLM/Qwen/blob/main/LICENSE
1 change: 0 additions & 1 deletion third_party/Dell/model-deployment/README.md

This file was deleted.