Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
## Step 1: Prerequisites to Deploy DeepSeek-R1-Distill-Llama-8B Model on Xeon with Keycloak

Ensure the Enterprise Inference stack with Keycloak is already deployed before proceeding.

Edit `core/scripts/generate-token.sh` and set your values before sourcing it:

| Variable | Description |
| ------------------------- | ------------------------------------------------------------------------ |
| `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` |
| `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username |
| `KEYCLOAK_PASSWORD` | Keycloak admin password |
| `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment |

Then run:

```bash
export HUGGING_FACE_HUB_TOKEN="your_token_here"

cd ~/Enterprise-Inference
source core/scripts/generate-token.sh
```

This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`.

## Step 2: Deploy DeepSeek-R1-Distill-Llama-8B Model

```bash
helm install deepseek-r1-distill-cpu ./core/helm-charts/vllm \
--values ./core/helm-charts/vllm/xeon-values.yaml \
--set LLM_MODEL_ID="deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--set global.HUGGINGFACEHUB_API_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
--set ingress.enabled=true \
--set ingress.secretname="${BASE_URL}" \
--set ingress.host="${BASE_URL}" \
--set oidc.client_id="$KEYCLOAK_CLIENT_ID" \
--set oidc.client_secret="$KEYCLOAK_CLIENT_SECRET" \
--set apisix.enabled=true \
--set tensor_parallel_size="1" \
--set pipeline_parallel_size="1"
```

## Step 3: Verify the Deployment

```bash
kubectl get pods
kubectl get apisixroutes
```

Expected Output:

```
NAME READY STATUS RESTARTS
keycloak-0 1/1 Running 0
keycloak-postgresql-0 1/1 Running 0
deepseek-r1-distill-cpu-<hash>-<hash> 1/1 Running 0
```

> Note: The pod name suffix `<hash>-<hash>` is auto-generated by Kubernetes and will differ on each deployment. Ensure all pods show `1/1 Running`.

```
NAME HOSTS
deepseek-r1-distill-cpu-apisixroute api.example.com
```

## Step 4: Test the Deployed Model

```bash
curl -k https://${BASE_URL}/DeepSeek-R1-Distill-Llama-8B-vllmcpu/v1/completions \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"prompt": "What is Deep Learning?",
"max_tokens": 25,
"temperature": 0
}'
```

If successful, the model will return a completion response.

## To undeploy the model

```bash
helm uninstall deepseek-r1-distill-cpu
```

## Parameters

| Parameter | Description |
| --------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| `--set LLM_MODEL_ID="deepseek-ai/DeepSeek-R1-Distill-Llama-8B"`| Defines the target model from **Hugging Face** to deploy. |
| `--set global.HUGGINGFACEHUB_API_TOKEN="..."` | Authenticates access to gated or private Hugging Face models. Replace with your own secure token. |
| `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. |
| `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). |
| `--set ingress.secretname="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. |
| `--set oidc.client_id="..."` | Keycloak OIDC client ID used for token-based authentication. |
| `--set oidc.client_secret="..."` | Keycloak OIDC client secret corresponding to the client ID. |
| `--set apisix.enabled=true` | Enables **APISIX** as the API gateway for routing and authentication. |
| `--set tensor_parallel_size="1"` | Number of tensor parallel workers. Set to the number of available CPUs/GPUs per node. |
| `--set pipeline_parallel_size="1"` | Number of pipeline parallel stages. Typically `1` for single-node deployments. |
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# DeepSeek-R1-Distill-Llama-8B

This model uses DeepSeek-R1-Llama-8B, an 8-billion-parameter reasoning model distilled from the larger DeepSeek-R1 family and built upon Meta’s Llama architecture. It is optimized for lightweight deployment, faster inference, and efficient reasoning performance while preserving strong capabilities in logic, dialogue, and code generation.
DeepSeek’s R1 reinforcement learning process and distillation techniques enable this smaller variant to maintain high reasoning quality with substantially reduced computational requirements.

For complete technical details, licensing, evaluation metrics, and usage guidelines, please refer to the official Hugging Face model page:

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B

This model provides inference-only access and is distributed under the DeepSeek license.

Ensure full compliance with the DeepSeek and Meta licensing terms before integrating this model into any application or service.

### Model Attribution

**Developer:** DeepSeek AI

**purpose:** Lightweight reasoning, dialogue, and code generation

**Sizes/Variants:** 8B distilled reasoning model

**Modalities:** Text → Text (Reasoning, Coding, and Dialogue)

**Parameter Size:** 8 billion

**Max Context:** ~64K tokens (backend dependent)

**License:** DeepSeek License (use-restricted; see Hugging Face page)

**Minimum required CPU Cores:** 157

### Usage Notice

**By using this model, you agree that:**

- All data is processed through the DeepSeek-R1-Llama-8B model hosted under the DeepSeek license.
- You must follow the DeepSeek and Meta licensing requirements, including possible non-commercial or restricted-use clauses.
- Generated content (text, reasoning traces, or code) must be validated for correctness and safety before production use.
- The model must not be used to produce harmful content, misinformation, or automated decisions in critical or regulated domains.

### Intended Applications

- Lightweight and cost-efficient reasoning and problem-solving
- Assistant-style multi-turn conversations
- Code generation, completion, and debugging (Python, Go, JavaScript, etc.)
- Educational tools, research prototypes, and RAG-based assistant systems
- Baselines for fine-tuning and further distillation research
- On-device or edge inference scenarios with GPU/memory constraints

### Limitations

- May produce inaccurate or incomplete reasoning steps
- Smaller size may reduce performance on highly complex logic or long-context tasks
- Not suited for safety-critical or regulated environments
- License may restrict commercial use
- Inference performance depends on optimized backends for smaller Llama-based models

### References

DeepSeek Official Site: https://deepseek.ai

Hugging Face Model Card: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Llama Architecture Reference: https://huggingface.co/meta-llama



1 change: 0 additions & 1 deletion third_party/Dell/model-deployment/README.md

This file was deleted.