Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
## Step 1: Prerequisites to Deploy Granite-3.2-2b-Instruct Model on Xeon with Keycloak

Ensure the Enterprise Inference stack with Keycloak is already deployed before proceeding.

Edit `core/scripts/generate-token.sh` and set your values before sourcing it:

| Variable | Description |
| ------------------------- | ------------------------------------------------------------------------ |
| `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` |
| `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username |
| `KEYCLOAK_PASSWORD` | Keycloak admin password |
| `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment |

Then run:

```bash
export HUGGING_FACE_HUB_TOKEN="your_token_here"

cd ~/Enterprise-Inference
source core/scripts/generate-token.sh
```

This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`.

## Step 2: Deploy Granite-3.2-2b-Instruct Model

```bash
helm install vllm-granite-3-2-instruct ./core/helm-charts/vllm \
--values ./core/helm-charts/vllm/xeon-values.yaml \
--set LLM_MODEL_ID="ibm-granite/granite-3.2-2b-instruct" \
--set global.HUGGINGFACEHUB_API_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
--set ingress.enabled=true \
--set ingress.secretname="${BASE_URL}" \
--set ingress.host="${BASE_URL}" \
--set oidc.client_id="$KEYCLOAK_CLIENT_ID" \
--set oidc.client_secret="$KEYCLOAK_CLIENT_SECRET" \
--set apisix.enabled=true \
--set tensor_parallel_size="1" \
--set pipeline_parallel_size="1"
```

## Step 3: Verify the Deployment

```bash
kubectl get pods
kubectl get apisixroutes
```

Expected Output:

```
NAME READY STATUS RESTARTS
keycloak-0 1/1 Running 0
keycloak-postgresql-0 1/1 Running 0
vllm-granite-3-2-instruct-<hash>-<hash> 1/1 Running 0
```

> Note: The pod name suffix `<hash>-<hash>` is auto-generated by Kubernetes and will differ on each deployment. Ensure all pods show `1/1 Running`.

```
NAME HOSTS
vllm-granite-3-2-instruct-apisixroute api.example.com
```

## Step 4: Test the Deployed Model

```bash
curl -k https://${BASE_URL}/granite-3.2-2b-instruct-vllmcpu/v1/completions \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"model": "ibm-granite/granite-3.2-2b-instruct",
"prompt": "What is Deep Learning?",
"max_tokens": 25,
"temperature": 0
}'
```

If successful, the model will return a completion response.

## To undeploy the model

```bash
helm uninstall vllm-granite-3-2-instruct
```

## Parameters

| Parameter | Description |
| ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- |
| `--set LLM_MODEL_ID="ibm-granite/granite-3.2-2b-instruct"` | Defines the target model from **Hugging Face** to deploy. |
| `--set global.HUGGINGFACEHUB_API_TOKEN="..."` | Authenticates access to gated or private Hugging Face models. Replace with your own secure token. |
| `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. |
| `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). |
| `--set ingress.secretname="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. |
| `--set oidc.client_id="..."` | Keycloak OIDC client ID used for token-based authentication. |
| `--set oidc.client_secret="..."` | Keycloak OIDC client secret corresponding to the client ID. |
| `--set apisix.enabled=true` | Enables **APISIX** as the API gateway for routing and authentication. |
| `--set tensor_parallel_size="1"` | Number of tensor parallel workers. Set to the number of available CPUs/GPUs per node. |
| `--set pipeline_parallel_size="1"` | Number of pipeline parallel stages. Typically `1` for single-node deployments. |
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# granite-3.2-2b-instruct

This model uses ibm-granite/granite-3.2-2b-instruct, a modern, lightweight instruction-tuned large language model developed by IBM Granite Team. It is designed for efficient reasoning, instruction-following, and enterprise-grade AI workloads such as summarization, problem solving, structured response generation, and conversational AI.

For full details including model specifications, licensing, intended use, safety guidance, and example prompts, please visit the official Hugging Face page: **Official Hugging Face Page**

https://huggingface.co/ibm-granite/granite-3.2-2b-instruct

This model provides inference services using IBM’s open-weight Granite architecture and is distributed under the Apache 2.0 license.

### Model Attribution

**Developer:** IBM (Granite Team)

**purpose:** General-purpose instruction-following, reasoning, and enterprise AI workloads

**Sizes/Variants:** Granite 3.2 family – includes 2B, 8B, and larger variants optimized for different deployment scales

**Modalities:** Text → Text (natural language, reasoning, structured responses, code-related logic)

**Parameter Size:** ~2 billion parameters (dense)

**Max Context:** Up to ~128K tokens (depending on backend and serving configuration)

**License:** Apache 2.0 (open-weight, commercially usable)

### Usage Notice

**By using this model, you agree that:**

- Inputs and outputs are processed by the IBM Granite 3.2 2B Instruct model.
- You accept and comply with the Apache 2.0 License.
- Generated outputs must be reviewed for accuracy, safety, and compliance prior to production use.
- The model must not be used for malicious activities or violation of applicable laws or policies.
- Deployment in high-risk or regulated environments should include appropriate validation and guardrails.

### Intended Applications

- Conversational AI and enterprise assistants
- Instruction-following automation
- Reasoning and decision-support systems
- Long-document summarization and analysis
- Retrieval-Augmented Generation (RAG) systems
- Classification, extraction, and knowledge workflows
- Code-related reasoning and structured logic explanation
- Multilingual AI applications

### Limitations

- May still produce factual inaccuracies or hallucinated responses
- Performance may vary depending on prompt quality and domain complexity
- Not a replacement for expert decision-making in regulated environments
- Requires human validation in sensitive or critical applications
- Smaller size may reduce performance in highly complex multimodal tasks compared to very large models

### References

IBM Granite 3.2 Model Documentation
https://www.ibm.com/architectures/product-guides/granite-32

Hugging Face Model Page – IBM Granite 3.2 2B Instruct
https://huggingface.co/ibm-granite/granite-3.2-2b-instruct

IBM Granite Announcement Blog
https://www.ibm.com/new/announcements/ibm-granite-3-2-open-source-reasoning-and-vision

1 change: 0 additions & 1 deletion third_party/Dell/model-deployment/README.md

This file was deleted.