diff --git a/third_party/Dell/model-deployment/Granite-3.2-2b-Instruct/deployment.md b/third_party/Dell/model-deployment/Granite-3.2-2b-Instruct/deployment.md new file mode 100644 index 00000000..21bf16c2 --- /dev/null +++ b/third_party/Dell/model-deployment/Granite-3.2-2b-Instruct/deployment.md @@ -0,0 +1,101 @@ +## Step 1: Prerequisites to Deploy Granite-3.2-2b-Instruct Model on Xeon with Keycloak + +Ensure the Enterprise Inference stack with Keycloak is already deployed before proceeding. + +Edit `core/scripts/generate-token.sh` and set your values before sourcing it: + +| Variable | Description | +| ------------------------- | ------------------------------------------------------------------------ | +| `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` | +| `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username | +| `KEYCLOAK_PASSWORD` | Keycloak admin password | +| `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment | + +Then run: + +```bash +export HUGGING_FACE_HUB_TOKEN="your_token_here" + +cd ~/Enterprise-Inference +source core/scripts/generate-token.sh +``` + +This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`. + +## Step 2: Deploy Granite-3.2-2b-Instruct Model + +```bash +helm install vllm-granite-3-2-instruct ./core/helm-charts/vllm \ + --values ./core/helm-charts/vllm/xeon-values.yaml \ + --set LLM_MODEL_ID="ibm-granite/granite-3.2-2b-instruct" \ + --set global.HUGGINGFACEHUB_API_TOKEN="$HUGGING_FACE_HUB_TOKEN" \ + --set ingress.enabled=true \ + --set ingress.secretname="${BASE_URL}" \ + --set ingress.host="${BASE_URL}" \ + --set oidc.client_id="$KEYCLOAK_CLIENT_ID" \ + --set oidc.client_secret="$KEYCLOAK_CLIENT_SECRET" \ + --set apisix.enabled=true \ + --set tensor_parallel_size="1" \ + --set pipeline_parallel_size="1" +``` + +## Step 3: Verify the Deployment + +```bash +kubectl get pods +kubectl get apisixroutes +``` + +Expected Output: + +``` +NAME READY STATUS RESTARTS +keycloak-0 1/1 Running 0 +keycloak-postgresql-0 1/1 Running 0 +vllm-granite-3-2-instruct-- 1/1 Running 0 +``` + +> Note: The pod name suffix `-` is auto-generated by Kubernetes and will differ on each deployment. Ensure all pods show `1/1 Running`. + +``` +NAME HOSTS +vllm-granite-3-2-instruct-apisixroute api.example.com +``` + +## Step 4: Test the Deployed Model + +```bash +curl -k https://${BASE_URL}/granite-3.2-2b-instruct-vllmcpu/v1/completions \ + -X POST \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $TOKEN" \ + -d '{ + "model": "ibm-granite/granite-3.2-2b-instruct", + "prompt": "What is Deep Learning?", + "max_tokens": 25, + "temperature": 0 + }' +``` + +If successful, the model will return a completion response. + +## To undeploy the model + +```bash +helm uninstall vllm-granite-3-2-instruct +``` + +## Parameters + +| Parameter | Description | +| ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | +| `--set LLM_MODEL_ID="ibm-granite/granite-3.2-2b-instruct"` | Defines the target model from **Hugging Face** to deploy. | +| `--set global.HUGGINGFACEHUB_API_TOKEN="..."` | Authenticates access to gated or private Hugging Face models. Replace with your own secure token. | +| `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. | +| `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). | +| `--set ingress.secretname="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. | +| `--set oidc.client_id="..."` | Keycloak OIDC client ID used for token-based authentication. | +| `--set oidc.client_secret="..."` | Keycloak OIDC client secret corresponding to the client ID. | +| `--set apisix.enabled=true` | Enables **APISIX** as the API gateway for routing and authentication. | +| `--set tensor_parallel_size="1"` | Number of tensor parallel workers. Set to the number of available CPUs/GPUs per node. | +| `--set pipeline_parallel_size="1"` | Number of pipeline parallel stages. Typically `1` for single-node deployments. | diff --git a/third_party/Dell/model-deployment/Granite-3.2-2b-Instruct/model-card.md b/third_party/Dell/model-deployment/Granite-3.2-2b-Instruct/model-card.md new file mode 100644 index 00000000..d8b83238 --- /dev/null +++ b/third_party/Dell/model-deployment/Granite-3.2-2b-Instruct/model-card.md @@ -0,0 +1,66 @@ +# granite-3.2-2b-instruct + +This model uses ibm-granite/granite-3.2-2b-instruct, a modern, lightweight instruction-tuned large language model developed by IBM Granite Team. It is designed for efficient reasoning, instruction-following, and enterprise-grade AI workloads such as summarization, problem solving, structured response generation, and conversational AI. + +For full details including model specifications, licensing, intended use, safety guidance, and example prompts, please visit the official Hugging Face page: **Official Hugging Face Page** + +https://huggingface.co/ibm-granite/granite-3.2-2b-instruct + +This model provides inference services using IBM’s open-weight Granite architecture and is distributed under the Apache 2.0 license. + +### Model Attribution + +**Developer:** IBM (Granite Team) + +**purpose:** General-purpose instruction-following, reasoning, and enterprise AI workloads + +**Sizes/Variants:** Granite 3.2 family – includes 2B, 8B, and larger variants optimized for different deployment scales + +**Modalities:** Text → Text (natural language, reasoning, structured responses, code-related logic) + +**Parameter Size:** ~2 billion parameters (dense) + +**Max Context:** Up to ~128K tokens (depending on backend and serving configuration) + +**License:** Apache 2.0 (open-weight, commercially usable) + +### Usage Notice + +**By using this model, you agree that:** + +- Inputs and outputs are processed by the IBM Granite 3.2 2B Instruct model. +- You accept and comply with the Apache 2.0 License. +- Generated outputs must be reviewed for accuracy, safety, and compliance prior to production use. +- The model must not be used for malicious activities or violation of applicable laws or policies. +- Deployment in high-risk or regulated environments should include appropriate validation and guardrails. + +### Intended Applications + +- Conversational AI and enterprise assistants +- Instruction-following automation +- Reasoning and decision-support systems +- Long-document summarization and analysis +- Retrieval-Augmented Generation (RAG) systems +- Classification, extraction, and knowledge workflows +- Code-related reasoning and structured logic explanation +- Multilingual AI applications + +### Limitations + +- May still produce factual inaccuracies or hallucinated responses +- Performance may vary depending on prompt quality and domain complexity +- Not a replacement for expert decision-making in regulated environments +- Requires human validation in sensitive or critical applications +- Smaller size may reduce performance in highly complex multimodal tasks compared to very large models + +### References + +IBM Granite 3.2 Model Documentation +https://www.ibm.com/architectures/product-guides/granite-32 + +Hugging Face Model Page – IBM Granite 3.2 2B Instruct +https://huggingface.co/ibm-granite/granite-3.2-2b-instruct + +IBM Granite Announcement Blog +https://www.ibm.com/new/announcements/ibm-granite-3-2-open-source-reasoning-and-vision + diff --git a/third_party/Dell/model-deployment/README.md b/third_party/Dell/model-deployment/README.md deleted file mode 100644 index 43d98118..00000000 --- a/third_party/Dell/model-deployment/README.md +++ /dev/null @@ -1 +0,0 @@ -# PLACEHOLDER \ No newline at end of file