Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions docs/sagemaker/01_quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Quick Start: Deploy vLLM on SageMaker

Deploy a vLLM-powered Large Language Model on Amazon SageMaker in minutes.

## Quick Start

**Fastest way to get started:**

👉 **[Basic Endpoint Notebook](../../examples/vllm/notebooks/basic_endpoint.ipynb)**

The notebook includes complete deployment workflow, inference examples (single, concurrent, streaming), and automatic cleanup.

## Container Images

AWS provides official vLLM container images in the [Amazon ECR Public Gallery](https://gallery.ecr.aws/deep-learning-containers/vllm).

**Example:**
```
public.ecr.aws/deep-learning-containers/vllm:0.11.2-gpu-py312-cu129-ubuntu22.04-sagemaker-v1.2
```

**Note:** Copy the public image to your private ECR repository for SageMaker deployment. See [copy_image.ipynb](../../examples/vllm/notebooks/copy_image.ipynb).

**Features:**
- vLLM inference engine
- SageMaker-compatible API
- Custom handler support
- Custom middleware support
- Custom pre/post-processing support
- Sticky routing (stateful sessions)
- Multi-LoRA adapter management

## Basic Deployment

### Required Configuration

```python
sagemaker_client.create_model(
ModelName='my-vllm-model',
ExecutionRoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
PrimaryContainer={
'Image': f'{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:0.11.2-sagemaker-v1.2',
'Environment': {
'SM_VLLM_MODEL': 'meta-llama/Meta-Llama-3-8B-Instruct',
'HUGGING_FACE_HUB_TOKEN': 'hf_your_token_here',
}
}
)
```

### vLLM Engine Configuration

Configure vLLM using `SM_VLLM_*` environment variables (automatically converted to CLI arguments):

```python
'Environment': {
'SM_VLLM_MODEL': 'meta-llama/Meta-Llama-3-8B-Instruct',
'HUGGING_FACE_HUB_TOKEN': 'hf_your_token_here',
'SM_VLLM_MAX_MODEL_LEN': '2048',
'SM_VLLM_GPU_MEMORY_UTILIZATION': '0.9',
'SM_VLLM_DTYPE': 'auto',
'SM_VLLM_TENSOR_PARALLEL_SIZE': '1',
}
```

All vLLM CLI arguments are supported. See [vLLM CLI documentation](https://docs.vllm.ai/en/latest/cli/serve/#frontend) for available parameters.

### Model Path Options

`SM_VLLM_MODEL` accepts two types of values:

**1. Hugging Face Model ID** (downloads from HF Hub):
```python
'SM_VLLM_MODEL': 'meta-llama/Meta-Llama-3-8B-Instruct'
```

**2. Local Folder Path** (for S3 model artifacts):
```python
'SM_VLLM_MODEL': '/opt/ml/model'
```

When deploying with model artifacts from S3, SageMaker automatically downloads them to `/opt/ml/model`. Use this path to load your pre-downloaded models instead of fetching from Hugging Face.


## Making Inference Requests

```python
runtime_client = boto3.client('sagemaker-runtime')

response = runtime_client.invoke_endpoint(
EndpointName='my-vllm-endpoint',
ContentType='application/json',
Body=json.dumps({
"prompt": "What is the capital of France?",
"max_tokens": 100,
"temperature": 0.7
})
)

result = json.loads(response['Body'].read())
print(result['choices'][0]['text'])
```

For complete deployment code including concurrent requests, streaming responses, and cleanup, see the [Basic Endpoint Notebook](../../examples/vllm/notebooks/basic_endpoint.ipynb).

## Next Steps

**Advanced Features:**
- [Customize Handlers](02_customize_handlers.md) - Custom ping and invocation handlers
- [Customize Pre/Post Processing](03_customize_pre_post_processing.md) - Custom middleware for request/response transformation

**Resources:**
- [Basic Endpoint Notebook](../../examples/vllm/notebooks/basic_endpoint.ipynb) - Complete deployment example
- [Handler Customization Notebook](../../examples/vllm/notebooks/handler_customization_methods.ipynb) - Handler override examples
- [Pre/Post Processing Notebook](../../examples/vllm/notebooks/preprocessing_postprocessing_methods.ipynb) - Middleware examples
- [Python Package README](../../python/README.md) - Handler system documentation
- [vLLM Documentation](https://docs.vllm.ai/) - vLLM engine details
276 changes: 276 additions & 0 deletions docs/sagemaker/02_customize_handlers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Customize Handlers

This guide explains how to customize the `/ping` and `/invocations` endpoints for your vLLM models on Amazon SageMaker.

> 📓 **Working Example**: See the [Handler Override Notebook](../../examples/vllm/notebooks/handler_customization_methods.ipynb) for a complete working example.

## Overview

The vLLM container comes with default handlers for health checks and inference. You can override these defaults using custom Python code in your model artifacts.

**Handler Resolution Priority** (first match wins):
1. **Environment Variables** - `CUSTOM_FASTAPI_PING_HANDLER`, `CUSTOM_FASTAPI_INVOCATION_HANDLER`
2. **Decorator Registration** - `@custom_ping_handler`, `@custom_invocation_handler`
3. **Function Discovery** - Functions named `custom_sagemaker_ping_handler`, `custom_sagemaker_invocation_handler`
4. **Framework Defaults** - vLLM's built-in handlers

## Quick Start

### Step 1: Create Custom Handler Script

Create a Python file (e.g., `model.py`) with your custom handlers:

```python
# model.py
import model_hosting_container_standards.sagemaker as sagemaker_standards
from fastapi import Request, Response
import json

@sagemaker_standards.custom_ping_handler
async def my_health_check(request: Request) -> Response:
"""Custom health check logic."""
return Response(
content=json.dumps({"status": "healthy", "custom": True}),
media_type="application/json",
status_code=200
)

@sagemaker_standards.custom_invocation_handler
async def my_inference(request: Request) -> Response:
"""Custom inference logic."""
body = await request.json()
# Your custom logic here
result = {"predictions": ["custom response"]}
return Response(
content=json.dumps(result),
media_type="application/json"
)
```

### Step 2: Upload to S3

```python
import boto3

s3_client = boto3.client('s3')
s3_client.upload_file('model.py', 'my-bucket', 'my-model/model.py')
```

### Step 3: Deploy to SageMaker

```python
sagemaker_client = boto3.client('sagemaker')

sagemaker_client.create_model(
ModelName='my-vllm-model',
ExecutionRoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
PrimaryContainer={
'Image': f'{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:latest',
'ModelDataSource': {
'S3DataSource': {
'S3Uri': 's3://my-bucket/my-model/',
'S3DataType': 'S3Prefix',
'CompressionType': 'None',
}
},
'Environment': {
'SM_VLLM_MODEL': 'meta-llama/Meta-Llama-3-8B-Instruct',
'HUGGING_FACE_HUB_TOKEN': 'hf_your_token_here',
'CUSTOM_SCRIPT_FILENAME': 'model.py', # Default: model.py
}
}
)
```

**Key Environment Variables:**
- `CUSTOM_SCRIPT_FILENAME`: Name of your custom script (default: `model.py`)
- `SAGEMAKER_MODEL_PATH`: Model directory (default: `/opt/ml/model`)
- `SAGEMAKER_CONTAINER_LOG_LEVEL`: Logging level (ERROR, INFO, DEBUG)

## Customization Methods

### Method 1: Environment Variables (Highest Priority)

Point directly to specific handler functions using environment variables. This overrides all other methods.

**⚠️ Important:** When using environment variables, the recommended approach is to use the `model:` module alias instead of file paths.

```python
# ✅ RECOMMENDED: Use module alias
environment = {
'CUSTOM_SCRIPT_FILENAME': 'handlers_env_var.py',
'CUSTOM_FASTAPI_PING_HANDLER': 'model:health_check',
'CUSTOM_FASTAPI_INVOCATION_HANDLER': 'model:inference',
}

# ⚠️ ALTERNATIVE: Use absolute path
environment = {
'CUSTOM_FASTAPI_PING_HANDLER': '/opt/ml/model/handlers_env_var.py:health_check',
'CUSTOM_FASTAPI_INVOCATION_HANDLER': '/opt/ml/model/handlers_env_var.py:inference',
}
```

**Path Formats:**
- `model:function_name` - **Recommended** - Module alias (`model` = `$SAGEMAKER_MODEL_PATH/$CUSTOM_SCRIPT_FILENAME`)
- `/opt/ml/model/handlers.py:ping` - Absolute path
- `handlers.py:function_name` - Relative to `/opt/ml/model` (requires file to exist in that directory)
- `vllm.entrypoints.openai.api_server:health` - Python module path (for installed packages)

**Why use `model:` alias?**
- The `model` alias automatically resolves to your custom script file
- It's more portable and doesn't depend on absolute paths
- It works consistently across different deployment scenarios

### Method 2: Decorators

Use decorators to mark your custom handler functions. The system automatically discovers and registers them when your script loads.

**⚠️ Important:** Decorators must be defined in the file specified by `CUSTOM_SCRIPT_FILENAME` (default: `model.py`). The system only loads and scans this file for decorated functions.

```python
# model.py (or the file specified in CUSTOM_SCRIPT_FILENAME)
import model_hosting_container_standards.sagemaker as sagemaker_standards

@sagemaker_standards.custom_ping_handler
async def my_ping(request):
return {"status": "ok"}

@sagemaker_standards.custom_invocation_handler
async def my_invoke(request):
return {"result": "processed"}
```

**Environment Variable:**
```python
environment = {
'CUSTOM_SCRIPT_FILENAME': 'model.py', # File containing decorated functions
}
```

### Method 3: Function Discovery (Lowest Priority)

Name your functions with the expected pattern and they'll be automatically discovered - no decorator needed.

**⚠️ Important:** Functions must be defined in the file specified by `CUSTOM_SCRIPT_FILENAME` (default: `model.py`). The system only loads and scans this file for functions matching the expected names.

```python
# model.py (or the file specified in CUSTOM_SCRIPT_FILENAME)
async def custom_sagemaker_ping_handler(request):
"""Automatically discovered by name."""
return {"status": "healthy"}

async def custom_sagemaker_invocation_handler(request):
"""Automatically discovered by name."""
return {"result": "processed"}
```

**Environment Variable:**
```python
environment = {
'CUSTOM_SCRIPT_FILENAME': 'model.py', # File containing handler functions
}
```

## Complete Example

```python
# model.py
import model_hosting_container_standards.sagemaker as sagemaker_standards
from model_hosting_container_standards.logging_config import logger
from fastapi import Request, Response
import json

@sagemaker_standards.custom_ping_handler
async def health_check(request: Request) -> Response:
"""Custom health check."""
logger.info("Custom health check called")
return Response(
content=json.dumps({"status": "healthy"}),
media_type="application/json",
status_code=200
)

@sagemaker_standards.custom_invocation_handler
async def inference_with_rag(request: Request) -> Response:
"""Custom inference with optional RAG integration and vLLM engine."""
body = await request.json()
prompt = body["prompt"]
max_tokens = body.get("max_tokens", 100)
temperature = body.get("temperature", 0.7)

# Optional RAG integration
if body.get("use_rag", False):
context = "Retrieved context..."
prompt = f"Context: {context}\n\nQuestion: {prompt}"

logger.info(f"Processing prompt: {prompt[:50]}...")

# Call vLLM engine directly
from vllm import SamplingParams
import uuid

engine = request.app.state.engine_client

# Create sampling parameters
sampling_params = SamplingParams(
temperature=temperature,
max_tokens=max_tokens,
)

# Generate response using vLLM engine
request_id = str(uuid.uuid4())
results_generator = engine.generate(prompt, sampling_params, request_id)

# Collect final output
final_output = None
async for request_output in results_generator:
final_output = request_output

# Extract generated text
if final_output and final_output.outputs:
generated_text = final_output.outputs[0].text
prompt_tokens = len(final_output.prompt_token_ids) if hasattr(final_output, "prompt_token_ids") else 0
completion_tokens = len(final_output.outputs[0].token_ids)
else:
generated_text = ""
prompt_tokens = 0
completion_tokens = 0

response_data = {
"predictions": [generated_text],
"usage": {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
},
"rag_enabled": body.get("use_rag", False)
}

return Response(
content=json.dumps(response_data),
media_type="application/json",
headers={"X-Request-Id": request_id}
)
```

## Troubleshooting

**Custom handlers not loading:**
- Verify `CUSTOM_SCRIPT_FILENAME` is set correctly (default: `model.py`)
- Check CloudWatch logs for import errors
- Enable debug logging: `SAGEMAKER_CONTAINER_LOG_LEVEL=DEBUG`

**Wrong handler being called:**
- Check handler resolution priority (env vars > decorators > function discovery > framework defaults)
- Use `SAGEMAKER_CONTAINER_LOG_LEVEL=DEBUG` to see which handler is selected

**Import errors:**
- Ensure all dependencies are installed in the container
- Verify the script path is correct

## Additional Resources

- **[Python Package README](../../python/README.md)** - Detailed decorator documentation and middleware options
- **[Handler Override Notebook](../../examples/vllm/notebooks/handler_customization_methods.ipynb)** - Complete working example
- **[Quick Start Guide](01_quickstart.md)** - Basic deployment
- **[Customize Pre/Post Processing](03_customize_pre_post_processing.md)** - Custom middleware for request/response transformation
Loading