From de68d685200df56d8adddac20cdada07fea72fe6 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Sat, 6 Dec 2025 08:56:21 -0800 Subject: [PATCH 1/2] terraform: Document tier-based GPU selection for Lambda Labs Add comprehensive documentation for the tier-based GPU selection feature to the Lambda Labs README. This includes documentation for the capacity checking and tier selection scripts, the available tier groups for both single GPU and multi-GPU configurations, and quick start examples. The documentation covers how tier-based selection works with automatic fallback from higher to lower GPU tiers when capacity is unavailable. It also updates the defconfigs table and scripts reference to include the new tier-based options. Generated-by: Claude AI Signed-off-by: Luis Chamberlain Signed-off-by: Chuck Lever --- terraform/lambdalabs/README.md | 103 +++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/terraform/lambdalabs/README.md b/terraform/lambdalabs/README.md index 4ec1ac447..71da2490b 100644 --- a/terraform/lambdalabs/README.md +++ b/terraform/lambdalabs/README.md @@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras - [Prerequisites](#prerequisites) - [Quick Start](#quick-start) - [Dynamic Configuration](#dynamic-configuration) +- [Tier-Based GPU Selection](#tier-based-gpu-selection) - [SSH Key Security](#ssh-key-security) - [Configuration Options](#configuration-options) - [Provider Limitations](#provider-limitations) @@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md). +## Tier-Based GPU Selection + +Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying +a single instance type, you can specify a maximum tier and kdevops will automatically select +the highest available GPU within that tier. + +### How It Works + +1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS` +2. **Capacity Check**: The system queries Lambda Labs API for available instances +3. **Tier Fallback**: Tries each tier from highest to lowest until one is available +4. **Auto-Provision**: Deploys to the first region with available capacity + +### Single GPU Tier Groups + +| Tier Group | Fallback Order | Use Case | +|------------|----------------|----------| +| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance | +| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance | +| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective | +| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly | + +### Multi-GPU (8x) Tier Groups + +| Tier Group | Fallback Order | Use Case | +|------------|----------------|----------| +| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU | +| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU | +| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU | + +### Quick Start with Tier Selection + +```bash +# Single GPU - best available up to H100 +make defconfig-lambdalabs-h100-or-less +make bringup + +# Single GPU - best available up to GH200 +make defconfig-lambdalabs-gh200-or-less +make bringup + +# 8x GPU - best available up to H100 +make defconfig-lambdalabs-8x-h100-or-less +make bringup +``` + +### Checking Capacity + +Before deploying, you can check current GPU availability: + +```bash +# Check all available GPU instances +python3 scripts/lambdalabs_check_capacity.py + +# Check specific instance type +python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5 + +# JSON output for scripting +python3 scripts/lambdalabs_check_capacity.py --json +``` + +### Tier Selection Script + +The tier selection script finds the best available GPU: + +```bash +# Find best single GPU up to H100 +python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose + +# Find best 8x GPU up to H100 +python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose + +# List all available tier groups +python3 scripts/lambdalabs_select_tier.py --list-tiers +``` + +Example output: +``` +Checking tier group: h100-or-less +Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10 + +Checking tier 'h100-sxm': gpu_1x_h100_sxm5 + Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1 + +Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm) +gpu_1x_h100_sxm5 us-west-1 +``` + +### Benefits of Tier-Based Selection + +- **Higher Success Rate**: Automatically falls back to available GPUs +- **No Manual Intervention**: System handles capacity changes +- **Best Performance**: Always gets the highest tier available +- **Simple Configuration**: One defconfig covers multiple GPU types + ## SSH Key Security ### Automatic Unique Keys (Default - Recommended) @@ -168,6 +264,11 @@ The default configuration automatically: |--------|-------------|----------| | `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) | | `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing | +| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance | +| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance | +| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective | +| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU | +| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU | ### Manual Configuration @@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant |--------|---------| | `lambdalabs_api.py` | Main API integration, generates Kconfig | | `lambdalabs_smart_inference.py` | Smart instance/region selection | +| `lambdalabs_check_capacity.py` | Check GPU availability across regions | +| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback | | `lambdalabs_ssh_keys.py` | SSH key management | | `lambdalabs_list_instances.py` | List running instances | | `lambdalabs_credentials.py` | Manage API credentials | From 304c6a365447d44fe79c21815d0dfd43fd677f56 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Sat, 6 Dec 2025 08:56:22 -0800 Subject: [PATCH 2/2] docs: Organize cloud providers with Neoclouds section Reorganize the terraform documentation to distinguish between traditional cloud providers and neoclouds. A neocloud is a specialized cloud provider focused on GPU-as-a-Service for AI and ML workloads, with infrastructure optimized for raw speed, dense GPU clusters, and simplified pricing. Traditional providers include Azure, AWS, GCE, and OCI. Neoclouds include DataCrunch and Lambda Labs with their GPU-focused offerings. The documentation now includes links to Lambda Labs dynamic Kconfig generation and CLI reference documentation. Generated-by: Claude AI Signed-off-by: Luis Chamberlain Signed-off-by: Chuck Lever --- README.md | 2 +- docs/kdevops-terraform.md | 33 ++++++++++++++++++++++++++++++++- 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 52b07aede..bb64c6404 100644 --- a/README.md +++ b/README.md @@ -526,7 +526,7 @@ Below are sections which get into technical details of how kdevops works. * [Linux distribution support](docs/linux-distro-support.md) * [Overriding all Ansible role options with one file](docs/ansible-override.md) * [kdevops Vagrant support](docs/kdevops-vagrant.md) - * [kdevops terraform support - cloud setup with kdevops](docs/kdevops-terraform.md) + * [kdevops terraform and cloud provider support](docs/kdevops-terraform.md) - AWS, Azure, GCE, OCI, Lambda Labs, DataCrunch * [kdevops local Ansible roles](docs/ansible-roles.md) * [Tutorial on building your own custom Vagrant boxes](docs/custom-vagrant-boxes.md) diff --git a/docs/kdevops-terraform.md b/docs/kdevops-terraform.md index 0dd420481..51f920043 100644 --- a/docs/kdevops-terraform.md +++ b/docs/kdevops-terraform.md @@ -7,10 +7,13 @@ a Terraform plan. Terraform is used to deploy your development hosts on cloud virtual machines. Below are the list of clouds providers currently supported: +**Traditional Cloud Providers:** * azure - Microsoft Azure * aws - Amazon Web Service * gce - Google Cloud Compute * oci - Oracle Cloud Infrastructure + +**Neoclouds (GPU-optimized):** * datacrunch - DataCrunch GPU Cloud * lambdalabs - Lambda Labs GPU Cloud @@ -271,7 +274,18 @@ If your Ansible controller (where you run "make bringup") and your test instances operate inside the same subnet, you can disable the TERRAFORM_OCI_ASSIGN_PUBLIC_IP option for better network security. -### DataCrunch - GPU Cloud Provider +## Neoclouds + +A neocloud is a new type of specialized cloud provider that focuses on offering +high-performance computing, particularly GPU-as-a-Service, to handle demanding +AI and machine learning workloads. Unlike traditional, general-purpose cloud +providers like AWS or Azure, neoclouds are purpose-built for AI with +infrastructure optimized for raw speed, specialized hardware like dense GPU +clusters, and tailored services like fast deployment and simplified pricing. + +kdevops supports the following neocloud providers: + +### DataCrunch kdevops supports DataCrunch, a cloud provider specialized in GPU computing with competitive pricing for NVIDIA A100, H100, B200, and B300 instances. @@ -450,3 +464,20 @@ provider_installation { ``` For more information, visit: https://datacrunch.io/ + +### Lambda Labs + +kdevops supports Lambda Labs, a cloud provider focused on GPU instances for +machine learning workloads with competitive pricing. + +For detailed documentation on Lambda Labs integration, including tier-based +GPU selection, smart instance selection, and dynamic Kconfig generation, see: + + * [Lambda Labs Dynamic Cloud Kconfig](dynamic-cloud-kconfig.md) - Dynamic configuration generation for Lambda Labs + * [Lambda Labs CLI Reference](lambda-cli.1) - Man page for the lambda-cli tool + +Lambda Labs offers various GPU instance types including A10, A100, and H100 +configurations. kdevops provides smart selection features that automatically +choose the cheapest available instance type and region. + +For more information, visit: https://lambdalabs.com/