Cloud GPU Deployment Guide

This guide covers everything needed to run Tensorbit Core on cloud NVIDIA GPUs (A100 / H100) for pruning real large language models. No GPU is needed for development — see docs/DOCUMENTATION.md for local laptop testing.

Choosing a GPU

GPU	VRAM	Max model (FP32)	Hourly rate	Provider
A100-SXM4-80GB	80 GB	7B–8B	~$1.10	Lambda, RunPod
H100-SXM-80GB	80 GB	7B–13B	~$2.50	Lambda
A100-SXM4-40GB	40 GB	Up to 3B	~$0.75	Vast.ai
2× A100-80GB	160 GB	13B–30B	~$2.20	Lambda
4× A100-80GB	320 GB	30B–70B	~$4.40	Lambda

For 7B-parameter models (Llama 2, Mistral), a single A100-80GB is sufficient. Peak memory during pruning: ~58 GB (28 GB weights + 28 GB Fisher + 2 GB masks).

Providers

Lambda (Recommended)

Best balance of price, availability, and ease of use.

Go to lambda.ai
Create an account, add SSH key
Click "Launch GPU instance" from the dashboard
Select:
- GPU: 1× NVIDIA H100 (80 GB SXM) or 1× A100 (80 GB SXM)
- Image: Ubuntu 22.04
SSH into the instance

RunPod

Flexible, pay-per-minute GPU rentals. Good for short pruning jobs.

Go to runpod.io
Deploy a Secure Cloud pod or use GPU Cloud for spot pricing
Select A100 80GB or H100 80GB
Choose the RunPod PyTorch template (has CUDA pre-installed)
Connect via SSH or Web Terminal

Vast.ai

Cheapest option but less reliable. Good for experimentation.

Go to vast.ai
Search for "A100" or "H100" in the rental marketplace
Filter by reliability (> 99%) and price
Rent and SSH in

Instance Setup

Once connected to your cloud instance, run the setup script:

git clone <your-repo-url> tensorbit-core
cd tensorbit-core
sudo ./scripts/setup_cloud.sh

This installs:

GCC 13, CMake 3.28, Ninja, ccache
CUDA 12.6
Eigen3 3.4.0
Python 3.11 + PyTorch + safetensors + huggingface_hub

After setup, log out and back in (or source /etc/profile.d/cuda.sh) to activate CUDA in your PATH.

Verify:

nvcc --version     # Should show CUDA 12.x
nvidia-smi         # Should show your GPU

Building with CUDA

cd tensorbit-core
mkdir -p build && cd build
cmake .. -DEIGEN3_ROOT=/usr/local/include/eigen3 \
         -GNinja -DCMAKE_BUILD_TYPE=Release
ninja

The binary is at build/bin/tb-prune.

Downloading a Model

Activate the Python environment created by setup_cloud.sh:

source /opt/tensorbit-venv/bin/activate

# Download Llama 2 7B (requires HuggingFace access)
python scripts/download_model.py \
    --repo meta-llama/Llama-2-7b-hf \
    --output ./models/llama-2-7b/ \
    --token hf_YOUR_TOKEN \
    --quantize fp16

# Or Mistral 7B (open weights, no token needed for some variants)
python scripts/download_model.py \
    --repo mistralai/Mistral-7B-v0.1 \
    --output ./models/mistral-7b/ \
    --quantize fp16

Pruning a Real Model

cd build

# 2:4 structured sparsity with BlockOBS (best accuracy)
./bin/tb-prune \
    --model ../models/llama-2-7b/model-00001-of-00002.safetensors \
    --sparsity 2:4 \
    --strategy BlockOBS \
    --output llama-2-7b-2of4.tb

# 2:4 with faster OneShot strategy
./bin/tb-prune \
    --model ../models/llama-2-7b/model-00001-of-00002.safetensors \
    --sparsity 2:4 \
    --strategy OneShot \
    --output llama-2-7b-2of4-fast.tb

# 1:4 aggressive sparsity
./bin/tb-prune \
    --model ../models/llama-2-7b/model-00001-of-00002.safetensors \
    --sparsity 1:4 \
    --strategy Iterative \
    --output llama-2-7b-1of4.tb

Expected output: Same format as mock mode — the CLI will show EHAP progress, CORING mask generation, and .tb file verification.

Multi-GPU Pruning

For models too large for a single GPU (>13B parameters):

# Launch with NCCL for distributed Fisher accumulation (future feature)
# Currently: prune each safetensors shard independently
for shard in ../models/llama-2-70b/model-*.safetensors; do
    base=$(basename "$shard" .safetensors)
    ./bin/tb-prune \
        --model "$shard" \
        --sparsity 2:4 \
        --strategy BlockOBS \
        --output "llama-70b-${base}.tb"
done

Transferring .tb Files

After pruning, transfer the output to your local machine:

# From cloud to local
rsync -avh --progress user@cloud-ip:~/tensorbit-core/build/llama-2-7b-2of4.tb .

# Or using a cloud storage bucket
gsutil cp llama-2-7b-2of4.tb gs://your-bucket/

Cost Estimates

Model	Strategy	GPU	Time (est.)	Cost (est.)
Llama 2 7B	OneShot	A100 80GB	~10 min	~$0.18
Llama 2 7B	BlockOBS	A100 80GB	~60 min	~$1.10
Llama 2 13B	BlockOBS	2× A100 80GB	~120 min	~$4.40
Mistral 7B	OneShot	A100 80GB	~10 min	~$0.18

Troubleshooting

`nvcc: command not found`

source /etc/profile.d/cuda.sh
# or: export PATH=/usr/local/cuda-12/bin:$PATH

`CMake Error: Failed to find nvcc`

CUDA toolkit not installed. Run sudo ./scripts/setup_cloud.sh.

`cudaErrorNoDevice: no CUDA-capable device is detected`

nvidia-smi
# If no GPU shows, your instance may need a reboot or re-provisioning
sudo reboot

Out of memory during BlockOBS

Reduce block size via EHAPConfig::obs_block_size (default 128). BlockOBS allocates O(B²) memory per block — for large models, set B = 64.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud GPU Deployment Guide

Choosing a GPU

Providers

Lambda (Recommended)

RunPod

Vast.ai

Instance Setup

Building with CUDA

Downloading a Model

Pruning a Real Model

Multi-GPU Pruning

Transferring .tb Files

Cost Estimates

Troubleshooting

`nvcc: command not found`

`CMake Error: Failed to find nvcc`

`cudaErrorNoDevice: no CUDA-capable device is detected`

Out of memory during BlockOBS

FilesExpand file tree

CLOUD.md

Latest commit

History

CLOUD.md

File metadata and controls

Cloud GPU Deployment Guide

Choosing a GPU

Providers

Lambda (Recommended)

RunPod

Vast.ai

Instance Setup

Building with CUDA

Downloading a Model

Pruning a Real Model

Multi-GPU Pruning

Transferring .tb Files

Cost Estimates

Troubleshooting

nvcc: command not found

CMake Error: Failed to find nvcc

cudaErrorNoDevice: no CUDA-capable device is detected

Out of memory during BlockOBS

`nvcc: command not found`

`CMake Error: Failed to find nvcc`

`cudaErrorNoDevice: no CUDA-capable device is detected`