This guide covers everything needed to run Tensorbit Core on cloud NVIDIA GPUs
(A100 / H100) for pruning real large language models. No GPU is needed for
development — see docs/DOCUMENTATION.md for local laptop testing.
| GPU | VRAM | Max model (FP32) | Hourly rate | Provider |
|---|---|---|---|---|
| A100-SXM4-80GB | 80 GB | 7B–8B | ~$1.10 | Lambda, RunPod |
| H100-SXM-80GB | 80 GB | 7B–13B | ~$2.50 | Lambda |
| A100-SXM4-40GB | 40 GB | Up to 3B | ~$0.75 | Vast.ai |
| 2× A100-80GB | 160 GB | 13B–30B | ~$2.20 | Lambda |
| 4× A100-80GB | 320 GB | 30B–70B | ~$4.40 | Lambda |
For 7B-parameter models (Llama 2, Mistral), a single A100-80GB is sufficient. Peak memory during pruning: ~58 GB (28 GB weights + 28 GB Fisher + 2 GB masks).
Best balance of price, availability, and ease of use.
- Go to lambda.ai
- Create an account, add SSH key
- Click "Launch GPU instance" from the dashboard
- Select:
- GPU: 1× NVIDIA H100 (80 GB SXM) or 1× A100 (80 GB SXM)
- Image: Ubuntu 22.04
- SSH into the instance
Flexible, pay-per-minute GPU rentals. Good for short pruning jobs.
- Go to runpod.io
- Deploy a Secure Cloud pod or use GPU Cloud for spot pricing
- Select A100 80GB or H100 80GB
- Choose the RunPod PyTorch template (has CUDA pre-installed)
- Connect via SSH or Web Terminal
Cheapest option but less reliable. Good for experimentation.
- Go to vast.ai
- Search for "A100" or "H100" in the rental marketplace
- Filter by reliability (> 99%) and price
- Rent and SSH in
Once connected to your cloud instance, run the setup script:
git clone <your-repo-url> tensorbit-core
cd tensorbit-core
sudo ./scripts/setup_cloud.shThis installs:
- GCC 13, CMake 3.28, Ninja, ccache
- CUDA 12.6
- Eigen3 3.4.0
- Python 3.11 + PyTorch + safetensors + huggingface_hub
After setup, log out and back in (or source /etc/profile.d/cuda.sh) to activate CUDA in your PATH.
Verify:
nvcc --version # Should show CUDA 12.x
nvidia-smi # Should show your GPUcd tensorbit-core
mkdir -p build && cd build
cmake .. -DEIGEN3_ROOT=/usr/local/include/eigen3 \
-GNinja -DCMAKE_BUILD_TYPE=Release
ninjaThe binary is at build/bin/tb-prune.
Activate the Python environment created by setup_cloud.sh:
source /opt/tensorbit-venv/bin/activate
# Download Llama 2 7B (requires HuggingFace access)
python scripts/download_model.py \
--repo meta-llama/Llama-2-7b-hf \
--output ./models/llama-2-7b/ \
--token hf_YOUR_TOKEN \
--quantize fp16
# Or Mistral 7B (open weights, no token needed for some variants)
python scripts/download_model.py \
--repo mistralai/Mistral-7B-v0.1 \
--output ./models/mistral-7b/ \
--quantize fp16cd build
# 2:4 structured sparsity with BlockOBS (best accuracy)
./bin/tb-prune \
--model ../models/llama-2-7b/model-00001-of-00002.safetensors \
--sparsity 2:4 \
--strategy BlockOBS \
--output llama-2-7b-2of4.tb
# 2:4 with faster OneShot strategy
./bin/tb-prune \
--model ../models/llama-2-7b/model-00001-of-00002.safetensors \
--sparsity 2:4 \
--strategy OneShot \
--output llama-2-7b-2of4-fast.tb
# 1:4 aggressive sparsity
./bin/tb-prune \
--model ../models/llama-2-7b/model-00001-of-00002.safetensors \
--sparsity 1:4 \
--strategy Iterative \
--output llama-2-7b-1of4.tbExpected output: Same format as mock mode — the CLI will show EHAP progress,
CORING mask generation, and .tb file verification.
For models too large for a single GPU (>13B parameters):
# Launch with NCCL for distributed Fisher accumulation (future feature)
# Currently: prune each safetensors shard independently
for shard in ../models/llama-2-70b/model-*.safetensors; do
base=$(basename "$shard" .safetensors)
./bin/tb-prune \
--model "$shard" \
--sparsity 2:4 \
--strategy BlockOBS \
--output "llama-70b-${base}.tb"
doneAfter pruning, transfer the output to your local machine:
# From cloud to local
rsync -avh --progress user@cloud-ip:~/tensorbit-core/build/llama-2-7b-2of4.tb .
# Or using a cloud storage bucket
gsutil cp llama-2-7b-2of4.tb gs://your-bucket/| Model | Strategy | GPU | Time (est.) | Cost (est.) |
|---|---|---|---|---|
| Llama 2 7B | OneShot | A100 80GB | ~10 min | ~$0.18 |
| Llama 2 7B | BlockOBS | A100 80GB | ~60 min | ~$1.10 |
| Llama 2 13B | BlockOBS | 2× A100 80GB | ~120 min | ~$4.40 |
| Mistral 7B | OneShot | A100 80GB | ~10 min | ~$0.18 |
source /etc/profile.d/cuda.sh
# or: export PATH=/usr/local/cuda-12/bin:$PATHCUDA toolkit not installed. Run sudo ./scripts/setup_cloud.sh.
nvidia-smi
# If no GPU shows, your instance may need a reboot or re-provisioning
sudo rebootReduce block size via EHAPConfig::obs_block_size (default 128). BlockOBS
allocates O(B²) memory per block — for large models, set B = 64.