Skip to content

Enterprise-AI-Atlas/awesome-gpu-cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome GPU Cloud

Awesome License: CC0-1.0 PRs Welcome

A curated registry of GPU cloud providers, serverless GPU inference platforms, Kubernetes GPU platforms, and supporting tooling for training, fine-tuning, and serving AI models.

The goal is to be the definitive place to discover on-demand, reserved, and serverless GPU compute—from hyperscalers and GPU-first clouds to marketplaces and container images.

Inclusion criteria: Resources must be publicly accessible, actively maintained, and directly related to GPU cloud compute or GPU-accelerated AI infrastructure. General CPU cloud services and closed, undocumented offerings are out of scope. See CONTRIBUTING.md for the full quality bar.


Contents


Major Cloud Providers

Hyperscale clouds with managed GPU instances for training, inference, and HPC.

  • AWS EC2 Accelerated Computing Official — P5/P5e (H100/H200), P4d (A100), G6e (L40S), and Trn/Inf accelerators with EFA networking and UltraClusters.
    • Pricing: on-demand, spot, reserved, and capacity blocks.
  • Google Cloud GPU Instances Official — A3 (H100), A3 Ultra (H200), A4 (B200), A4X (GB200), G2 (L4), and G4 (RTX PRO 6000) machine types.
    • Tight integration with Vertex AI, GKE, and TPUs.
  • Azure NC/ND GPU Virtual Machines Official — NCads H100 v5, ND H100 v5, NC A100 v4, and NVads A10 v5 series with InfiniBand options.
    • Strong fit for Windows/Linux workstations and Azure ML training jobs.
  • Oracle Cloud Infrastructure GPU Official — Bare-metal and VM instances with H100, H200, A100, L40S, and AMD MI300X GPUs; RDMA cluster networking.
    • Often priced below AWS/GCP/Azure for equivalent shapes; free egress tiers.
  • IBM Cloud GPU Official — NVIDIA H100 and A100 instances for AI training, inference, and VDI workloads.
  • Alibaba Cloud GPU Official — GN7 (A10), GN10e (V100), and GPU-accelerated ECS instances for AI and rendering in Asia-Pacific.

Serverless GPU Inference

Platforms that abstract away infrastructure and scale GPU workloads to zero.

  • Modal Official — Serverless GPU functions with a Python SDK; deploy inference, training, and batch jobs on L40S, A100, and H100.
    • Pricing: per-second billing; H100 from ~$3.95/hr, L40S from ~$1.95/hr.
  • RunPod Serverless Official — Autoscaling GPU endpoints with FlashBoot; supports 30+ GPU types from RTX 4090 to H200.
    • Pricing: billed by the millisecond; H100 from ~$1.99/hr on Community Cloud.
  • Baseten Official — Model-serving platform with low-latency inference, async workers, and the open-source Truss packaging framework.
  • Replicate Official — Run and deploy open-source ML models as scalable APIs; supports custom models and a large public model zoo.
  • Fal Official — Generative-media inference engine optimized for diffusion and real-time image/video workloads.
  • Koyeb Official — Serverless cloud with global GPU deployment and scale-to-zero; supports H100, A100, and Tenstorrent accelerators.
    • Pricing: A100 ~$2/hr, H100 ~$3.30/hr.
  • Novita AI Official — GPU cloud and serverless inference with 200+ ready-to-use models and on-demand GPU instances.
  • Google Cloud Run GPUs Official — Serverless container platform with NVIDIA L4 GPU support and request-driven or job-based GPU workloads.
  • Cerebrium Official — Serverless GPU platform for AI inference and training with infrastructure-as-code and fast cold starts.
  • Beam Cloud Official — Serverless GPU cloud for training and inference with per-second billing and Python SDK.

GPU Cloud Marketplaces

Aggregators and peer-to-peer marketplaces for finding GPU capacity across providers.

  • Vast.ai Community / Marketplace — Peer-to-peer GPU marketplace with 20,000+ GPUs; on-demand, interruptible, and reserved pricing.
    • Pricing: RTX 4090 from ~$0.27/hr, A100 80GB from ~$0.70/hr.
  • TensorDock Community / Marketplace — Marketplace of independent hosts offering H100, A100, RTX 4090, and more across 100+ locations.
    • Pricing: H100 from ~$2.20/hr; no quotas or long-term contracts.
  • Shadeform Community / Marketplace — GPU cloud marketplace deploying across 30+ clouds with one console, API, and bill.
  • Prime Intellect Community — Decentralized compute exchange aggregating 12+ providers for distributed training and inference.
  • GPUFindr Community — Live price comparison across CoreWeave, Lambda, RunPod, Vast.ai, and other GPU clouds; free API and MCP server.
  • NodeHawk Community — Price aggregator scanning 50+ clouds to surface the cheapest GPU deals.

Kubernetes GPU Platforms

Managed Kubernetes services and schedulers purpose-built for GPU workloads.

  • CoreWeave Kubernetes Service Official — Kubernetes-native GPU cloud with bare-metal NVIDIA H100, H200, B200, and GB200 nodes.
  • RunPod Official — GPU cloud with Kubernetes support, virtual kubelet integrations, and on-demand Pods/Serverless/Clusters.
  • Nebius AI Cloud Official — Full-stack AI cloud with managed Kubernetes and Slurm, H100/H200/B200/GB200 clusters, and InfiniBand.
  • Akamai Cloud GPU Official — GPU instances and managed Kubernetes (LKE) nodes with RTX PRO 6000 Blackwell and L40S GPUs.
  • Crusoe Cloud Managed Kubernetes Official — Managed Kubernetes for GPU and CPU workloads with InfiniBand-backed clusters.
  • NVIDIA Run:ai Official — Kubernetes-native GPU orchestration and workload scheduler for maximizing AI compute utilization.
  • SUNK (Slurm on Kubernetes) Official — CoreWeave's Slurm-on-Kubernetes solution for running HPC and AI training on the same cluster.

Budget / Consumer GPU Cloud

Low-cost options that prioritize price-per-FLOP, often using consumer GPUs.

  • Vast.ai Community — Marketplace with consumer RTX 3090/4090 and data-center GPUs at market-driven prices.
    • Pricing: RTX 3090 from ~$0.07/hr, RTX 4090 from ~$0.27/hr.
  • SaladCloud Community — Distributed cloud using consumer gaming GPUs across 450,000+ providers.
    • Pricing: RTX 4090 from ~$0.16/hr, H100 from ~$0.99/hr.
  • JarvisLabs Community — Per-minute billing GPU cloud for individual developers and small teams.
    • Pricing: RTX 3090 from ~$0.29/hr, H100 from ~$2.69/hr.
  • Paperspace Official — Managed GPU notebooks and instances from RTX 4090 to A100/H100.
  • TensorDock Community — Marketplace with consumer and data-center GPUs; full VM control.
  • DataCrunch / Verda Official — Low-cost GPU instances and clusters in Iceland/Finland.
    • Pricing: H100 from ~$2.29/hr.
  • Hyperstack Official — EU-focused GPU cloud with H100, A100, and L40S instances; 100% renewable energy.

Enterprise HPC Cloud

GPU-first clouds built for large-scale training, HPC, and production AI at scale.

  • CoreWeave Official — AI hyperscaler with Kubernetes-native H100/H200/B200/GB200/GB300 clusters and InfiniBand.
    • Customers include OpenAI, Mistral AI, and Jane Street.
  • Lambda Official — GPU cloud and superclusters for AI training/inference; 1-Click Clusters and Lambda Stack.
    • Pricing: H100 SXM from ~$3.29/hr, A100 SXM from ~$1.29/hr.
  • NVIDIA DGX Cloud Official — AI supercomputing-as-a-service with DGX systems, NIM/NeMo, and NVIDIA Cloud Partners.
  • NVIDIA DGX Cloud Lepton Official — Compute marketplace connecting developers to tens of thousands of GPUs across global cloud partners.
  • FluidStack Official — Exascale GPU clusters for frontier AI training, deployed in 48 hours across global data centers.
  • Crusoe Energy Official — AI cloud powered by stranded and renewable energy; H100/H200/GB200 and managed inference.
  • GMI Cloud Official — GPU cloud with H100, H200, B200, and GB200 clusters and managed Kubernetes.
  • Nscale Official — GPU cloud built for large-scale AI training and sovereign AI deployments.
  • Voltage Park Official — Bare-metal GPU cloud with H100 clusters for training and inference.

AI Training Platforms

Managed platforms that streamline distributed training, fine-tuning, and model development.

  • Anyscale Official — Managed Ray platform for distributed training, batch inference, and serving across clouds.
  • Together AI GPU Clusters Official — Self-serve H100/H200/B200/GB200 clusters with InfiniBand and Kubernetes/Slurm orchestration.
  • Databricks Mosaic AI Official — Unified platform for fine-tuning, training, and serving LLMs inside the Databricks Lakehouse.
  • MosaicML Foundry Community — Open-source training framework for LLMs from 125M to 70B+ parameters.
  • Hugging Face Training Cluster as a Service Community — On-demand GPU clusters integrated with Hugging Face tools and datasets.
  • Weights & Biases Official — MLOps platform for experiment tracking, model registry, and performance monitoring for GPU training runs.

Container Registries / Images

Registries and pre-built images for GPU-accelerated AI workloads.

  • NVIDIA NGC Catalog Official — Curated GPU-optimized containers, models, SDKs, and Helm charts for PyTorch, TensorFlow, TensorRT, CUDA, and more.
  • Docker Hub Official — Public container registry hosting official GPU-enabled images for CUDA, PyTorch, TensorFlow, and community ML projects.
  • GitHub Container Registry Official — Container and artifact hosting tightly integrated with GitHub Actions and repositories.
  • Amazon ECR Official — Fully managed container registry for storing and deploying GPU container images on AWS.
  • Google Artifact Registry Official — Managed artifact and container registry for Google Cloud GPU/AI workloads.
  • Red Hat Quay Official — Enterprise container registry with vulnerability scanning and geo-replication.

Related Awesome Lists


Notes

  • Pricing is indicative: Rates change frequently and vary by region, GPU SKU (PCIe vs SXM), contract length, and spot/interruptible availability. Always verify current pricing on the provider's site.
  • Community vs. Official: Official denotes the provider's own service; Community / Marketplace denotes peer-to-peer, aggregator, or open-source/community-maintained resources.
  • Reliability trade-offs: Marketplace and consumer-GPU options can offer dramatic cost savings, but may have weaker SLAs, variable networking, and preemptible instances. Use them for fault-tolerant batch work and keep production APIs on SLA-backed platforms.
  • Multi-node training: For large distributed training jobs, prioritize providers with NVLink/NVSwitch, InfiniBand/RoCE, and RDMA support (CoreWeave, Lambda, Nebius, Crusoe, AWS P5 UltraClusters, etc.).
  • Updates welcome: If you find a new GPU cloud provider or category, see CONTRIBUTING.md and open a pull request.

Contributing

Read CONTRIBUTING.md for the quality bar, entry format, and PR process.


License

This list is released into the public domain under CC0-1.0.

Want us to build this for you?

Enterprise AI Atlas is maintained by Vibe Coding Agency. We prototype and ship agentic systems, MCP servers, and enterprise AI integrations for teams that need working software fast — without hiring a full AI engineering team.

Free guide: The Non-Technical Founder's Guide to Agentic AI — what agents and MCP servers are, and how to get a system built.

About

Curated GPU cloud providers, platforms, and services for AI and accelerated computing workloads.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages