Skip to content

Commit 1ba0d93

Browse files
authored
docs(gpu): add doc blackwell vs hopper (#5930)
* docs(gpu): add doc blackwell vs hopper * fix(gpu): fix link * docs(gpu): fix link
1 parent 3bac255 commit 1ba0d93

File tree

5 files changed

+75
-9
lines changed

5 files changed

+75
-9
lines changed

pages/gpu/menu.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ export const gpuMenu = {
6262
label: 'Choosing the right GPU Instance type',
6363
slug: 'choosing-gpu-instance-type',
6464
},
65+
{
66+
label: 'Blackwell vs Hopper - Choosing the right NVIDIA GPU architecture',
67+
slug: 'blackwell-vs-hopper-choosing-the-right-architecture',
68+
},
6569
{
6670
label:
6771
'GPU Instances internet and Block Storage bandwidth overview',
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: Blackwell vs Hopper - Choosing the right NVIDIA GPU architecture
3+
description: This page provides information about the NVIDIA Blackwell and Hopper GPU architectures.
4+
tags: NVIDIA GPU cloud instance
5+
dates:
6+
validation: 2025-12-05
7+
posted: 2025-12-05
8+
---
9+
10+
A GPU architecture defines the underlying design of NVIDIA’s Graphics Processing Units (GPUs), optimized for accelerating AI training, inference, and high-performance computing (HPC) workloads.
11+
12+
* **[Blackwell](https://www.nvidia.com/en/us/data-center/technologies/blackwell-architecture/)**, announced in 2024 and shipping in late 2025, represents the newest evolution: featuring a dual-die design in GPUs such as the Scaleway [B300-SXM GPU Instances](https://www.scaleway.com/en/b300-sxm/). Engineered for trillion-parameter AI at unprecedented scale, Blackwell pushes the boundaries of performance and efficiency.
13+
* **[Hopper](https://www.nvidia.com/en/us/data-center/technologies/blackwell-architecture/)**, introduced in 2022, powers flagship data center GPUs like the H100. Available in multiple configurations like Scaleway [H100-SXM GPU Instances](https://www.scaleway.com/en/h100/), it excels at mixed-precision computing for large language models (LLMs) and general-purpose AI.
14+
15+
Choosing between Blackwell and Hopper ultimately depends on your workload’s requirements for performance, memory capacity, precision needs, and cost-efficiency.
16+
17+
## B300-SXM Instances: The specialized Instance for frontier AI
18+
19+
Frontier AI refers to the most advanced AI models out there, the ones that can match or even beat human performance on a whole bunch of different tasks and that require massive
20+
computing performance. Scaleway’s [B300-SXM GPU Instances](https://www.scaleway.com/en/b300-sxm/), powered by the **Blackwell Ultra architecture**, are engineered for the new era of AI reasoning and trillion-parameter models.
21+
22+
Launched by Scaleway, during the [AI pulse event in December 2025](https://www.scaleway.com/en/news/scaleway-announces-at-ai-pulse-major-advancements-in-ai-model-accessibility-new-compute-capabilities-and-expansion-of-its-presence-across-europe/),
23+
the B300 SXM GPU marks the current masterpiece of data center AI performance, making it the preferred platform for hyperscale AI factories running massive language models, long-context reasoning, and high-throughput inference.
24+
25+
The B300 delivers exceptional memory and bandwidth:
26+
* 288 GB of faster HBM3e memory: over 3.5× more than the H100-SXM (80 GB HBM3)
27+
* Up to 7.7 TB/s of memory bandwidth (double than what HBM3 provides)
28+
29+
This massive capacity enables entire 1-trillion-parameter models, huge batch sizes, and ultra-long context windows (*up to 1 million+ tokens*) to reside on just a few GPUs,reducing the need for complex multi-node communication and drastically lowering inter-node overhead.
30+
Equipped with **fifth-generation Tensor Cores**, the B300 introduces native hardware support for FP4 and FP6 precision, a major advancement over Hopper. On H100, FP4 operations are emulated using INT8 arithmetic, which limits efficiency and real-world performance. In contrast, Blackwell’s Tensor Cores process FP4 natively, unlocking significantly higher throughput and energy efficiency for ultra-low-precision AI workloads.
31+
32+
Combined with an enhanced **second-generation Transformer Engine**, these improvements enable the B300 to deliver up to **15 PFLOPS of dense FP4 performance**, achieving multiple times higher inference throughput than the H100-SXM on modern reasoning workloads such as DeepSeek-R1 and Llama 3.1 405B+.
33+
Compared to B200, B300 delivers higher FP4 perfomance at the expense of the FP64 performance. As a result, the B300 delivers lower FP64 performance compared to the H100-SXM, making it less well-suited for traditional scientific computing and HPC simulations that rely on high-precision arithmetic.
34+
With its combination of vast memory and ultra-efficient low-precision compute, the B300 shines in “*big-AI*” scenarios, including:
35+
* Training and fine-tuning trillion-parameter dense or Mixture-of-Experts (MoE) models
36+
* Real-time, high-throughput inference at scale
37+
* Retrieval-Augmented Generation (RAG) with massive context
38+
* AI reasoning pipelines requiring extended token sequences
39+
40+
The architecture's strong focus on frontier AI means the B300 is *not optimized for HPC workloads* such as computational physics, quantum chemistry, or climate modeling,domains where FP64 accuracy is critical. In these cases, performance may be inferior to H100, despite Blackwell’s generational leap in AI capabilities.
41+
Moreover, the B300’s extreme capabilities make it over-provisioned for smaller or mid-sized models (≤70B parameters), prototyping, or general-purpose AI tasks. For these use cases, Scaleway’s H100-SXM GPU Instances remain a more economical and practical choice.
42+
43+
## NVIDIA H100-SXM: The reliable standard for AI and HPC
44+
45+
Scaleway’s [H100-SXM GPU Instances](https://www.scaleway.com/en/h100/), built on the 2022 Hopper architecture, are based on the most widely adopted and battle-tested data center GPU. With several years of production deployment, the H100 remains the industry standard, offering a robust balance of AI acceleration, high-precision computing, and broad software compatibility across cloud providers and supercomputing environments.
46+
47+
Its maturity ensures unmatched stability and predictability. Drivers, frameworks (PyTorch, JAX, TensorFlow), and ecosystem tooling are fully optimized, making the H100-SXM the default choice for:
48+
- Open-source model development
49+
- Enterprise AI pipelines
50+
- Scientific research and academic workloads
51+
52+
Powered by **fourth-generation Tensor Cores** and the **first-generation Transformer Engine**, the H100 supports automatic mixed-precision (FP8, FP16, BF16, TF32), delivering up to 1,979 TFLOPS (FP16 TC).
53+
54+
While the H100 can perform FP4-like operations, it does so via software emulation using INT8, which is *less efficient* and *less accurate* than true FP4 computation. This limits its peak performance and efficiency in low-precision scenarios compared to Blackwell.
55+
56+
Crucially, the H100 maintains strong FP64 performance, a key advantage for legacy HPC, scientific simulations, and engineering workloads where double-precision accuracy is essential. This makes Hopper a true **dual-use architecture**, capable of excelling in both AI and traditional HPC.
57+
58+
Additional features enhance flexibility and efficiency: [NVLink 4.0](/gpu/reference-content/understanding-nvidia-nvlink/) enables 900 GB/s of GPU-to-GPU bandwidth, and [Multi-Instance GPU (MIG)](/gpu/how-to/use-nvidia-mig-technology/) allows secure, isolated workloads on a single GPU, ideal for Kubernetes cloud environments
59+
60+
Scaleway’s H100-SXM Instances offer the best cost-performance ratio for most applications, including fine-tuning 7B–70B parameter models, running large-scale inference or RAG pipelines, as well as for computer vision and speech processing.
61+
62+
That said, the 80 GB HBM3 memory can become a bottleneck for models exceeding 400 billion parameters or when processing very long contexts. In such cases, advanced techniques like model parallelism or offloading are often required.

0 commit comments

Comments
 (0)