Comparing Ascend 910B and NVIDIA H100

Comparing Ascend 910B and NVIDIA H100 | Generated by AI

Home 2026.03

Question: How powerful is a cluster of 8,192 Huawei Ascend 910B NPUs, and does it compare to a cluster of 8,192 NVIDIA H100 GPUs?

Answer:

Single-Chip Comparison: Ascend 910B vs NVIDIA H100

First, let’s establish the baseline for one chip:

Metric	Ascend 910B	NVIDIA H100
FP16 Compute	~320 TFLOPS	~989 TFLOPS
Memory	64 GB HBM2e	80 GB HBM3
Memory Bandwidth	~400–1,200 GB/s (variant)	~3,350 GB/s
Process Node	SMIC 7nm (N+1)	TSMC 4nm
TDP	~400W	~700W

The Ascend 910B has an FP16 computing power of approximately 320 TFLOPS and INT8 computing power of about 640 TOPS, with 64 GB of HBM2e memory and a memory bandwidth of 400 GB/s. The 910B3 variant introduced HBM3e memory with 1.2 TB/s bandwidth.

NVIDIA’s H100 delivers 989.5 FP16 TFLOPS, which represents roughly triple the performance compared to the A100’s 312 TFLOPS.

In short, one H100 is roughly 3× more powerful than one 910B in raw FP16 compute.

8,192-Chip Cluster: Raw Numbers

Metric	8,192 × Ascend 910B	8,192 × NVIDIA H100
Total FP16 TFLOPS	~2,621 PFLOPS (~2.6 EFLOPS)	~8,110 PFLOPS (~8.1 EFLOPS)
Total Memory	~524 TB	~655 TB

So in raw theoretical peak FP16 compute:

8,192 H100s ≈ 3× more powerful than 8,192 Ascend 910Bs.

Huawei’s Atlas 950 SuperPoD: The Real 8,192-Chip System

Interestingly, Huawei is building exactly this scale of system. Huawei plans to launch the Atlas 950 SuperPoD, linking 8,192 Ascend chips to deliver 8 EFLOPS of FP8 performance, backed by 1,152 TB of memory and 16.3 PB/s of interconnect bandwidth. Note that the 8 EFLOPS figure is for FP8 precision (lower precision, commonly used in inference), not the FP16 figure used above — so it’s not directly comparable to the H100’s FP16 numbers.

Why It’s Not Just About Raw TFLOPS

Beyond raw compute, several factors widen the gap further:

1. Memory Bandwidth Gap The 910C’s bandwidth bottleneck (1,800 vs 3,350 GB/s) is the primary limiter for the autoregressive decode phase of LLM inference, which is dominated by memory reads. The 910B is even lower at ~400 GB/s, making this a significant disadvantage.

2. Interconnect The 910B’s HCCS interconnect has a total bandwidth of 392 GB/s for 8-card modules, which is broadly comparable to the A800’s NVLink at 400 GB/s. However, NVIDIA’s NVLink for H100 clusters is significantly more advanced, enabling more efficient multi-GPU scaling.

3. Software Ecosystem (CUDA vs CANN) Working with Ascend 910B chips still requires debugging without community support. Any public model must undergo deep optimization by Huawei before it can run on Huawei’s platform, and this optimization process is heavily dependent on Huawei and progresses slowly. This means real-world utilization of 910B clusters is lower.

4. Training Reliability Long-term training reliability is a critical weakness of Chinese processors. This challenge stems from the deep integration of Nvidia’s hardware and software ecosystem, which has been developed over two decades.

Verdict: Does 8,192 × 910B Qualify as Equivalent to 8,192 × H100?

No, not by a significant margin. Here’s the summary:

In raw FP16 compute, 8,192 H100s (~8.1 EFLOPS) are roughly 3× more powerful than 8,192 Ascend 910Bs (~2.6 EFLOPS).
In real-world AI training workloads, the gap is likely even larger due to superior memory bandwidth, NVLink interconnect, software maturity (CUDA ecosystem), and training stability on the H100 side.
In inference specifically, the gap narrows somewhat. DeepSeek testing revealed that the 910C (two 910Bs combined) exceeded expectations in inference performance, reaching around 60% of the H100’s inference throughput. Individual 910Bs would be lower still.
Huawei’s strategy to compensate is “strength in numbers” — although the performance of a single chip may not be competitive, by innovating at the system level and forming larger-scale clusters, it is possible to partially bridge the gap and meet computing power demands.

Bottom line: You would need approximately 3× as many Ascend 910Bs (roughly ~24,000+) to approach the equivalent raw compute of 8,192 H100s, and even then real-world training performance would likely still lag due to software and interconnect limitations.

References:

Back Donate