Comparing Ascend 910B and NVIDIA H100 | Generated by AI

Home 2026.03

Question: How powerful is a cluster of 8,192 Huawei Ascend 910B NPUs, and does it compare to a cluster of 8,192 NVIDIA H100 GPUs?

Answer:

Single-Chip Comparison: Ascend 910B vs NVIDIA H100

First, let’s establish the baseline for one chip:

Metric Ascend 910B NVIDIA H100
FP16 Compute ~320 TFLOPS ~989 TFLOPS
Memory 64 GB HBM2e 80 GB HBM3
Memory Bandwidth ~400–1,200 GB/s (variant) ~3,350 GB/s
Process Node SMIC 7nm (N+1) TSMC 4nm
TDP ~400W ~700W

The Ascend 910B has an FP16 computing power of approximately 320 TFLOPS and INT8 computing power of about 640 TOPS, with 64 GB of HBM2e memory and a memory bandwidth of 400 GB/s. The 910B3 variant introduced HBM3e memory with 1.2 TB/s bandwidth.

NVIDIA’s H100 delivers 989.5 FP16 TFLOPS, which represents roughly triple the performance compared to the A100’s 312 TFLOPS.

In short, one H100 is roughly 3× more powerful than one 910B in raw FP16 compute.


8,192-Chip Cluster: Raw Numbers

Metric 8,192 × Ascend 910B 8,192 × NVIDIA H100
Total FP16 TFLOPS ~2,621 PFLOPS (~2.6 EFLOPS) ~8,110 PFLOPS (~8.1 EFLOPS)
Total Memory ~524 TB ~655 TB

So in raw theoretical peak FP16 compute:


Huawei’s Atlas 950 SuperPoD: The Real 8,192-Chip System

Interestingly, Huawei is building exactly this scale of system. Huawei plans to launch the Atlas 950 SuperPoD, linking 8,192 Ascend chips to deliver 8 EFLOPS of FP8 performance, backed by 1,152 TB of memory and 16.3 PB/s of interconnect bandwidth. Note that the 8 EFLOPS figure is for FP8 precision (lower precision, commonly used in inference), not the FP16 figure used above — so it’s not directly comparable to the H100’s FP16 numbers.


Why It’s Not Just About Raw TFLOPS

Beyond raw compute, several factors widen the gap further:

1. Memory Bandwidth Gap The 910C’s bandwidth bottleneck (1,800 vs 3,350 GB/s) is the primary limiter for the autoregressive decode phase of LLM inference, which is dominated by memory reads. The 910B is even lower at ~400 GB/s, making this a significant disadvantage.

2. Interconnect The 910B’s HCCS interconnect has a total bandwidth of 392 GB/s for 8-card modules, which is broadly comparable to the A800’s NVLink at 400 GB/s. However, NVIDIA’s NVLink for H100 clusters is significantly more advanced, enabling more efficient multi-GPU scaling.

3. Software Ecosystem (CUDA vs CANN) Working with Ascend 910B chips still requires debugging without community support. Any public model must undergo deep optimization by Huawei before it can run on Huawei’s platform, and this optimization process is heavily dependent on Huawei and progresses slowly. This means real-world utilization of 910B clusters is lower.

4. Training Reliability Long-term training reliability is a critical weakness of Chinese processors. This challenge stems from the deep integration of Nvidia’s hardware and software ecosystem, which has been developed over two decades.


Verdict: Does 8,192 × 910B Qualify as Equivalent to 8,192 × H100?

No, not by a significant margin. Here’s the summary:

Bottom line: You would need approximately 3× as many Ascend 910Bs (roughly ~24,000+) to approach the equivalent raw compute of 8,192 H100s, and even then real-world training performance would likely still lag due to software and interconnect limitations.


References:


Back Donate