PyTorch FP8 Support Current State

PyTorch FP8 Support Current State | Generated by AI

Home 2026.06

Question: How’s PyTorch’s support of FP8?

Answer:

PyTorch has native FP8 support as of recent versions, but it’s still maturing. Here’s the current state:

What’s Available

PyTorch supports FP8 through torch.float8_e4m3fn and torch.float8_e5m2 dtypes. The key APIs:

import torch

# Cast to FP8
x_fp8 = x.to(torch.float8_e4m3fn)  # E4M3: higher precision, good for weights/activations
x_fp8 = x.to(torch.float8_e5m2)    # E5M2: wider range, good for gradients

# Check availability
print(torch.cuda.is_bf16_supported())  # True on RTX 4070

torch.float8_experimental

The main FP8 training path is through torch.float8_experimental:

from torchao.float8 import convert_to_float8_training

# Convert model to FP8
model = convert_to_float8_training(model)

# Or use the context manager
with torch.autocast("cuda", dtype=torch.float8_e4m3fn):
    output = model(input)

FP8 Training with torchao

The most practical path today:

pip install torchao

from torchao.float8 import convert_to_float8_training

model = YourModel()
model = convert_to_float8_training(model)

# Training loop works as normal
optimizer = torch.optim.AdamW(model.parameters())
for batch in dataloader:
    loss = model(batch).sum()
    loss.backward()
    optimizer.step()

Current Limitations

Hardware support: FP8 requires Ada Lovelace (RTX 4090+) or Hopper (H100). RTX 4070 has the Tensor Cores but PyTorch’s FP8 kernels aren’t fully optimized for it yet.
Kernel maturity: The FP8 kernels are primarily tested on H100/RTX 4090. Performance on RTX 4070 is inconsistent.
Gradient accumulation: FP8 gradients need careful handling — mixed precision with FP16/BF16 gradients is more stable.
torch.compile compatibility: FP8 works better with torch.compile() for kernel fusion, but compilation overhead can negate gains on small models.

What Works Well Today

For your RTX 4070 + nanoGPT setup:

# Best approach: BF16 mixed precision
scaler = torch.amp.GradScaler()
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)

The Real Answer

PyTorch’s FP8 support is production-ready on H100, experimental on RTX 4090, and unreliable on RTX 4070. The ecosystem is moving fast — torchao and torch.float8_experimental are the places to watch.

For your current workload: stick with BF16. FP8 on RTX 4070 isn’t worth the debugging time yet.

References:

Back Donate