Open-Source vs Closed-Source AI Hardware

Open-Source vs Closed-Source AI Hardware | Generated by AI

Home 2026.05

Question: I’m diving into open-source chip software (AMD ROCm as the open-source CUDA alternative). I want to understand the gap between open-source and closed-source in the AI value chain. What should I know next?

Answer:

Good instinct. Here’s the map — from where you are to where it gets interesting.

The AI Value Chain: Open vs Closed Gap (honest assessment)

Layer	Open Source	Closed Source	Gap
ML Frameworks	PyTorch, JAX	—	Basically closed. PyTorch IS the standard
Compiler/IR	MLIR, XLA, Triton	cuDNN internals	Medium — Triton closes it fast
Runtime	ROCm HIP, oneAPI	CUDA runtime	Medium — ROCm ~85% there
Kernel libs	rocBLAS, MIOpen	cuBLAS, cuDNN	Significant — NVIDIA still wins perf
Hardware uArch	RISC-V (VexRiscV, CVA6)	NVIDIA SM design	Huge — uArch docs still proprietary
Driver	`amdgpu` (in-kernel)	NVIDIA open kernel (partial)	AMD actually wins here

Verdict: Software stack gap is closeable. Hardware uArch gap is the moat NVIDIA actually has.

What to Actually Study Next (ordered)

1. HIP + ROCm internals (you’re here)

# Confirm your ROCm install sees the GPU
rocminfo | grep -A5 "Agent 2"
hipcc --version

# Hello triangle in HIP
cat << 'EOF' > vec_add.hip
#include <hip/hip_runtime.h>
#include <stdio.h>

__global__ void vecAdd(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    int n = 1 << 20;
    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, n*4); hipMalloc(&d_b, n*4); hipMalloc(&d_c, n*4);
    hipLaunchKernelGGL(vecAdd, dim3(n/256), dim3(256), 0, 0, d_a, d_b, d_c, n);
    hipFree(d_a); hipFree(d_b); hipFree(d_c);
}
EOF
hipcc vec_add.hip -o vec_add && ./vec_add

HIP → CUDA translation is ~95% mechanical. hipify-clang converts CUDA → HIP automatically. This is how ROCm bootstrapped.

2. Triton — the real open-source inflection point

This is where the action is. Triton lets you write GPU kernels in Python that compile to both CUDA PTX and AMD GCN/RDNA. OpenAI open-sourced it and it’s now in PyTorch core.

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(A, B, C, M, N, K, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    # tile over M dimension
    rm = pid * BLOCK + tl.arange(0, BLOCK)
    rn = tl.arange(0, BLOCK)
    rk = tl.arange(0, BLOCK)
    acc = tl.zeros((BLOCK, BLOCK), dtype=tl.float32)
    for k in range(0, K, BLOCK):
        a = tl.load(A + rm[:, None]*K + (rk[None,:]+k))
        b = tl.load(B + (rk[:,None]+k)*N + rn[None,:])
        acc += tl.dot(a, b)
    tl.store(C + rm[:,None]*N + rn[None,:], acc)

Why this matters: Flash Attention 2, the kernel that makes LLM inference fast, is written in Triton. The gap between ROCm and CUDA is shrinking fastest here.

3. MLIR + the compiler stack (where open-source is winning)

PyTorch/JAX
    ↓ torch.compile / XLA
  Triton IR / StableHLO
    ↓ MLIR passes
  LLVM IR
    ↓
  PTX (NVIDIA) / GCN ISA (AMD) / RISC-V (future)

MLIR is Google/LLVM’s bet on a unified compiler IR. Every major chip startup (Tenstorrent, Groq, Cerebras) targets MLIR. This is the layer where open-source actually leads.

# Play with MLIR
pip install mlir-python-bindings
# Or build from source for full control
git clone https://github.com/llvm/llvm-project
cd llvm-project && cmake -DLLVM_ENABLE_PROJECTS="mlir" ...

4. Open-source hardware — RISC-V GPU experiments

This is frontier territory. Actual open GPU cores:

Vortex — RISC-V GPGPU, runs OpenCL, tapes out on FPGA

git clone https://github.com/vortexgpgpu/vortex
# Runs on Verilator simulation or actual FPGA

NaxRiscv — out-of-order RISC-V, serious uArch
NVDLA (NVIDIA open-sourced their DLA) — inference accelerator RTL

The gap: open RTL exists, but process node access (TSMC 4nm) is the actual moat. Software is solvable. Fab access isn’t.

The Real Moat Map

cuDNN fused kernels (hand-tuned asm)        ← hardest to replicate
CUDA PTX ISA stability + ecosystem lock-in  ← 15 years of software
NVLink / NVSwitch topology                  ← hardware, needs $$$
HBM allocation + supplier relationships     ← supply chain

ROCm closes the software gap. Nobody closes the HBM/NVLink gap without a fab.

Your Learning Path

Week 1-2:  HIP kernels, hipify existing CUDA code
Week 3-4:  Triton kernel writing — reimplement softmax, layernorm
Week 5-6:  torch.compile internals, Dynamo/Inductor
Week 7-8:  MLIR basics — write a simple lowering pass
Month 3+:  Vortex RISC-V GPU on Verilator, understand GCN ISA

Key repos to star:

# Core
https://github.com/ROCm/ROCm
https://github.com/triton-lang/triton
https://github.com/llvm/llvm-project          # MLIR lives here

# Learn from
https://github.com/tinygrad/tinygrad          # George Hotz's ROCm-first stack
https://github.com/vortexgpgpu/vortex         # Open RISC-V GPU
https://github.com/openxla/xla               # Google's compiler

tinygrad is your best teacher here — it’s a full ML framework that runs on ROCm, written to be readable, and George explicitly built it to understand + fight CUDA lock-in. Read the runtime/ folder.

Bottom Line

The open-source software stack (ROCm + Triton + MLIR + PyTorch) is good enough for 90% of workloads and closing fast. The remaining gap is in hand-tuned kernel libraries (cuDNN) and hardware topology (NVLink). If you’re going deep on this, Triton + tinygrad is the fastest path to understanding where the real work happens.

References:

Back Donate