TileLang: Pythonic DSL for GPU Kernels | Generated by AI

Home 2026.06

Question: What is tile-ai/tilelang?

Answer:

TileLang is a Pythonic DSL for writing high-performance GPU/CPU kernels, built on top of TVM as a compiler backend. Think of it as the “middle ground” between writing raw CUDA/HIP (tedious, not portable) and using cuBLAS/CUTLASS (opaque, not flexible) — you get close-to-metal control with Python ergonomics.

6.5k stars, 597 forks — it’s gaining real traction.


What problem it solves

Writing kernels like FlashAttention, GEMM, or MLA by hand in CUDA requires dealing with:

TileLang abstracts all of that while still letting you express the key algorithmic decisions (tile sizes, swizzle, pipeline stages).


Core abstractions

with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
    A_shared = T.alloc_shared((block_M, block_K), dtype)   # shared mem
    C_local  = T.alloc_fragment((block_M, block_N), accum_dtype)  # register tile

    T.clear(C_local)

    for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):  # async pipeline
        T.copy(A[by * block_M, ko * block_K], A_shared)  # parallelized gmem->smem
        T.copy(B[ko * block_K, bx * block_N], B_shared)
        T.gemm(A_shared, B_shared, C_local)  # dispatches to CuTe/HIP intrinsics

    T.copy(C_local, C[by * block_M, bx * block_N])

Key primitives:


Why it’s relevant to you specifically

You trained GPT-2 on MI300X — TileLang explicitly supports MI300X with Async Copy support and even has a FlashMLA implementation for AMD MI300X that matches hand-optimized assembly (Aiter). This is directly useful if you want to write custom attention or MoE routing kernels for your training runs.

The DeepSeek MLA decode kernel is only 80 lines of Python and matches FlashMLA on H100 — that’s the kind of productivity/performance tradeoff this DSL is targeting.


Stack


TileLang vs alternatives

  TileLang Triton CUTLASS/CuTe Raw CUDA
Abstraction level Tile Block/warp Thread block Thread
Portability CUDA+HIP+Metal Mainly CUDA CUDA only CUDA only
Async pipeline control ✅ explicit manual
AMD support ✅ MI300X partial HIP
Pythonic

The main differentiator vs Triton: TileLang gives you explicit shared memory layout control and per-thread ops (critical for dequant GEMM), while Triton abstracts that away. It’s also used as the backend for Microsoft BitBLAS and AttentionEngine.


Quick try on your RTX 4070

pip install tilelang

Then run the GEMM example from the README — it should JIT-compile and benchmark cleanly on CUDA.

For MI300X on AMD Dev Cloud:

pip install tilelang -f https://tile-ai.github.io/whl/nightly  # nightly for latest ROCm fixes

References:


Back Donate