Best Sparse Attention Repos Ranked | Generated by AI
Here are the repos worth your time, ranked by signal-to-noise for actually understanding the mechanics:
Original / canonical
openai/sparse_attention — Child et al.’s own reference implementation. Small, readable. The attention.py has the strided/fixed patterns directly. This is the primary source — read it first. Note it’s TF1-era so you won’t run it, but the pattern logic is what matters.
For the kernels (the part that makes it real)
openai/triton — the python/tutorials/06-fused-attention.py tutorial is the canonical block-sparse/fused attention walkthrough in Triton. This is where “fused block-sparse GPU kernels” stops being a phrase and becomes code you can modify. Directly relevant to your CUDA/inference-optimization interest.
Dao-AILab/flash-attention — not sparse per se, but FlashAttention is the IO-aware tiling idea that superseded the “recompute to save memory” trick from the Sparse Transformer paper. Read flash_attn/flash_attn_triton.py to see the modern version of what Child et al. were reaching for. Essential lineage.
For the lineage forward
[lucidrains repos] — Phil Wang reimplements basically every attention variant in clean minimal PyTorch:
- sparse-attention and the routing/reformer ones
- routing-transformer — content-based clustering branch
- reformer-pytorch — LSH bucketing branch
These are the best way to see the content-agnostic vs content-based fork in actual code, side by side. ~200 lines each, no infra.
allenai/longformer and google-research/bigbird — the direct generalizations of fixed-pattern attention (local + global + random). BigBird’s bigbird/core/attention.py shows how the fixed patterns got combined.
My recommendation for your workflow
Given you learn by building: skip straight to the Triton fused-attention tutorial and lucidrains’ minimal reimplementations. Clone both, run on your 4070, then port the strided_mask sketch above into a working block-sparse attention layer. That’s a tight afternoon project that teaches you more than reading all six papers.
Want a minimal nanoGPT-style training script that swaps in block-sparse attention so you can benchmark it against dense on your hardware?