Optimize by Doing Not Just Reading | Generated by AI

Home 2026.05

Question: For optimization learning (e.g., AMD ROCm kernel optimization), the key is to find something real to optimize and iterate with AI assistance, rather than just reading — is that right?

Answer:

Yes, exactly. This is the Karpathy principle applied to systems work:

“Don’t be a tourist. Write the code.”

Reading ROCm docs or CUDA papers gives you vocabulary. Actually profiling and optimizing a real kernel gives you intuition — and intuition is what compounds.

The Learning Loop That Works

Pick a real bottleneck → measure it → form hypothesis → 
change code → measure again → ask AI to explain the delta → repeat

The AI is most useful inside the loop, not before it. You don’t ask “explain ROCm memory coalescing” in the abstract — you paste your HIP kernel, your rocprof output, and ask “why is this 3x slower than expected?”

Concrete Example: ROCm Kernel Optimization

Step 1: Find something real

# You're running a transformer — profile the attention kernel
import torch
x = torch.randn(32, 512, 512).to("cuda")  # ROCm = same API

# Measure
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
out = torch.softmax(x @ x.transpose(-1,-2), dim=-1) @ x
end.record()
torch.cuda.synchronize()
print(f"{start.elapsed_time(end):.2f} ms")

Step 2: Write a naive HIP kernel

__global__ void naive_matmul(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
    for (int k = 0; k < N; k++)
        sum += A[row * N + k] * B[k * N + col];
    C[row * N + col] = sum;
}

Step 3: Profile with rocprof

rocprof --stats -o profile.csv ./my_kernel
# Look at: FETCH_SIZE, L2CacheHit, wavefront occupancy

Step 4: Ask AI with evidence

Here's my kernel: [paste]
Here's my rocprof output: [paste]
L2 hit rate is 23%. My hypothesis is that the B matrix access 
pattern is non-coalesced. Is that right? What's the fix?

Now you’re using AI as a senior engineer pair, not a textbook.

The Real Insight

Most optimization knowledge is not in the reading — it’s in the gap between:

That gap is where learning lives. AI accelerates closing it because you can explain the gap and get a targeted hypothesis in 30 seconds instead of digging through docs for 3 hours.

Applied to Your Situation

You train GPT-2 on H100/H200. That’s your real workload. Concrete things to optimize right now:

Target Tool What you’ll learn
Attention FLOP efficiency nsys / rocprof Memory bandwidth vs compute bound
DataLoader bottleneck py-spy CPU/GPU pipeline overlap
All-reduce in DDP NCCL logs Communication hiding
Flash Attention vs naive Triton kernel Tiling, SRAM usage

Each of these is a 1-week project that will teach you more than a month of reading optimization papers — because you’ll have numbers to explain, not abstractions.

The reading then becomes a reference, not the lesson.


Back Donate