Inference vs Pretraining Career Path | Generated by AI

Home 2026.06

Question: Is working on LLM inference/serving infrastructure (vLLM, SGLang, llama.cpp) a better career/learning direction than LLM pretraining?

Answer:

Depends what you mean by “better” — but let me give you a sharp take for your specific trajectory.

The Honest Comparison

Dimension Inference/Serving Pretraining
Feedback loop Minutes to hours Days to weeks
Hardware cost RTX 4070 is enough Need A100/H100 clusters
Job market Huge right now Narrow (OpenAI, Anthropic, DeepSeek)
Depth ceiling Very high Highest
Transferability Broad Narrow
Your current setup ✅ fits ⚠️ limited

Why Inference Is Underrated

Most people think pretraining = more prestigious. That’s wrong in practice. The systems problems in inference are legitimately hard:

These aren’t “glue code” problems. They require knowing attention math, memory bandwidth math, CUDA kernels, and distributed systems simultaneously.

What You’d Learn in Inference That Compounds

Transformer internals
    → KV cache mechanics (shape: [layers, heads, seq_len, head_dim])
    → Attention complexity O(n²) → why sequence length kills you
    → FlashAttention: fused kernel, tiling, avoid HBM round-trips

Systems
    → CUDA memory hierarchy (HBM → L2 → SRAM)
    → Tensor parallelism (split heads across GPUs)
    → Pipeline parallelism (split layers)
    → Prefill vs decode phases (compute-bound vs memory-bound)

Quantization
    → Post-training quantization math
    → Calibration datasets, perplexity impact
    → GPTQ, AWQ, SmoothQuant internals

All of this directly transfers to understanding pretraining too — because pretraining is just a forward+backward pass, and you already understand the forward pass deeply from inference.

The Strategic Answer for You

Given your setup (RTX 4070, MI300X access, already training GPT-2 scale):

Do inference first as your daily driver, pretraining as your weekend project.

Inference gives you:

Pretraining gives you:

Concrete Ramp for You

# Week 1-2: vLLM internals
git clone https://github.com/vllm-project/vllm
# Read: vllm/core/scheduler.py — continuous batching logic
# Read: vllm/attention/backends/flash_attn.py — PagedAttention impl

# Week 3-4: SGLang
git clone https://github.com/sgl-project/sglang
# Read: sglang/srt/managers/radix_cache.py — RadixAttention
# Experiment: prefix caching benchmarks on your 4070

# Week 5-6: llama.cpp
git clone https://github.com/ggerganov/llama.cpp
# Read: ggml-cuda.cu — how CUDA kernels are hand-written
# Experiment: quantize a 7B model, measure perplexity vs speed

# Then: contribute a small PR to vLLM or SGLang

Bottom Line

Inference/serving is not easier — it’s differently hard, with faster feedback. For your situation it’s the better primary track because the compounding is faster and you can do real experiments on hardware you already have. Pretraining knowledge comes for free as a byproduct once you deeply understand the forward pass from the inference side.

The Karpathy path: understand every line of the forward pass → that’s inference. The pretraining is just loss.backward() on top.


Back Donate