Deepen Training, Fine-Tuning, MoE | Generated by AI

Home 2026.06

Question: Based on my AI portfolio, what should I do in the next few months to learn AI better — training, inference, what projects to build, what experiments to run?

Answer:

You’ve got a solid foundation: GPT-2 training from scratch, nanoGPT internals, ~2.3B tokens consumed, agent tooling (ww/iclaw/zz), and you’ve given a first-principles talk to 80 engineers. The gap between where you are and where Karpathy/Yin Wang operate is mostly depth of understanding + publication/artifact quality. Here’s how I’d structure the next 3–6 months.


1. Go Deeper on Training — Finish the GPT Story Before Moving On

You’ve trained GPT-2 124M and 760M. The next inflection point is understanding why loss curves look the way they do, not just that they go down.

Experiments to run on your existing nanoGPT fork:

# Track these explicitly per run — log to W&B or even just a CSV
{
  "model_size": "124M",
  "dataset": "fineweb-10B",
  "lr_schedule": "cosine",
  "batch_size": 524288,  # tokens
  "loss_at_1B_tokens": ...,
  "loss_at_5B_tokens": ...,
  "grad_norm_mean": ...,
  "throughput_tok_per_sec": ...,
}

Concrete experiments worth running, in order of ROI:


2. The Missing Piece: Fine-Tuning + RLHF/DPO

You’ve done pretraining. You haven’t done fine-tuning at the code level. This is the most commercially valuable skill gap you have right now.

Project: nanochat — instruction fine-tuning from scratch

This is exactly what Karpathy’s nanochat is about. Build it yourself:

# Phase 1: SFT on Alpaca/ShareGPT
# Start from your GPT-2 124M pretrained checkpoint
# Add chat template: <|user|>...<|assistant|>...
# Train with cross-entropy only on assistant tokens

def compute_sft_loss(logits, targets, mask):
    # mask = 1 only for assistant tokens
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)),
                           targets.view(-1), reduction='none')
    return (loss * mask.view(-1)).sum() / mask.sum()

Then DPO (Direct Preference Optimization) — the math is simple, the implementation is ~100 lines:

# DPO loss — no reward model needed
def dpo_loss(pi_logps_chosen, pi_logps_rejected,
             ref_logps_chosen, ref_logps_rejected, beta=0.1):
    pi_ratio = pi_logps_chosen - pi_logps_rejected
    ref_ratio = ref_logps_chosen - ref_logps_rejected
    return -F.logsigmoid(beta * (pi_ratio - ref_ratio)).mean()

Run this on Anthropic’s HH-RLHF dataset or UltraFeedback. The goal isn’t a great model — it’s that you can say “I built SFT + DPO from scratch” and mean it at the code level.


3. MoE — You Listed DeepSeek v4, Now Implement It

You mentioned exploring DeepSeek v4 MoE. The gap is: have you actually implemented sparse routing? If not, this is the most important architecture experiment for the next 6 months.

Minimal MoE implementation (~150 lines):

class SparseMoE(nn.Module):
    def __init__(self, n_experts=8, top_k=2, d_model=512, d_ff=2048):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        self.gate = nn.Linear(d_model, n_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(),
                         nn.Linear(d_ff, d_model))
            for _ in range(n_experts)
        ])

    def forward(self, x):
        # x: (B, T, C)
        B, T, C = x.shape
        x_flat = x.view(-1, C)  # (B*T, C)

        logits = self.gate(x_flat)  # (B*T, n_experts)
        top_k_logits, top_k_indices = logits.topk(self.top_k, dim=-1)
        weights = F.softmax(top_k_logits, dim=-1)  # (B*T, top_k)

        out = torch.zeros_like(x_flat)
        for i, expert in enumerate(self.experts):
            mask = (top_k_indices == i).any(dim=-1)  # which tokens route here
            if mask.any():
                expert_out = expert(x_flat[mask])
                w = weights[mask][top_k_indices[mask] == i].unsqueeze(-1)
                out[mask] += w * expert_out

        # Load balancing loss
        routing_probs = F.softmax(logits, dim=-1).mean(0)  # (n_experts,)
        load_balance_loss = n_experts * (routing_probs * routing_probs).sum()

        return out.view(B, T, C), load_balance_loss

Then read DeepSeek-V2’s auxiliary-free load balancing paper and implement that variant. The difference between top-k routing with load balance loss vs DeepSeek’s approach is subtle but important.


4. Inference Engineering — You’re Under-indexed Here

Your profile is pretraining-heavy. Inference optimization is where the real production value is and you’ll need it as an AI engineer at a bank or a startup.

Projects in order:

a) KV Cache from scratch — implement in your nanoGPT:

# During generation, cache K and V instead of recomputing
past_kv = []  # list of (k, v) per layer
for t in range(max_new_tokens):
    # only run forward on the new token, not full sequence
    logits, past_kv = model(x[:, -1:], past_kv=past_kv)

b) Speculative decoding — implement draft + verify loop. Pick a small model (124M) as draft, your 760M as target. Measure 2-4x speedup empirically on your RTX 4070.

c) INT8/INT4 quantization — use bitsandbytes or implement naive post-training quantization, measure perplexity degradation vs throughput gain. This is what you’d need in production at HSBC.

d) vLLM / SGLang internals — don’t just use them. Read the PagedAttention paper, then trace through the vLLM codebase for one forward pass. Write a 1000-word blog post explaining it. This cements it.


5. One “Research-Grade” Artifact Per Month

You have 9,700+ notes and 400 blog posts. The next level is publishing something that gets cited or forked, not just read. Concretely:

The goal: something Karpathy would retweet. That’s the bar.


6. Agent Architecture — Upgrade iclaw

iclaw is a good MVP. The next version should implement:


Priority Stack (Next 3 Months)

Priority Project Machine Time
1 SFT + DPO from scratch on nanochat MI300X 3–4 weeks
2 MoE implementation with load balancing RTX 4070 2 weeks
3 Speculative decoding on your 760M RTX 4070 1 week
4 Chinchilla scaling experiment, W&B tracked MI300X 2 weeks
5 iclaw v2 with eval harness MacBook ongoing

The single most leveraged thing you can do right now is SFT + DPO from scratch. It closes the biggest gap between your pretraining skill and production AI engineering, and it’s the thing you can’t fake having done.


References:


Back Donate