Fine-Tune on 10k Jekyll Notes

Fine-Tune on 10k Jekyll Notes | Generated by AI

Home 2026.06

Question: I have ~10k notes at lzwjava.github.io (Jekyll repo: github.com/lzwjava/jekyll-ai-blog). I want to fine-tune a model on them — what should I do?

Answer:

First, the honest framing: 10k notes ≈ 8–15M tokens. That’s far too small for pretraining, but it’s a very good SFT dataset — especially because your notes-en corpus is literally already in Q&A format (title = question, body = answer). Fine-tuning will teach a model your voice, format, and answer style, not reliable factual recall of every note’s content (for recall, RAG wins; weights are a lossy compressor at this data scale). The interesting project here is: “a model that answers like lzwjava’s notes” — and that’s exactly what SFT does well.

Step 1 — Extract the dataset from Jekyll

Your posts are markdown with YAML front matter. Parse them into {question, answer} pairs:

# build_dataset.py
import json, re
from pathlib import Path
import frontmatter  # pip install python-frontmatter

POSTS = Path("jekyll-ai-blog/_posts")  # or wherever notes live
out = open("notes_sft.jsonl", "w")

LIQUID = re.compile(r"{[%{].*?[%}]}", re.DOTALL)

n = 0
for f in sorted(POSTS.rglob("*.md")):
    post = frontmatter.load(f)
    if post.get("lang", "en") != "en":      # English only
        continue
    title = post.get("title", "").strip()
    body = LIQUID.sub("", post.content).strip()
    if not title or len(body) < 300:        # drop stubs
        continue
    out.write(json.dumps({
        "conversations": [
            {"role": "user", "content": title},
            {"role": "assistant", "content": body},
        ]
    }, ensure_ascii=False) + "\n")
    n += 1

print(f"{n} examples")

Then sanity-check token count and dedupe:

python -c "
import json, tiktoken
enc = tiktoken.get_encoding('cl100k_base')
tot = sum(len(enc.encode(json.loads(l)['conversations'][1]['content'])) for l in open('notes_sft.jsonl'))
print(f'{tot/1e6:.1f}M assistant tokens')"

Hold out ~200 examples as an eval set before training. Also filter aggressively — 7k clean examples beat 10k noisy ones. Drop notes that are pure link dumps, config snippets without prose, or duplicated answers.

Step 2 — Pick model + machine

Setup	Model	Method	Why
RTX 4070 (12GB)	Qwen3-4B / Llama-3.1-8B	QLoRA 4-bit	Fits in 12GB with Unsloth; free iteration
MI300X ($2/hr)	Qwen3-32B or Llama-3.3-70B	LoRA 16-bit	192GB HBM3 lets you LoRA a 70B on one GPU

Unsloth now officially supports AMD — its Triton kernels were ported to HIP/ROCm in collaboration with the ROCm team, verified on real MI300X hardware. Install on the MI300X box:

python3 -m venv unsloth_env && source unsloth_env/bin/activate
pip install torch torchvision torchao xformers --index-url https://download.pytorch.org/whl/rocm6.4
pip install --no-deps unsloth unsloth-zoo
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"

(ROCm systems need a pre-release bitsandbytes build — versions ≤ 0.49.2 have a 4-bit decode NaN bug on AMD GPUs — only matters if you go 4-bit; on MI300X just use 16-bit LoRA.)

Step 3 — Train

# train.py — works on both 4070 (CUDA) and MI300X (ROCm)
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-8B",          # 70B on MI300X if you want
    max_seq_length=4096,
    load_in_4bit=True,            # False on MI300X
)
model = FastLanguageModel.get_peft_model(
    model, r=32, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)

ds = load_dataset("json", data_files="notes_sft.jsonl", split="train")
ds = ds.map(lambda x: {"text": tokenizer.apply_chat_template(
    x["conversations"], tokenize=False)})

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=ds,
    args=SFTConfig(
        per_device_train_batch_size=2,   # 64+ on MI300X
        gradient_accumulation_steps=8,
        num_train_epochs=2,              # small data — don't go past 3
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True, logging_steps=10,
        output_dir="lzw-notes-lora",
        packing=True,                    # critical for short Q&A pairs
    ),
)
trainer.train()
model.save_pretrained_merged("lzw-notes-merged", tokenizer)

With ~10M tokens, 2 epochs on an 8B QLoRA is a few hours on the 4070; on MI300X with batch 64+ it’s well under an hour — one $2 droplet session covers many experiments. Watch eval loss; with data this small, overfitting (memorized phrasing, repetition) shows up fast after epoch 2–3.

Step 4 — Eval and serve

Vibe-eval against the base model on your 200 held-out titles: generate both, diff the style. Then export for local use:

# GGUF for llama.cpp / ollama on the M2 Air
model.save_pretrained_gguf("lzw-notes-gguf", tokenizer, quantization_method="q4_k_m")
# Or serve on the 4070 box:
vllm serve ./lzw-notes-merged --max-model-len 4096

Recommended path

This weekend, 4070: build dataset → QLoRA Qwen3-4B → see if the “lzwjava voice” emerges. Total cost: $0.
Once the pipeline works, MI300X: same script, LoRA a 32B/70B 16-bit, batch 64. This is also exactly the muscle memory that transfers to your nanochat/DeepSeek ambitions — the SFT stage of nanochat is this same loop, just hand-rolled.
Stretch: take the LoRA’d model + DPO pairs (your edited answers vs raw model answers) — that’s a natural next step into the RLHF/DPO territory you want to learn.

References:

Back Donate