Fine-tuning Qwen 3.5 Using LoRA Guide | Generated by AI

Home 2026.03

Sure — here’s the full walkthrough:


1. Configuration (top of file)

MODEL_NAME = "Qwen/Qwen3.5-9B"        # The base model to finetune
POSTS_DIR = "~/projects/blog-source/_posts/en"  # Your blog posts
MAX_LENGTH = 2048                      # Max tokens per training sample
BATCH_SIZE = 4                         # Samples per GPU step
GRAD_ACCUM = 4                         # Accumulate gradients before update (effective batch = 4×4 = 16)
EPOCHS = 3                             # Train through all data 3 times
LEARNING_RATE = 2e-5                   # How fast the model learns
LORA_R = 16 / LORA_ALPHA = 32          # LoRA rank and scaling factor

BATCH_SIZE × GRAD_ACCUM = 16 effective batch size. Low per-step batch but accumulates — saves VRAM.


2. create_training_data() — Loading your blog posts

def create_training_data(posts_dir):
    all_texts = []
    for file_path in sorted(glob.glob(os.path.join(posts_dir, "*.md"))):
        # Read file
        content = f.read()
        # Split on "---" — Jekyll front matter separator
        # e.g. "---\ntitle: Foo\n---\nActual content here"
        parts = content.split("---", 2)
        if len(parts) >= 3:
            content = parts[2].strip()  # Keep only body, discard metadata
        if len(content) < 50:
            continue  # Skip tiny fragments
        all_texts.append(content)
    return all_texts

3. prepare_dataset() — Tokenization

def prepare_dataset(texts, tokenizer):
    encodings = tokenizer(
        texts,
        truncation=True,       # Cut off at MAX_LENGTH
        padding=True,          # Pad shorter sequences
        max_length=MAX_LENGTH, # 2048 tokens
        return_tensors="pt",   # Return PyTorch tensors
    )
    return Dataset.from_dict({
        "input_ids": encodings["input_ids"],      # Token IDs
        "attention_mask": encodings["attention_mask"], # 1=real, 0=padding
    })

This is pure next-token prediction (causal LM):


4. train_model() — The core

4a. Model loading

torch_dtype = torch.bfloat16  # 16-bit brain float — good for modern GPUs
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,         # Qwen uses custom code
    torch_dtype=torch_dtype,        # Load in bf16 (saves VRAM vs fp32)
    attn_implementation="flash_attention_2",  # Fast attention kernel
)
model.gradient_checkpointing_enable()  # Trade compute for VRAM

4b. LoRA — The key trick for 9B

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,           # Rank — how expressive the adapter is
    lora_alpha=32,  # Scaling factor (alpha/r = 2x multiplier)
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projections
        "gate_proj", "up_proj", "down_proj",       # MLP projections
    ],
)
model = get_peft_model(model, lora_config)

LoRA (Low-Rank Adaptation) freezes the original 9B weights and injects small trainable matrices (rank 16) into every linear layer. Instead of training 9 billion parameters, you train ~0.5-1% of them.

4c. Training arguments

TrainingArguments(
    learning_rate=2e-5,           # Conservative LR for stable finetuning
    lr_scheduler_type="cosine",   # Smooth decay from peak → 0
    warmup_ratio=0.05,            # 5% of steps to ramp up LR
    weight_decay=0.01,            # L2 regularization
    optim="adamw_torch_fused",    # Fused AdamW — faster
    save_steps=500,               # Checkpoint every 500 steps
    save_total_limit=3,           # Keep only last 3 checkpoints
    logging_steps=10,             # Print loss every 10 steps
)

4d. Data collator

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

4e. Training loop

trainer = Trainer(model=model, args=training_args, train_dataset=dataset, data_collator=data_collator)
trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

HuggingFace Trainer handles:

After training, saves the LoRA adapter weights + tokenizer to trained_model_qwen/.


5. To load the trained model later

from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "trained_model_qwen")

Or merge into a single model:

model = model.merge_and_unload()  # Fuses LoRA weights back into base
model.save_pretrained("merged_model")

VRAM estimate

Component VRAM
9B model in bf16 ~18 GB
LoRA adapters ~0.5 GB
Gradients + optimizer states ~2 GB
Activations (with checkpointing) ~2-4 GB
Total ~22-25 GB

Needs a 24GB+ GPU (A100/4090/3090). What’s your setup? 🦞


Back Donate