Training Qwen2.5 on 12 GB | Generated by AI

Home 2026.07

Question: Beyond inference, can I actually train (or fine-tune) Qwen2.5-1.5B-Instruct on an RTX 4070 (12 GB)? SGLang is inference-only, so what are the real training options on this card?

Answer:

Yes — but let’s be precise about what “train” means on 12 GB. SGLang is an inference/serving engine (RadixAttention, continuous batching); it does no gradient computation. For training you need PyTorch + an optimizer, and the memory math is brutal.

The 12 GB memory wall

A 1.5B model in bf16 weights alone is ~3 GB. The killer is the optimizer state. Full fine-tuning with AdamW needs, per parameter:

That’s ~16 bytes/param → 1.5B × 16 ≈ 24 GB just for static state, before activations. Full fine-tuning is dead on a 4070. So the real options are LoRA/QLoRA.

QLoRA memory math (this is what fits)

QLoRA quantizes the frozen base to 4-bit (NF4) and only trains small low-rank adapters:

Total lands around 4–7 GB depending on seq length and batch size. Very comfortable on 12 GB. Plain LoRA (base in bf16, ~3 GB) also fits with headroom.

Minimal QLoRA training script

This is the build-first path — trl + peft + bitsandbytes:

# pip install -U trl peft bitsandbytes transformers datasets accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

model_id = "Qwen/Qwen2.5-1.5B-Instruct"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # saves another ~0.4 GB
)

model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb, device_map={"": 0},
    attn_implementation="flash_attention_2",  # or "sdpa" if FA2 unbuilt
)
tok = AutoTokenizer.from_pretrained(model_id)

peft_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],  # attn + MLP
)

ds = load_dataset("trl-lib/Capybara", split="train[:2000]")

cfg = SFTConfig(
    output_dir="qwen-qlora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,        # effective batch 16
    gradient_checkpointing=True,          # biggest activation saver
    bf16=True,
    max_length=1024,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    optim="paged_adamw_8bit",             # paged = survives VRAM spikes
)

trainer = SFTTrainer(model=model, args=cfg, train_dataset=ds, peft_config=peft_cfg)
trainer.train()
trainer.save_model("qwen-qlora/adapter")

The levers that actually control whether it OOMs, in order of impact: gradient_checkpointing=True (trades ~30% compute for large activation savings), max_length (activations scale linearly — drop to 512 if tight), per_device_train_batch_size, and paged_adamw_8bit (the paged optimizer offloads to CPU RAM on spikes instead of crashing).

Knowledge → memory map

Method Base dtype Fits 12 GB? Trains
Full FT bf16 ❌ ~24 GB all params
LoRA bf16 (~3 GB) ✅ ~6–9 GB adapters
QLoRA 4-bit (~0.75 GB) ✅ ~4–7 GB adapters

Where your two GPUs split the work

Prototype the QLoRA loop on the 4070 — get the data pipeline, chat template, and adapter config correct at max_length=512, batch 1–2. When you want full fine-tuning, longer context, or a real sweep, push it to the MI300X (192 GB), where full bf16 FT of a 1.5B sits in well under 30 GB and you can crank batch size and sequence length. The 4070 is your correctness-checking environment; the MI300X is your scale environment.

After training, merge and serve back through SGLang:

from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")
merged = PeftModel.from_pretrained(base, "qwen-qlora/adapter").merge_and_unload()
merged.save_pretrained("qwen-merged")
# then: python -m sglang.launch_server --model-path qwen-merged

One caveat on flash-attention-2: prebuilt wheels for the 4070 (Ada, sm_89) can be finicky — if flash_attention_2 errors at load, switch to attn_implementation="sdpa", which uses PyTorch’s built-in fused attention and is nearly as fast at this scale.

References:


Back Donate