FinGPT on RTX 4070: Issues & Fixes | Generated by AI
Here’s the full picture:
GPU Status
RTX 4070 | 12GB VRAM (12,282 MiB) | 230MiB used | CUDA 13.0
Plenty of headroom, only Xorg using it.
The Code — What It Does
FinGPT is a LoRA fine-tuning framework for financial LLMs. It wraps HuggingFace Trainer + PEFT + DeepSpeed. Key sub-projects:
| Sub-project | What it does |
|---|---|
| FinGPT_Benchmark | Multi-task training (sentiment, NER, headline IE, relation extraction, ConvFinQA) on 7B models |
| FinGPT_Forecaster | Stock movement prediction from news — Llama-2-7B, max_length 4096 |
| FinGPT_Sentiment_Analysis | Financial sentiment classifier |
| FinGPT_RAG / MultiAgentsRAG | RAG pipelines for financial QA |
| FinGPT_FinancialReportAnalysis | PDF report analysis |
Model sizes supported:
| Model | Size (fp16) | 8-bit | 4-bit |
|---|---|---|---|
| ChatGLM2-6B / Falcon-7B / Qwen-7B / Llama-2-7B | ~13-14 GB | ~6.5 GB | ~3.5 GB |
| Llama-2-13B | ~26 GB | ~13 GB | ~7 GB |
Training uses LoRA (not full fine-tune) — adapter weights are tiny (~few MB).
Issues I See
1. load_in_8bit=True is commented out (line 50 in both train_lora.py files)
model = AutoModelForCausalLM.from_pretrained(
model_name,
# load_in_8bit=True, <-- commented out
# device_map="auto",
trust_remote_code=True
)
This means the model loads in full fp16 — 13GB for a 7B model won’t fit in 12GB VRAM. You WILL get CUDA OOM unless you uncomment this or use 4-bit.
2. Designed for 4 GPUs, not 1
- Forecaster
train.shdoes--include localhost:2,3— hardcoded for GPUs 2-3 on a multi-GPU box - DeepSpeed config uses ZeRO stage 0 (no sharding) — it helps with optimizer memory but doesn’t split the model
- Original paper says training costs <$300 per run, but they assume 4× A100 or similar
3. Needs model files locally
from_remote=Falseby default → looks for models at localbase_models/Llama-2-7b-hfetc.- Would need to either download models first or set
--from_remote(and have HF token for gated models like Llama-2)
4. Dated dependencies
- Uses
prepare_model_for_int8_trainingfrom PEFT — deprecated in newer versions, replaced byprepare_model_for_kbit_training - No flash attention, no gradient checkpointing, no QLoRA (4-bit NF4)
What You Can Actually Do on RTX 4070 12GB
✅ EASILY fits — comfortable
| Task | Setup |
|---|---|
| Sentiment / NER / Headline fine-tune on 7B | 4-bit QLoRA, batch_size 4, max_length 512 → ~5-6 GB |
| Forecaster on 7B | 4-bit, batch_size 1, max_length 4096 → ~6-7 GB |
| Inference on any 7B model | fp16 or 8-bit, fits easily |
| Inference on 13B | 4-bit quantization → ~7-8 GB |
⚠️ Fits with work — needs tuning
| Task | Setup |
|---|---|
| Forecaster on 7B with 8-bit | Batch_size 1, grad_accum 16 (like the original), max_length 2048 |
| 13B fine-tune with LoRA | 4-bit QLoRA only, tiny batch_size 1, max_length 512 |
❌ Won’t work
| Task | Why |
|---|---|
| Full fp16 fine-tune of any 7B+ model | 13GB > 12GB |
| 13B model in 8-bit | ~13GB barely over 12GB, no headroom for activations |
| Anything requiring fp32 (without mixed precision) | 26GB+ per 7B model |
Quick Fix to Get Running
Uncomment load_in_8bit=True and add device_map="auto" in train_lora.py. Or better, upgrade to 4-bit QLoRA:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
Then drop DeepSpeed (overkill for 1 GPU) and use HuggingFace Trainer directly. For the Forecaster, also enable gradient checkpointing and set batch_size=1, gradient_accumulation_steps=16.
Want me to make these changes and set up a test run?