While Llama-3 is a powerful base model trained on diverse data, it’s not optimized for specific domains or tasks. Fine-tuning allows you to:
- Adapt to Specific Domains: Customize the model for medical, legal, financial, or technical content
- Improve Task Performance: Enhance accuracy on specific tasks like summarization, code generation, or question-answering
- Reduce Hallucinations: Ground responses in proprietary knowledge bases and documents
- Control Behavior: Adjust tone, style, and response patterns to match your requirements
- Lower Costs: Deploy smaller fine-tuned models instead of larger commercial APIs
- Privacy Compliance: Keep sensitive data on-premises without sending it to external services
Parameter-Efficient Fine-Tuning (PEFT) Methods
PEFT techniques reduce the computational cost of fine-tuning by training only a small subset of parameters. Let’s examine the three main approaches:
1. Full Fine-Tuning
Updates all model parameters during training.
- โ Best quality results
- โ Requires 160GB+ VRAM for Llama-3 70B
- โ Extremely expensive and slow
- โ Risk of catastrophic forgetting
2. Low-Rank Adaptation (LoRA)
Freezes model weights and trains small adapter matrices (rank decomposition).
- โ 25-40x memory reduction
- โ Fast training (2-4x faster than full fine-tuning)
- โ ~0.5% trainable parameters
- โ ๏ธ Still requires 16-24GB VRAM for Llama-3 8B
3. Quantized LoRA (QLoRA)
Combines 4-bit quantization with LoRA for maximum efficiency.
- โ 33% additional memory reduction vs. LoRA (15-16GB vs. 24GB)
- โ Fits 8B models on consumer GPUs
- โ ~0.5% trainable parameters
- โ ๏ธ Training is 39% slower due to dequantization overhead
Performance & Resource Comparison

QLoRA achieves 90% of full fine-tuning quality while using only 15% of the memory
QLoRA: Understanding the Magic Behind Single-GPU Fine-Tuning
How QLoRA Works
QLoRA combines three key techniques to fit large models on small GPUs:
๐ข NF4 Quantization
Uses Normal Float 4-bit precision instead of standard 32-bit floats. This reduces model weights to 1/8th of their original size while maintaining precision.
๐ Double Quantization
Quantizes the quantization constants themselves, saving additional 0.4 bits per parameter on average.
๐พ Paged Optimizers
Temporarily moves GPU memory to CPU RAM when needed, preventing out-of-memory errors during training.
The QLoRA Computation Flow
- Frozen Weights (NF4): Model weights are stored in 4-bit NF4 format on VRAM
- Dequantization: When needed, weights are dequantized to BF16 (brain float 16)
- LoRA Computation: LoRA adapter matrices (trained) are combined with dequantized weights in BF16
- Backprop: Gradients are computed and only LoRA adapters are updated
- Re-quantization: Weights are quantized back to NF4 to free VRAM
QLoRA vs LoRA: When to Use Which?
| Factor | Use LoRA | Use QLoRA |
|---|---|---|
| Available VRAM | 16GB+ | <16GB |
| Model Size | 7B-13B | 30B-70B (or 8B on limited hardware) |
| Training Speed | Fast (baseline) | Slower (39% overhead) |
| Quality | Slightly better | Nearly identical (92% vs 95%) |
| Use Case | Production systems with resources | Research, prototyping, limited budgets |
Prerequisites & Environment Setup
Hardware Requirements
Software Installation
Step 1: Create Python Environment
$ source llama-finetune/bin/activate # Linux/Mac
$ llama-finetune\Scripts\activate # Windows
Step 2: Install Core Dependencies
$ pip install transformers datasets peft bitsandbytes accelerate
$ pip install wandb trl # For logging and training utilities
Step 3: Verify Installation
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU only\"}')
print(f'CUDA Version: {torch.version.cuda}')
"
Complete QLoRA Implementation
Full Production-Ready Code
#!/usr/bin/env python3
"""
Complete QLoRA Fine-Tuning Implementation for Llama-3
Optimized for single GPU with 12GB+ VRAM
"""
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
TextStreamer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import wandb
import os
# ===== Configuration =====
MODEL_NAME = "meta-llama/Llama-3-8b-hf" # Or use "meta-llama/Llama-3-8b-Instruct"
OUTPUT_DIR = "./llama-3-finetuned"
DATASET_NAME = "mlabonne/FineTome-100k" # Replace with your dataset
MAX_SEQ_LENGTH = 2048
LEARNING_RATE = 3e-4
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 1
WARMUP_STEPS = 100
# ===== Quantization Config (QLoRA) =====
def create_bnb_config():
"""Configure 4-bit quantization using bitsandbytes"""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
return bnb_config
# ===== LoRA Config =====
def create_peft_config():
"""Configure LoRA adapters"""
peft_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # LoRA scaling factor
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj", # Query projection
"v_proj", # Value projection
"k_proj", # Key projection
"o_proj", # Output projection
"gate_proj", # Gate projection (feed-forward)
"up_proj", # Up projection (feed-forward)
"down_proj" # Down projection (feed-forward)
],
modules_to_save=None, # Freeze all other modules
)
return peft_config
# ===== Model Loading =====
def load_model_and_tokenizer():
"""Load Llama-3 model with QLoRA configuration"""
print("Loading model with QLoRA quantization...")
bnb_config = create_bnb_config()
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2" # For faster inference
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # For efficient padding
print(f"Model loaded: {MODEL_NAME}")
print(f"Model dtype: {model.dtype}")
return model, tokenizer
# ===== Dataset Preparation =====
def prepare_dataset(tokenizer):
"""Load and prepare training dataset"""
print(f"Loading dataset: {DATASET_NAME}")
dataset = load_dataset(DATASET_NAME, split="train")
# For large datasets, use subset
if len(dataset) > 50000:
dataset = dataset.select(range(50000))
# Split into train/validation
dataset = dataset.train_test_split(test_size=0.1)
def format_instruction(example):
"""Format conversations for instruction tuning"""
# Assuming dataset has 'text' or 'conversations' field
if 'text' in example:
return {"text": example["text"]}
elif 'conversations' in example:
# Convert ShareGPT format to single text
messages = example["conversations"]
text = ""
for msg in messages:
role = msg.get("from", "").replace("human", "user")
content = msg.get("value", "")
text += f"{role}: {content}\n"
return {"text": text}
return example
dataset = dataset.map(format_instruction, batched=False, remove_columns=dataset.column_names)
return dataset
# ===== Training Setup =====
def setup_training(model, tokenizer, dataset):
"""Configure training parameters"""
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
# Apply LoRA config
peft_config = create_peft_config()
model = get_peft_model(model, peft_config)
# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable_params:,}")
print(f"Total params: {total_params:,}")
print(f"Trainable %: {100 * trainable_params / total_params:.2f}%")
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
learning_rate=LEARNING_RATE,
num_train_epochs=NUM_EPOCHS,
logging_steps=10,
eval_steps=100,
save_steps=100,
save_total_limit=3,
warmup_steps=WARMUP_STEPS,
logging_dir="./logs",
optim="paged_adamw_8bit", # 8-bit optimizer for memory efficiency
bf16=True, # Use bfloat16 for faster training
tf32=True, # Use TF32 on A100 GPUs
lr_scheduler_type="cosine",
weight_decay=0.01,
push_to_hub=False, # Set to True if uploading to Hugging Face
report_to=["wandb"], # Enable W&B logging
dataloader_pin_memory=True,
dataloader_num_workers=4,
ddp_find_unused_parameters=False,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
dataset_text_field="text",
args=training_args,
packing=True, # Pack multiple examples in single sequence
max_seq_length=MAX_SEQ_LENGTH,
)
return trainer
# ===== Main Execution =====
def main():
# Initialize W&B logging
wandb.init(project="llama3-finetuning", name="qlora-experiment")
# Load model and tokenizer
model, tokenizer = load_model_and_tokenizer()
# Prepare dataset
dataset = prepare_dataset(tokenizer)
# Setup training
trainer = setup_training(model, tokenizer, dataset)
# Train
print("Starting training...")
trainer.train()
# Save model
print("Saving model...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
# Merge LoRA adapters with base model (optional)
print("Merging LoRA adapters...")
from peft import AutoPeftModelForCausalLM
merged_model = AutoPeftModelForCausalLM.from_pretrained(
OUTPUT_DIR,
torch_dtype=torch.float16,
device_map="auto"
)
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained(f"{OUTPUT_DIR}-merged")
print("โ
Fine-tuning complete!")
if __name__ == "__main__":
main()
Hyperparameter Tuning for QLoRA
LoRA Configuration
Key LoRA Parameters:
Training Hyperparameters
Recommended Configurations
| Scenario | Rank | Alpha | Batch | LR | Notes |
|---|---|---|---|---|---|
| Small Dataset (<5k) | 8 | 16 | 2 | 1e-4 | Lower rank to prevent overfitting |
| Medium Dataset (5k-50k) | 16 | 32 | 4 | 3e-4 | Balanced quality/efficiency |
| Large Dataset (>50k) | 32 | 64 | 8 | 5e-4 | Can use higher rank for better quality |
| Limited VRAM (8GB) | 8 | 16 | 1-2 | 1e-4 | Minimal rank, smaller batch |
Training & Monitoring
Memory Optimization Techniques
- Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass
- Packing: Combine multiple examples into single sequence to maximize GPU utilization
- Flash Attention: Use optimized attention implementation for 2x speedup
- 8-bit Optimizer: Use
paged_adamw_8bitinstead of standard AdamW - BF16 Training: Use bfloat16 for memory savings and faster computation on A100/H100
Monitoring Training
import wandb
import matplotlib.pyplot as plt
# Monitor metrics
wandb_metrics = {
"loss": [],
"learning_rate": [],
"epoch": []
}
# After training, plot results
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(wandb_metrics["epoch"], wandb_metrics["loss"])
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss")
plt.subplot(1, 2, 2)
plt.plot(wandb_metrics["epoch"], wandb_metrics["learning_rate"])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Learning Rate Schedule")
plt.tight_layout()
plt.savefig("training_metrics.png")
plt.show()
Inference After Fine-Tuning
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
# Load fine-tuned model
model = AutoPeftModelForCausalLM.from_pretrained(
"./llama-3-finetuned",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./llama-3-finetuned")
# Inference
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advanced Optimization Techniques
Alternative: Using Unsloth for 2x Faster Training
Unsloth provides custom kernels for even faster QLoRA training. Installation is simple:
from unsloth import FastLanguageModel
# Load with Unsloth (2x faster, 60% less memory)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
dtype=None,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
use_rslora=True, # Rank-stabilized LoRA
use_gradient_checkpointing="unsloth"
)
Model Merging & Quantization
from peft import AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM
# Merge LoRA adapters with base model
model = AutoPeftModelForCausalLM.from_pretrained("./llama-3-finetuned")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-3-merged")
# Optional: Quantize to GGUF for inference
# Using llama-cpp-python
from llama_cpp import Llama
# Convert to GGUF first (requires external tool)
# Then load for inference
llm = Llama(model_path="./llama-3-merged.gguf", n_gpu_layers=35)
Multi-GPU Training with FSDP
# For distributed training across multiple GPUs
# Set in TrainingArguments:
training_args = TrainingArguments(
...
ddp_backend="nccl",
ddp_find_unused_parameters=False,
# Enable FSDP
fsdp="full_shard auto_wrap",
fsdp_config={
"sharding_strategy": "FULL_SHARD",
"backward_prefetch": "BACKWARD_PRE",
"forward_prefetch": True,
},
)
The Future of Model Customization
QLoRA represents a paradigm shift in making large language models accessible to everyone. By combining 4-bit quantization with LoRA, you can now fine-tune Llama-3 on consumer-grade hardware that would have been impossible just two years ago.
- QLoRA fits Llama-3 8B on 12GB GPUs (previously required 80GB+)
- Training quality remains 90%+ compared to full fine-tuning
- Merging adapters gives production-ready models at negligible cost
- Unsloth provides 2x speedup with minimal code changes
- Monitor VRAM usage and adjust batch size/rank accordingly
Next Steps
- Set up your environment with the provided installation steps
- Prepare your training dataset in instruction-answer format
- Start with the recommended hyperparameters for your dataset size
- Monitor training with W&B or TensorBoard
- Evaluate on validation set and adjust hyperparameters if needed
- Merge adapters and deploy your fine-tuned model
- Consider using GGUF quantization for CPU/edge device inference
