Fine-Tuning Llama-3 on a Single GPU: A QLoRA Implementation Guide

While Llama-3 is a powerful base model trained on diverse data, it’s not optimized for specific domains or tasks. Fine-tuning allows you to:

Adapt to Specific Domains: Customize the model for medical, legal, financial, or technical content
Improve Task Performance: Enhance accuracy on specific tasks like summarization, code generation, or question-answering
Reduce Hallucinations: Ground responses in proprietary knowledge bases and documents
Control Behavior: Adjust tone, style, and response patterns to match your requirements
Lower Costs: Deploy smaller fine-tuned models instead of larger commercial APIs
Privacy Compliance: Keep sensitive data on-premises without sending it to external services

🎯 The Challenge: Traditional fine-tuning requires updating all model parameters, demanding 80GB+ VRAM even for 8B models. QLoRA solves this by requiring only 8GB-12GB VRAM for the same 8B model.

Table of Contents

Parameter-Efficient Fine-Tuning (PEFT) Methods

PEFT techniques reduce the computational cost of fine-tuning by training only a small subset of parameters. Let’s examine the three main approaches:

1. Full Fine-Tuning

Updates all model parameters during training.

✅ Best quality results
❌ Requires 160GB+ VRAM for Llama-3 70B
❌ Extremely expensive and slow
❌ Risk of catastrophic forgetting

2. Low-Rank Adaptation (LoRA)

Freezes model weights and trains small adapter matrices (rank decomposition).

✅ 25-40x memory reduction
✅ Fast training (2-4x faster than full fine-tuning)
✅ ~0.5% trainable parameters
⚠️ Still requires 16-24GB VRAM for Llama-3 8B

3. Quantized LoRA (QLoRA)

Combines 4-bit quantization with LoRA for maximum efficiency.

✅ 33% additional memory reduction vs. LoRA (15-16GB vs. 24GB)
✅ Fits 8B models on consumer GPUs
✅ ~0.5% trainable parameters
⚠️ Training is 39% slower due to dequantization overhead

Performance & Resource Comparison

Comparison of Fine-Tuning Methods

QLoRA achieves 90% of full fine-tuning quality while using only 15% of the memory

QLoRA: Understanding the Magic Behind Single-GPU Fine-Tuning

How QLoRA Works

QLoRA combines three key techniques to fit large models on small GPUs:

🔢 NF4 Quantization

Uses Normal Float 4-bit precision instead of standard 32-bit floats. This reduces model weights to 1/8th of their original size while maintaining precision.

🔄 Double Quantization

Quantizes the quantization constants themselves, saving additional 0.4 bits per parameter on average.

💾 Paged Optimizers

Temporarily moves GPU memory to CPU RAM when needed, preventing out-of-memory errors during training.

The QLoRA Computation Flow

Frozen Weights (NF4): Model weights are stored in 4-bit NF4 format on VRAM
Dequantization: When needed, weights are dequantized to BF16 (brain float 16)
LoRA Computation: LoRA adapter matrices (trained) are combined with dequantized weights in BF16
Backprop: Gradients are computed and only LoRA adapters are updated
Re-quantization: Weights are quantized back to NF4 to free VRAM

⚠️ Important: This dequantization overhead means QLoRA training is ~39% slower than standard LoRA. However, the memory savings often make it the only practical option for consumer GPUs.

QLoRA vs LoRA: When to Use Which?

Factor	Use LoRA	Use QLoRA
Available VRAM	16GB+	<16GB
Model Size	7B-13B	30B-70B (or 8B on limited hardware)
Training Speed	Fast (baseline)	Slower (39% overhead)
Quality	Slightly better	Nearly identical (92% vs 95%)
Use Case	Production systems with resources	Research, prototyping, limited budgets

Prerequisites & Environment Setup

Hardware Requirements

🎮 GPU:RTX 4090 / A6000 (24GB ideal), or RTX 3090 / A100 (20GB+). Minimum 10GB VRAM, though 12GB+ recommended.

💾 RAM:32GB system RAM. QLoRA may spill to CPU for optimizer states and gradient accumulation.

💿 Storage:50GB+ free space for model, dataset, and checkpoints. SSD strongly recommended.

⚡ Power:Stable power supply. Training is compute-intensive and can trigger power management issues.

Software Installation

Step 1: Create Python Environment

$ python -m venv llama-finetune
$ source llama-finetune/bin/activate # Linux/Mac
$ llama-finetune\Scripts\activate # Windows

Step 2: Install Core Dependencies

$ pip install torch torchvision torchaudio –index-url https://download.pytorch.org/whl/cu118
$ pip install transformers datasets peft bitsandbytes accelerate
$ pip install wandb trl # For logging and training utilities

Step 3: Verify Installation

python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU only\"}')
print(f'CUDA Version: {torch.version.cuda}')
"

Complete QLoRA Implementation

Full Production-Ready Code

#!/usr/bin/env python3
"""
Complete QLoRA Fine-Tuning Implementation for Llama-3
Optimized for single GPU with 12GB+ VRAM
"""

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    TextStreamer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import wandb
import os

# ===== Configuration =====
MODEL_NAME = "meta-llama/Llama-3-8b-hf"  # Or use "meta-llama/Llama-3-8b-Instruct"
OUTPUT_DIR = "./llama-3-finetuned"
DATASET_NAME = "mlabonne/FineTome-100k"  # Replace with your dataset
MAX_SEQ_LENGTH = 2048
LEARNING_RATE = 3e-4
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 1
WARMUP_STEPS = 100

# ===== Quantization Config (QLoRA) =====
def create_bnb_config():
    """Configure 4-bit quantization using bitsandbytes"""
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    return bnb_config

# ===== LoRA Config =====
def create_peft_config():
    """Configure LoRA adapters"""
    peft_config = LoraConfig(
        r=16,  # LoRA rank
        lora_alpha=32,  # LoRA scaling factor
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=[
            "q_proj",    # Query projection
            "v_proj",    # Value projection
            "k_proj",    # Key projection
            "o_proj",    # Output projection
            "gate_proj", # Gate projection (feed-forward)
            "up_proj",   # Up projection (feed-forward)
            "down_proj"  # Down projection (feed-forward)
        ],
        modules_to_save=None,  # Freeze all other modules
    )
    return peft_config

# ===== Model Loading =====
def load_model_and_tokenizer():
    """Load Llama-3 model with QLoRA configuration"""
    print("Loading model with QLoRA quantization...")
    
    bnb_config = create_bnb_config()
    
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        attn_implementation="flash_attention_2"  # For faster inference
    )
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"  # For efficient padding
    
    print(f"Model loaded: {MODEL_NAME}")
    print(f"Model dtype: {model.dtype}")
    
    return model, tokenizer

# ===== Dataset Preparation =====
def prepare_dataset(tokenizer):
    """Load and prepare training dataset"""
    print(f"Loading dataset: {DATASET_NAME}")
    
    dataset = load_dataset(DATASET_NAME, split="train")
    
    # For large datasets, use subset
    if len(dataset) > 50000:
        dataset = dataset.select(range(50000))
    
    # Split into train/validation
    dataset = dataset.train_test_split(test_size=0.1)
    
    def format_instruction(example):
        """Format conversations for instruction tuning"""
        # Assuming dataset has 'text' or 'conversations' field
        if 'text' in example:
            return {"text": example["text"]}
        elif 'conversations' in example:
            # Convert ShareGPT format to single text
            messages = example["conversations"]
            text = ""
            for msg in messages:
                role = msg.get("from", "").replace("human", "user")
                content = msg.get("value", "")
                text += f"{role}: {content}\n"
            return {"text": text}
        return example
    
    dataset = dataset.map(format_instruction, batched=False, remove_columns=dataset.column_names)
    
    return dataset

# ===== Training Setup =====
def setup_training(model, tokenizer, dataset):
    """Configure training parameters"""
    
    # Prepare model for k-bit training
    model = prepare_model_for_kbit_training(model)
    
    # Apply LoRA config
    peft_config = create_peft_config()
    model = get_peft_model(model, peft_config)
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable params: {trainable_params:,}")
    print(f"Total params: {total_params:,}")
    print(f"Trainable %: {100 * trainable_params / total_params:.2f}%")
    
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        learning_rate=LEARNING_RATE,
        num_train_epochs=NUM_EPOCHS,
        logging_steps=10,
        eval_steps=100,
        save_steps=100,
        save_total_limit=3,
        warmup_steps=WARMUP_STEPS,
        logging_dir="./logs",
        optim="paged_adamw_8bit",  # 8-bit optimizer for memory efficiency
        bf16=True,  # Use bfloat16 for faster training
        tf32=True,  # Use TF32 on A100 GPUs
        lr_scheduler_type="cosine",
        weight_decay=0.01,
        push_to_hub=False,  # Set to True if uploading to Hugging Face
        report_to=["wandb"],  # Enable W&B logging
        dataloader_pin_memory=True,
        dataloader_num_workers=4,
        ddp_find_unused_parameters=False,
    )
    
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        dataset_text_field="text",
        args=training_args,
        packing=True,  # Pack multiple examples in single sequence
        max_seq_length=MAX_SEQ_LENGTH,
    )
    
    return trainer

# ===== Main Execution =====
def main():
    # Initialize W&B logging
    wandb.init(project="llama3-finetuning", name="qlora-experiment")
    
    # Load model and tokenizer
    model, tokenizer = load_model_and_tokenizer()
    
    # Prepare dataset
    dataset = prepare_dataset(tokenizer)
    
    # Setup training
    trainer = setup_training(model, tokenizer, dataset)
    
    # Train
    print("Starting training...")
    trainer.train()
    
    # Save model
    print("Saving model...")
    model.save_pretrained(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)
    
    # Merge LoRA adapters with base model (optional)
    print("Merging LoRA adapters...")
    from peft import AutoPeftModelForCausalLM
    
    merged_model = AutoPeftModelForCausalLM.from_pretrained(
        OUTPUT_DIR,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    merged_model = merged_model.merge_and_unload()
    merged_model.save_pretrained(f"{OUTPUT_DIR}-merged")
    
    print("✅ Fine-tuning complete!")

if __name__ == "__main__":
    main()

Hyperparameter Tuning for QLoRA

LoRA Configuration

Key LoRA Parameters:

r (rank): Controls LoRA matrix size. Start with r=8 for small models, r=16 for 8B models, r=32 for 70B. Higher ranks = better quality but more memory/time.

lora_alpha: Scaling factor. Usually set to 2x the rank (e.g., r=16 → alpha=32). Affects the magnitude of LoRA updates.

lora_dropout: Dropout applied to LoRA adapters. Range: 0.0-0.1. Helps prevent overfitting but too high hurts convergence.

target_modules: Which layers to apply LoRA. Options: q_proj, v_proj (attention), gate_proj/up_proj/down_proj (feed-forward). Targeting all increases memory usage but improves quality.

Training Hyperparameters

🎯 Learning Rate:3e-4 to 1e-3. Start with 3e-4. Higher rates cause instability; lower rates slow convergence.

📦 Batch Size:4 (per GPU). Use gradient accumulation to simulate larger batches without OOM errors.

🔄 Gradient Accumulation:4-8 steps. Accumulate gradients to increase effective batch size (4 * 4 = 16) while keeping memory usage low.

📈 Warmup Steps:100-500. Gradually increase learning rate to stabilize early training.

Recommended Configurations

Scenario	Rank	Alpha	Batch	LR	Notes
Small Dataset (<5k)	8	16	2	1e-4	Lower rank to prevent overfitting
Medium Dataset (5k-50k)	16	32	4	3e-4	Balanced quality/efficiency
Large Dataset (>50k)	32	64	8	5e-4	Can use higher rank for better quality
Limited VRAM (8GB)	8	16	1-2	1e-4	Minimal rank, smaller batch

Training & Monitoring

Memory Optimization Techniques

Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass
Packing: Combine multiple examples into single sequence to maximize GPU utilization
Flash Attention: Use optimized attention implementation for 2x speedup
8-bit Optimizer: Use paged_adamw_8bit instead of standard AdamW
BF16 Training: Use bfloat16 for memory savings and faster computation on A100/H100

Monitoring Training

import wandb
import matplotlib.pyplot as plt

# Monitor metrics
wandb_metrics = {
    "loss": [],
    "learning_rate": [],
    "epoch": []
}

# After training, plot results
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(wandb_metrics["epoch"], wandb_metrics["loss"])
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss")

plt.subplot(1, 2, 2)
plt.plot(wandb_metrics["epoch"], wandb_metrics["learning_rate"])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Learning Rate Schedule")

plt.tight_layout()
plt.savefig("training_metrics.png")
plt.show()

Inference After Fine-Tuning

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Load fine-tuned model
model = AutoPeftModelForCausalLM.from_pretrained(
    "./llama-3-finetuned",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./llama-3-finetuned")

# Inference
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Optimization Techniques

Alternative: Using Unsloth for 2x Faster Training

Unsloth provides custom kernels for even faster QLoRA training. Installation is simple:

$ pip install unsloth[colab] @ git+https://github.com/unslothai/unsloth.git

from unsloth import FastLanguageModel

# Load with Unsloth (2x faster, 60% less memory)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=None,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    use_rslora=True,  # Rank-stabilized LoRA
    use_gradient_checkpointing="unsloth"
)

Model Merging & Quantization

from peft import AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Merge LoRA adapters with base model
model = AutoPeftModelForCausalLM.from_pretrained("./llama-3-finetuned")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-3-merged")

# Optional: Quantize to GGUF for inference
# Using llama-cpp-python
from llama_cpp import Llama

# Convert to GGUF first (requires external tool)
# Then load for inference
llm = Llama(model_path="./llama-3-merged.gguf", n_gpu_layers=35)

Multi-GPU Training with FSDP

# For distributed training across multiple GPUs
# Set in TrainingArguments:
training_args = TrainingArguments(
    ...
    ddp_backend="nccl",
    ddp_find_unused_parameters=False,
    # Enable FSDP
    fsdp="full_shard auto_wrap",
    fsdp_config={
        "sharding_strategy": "FULL_SHARD",
        "backward_prefetch": "BACKWARD_PRE",
        "forward_prefetch": True,
    },
)

The Future of Model Customization

QLoRA represents a paradigm shift in making large language models accessible to everyone. By combining 4-bit quantization with LoRA, you can now fine-tune Llama-3 on consumer-grade hardware that would have been impossible just two years ago.

✅ Key Takeaways:

QLoRA fits Llama-3 8B on 12GB GPUs (previously required 80GB+)
Training quality remains 90%+ compared to full fine-tuning
Merging adapters gives production-ready models at negligible cost
Unsloth provides 2x speedup with minimal code changes
Monitor VRAM usage and adjust batch size/rank accordingly

Next Steps

Set up your environment with the provided installation steps
Prepare your training dataset in instruction-answer format
Start with the recommended hyperparameters for your dataset size
Monitor training with W&B or TensorBoard
Evaluate on validation set and adjust hyperparameters if needed
Merge adapters and deploy your fine-tuned model
Consider using GGUF quantization for CPU/edge device inference

🚀 Final Thoughts: QLoRA democratizes LLM fine-tuning. You no longer need enterprise-scale resources to customize state-of-the-art models. Start with Llama-3 8B, experiment with your data, and iterate rapidly. The combination of open-source models, efficient training techniques, and accessible hardware creates unprecedented opportunities for building specialized AI systems.