Quantized Low-Rank Adaptation (QLoRA) is an extended version of Low-Rank Adaptation (LoRA) that enables fine-tuning of extremely large models (like 65B or 70B parameter models) on a single consumer GPU (e.g., 48GB VRAM) without degrading performance.
It typically involves quantizing the base model to 4-bit precision while keeping the LoRA adapters in higher precision (16-bit) during training.
Key Innovations
QLoRA introduced three main memory-saving techniques:
1. 4-bit NormalFloat (NF4)
A new data type strictly optimized for normally distributed weights. It packs information more efficiently than standard 4-bit Integers or Floats, retaining higher fidelity of the original model weights.
2. Double Quantization
Quantizing the quantization constants themselves.
- Standard quantization stores a constant for every block of parameters. QLoRA quantizes these constants again, saving an average of ~0.37 bits per parameter (which adds up massively for 70B models).
3. Paged Optimizers
Uses NVIDIA Unified Memory to automatically offload optimizer states to the CPU RAM when the GPU runs out of memory. This prevents the dreaded “Out of Memory” (OOM) errors during training spikes.
Impact
QLoRA “democratized” Fine-Tuning. Before QLoRA, fine-tuning a 65B model required ~780GB of VRAM (multiple A100s). With QLoRA, it can be done on a single 48GB GPU, making state-of-the-art model customization accessible to researchers and hobbyists.
