Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques used to fine-tune large pre-trained models by updating only a small subset of parameters (or adding a small number of new trainable parameters), while keeping the vast majority of the original pre-trained weights frozen.
Core Problem Solved
Full Fine-Tuning of large models (e.g., 70B parameters) is extremely expensive:
- Compute: Requires massive GPU processing power to calculate gradients for all weights.
- Memory: Requires storing optimizer states for every single parameter (often 3-4x the model size).
- Storage: Producing a separate full copy of the model for every downstream task is inefficient.
Key Techniques
- Low-Rank Adaptation (LoRA): Injects small, trainable rank-decomposition matrices into linear layers.
- Adapters: Inserts small trainable neural network layers between existing frozen layers.
- Prompt Tuning: Adds trainable “virtual tokens” to the input prompt, leaving the model weights entirely untouched.
- Quantized Low-Rank Adaptation (QLoRA): Combines LoRA with aggressive quantization (4-bit) to further reduce memory usage.
Benefits
- Reduced Hardware Requirements: Enables fine-tuning massive models on consumer hardware (e.g., a single GPU).
- Modularity: You can have one widely shared “Base Model” and swap small (mb-sized) “Adapter” files for different tasks (e.g., one adapter for coding, one for creative writing).
- Less Catastrophic Forgetting: Since most weights are frozen, the model is less likely to forget its pre-trained general knowledge.
