Fine-tuning large language models used to require significant resources and expertise. That's changed dramatically with QLoRA, PEFT, and cloud fine-tuning APIs from major providers.
But accessibility doesn't mean you should always fine-tune. The decision between fine-tuning, prompt engineering, and RAG depends on your specific use case.
Fine-Tuning vs Prompting vs RAG
Use Prompting when:
- Task is well-defined and models handle it reasonably
- You need flexibility to change behavior quickly
- Dataset is small or constantly changing
- Budget is limited
Use RAG when:
- You need to incorporate external knowledge
- Information changes frequently
- You need citations/sources
- Knowledge base is large
Use Fine-Tuning when:
- Consistent style/tone required
- Domain-specific terminology matters
- Prompts are too long/complex
- You have quality training data
- Cost per query matters at scale
Preparing Your Dataset
Quality matters more than quantity. A few hundred high-quality examples often outperform thousands of mediocre ones.
interface TrainingExample {
messages: Array<{
role: 'system' | 'user' | 'assistant'
content: string
}>
}
// Quality checklist for each example:
// ✅ Demonstrates desired behavior clearly
// ✅ Contains domain-specific patterns
// ✅ Represents realistic user inputs
// ✅ Has accurate, high-quality outputs
// ❌ Not copied from base model outputs
// ❌ Not artificially padded or repetitiveQLoRA and PEFT Explained
LoRA (Low-Rank Adaptation): Instead of updating all model weights, LoRA adds small trainable matrices that modify the model's behavior. This dramatically reduces compute and memory requirements.
QLoRA: Combines LoRA with 4-bit quantization. Train a 7B model on a single consumer GPU.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load base model with quantization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
)
# LoRA config
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)Cloud Fine-Tuning APIs
OpenAI: Simple API, limited customization. Good for GPT-3.5/GPT-4 fine-tuning.
Anthropic: Claude fine-tuning available for enterprise customers.
Together/Replicate: Fine-tune open-source models with more control.
// OpenAI fine-tuning example
import OpenAI from 'openai'
const openai = new OpenAI()
// Upload training data
const file = await openai.files.create({
file: fs.createReadStream('training_data.jsonl'),
purpose: 'fine-tune'
})
// Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
training_file: file.id,
model: 'gpt-4o-mini-2024-07-18',
hyperparameters: {
n_epochs: 3
}
})Measuring Success
interface EvaluationMetrics {
// Task-specific accuracy
accuracy: number
// Style/format consistency
formatCompliance: number
// Comparison to base model
improvementOverBase: number
// Regression on general capabilities
generalCapabilityRetention: number
}Create a held-out test set before training. Evaluate both task-specific performance AND general capability retention to catch catastrophic forgetting.
Conclusion
Fine-tuning is now accessible, but it's not always the answer. Start with prompting, try RAG for knowledge-heavy tasks, and fine-tune when you have quality data and consistent behavior matters. The right choice depends on your specific requirements.
