For privacy-sensitive applications, sending data to cloud AI providers isn't an option. The good news: local LLMs have become remarkably capable, and running them is more accessible than ever.
Why Run LLMs Locally?
- Privacy: Data never leaves your machine
- Cost: No per-token charges after hardware investment
- Latency: No network round-trip
- Control: Choose your model, customize as needed
- Offline: Works without internet
Hardware Requirements
7B model (e.g., Llama 3 7B):
- RAM: 8GB minimum, 16GB recommended
- GPU: 6GB VRAM, or CPU-only (slower)
- Storage: 5-10GB per model
13B model:
- RAM: 16GB minimum, 32GB recommended
- GPU: 12GB VRAM recommended
- Storage: 10-15GB per model
70B model:
- RAM: 64GB+ or GPU memory
- GPU: Multiple GPUs or quantization required
- Storage: 40-100GB per modelOllama: The Easy Way
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3
ollama run llama3
# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{ "role": "user", "content": "Hello!" }]
}'// Use with OpenAI SDK (compatible API)
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama' // Required but ignored
})
const response = await client.chat.completions.create({
model: 'llama3',
messages: [{ role: 'user', content: 'Explain React hooks' }]
})llama.cpp: Maximum Control
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
# Run with quantized model
./main -m models/llama-3-8b-q4_k_m.gguf \
-p "Write a function to sort an array" \
-n 256 \
--temp 0.7Quantization: Quality vs Speed
Quantization levels (for 7B model):
Q8_0: ~8GB, highest quality, slowest
Q6_K: ~6GB, excellent quality
Q5_K_M: ~5GB, great quality (recommended)
Q4_K_M: ~4GB, good quality
Q4_0: ~4GB, acceptable, fastest
Q2_K: ~3GB, degraded qualityRecommendation: Start with Q4_K_M or Q5_K_M for best quality/performance balance.
Building Production Apps
// Local LLM service wrapper
import { Ollama } from 'ollama'
class LocalLLMService {
private ollama = new Ollama({ host: 'http://localhost:11434' })
async generate(prompt: string, options?: GenerateOptions) {
const response = await this.ollama.generate({
model: options?.model || 'llama3',
prompt,
options: {
temperature: options?.temperature || 0.7,
num_predict: options?.maxTokens || 512
},
stream: false
})
return response.response
}
async *stream(prompt: string) {
const response = await this.ollama.generate({
model: 'llama3',
prompt,
stream: true
})
for await (const chunk of response) {
yield chunk.response
}
}
async embeddings(text: string): Promise<number[]> {
const response = await this.ollama.embeddings({
model: 'nomic-embed-text',
prompt: text
})
return response.embedding
}
}Model Selection
General purpose:
- Llama 3 8B: Best overall quality for size
- Mistral 7B: Fast, good reasoning
- Phi-3 Mini: Compact, surprisingly capable
Code generation:
- CodeLlama 7B/13B
- DeepSeek Coder
- StarCoder2
Embeddings:
- nomic-embed-text
- all-minilmConclusion
Local LLMs provide privacy, control, and cost savings. Start with Ollama for ease of use, move to llama.cpp for production optimization. The capability gap with cloud models is shrinking—for many use cases, local is now good enough.
