Local LLMs: Running AI Without the Cloud

For privacy-sensitive applications, sending data to cloud AI providers isn't an option. The good news: local LLMs have become remarkably capable, and running them is more accessible than ever.

Why Run LLMs Locally?

Privacy: Data never leaves your machine
Cost: No per-token charges after hardware investment
Latency: No network round-trip
Control: Choose your model, customize as needed
Offline: Works without internet

Hardware Requirements

7B model (e.g., Llama 3 7B):
- RAM: 8GB minimum, 16GB recommended
- GPU: 6GB VRAM, or CPU-only (slower)
- Storage: 5-10GB per model

13B model:
- RAM: 16GB minimum, 32GB recommended
- GPU: 12GB VRAM recommended
- Storage: 10-15GB per model

70B model:
- RAM: 64GB+ or GPU memory
- GPU: Multiple GPUs or quantization required
- Storage: 40-100GB per model

Ollama: The Easy Way

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3
ollama run llama3

# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{ "role": "user", "content": "Hello!" }]
  }'

// Use with OpenAI SDK (compatible API)
import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama'  // Required but ignored
})

const response = await client.chat.completions.create({
  model: 'llama3',
  messages: [{ role: 'user', content: 'Explain React hooks' }]
})

llama.cpp: Maximum Control

# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Run with quantized model
./main -m models/llama-3-8b-q4_k_m.gguf \
  -p "Write a function to sort an array" \
  -n 256 \
  --temp 0.7

Quantization: Quality vs Speed

Quantization levels (for 7B model):

Q8_0: ~8GB, highest quality, slowest
Q6_K: ~6GB, excellent quality
Q5_K_M: ~5GB, great quality (recommended)
Q4_K_M: ~4GB, good quality
Q4_0: ~4GB, acceptable, fastest
Q2_K: ~3GB, degraded quality

Recommendation: Start with Q4_K_M or Q5_K_M for best quality/performance balance.

Building Production Apps

// Local LLM service wrapper
import { Ollama } from 'ollama'

class LocalLLMService {
  private ollama = new Ollama({ host: 'http://localhost:11434' })
  
  async generate(prompt: string, options?: GenerateOptions) {
    const response = await this.ollama.generate({
      model: options?.model || 'llama3',
      prompt,
      options: {
        temperature: options?.temperature || 0.7,
        num_predict: options?.maxTokens || 512
      },
      stream: false
    })
    
    return response.response
  }
  
  async *stream(prompt: string) {
    const response = await this.ollama.generate({
      model: 'llama3',
      prompt,
      stream: true
    })
    
    for await (const chunk of response) {
      yield chunk.response
    }
  }
  
  async embeddings(text: string): Promise<number[]> {
    const response = await this.ollama.embeddings({
      model: 'nomic-embed-text',
      prompt: text
    })
    return response.embedding
  }
}

Model Selection

General purpose:
- Llama 3 8B: Best overall quality for size
- Mistral 7B: Fast, good reasoning
- Phi-3 Mini: Compact, surprisingly capable

Code generation:
- CodeLlama 7B/13B
- DeepSeek Coder
- StarCoder2

Embeddings:
- nomic-embed-text
- all-minilm

Conclusion

Local LLMs provide privacy, control, and cost savings. Start with Ollama for ease of use, move to llama.cpp for production optimization. The capability gap with cloud models is shrinking—for many use cases, local is now good enough.