Prompt Caching: The Cost Optimization Nobody Talks About

LLM API costs spiral quickly at scale. A system prompt of 2,000 tokens repeated across 100,000 daily requests means 200 million input tokens just for the static part. Prompt caching addresses this directly—caching can reduce costs by 90% or more for repetitive workloads.

Understanding Prompt Caching

Prompt caching works by recognizing when the beginning of a prompt matches a recent request. Instead of processing those tokens again, the provider reuses cached computations.

Explicit caching (Anthropic): You mark which parts should be cached.

Automatic caching (OpenAI): The provider automatically caches repeated prefixes.

Anthropic's Cache Control

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  system: [{
    type: 'text',
    text: systemPrompt,
    cache_control: { type: 'ephemeral' }
  }],
  messages: [{ role: 'user', content: userMessage }]
});

console.log('Cache read tokens:', response.usage.cache_read_input_tokens);

OpenAI's Automatic Caching

OpenAI automatically caches prompts longer than 1,024 tokens. No code changes required—it just works. Caching happens at 128-token increments from the start of the prompt.

Designing Prompts for Maximum Cache Hits

Structure your prompts with stable content first, variable content last. For RAG applications, consider caching your entire knowledge base as part of the system prompt if it fits in context.

Measuring Cache Performance

Track metrics: cache hit rate (target >80%), cache write rate (<5%), cost reduction (>50% for eligible workloads).

Cost Analysis Example

Customer service app: 3,000 token system prompt, 50,000 daily requests. Without caching: $615/day. With 90% cache hit rate: $205.50/day. Savings: $409.50/day (67% reduction).

Prompt caching is one of the highest-leverage optimizations for LLM applications at scale. The implementation cost is low—a few hours of work—and the savings can be dramatic.