LLM API costs spiral quickly at scale. A system prompt of 2,000 tokens repeated across 100,000 daily requests means 200 million input tokens just for the static part. Prompt caching addresses this directly—caching can reduce costs by 90% or more for repetitive workloads.
Understanding Prompt Caching
Prompt caching works by recognizing when the beginning of a prompt matches a recent request. Instead of processing those tokens again, the provider reuses cached computations.
Explicit caching (Anthropic): You mark which parts should be cached.
Automatic caching (OpenAI): The provider automatically caches repeated prefixes.
Anthropic's Cache Control
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
system: [{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' }
}],
messages: [{ role: 'user', content: userMessage }]
});
console.log('Cache read tokens:', response.usage.cache_read_input_tokens);OpenAI's Automatic Caching
OpenAI automatically caches prompts longer than 1,024 tokens. No code changes required—it just works. Caching happens at 128-token increments from the start of the prompt.
Designing Prompts for Maximum Cache Hits
Structure your prompts with stable content first, variable content last. For RAG applications, consider caching your entire knowledge base as part of the system prompt if it fits in context.
Measuring Cache Performance
Track metrics: cache hit rate (target >80%), cache write rate (<5%), cost reduction (>50% for eligible workloads).
Cost Analysis Example
Customer service app: 3,000 token system prompt, 50,000 daily requests. Without caching: $615/day. With 90% cache hit rate: $205.50/day. Savings: $409.50/day (67% reduction).
Prompt caching is one of the highest-leverage optimizations for LLM applications at scale. The implementation cost is low—a few hours of work—and the savings can be dramatic.
