Modern LLMs boast impressive context windows—Claude offers 200K tokens, Gemini pushes to millions. But 'just dump everything in' is not a strategy. Models exhibit degraded performance with long context, costs scale linearly, and more context doesn't mean better results.
Understanding Context Window Economics
// Context cost analysis
function analyzeContextCosts(systemPrompt, context, responseLength, dailyRequests) {
const systemTokens = countTokens(systemPrompt)
const contextTokens = countTokens(context)
const promptTokens = systemTokens + contextTokens
// Claude 3.5 Sonnet pricing
const inputCost = (promptTokens / 1_000_000) * 3.00
const outputCost = (responseLength / 1_000_000) * 15.00
const costPerRequest = inputCost + outputCost
return { promptTokens, monthlyCost: costPerRequest * dailyRequests * 30 }
}
// 2.5K tokens: ~$45/month at 10K requests/day
// 50K tokens: ~$450/month (10x cost!)
Chunking and Summarization
// Recursive summarization for very long documents
class HierarchicalSummarizer {
async summarize(document, targetTokens) {
const currentTokens = countTokens(document)
if (currentTokens <= targetTokens) return document
const chunks = new SemanticChunker().chunk(document)
const summaries = await Promise.all(
chunks.map(chunk => this.summarizeChunk(chunk.text))
)
const combined = summaries.join('\n\n')
if (countTokens(combined) > targetTokens) {
return this.summarize(combined, targetTokens)
}
return combined
}
}
Dynamic Context: Retrieving What's Relevant
class RAGContextManager {
async retrieve(query, maxTokens) {
const queryEmbedding = await this.embed(query)
const results = await this.vectorStore.search({ embedding: queryEmbedding, topK: 20 })
// Greedily select chunks that fit
const selectedChunks = []
let usedTokens = 0
for (const result of results) {
const chunkTokens = countTokens(result.text)
if (usedTokens + chunkTokens <= maxTokens) {
selectedChunks.push(result)
usedTokens += chunkTokens
}
}
return { chunks: await this.rerank(query, selectedChunks), totalTokens: usedTokens }
}
}
Conversation Memory Management
class SlidingWindowMemory {
async getContext(memory, maxTokens = 4000) {
let result = []
let tokenCount = 0
// Include summary of older conversation
if (memory.summary) {
result.push({ role: 'system', content: `Previous conversation summary: ${memory.summary}` })
tokenCount += countTokens(memory.summary)
}
// Add recent messages from newest to oldest until budget exhausted
for (const msg of memory.messages.reverse()) {
const msgTokens = countTokens(msg.content)
if (tokenCount + msgTokens > maxTokens) break
result.unshift(msg)
tokenCount += msgTokens
}
return result
}
async condenseOldMessages(messages) {
const oldMessages = messages.slice(0, -10)
return await llm.summarize(oldMessages)
}
}
Key Takeaways
More tokens ≠ better results. Quality often degrades with long, unfocused context.
Retrieve dynamically. Use RAG to include only what's relevant to the current query.
Summarize strategically. Hierarchical summarization preserves information while reducing tokens.
Manage conversation memory. Sliding windows plus summaries keep context focused and bounded.
