Context Windows: Strategies for Working Within LLM Limits

Modern LLMs boast impressive context windows—Claude offers 200K tokens, Gemini pushes to millions. But 'just dump everything in' is not a strategy. Models exhibit degraded performance with long context, costs scale linearly, and more context doesn't mean better results.

Understanding Context Window Economics

// Context cost analysis
function analyzeContextCosts(systemPrompt, context, responseLength, dailyRequests) {
  const systemTokens = countTokens(systemPrompt)
  const contextTokens = countTokens(context)
  const promptTokens = systemTokens + contextTokens
  
  // Claude 3.5 Sonnet pricing
  const inputCost = (promptTokens / 1_000_000) * 3.00
  const outputCost = (responseLength / 1_000_000) * 15.00
  const costPerRequest = inputCost + outputCost
  
  return { promptTokens, monthlyCost: costPerRequest * dailyRequests * 30 }
}

// 2.5K tokens: ~$45/month at 10K requests/day
// 50K tokens: ~$450/month (10x cost!)

Chunking and Summarization

// Recursive summarization for very long documents
class HierarchicalSummarizer {
  async summarize(document, targetTokens) {
    const currentTokens = countTokens(document)
    if (currentTokens <= targetTokens) return document
    
    const chunks = new SemanticChunker().chunk(document)
    const summaries = await Promise.all(
      chunks.map(chunk => this.summarizeChunk(chunk.text))
    )
    
    const combined = summaries.join('\n\n')
    if (countTokens(combined) > targetTokens) {
      return this.summarize(combined, targetTokens)
    }
    return combined
  }
}

Dynamic Context: Retrieving What's Relevant

class RAGContextManager {
  async retrieve(query, maxTokens) {
    const queryEmbedding = await this.embed(query)
    const results = await this.vectorStore.search({ embedding: queryEmbedding, topK: 20 })
    
    // Greedily select chunks that fit
    const selectedChunks = []
    let usedTokens = 0
    
    for (const result of results) {
      const chunkTokens = countTokens(result.text)
      if (usedTokens + chunkTokens <= maxTokens) {
        selectedChunks.push(result)
        usedTokens += chunkTokens
      }
    }
    
    return { chunks: await this.rerank(query, selectedChunks), totalTokens: usedTokens }
  }
}

Conversation Memory Management

class SlidingWindowMemory {
  async getContext(memory, maxTokens = 4000) {
    let result = []
    let tokenCount = 0
    
    // Include summary of older conversation
    if (memory.summary) {
      result.push({ role: 'system', content: `Previous conversation summary: ${memory.summary}` })
      tokenCount += countTokens(memory.summary)
    }
    
    // Add recent messages from newest to oldest until budget exhausted
    for (const msg of memory.messages.reverse()) {
      const msgTokens = countTokens(msg.content)
      if (tokenCount + msgTokens > maxTokens) break
      result.unshift(msg)
      tokenCount += msgTokens
    }
    
    return result
  }
  
  async condenseOldMessages(messages) {
    const oldMessages = messages.slice(0, -10)
    return await llm.summarize(oldMessages)
  }
}

Key Takeaways

More tokens ≠ better results. Quality often degrades with long, unfocused context.

Retrieve dynamically. Use RAG to include only what's relevant to the current query.

Summarize strategically. Hierarchical summarization preserves information while reducing tokens.

Manage conversation memory. Sliding windows plus summaries keep context focused and bounded.