When RAG first gained popularity, the promise was simple: embed your documents, store them in a vector database, and let the LLM answer questions with your data as context. The reality proved more complex. Naive RAG implementations often retrieve irrelevant chunks, miss critical information, and produce confident-sounding but wrong answers.
The RAG systems succeeding in production today look very different from those early experiments.
Limitations of Pure Vector Search
Vector search has fundamental limitations:
Semantic Similarity ≠ Relevance: Searching for "What is our refund policy?" might return an HR policy about employee expense refunds instead of the customer-facing policy—semantically similar but contextually wrong.
Keyword Blind Spots: Embedding models sometimes miss exact matches that keyword search would catch. Searching "Error code E-4521" might miss the specific troubleshooting document.
Hybrid Retrieval: Dense and Sparse Combined
async function hybridSearch(query: string, { topK, alpha }: Options) {
// Dense (vector) search
const queryEmbedding = await embed(query)
const denseResults = await vectorDb.search(queryEmbedding, topK * 2)
// Sparse (keyword) search - BM25
const sparseResults = await bm25Search(query, topK * 2)
// Reciprocal Rank Fusion
const fusedScores = new Map<string, number>()
denseResults.forEach((result, rank) => {
const rrfScore = 1 / (60 + rank) * alpha
fusedScores.set(result.id, (fusedScores.get(result.id) || 0) + rrfScore)
})
sparseResults.forEach((result, rank) => {
const rrfScore = 1 / (60 + rank) * (1 - alpha)
fusedScores.set(result.id, (fusedScores.get(result.id) || 0) + rrfScore)
})
return Array.from(fusedScores.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, topK)
}Reranking: The Secret to Better Results
Retrieval gets candidates; reranking selects the best ones:
async function ragPipeline(query: string) {
// Stage 1: Retrieve candidates (over-fetch)
const candidates = await hybridSearch(query, { topK: 20, alpha: 0.5 })
// Stage 2: Rerank to find truly relevant ones
const reranked = await cohere.rerank({
query,
documents: candidates.map(c => c.content),
topN: 5,
model: 'rerank-english-v3.0'
})
// Stage 3: Build context and generate answer
const context = reranked.map(r => candidates[r.index].content).join('\n\n')
return generateAnswer(query, context)
}Chunking Strategies That Work
Semantic Chunking: Break at semantic boundaries, not fixed sizes.
Overlapping Chunks: Include overlap to preserve context at boundaries.
Hierarchical Chunking: Store at multiple granularities—document summaries, section summaries, paragraphs.
RAG Quality Metrics
function calculateRetrievalMetrics(retrievedIds: string[], relevantIds: Set<string>) {
const relevant = retrievedIds.filter(id => relevantIds.has(id))
return {
precision: relevant.length / retrievedIds.length,
recall: relevant.length / relevantIds.size,
mrr: 1 / (retrievedIds.findIndex(id => relevantIds.has(id)) + 1)
}
}
// LLM-as-judge for answer quality
async function evaluateAnswer(question: string, answer: string, groundTruth: string) {
return llm.evaluate(`
Evaluate on 1-5 scale:
- Faithfulness: Does answer stick to context?
- Relevance: Does it address the question?
- Correctness: Compared to ground truth?
`)
}Conclusion
Production RAG isn't a one-time implementation—it's ongoing measurement, refinement, and optimization. Hybrid retrieval catches what pure vector search misses. Reranking dramatically improves relevance. Smart chunking preserves context. Metrics tell you what's actually working. Start with the basics, measure relentlessly, and iterate.



