Retrieval-Augmented Generation (RAG): Systems, Search, and Reranking
Large language models are strong pattern learners, but they are not perfect knowledge stores. Their training data is frozen at pretraining time, they can hallucinate, and they struggle with private or rapidly changing information. Retrieval-Augmented Generation (RAG) addresses this by combining a retriever (to fetch evidence) with a generator (to produce grounded answers).
The core idea is simple: retrieve first, then generate. The implementation details, however, determine whether your system is reliable, fast, and cost-effective.
What RAG is, formally
RAG was introduced by Lewis et al. (2020), where a model retrieves passages from a non-parametric memory and conditions generation on those passages. This separates knowledge storage (retrieval index) from language behavior (generator), making updates cheaper and improving factuality.
In production, most RAG systems follow this high-level pipeline:
- Ingestion: parse documents, clean text, chunk content, add metadata.
- Indexing: build lexical, vector, graph, or hybrid indexes.
- Query understanding: rewrite, expand, or decompose user questions.
- Retrieval: fetch candidate contexts from one or more sources.
- Reranking: reorder candidates with stronger relevance models.
- Generation: answer with citations and uncertainty-aware behavior.
- Evaluation: measure retrieval recall, answer faithfulness, latency, and cost.
Which tools can be used for retrieval?
There is no single best backend. Different data shapes need different retrieval mechanisms:
- Web search: useful for fresh public information and broad discovery. It can bootstrap context before deeper internal retrieval.
- GraphDB (Neo4j): useful when relationships are first-class (entities, events, ownership, dependencies). Graph traversal can recover multi-hop facts that flat chunk retrieval may miss.
- PostgreSQL for SQL search: strong for structured filters, joins, permissions, and business constraints. With extensions (for example pgvector), PostgreSQL can also support hybrid lexical + vector retrieval in one operational stack.
In practice, high-performing systems are often multi-retriever: they retrieve from a web layer, a vector/keyword corpus, and a structured database, then merge and rerank.
Search techniques used in modern RAG
1) Lexical search (BM25 and variants)
Classic sparse retrieval (such as BM25) remains extremely strong for exact terms, rare entities, and identifiers. It is often the best first-stage retriever for enterprise content that has domain-specific vocabulary.
2) Dense vector retrieval
Embed queries and chunks into a semantic space and retrieve nearest neighbors via ANN search (e.g., HNSW/IVF-style indexes). This improves paraphrase matching and concept-level recall.
3) Hybrid retrieval
Combine lexical and dense scores. Reciprocal Rank Fusion (RRF) and weighted score fusion are common because lexical and semantic retrievers have complementary failure modes.
4) Metadata and faceted filtering
Apply hard constraints before retrieval: tenant, access control, timestamp, source type, jurisdiction, language. This often boosts relevance more than any model upgrade.
5) Query rewriting and decomposition
Transform user questions into retrieval-friendly forms: acronym expansion, canonical entity names, multi-step decomposition, and sub-query generation. This is especially useful for long or ambiguous prompts.
6) Multi-hop and graph-based retrieval
For questions requiring reasoning across entities, graph expansion (e.g., Neo4j traversals) can produce better candidate context than independent chunk similarity.
Reranking techniques (the quality multiplier)
First-stage retrieval optimizes recall; reranking optimizes precision in top-k context. Strong RAG systems almost always rerank.
- Cross-encoder rerankers: score (query, passage) pairs jointly and usually outperform bi-encoder retrieval scores for top-k ordering.
- Late interaction models: ColBERT-style token-level matching can improve relevance while preserving some retrieval efficiency.
- LLM-as-reranker: prompt an LLM to judge passage relevance; often high quality but higher latency/cost and more variance.
- Diversity-aware reranking: Maximal Marginal Relevance (MMR) reduces redundancy so context windows contain complementary evidence.
Reranking typically yields larger answer-quality gains than increasing generator size.
Generation techniques for grounded answers
- Citation prompting: force answer statements to map to retrieved passages.
- Context compression: summarize or select evidence spans before final generation to fit token budgets.
- Answer abstention: instruct the model to say "insufficient evidence" when retrieval confidence is low.
- Self-consistency checks: run secondary verification against retrieved evidence before final output.
Evaluation: what to measure
End-to-end answer quality is important, but retrieval diagnostics are mandatory for iteration speed:
- Retrieval metrics: Recall@k, nDCG@k, MRR.
- Answer metrics: exactness, faithfulness/groundedness, citation correctness.
- Operational metrics: p95 latency, throughput, index freshness, and per-query cost.
Without retrieval-level observability, teams often misattribute errors to the generator when the actual issue is poor candidate recall.
Practical reference tools
If you want to compare interaction styles, retrieval depth, and answer grounding behaviors across assistants, you can benchmark prompts across:
For engineering work, treat these as product references rather than fixed ground truth. Always validate with your own domain-specific evaluation set.
Key takeaway
RAG is not just "vector search + prompt." It is a full retrieval system: indexing strategy, query processing, multi-source search, reranking, grounded generation, and evaluation. Teams that invest in retrieval quality and reranking usually get the largest and most stable gains.
Selected papers and references
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401.
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP. arXiv:2004.04906.
- Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR. arXiv:2004.12832.
- Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
- Johnson, J., Douze, M., & Jegou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data.
- Robertson, S., Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
- Chen, D. et al. (2017). Reading Wikipedia to Answer Open-Domain Questions (DrQA). ACL. arXiv:1704.00051.
- Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.