Retrieval-Augmented Generation (RAG) has become the dominant pattern for grounding LLM responses in real data. If you're building any agent that needs to answer questions about documents, codebases, or proprietary knowledge, RAG is almost certainly the right approach. This guide cuts through the theory and focuses on what you actually need to build it right.
The Core RAG Architecture
A RAG pipeline has three stages: indexing (process and embed your source documents), retrieval (find the most relevant chunks for a given query), and generation (pass those chunks as context to the LLM). Each stage has meaningful decisions that affect quality and cost.
Stage 1: Indexing — Getting Your Data In
Chunking strategy is where most RAG systems fail. Common mistakes:
- Chunks too large (LLM can't focus on the relevant part)
- Chunks too small (lose context and sentence boundaries)
- Splitting in the middle of code blocks or tables
For text documents, start with recursive character splitting with a chunk size of 512 tokens and 50-token overlap. For structured data (code, markdown, HTML), use format-aware splitters. LangChain provides RecursiveCharacterTextSplitter, MarkdownTextSplitter, and PythonCodeTextSplitter out of the box.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)For embeddings, OpenAI's text-embedding-3-small offers excellent quality at low cost. For self-hosted options, sentence-transformers/all-MiniLM-L6-v2 runs fast on CPU and is good enough for many use cases.
Stage 2: Retrieval — Finding the Right Chunks
Basic vector similarity (cosine or dot product) gets you 70% of the way there. To get to 90%+, add:
- Hybrid search: combine dense vector retrieval with BM25 keyword search. Weaviate and Elasticsearch support this natively. Helps when users use exact product names or technical terms.
- Reranking: use a cross-encoder model (e.g., Cohere Rerank, or the open-source
cross-encoder/ms-marco-MiniLM-L-6-v2) to re-score your top-k results before passing to the LLM. This consistently improves recall. - Metadata filtering: don't retrieve across your entire corpus if you can filter first. Store document source, date, category as metadata and filter at retrieval time.
Stage 3: Generation — Building the Prompt
The retrieval step hands you 3–10 relevant chunks. How you format them matters:
CONTEXT:
{chunk_1}
---
{chunk_2}
---
{chunk_3}
QUESTION: {user_query}
Answer based only on the context above. If the answer is not in the context, say so.The "answer only from context" instruction is critical for production systems. Without it, models will hallucinate when context is insufficient rather than admitting uncertainty.
Evaluating RAG Quality
Use RAGAS to measure: faithfulness (does the answer follow from the retrieved context?), answer relevancy (does the answer address the question?), and context recall (were the right chunks retrieved?). Set up automated RAGAS evals in your CI pipeline so regressions are caught before deployment.
Production Considerations
At scale, your RAG pipeline needs: caching for repeated queries (Redis works well), async indexing for new documents, index versioning for schema changes, and monitoring on retrieval latency and empty-result rates.
Engineers with production RAG experience are highly sought after. If you're looking to find your next role in this space, AgenticCareers.co has a strong selection of RAG and LLM engineering jobs from companies building on these exact patterns.