Almost every AI feature we are asked to build for a client now has the same shape. The user asks a question. The system looks something up in the company's own data. The answer is grounded in what was found. The pattern is called Retrieval-Augmented Generation, RAG for short, and it has become the workhorse of practical enterprise AI.
This post covers what RAG is, how the architecture fits together, when it is the right pattern, and the decisions that make or break a real system.
The Problem RAG Solves
Large language models know an astonishing amount. They do not know your company. They do not know what your support team wrote in a runbook last Tuesday, what is in the contract you signed with a vendor in 2023, or what your engineering team agreed to in last week's design review. They also have a knowledge cutoff, so anything that happened after the model was trained is invisible to them.
For a lot of useful AI features, that is a problem. A chatbot that answers questions about your product needs to know your product. A research assistant that summarizes recent regulations needs access to recent regulations. An internal copilot that answers "where did we land on the database migration plan" needs the actual planning doc.
You have two options. You can fine-tune a model on your data, which is expensive, slow to update, and a poor fit for information that changes. Or you can fetch the relevant material at the moment of the question and hand it to the model along with the prompt. The second option is RAG.
| RAG | Fine-Tuning | Long-Context Prompt | |
|---|---|---|---|
| Setup cost | Medium (build pipeline + index) | High (training infra + dataset) | Low (write the prompt, ship) |
| Time to update | Real-time (re-index when content changes) | Slow (re-train + redeploy) | Real-time (edit the prompt) |
| Corpus size limit | Practically none | Bound by training cost | Context window (~200K tokens, paid each call) |
| Provenance | Yes (cite the retrieved chunks) | No (knowledge is in the weights) | Manual (if you instrument it yourself) |
| Best for | Large knowledge bases that change and need citations | Behavior, style, and domain-specific patterns | Small static reference that fits in one prompt |
The Architecture, in One Diagram of Words
A RAG system has two phases. The first happens once, in advance. The second happens every time a user asks a question.
Phase 1: Indexing
You take your source documents: PDFs, wiki pages, support tickets, contracts, whatever the system needs to know about. You break each document into smaller pieces called chunks. You run each chunk through an embedding model, which turns the text into a vector (a list of numbers that captures the meaning of the chunk). You store the vectors in a database optimized for finding similar vectors quickly.
![Two-part diagram on sand background showing how text embeddings capture meaning rather than keywords, with three monospace sentences on the left each followed by an arrow and a truncated four-dimension vector — "the client requested a refund" mapping to [0.82, 0.14, -0.47, 0.91, ...], "the customer asked for money back" mapping to a visibly similar [0.79, 0.18, -0.44, 0.88, ...], and "the HTTP client timed out" mapping to a clearly different [-0.31, 0.67, 0.22, -0.15, ...] — and on the right a 2D embedding space bounded by four corner brackets where the two semantically related sentences appear as paired seafoam dots joined by a dashed connector and labeled "similar meaning" while the unrelated HTTP timeout sentence sits as a squid-purple dot far away in the lower-left, illustrating that embeddings group sentences by meaning rather than shared words and that this is why retrieval-augmented generation can find relevant context even when the user's wording differs from the source documents.](https://ik.imagekit.io/cuttlesoft/tr:w-1472,q-auto,f-auto/wp-content/uploads/2026/05/05171137/Users_frank_Work_cuttlesoft_cuttlesoft.com_content_rag-fundamentals_embeddings-capture-meaning.html.png)
That database is called a vector database. Pinecone, Weaviate, Qdrant, and pgvector on Postgres are the names you will hear most. They are all variations on the same idea: store vectors, find the closest ones to a query vector quickly, return the original text.
Phase 2: Querying
A user asks a question. You run the question through the same embedding model that you used during indexing. You get a vector. You ask the vector database for the chunks whose vectors are closest to the question vector. You take the top few chunks, paste them into a prompt that says something like "answer the user's question using the following information," and send the whole thing to an LLM. The model writes an answer that draws on the chunks you provided.
That is it. The whole pattern. Everything else, every reranker, every hybrid search trick, every graph database overlay, is a refinement on top of this skeleton.
When RAG Is the Right Tool
RAG is the right pattern when three conditions are true.
Your knowledge base is too large to fit in the context window. If you have a 50-page handbook and the model can take a 200,000-token prompt, you do not need RAG. Just paste the handbook into the prompt every time. RAG starts mattering when the corpus is bigger than what the model can read in one shot, or when you do not want to pay to read it every time.
Your knowledge base changes faster than you would want to retrain. If your data is updated daily or hourly, fine-tuning is impractical. RAG indexes new content the moment you add it.
You need provenance. A RAG system can show the user which documents the answer came from. That is critical for any compliance-sensitive use case, and increasingly expected by users in any context. A fine-tuned model gives you a confident answer with no audit trail. A RAG system gives you a confident answer with citations.
If only one of those three conditions is true, you probably do not need RAG. A simple prompt with the relevant document attached is faster, cheaper, and easier to debug.
When RAG Is the Wrong Tool
The most common mistake we see is reaching for RAG when the question is not really a question, but a calculation, a lookup, or a transaction. RAG is a reading-comprehension pattern. If a user asks "how many open invoices does customer X have," that is a database query, not a RAG query. You do not want the model to find a chunk that mentions invoices and reason about it. You want to run a SQL statement and return the number.
The fix is usually a tool-calling pattern, where the model decides whether to retrieve text or call a function. We covered the architecture for that in A Practical Guide to Agent Orchestration Frameworks. RAG is a tool in that toolbox, not the whole toolbox.
RAG is also wrong when the answer requires reasoning across many documents at once. A vector search returns the most similar chunks to the question, not the chunks that, when combined, would answer it. If your use case is "summarize what every customer said about feature X across 10,000 support tickets," you do not want top-k retrieval. You want a different pattern, often involving map-reduce summarization or a graph-based approach.
The Decisions That Make or Break a RAG System
The architecture is simple. The decisions inside it are not. Five of them deserve more attention than they get.
1. Chunking strategy
How you slice the documents matters more than which database you use. Chunks that are too small lose context. Chunks that are too large dilute the embedding's signal and crowd the prompt. The right size depends on the content. Code wants different chunks than legal text. Conversations want different chunks than reference manuals. Most teams start with a default like 500 tokens with 50-token overlap, and almost everyone tunes from there.
2. Embedding model
OpenAI's text-embedding-3 family, Cohere, Voyage, and the open Sentence Transformers models all do the job. The differences show up at scale and on domain-specific corpora. A general-purpose embedding model trained on web text may underperform on highly technical content. Run an eval on your own corpus before committing.
3. Retrieval strategy
Pure vector search is rarely the best answer. Hybrid search combines vector similarity with keyword matching, which catches cases where the user's wording matches the source but the embeddings do not align. A reranker, a small model that scores the top k results from initial retrieval, often improves quality by a meaningful margin for a small latency cost. We will get into these techniques in a future post.
4. Prompt construction
How you assemble the retrieved chunks into the final prompt determines what the model sees. Order matters, because models pay more attention to the start and end of long contexts. Formatting matters, because the model uses cues like document titles and section headers to ground its answer. Templates that include "if the answer is not in the context, say so" reduce hallucinations more than you would expect.
5. Model choice
The generation model matters. A frontier model like GPT-4o or Claude Sonnet 4 will follow grounding instructions better and hallucinate less than a smaller model on the same retrieved context. The cost difference is real, and the right answer is often not the most capable model. We covered the model-selection tradeoff in How to Choose an LLM When Every Model Claims State of the Art.
A Minimal Working Example
Here is the smallest useful RAG system in Python, using LangChain and pgvector. It is not production-ready, but it shows the moving parts.
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain_core.prompts import ChatPromptTemplate
# 1. Indexing (run once)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
store = PGVector(
embeddings=embeddings,
collection_name="company_docs",
connection="postgresql+psycopg://user:pass@localhost/db",
)
store.add_texts(["...your chunked documents..."])
# 2. Querying (every request)
def answer(question: str) -> str:
docs = store.similarity_search(question, k=4)
context = "\n\n".join(d.page_content for d in docs)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the context. If the answer is not in the context, say so."),
("user", "Context:\n{context}\n\nQuestion: {question}"),
])
llm = ChatOpenAI(model="gpt-4o-mini")
return llm.invoke(prompt.format(context=context, question=question)).contentTwenty lines, give or take. That is the foundation. Everything else in this series is what you do when those twenty lines are not enough.
Where the Real Work Begins
The minimal pattern works in development. It works for the demo. What happens when you put it in front of real users with real data is a different story: how to monitor whether retrieval is actually working, what to do when faithfulness scores plateau, and which advanced techniques (hybrid retrieval, rerankers, query rewriting, graph-based methods) are worth the added complexity. Those are the next posts in this series.
If you are building a RAG system right now and want help making the architecture decisions before they bake into your codebase, that is the kind of work we do.


