Featured image for Cuttlesoft's "RAG Fundamentals: What It Is and When to Use It" post showing a stylized 2D projection of a vector embedding space against a dark midnight background, with seven color-coded constellations of dots representing topical clusters in a knowledge base — squid for Pricing, aquamarine for Onboarding, pacific blue for API Docs, sand-muted for Policies, urchin pink for Support, sunbeam gold for Release Notes, and a central seafoam cluster around the query — each constellation woven together with faint intra-cluster edges that suggest the local manifold structure of embeddings, plus a clockwise perimeter loop and four diagonal connectors hinting at the broader topology of the vector space, a bright seafoam query point at the center surrounded by a soft halo and a dashed search radius, four crisp seafoam edges connecting the query to its k=4 nearest neighbors with a small monospace "QUERY" callout above and "k = 4 NEAREST" label below, visually arguing that retrieval-augmented generation works because semantically related content lands in similar positions in vector space and a similarity search reliably pulls the right chunks out of a much larger corpus.

Almost every AI feature we are asked to build for a client now has the same shape. The user asks a question. The system looks something up in the company's own data. The answer is grounded in what was found. The pattern is called Retrieval-Augmented Generation, RAG for short, and it has become the workhorse of practical enterprise AI.

This post covers what RAG is, how the architecture fits together, when it is the right pattern, and the decisions that make or break a real system.

The Problem RAG Solves

Large language models know an astonishing amount. They do not know your company. They do not know what your support team wrote in a runbook last Tuesday, what is in the contract you signed with a vendor in 2023, or what your engineering team agreed to in last week's design review. They also have a knowledge cutoff, so anything that happened after the model was trained is invisible to them.

For a lot of useful AI features, that is a problem. A chatbot that answers questions about your product needs to know your product. A research assistant that summarizes recent regulations needs access to recent regulations. An internal copilot that answers "where did we land on the database migration plan" needs the actual planning doc.

You have two options. You can fine-tune a model on your data, which is expensive, slow to update, and a poor fit for information that changes. Or you can fetch the relevant material at the moment of the question and hand it to the model along with the prompt. The second option is RAG.

RAGFine-TuningLong-Context Prompt
Setup costMedium (build pipeline + index)High (training infra + dataset)Low (write the prompt, ship)
Time to updateReal-time (re-index when content changes)Slow (re-train + redeploy)Real-time (edit the prompt)
Corpus size limitPractically noneBound by training costContext window (~200K tokens, paid each call)
ProvenanceYes (cite the retrieved chunks)No (knowledge is in the weights)Manual (if you instrument it yourself)
Best forLarge knowledge bases that change and need citationsBehavior, style, and domain-specific patternsSmall static reference that fits in one prompt
The three patterns are not mutually exclusive. Many production systems combine all three.

The Architecture, in One Diagram of Words

A RAG system has two phases. The first happens once, in advance. The second happens every time a user asks a question.

Phase 1: Indexing

You take your source documents: PDFs, wiki pages, support tickets, contracts, whatever the system needs to know about. You break each document into smaller pieces called chunks. You run each chunk through an embedding model, which turns the text into a vector (a list of numbers that captures the meaning of the chunk). You store the vectors in a database optimized for finding similar vectors quickly.

Two-part diagram on sand background showing how text embeddings capture meaning rather than keywords, with three monospace sentences on the left each followed by an arrow and a truncated four-dimension vector — "the client requested a refund" mapping to [0.82, 0.14, -0.47, 0.91, ...], "the customer asked for money back" mapping to a visibly similar [0.79, 0.18, -0.44, 0.88, ...], and "the HTTP client timed out" mapping to a clearly different [-0.31, 0.67, 0.22, -0.15, ...] — and on the right a 2D embedding space bounded by four corner brackets where the two semantically related sentences appear as paired seafoam dots joined by a dashed connector and labeled "similar meaning" while the unrelated HTTP timeout sentence sits as a squid-purple dot far away in the lower-left, illustrating that embeddings group sentences by meaning rather than shared words and that this is why retrieval-augmented generation can find relevant context even when the user's wording differs from the source documents.

That database is called a vector database. Pinecone, Weaviate, Qdrant, and pgvector on Postgres are the names you will hear most. They are all variations on the same idea: store vectors, find the closest ones to a query vector quickly, return the original text.

Phase 2: Querying

A user asks a question. You run the question through the same embedding model that you used during indexing. You get a vector. You ask the vector database for the chunks whose vectors are closest to the question vector. You take the top few chunks, paste them into a prompt that says something like "answer the user's question using the following information," and send the whole thing to an LLM. The model writes an answer that draws on the chunks you provided.

That is it. The whole pattern. Everything else, every reranker, every hybrid search trick, every graph database overlay, is a refinement on top of this skeleton.

When RAG Is the Right Tool

RAG is the right pattern when three conditions are true.

Your knowledge base is too large to fit in the context window. If you have a 50-page handbook and the model can take a 200,000-token prompt, you do not need RAG. Just paste the handbook into the prompt every time. RAG starts mattering when the corpus is bigger than what the model can read in one shot, or when you do not want to pay to read it every time.

Your knowledge base changes faster than you would want to retrain. If your data is updated daily or hourly, fine-tuning is impractical. RAG indexes new content the moment you add it.

You need provenance. A RAG system can show the user which documents the answer came from. That is critical for any compliance-sensitive use case, and increasingly expected by users in any context. A fine-tuned model gives you a confident answer with no audit trail. A RAG system gives you a confident answer with citations.

If only one of those three conditions is true, you probably do not need RAG. A simple prompt with the relevant document attached is faster, cheaper, and easier to debug.

When RAG Is the Wrong Tool

The most common mistake we see is reaching for RAG when the question is not really a question, but a calculation, a lookup, or a transaction. RAG is a reading-comprehension pattern. If a user asks "how many open invoices does customer X have," that is a database query, not a RAG query. You do not want the model to find a chunk that mentions invoices and reason about it. You want to run a SQL statement and return the number.

The fix is usually a tool-calling pattern, where the model decides whether to retrieve text or call a function. We covered the architecture for that in A Practical Guide to Agent Orchestration Frameworks. RAG is a tool in that toolbox, not the whole toolbox.

RAG is also wrong when the answer requires reasoning across many documents at once. A vector search returns the most similar chunks to the question, not the chunks that, when combined, would answer it. If your use case is "summarize what every customer said about feature X across 10,000 support tickets," you do not want top-k retrieval. You want a different pattern, often involving map-reduce summarization or a graph-based approach.

The Decisions That Make or Break a RAG System

The architecture is simple. The decisions inside it are not. Five of them deserve more attention than they get.

1. Chunking strategy

How you slice the documents matters more than which database you use. Chunks that are too small lose context. Chunks that are too large dilute the embedding's signal and crowd the prompt. The right size depends on the content. Code wants different chunks than legal text. Conversations want different chunks than reference manuals. Most teams start with a default like 500 tokens with 50-token overlap, and almost everyone tunes from there.

2. Embedding model

OpenAI's text-embedding-3 family, Cohere, Voyage, and the open Sentence Transformers models all do the job. The differences show up at scale and on domain-specific corpora. A general-purpose embedding model trained on web text may underperform on highly technical content. Run an eval on your own corpus before committing.

3. Retrieval strategy

Pure vector search is rarely the best answer. Hybrid search combines vector similarity with keyword matching, which catches cases where the user's wording matches the source but the embeddings do not align. A reranker, a small model that scores the top k results from initial retrieval, often improves quality by a meaningful margin for a small latency cost. We will get into these techniques in a future post.

4. Prompt construction

How you assemble the retrieved chunks into the final prompt determines what the model sees. Order matters, because models pay more attention to the start and end of long contexts. Formatting matters, because the model uses cues like document titles and section headers to ground its answer. Templates that include "if the answer is not in the context, say so" reduce hallucinations more than you would expect.

5. Model choice

The generation model matters. A frontier model like GPT-4o or Claude Sonnet 4 will follow grounding instructions better and hallucinate less than a smaller model on the same retrieved context. The cost difference is real, and the right answer is often not the most capable model. We covered the model-selection tradeoff in How to Choose an LLM When Every Model Claims State of the Art.

A Minimal Working Example

Here is the smallest useful RAG system in Python, using LangChain and pgvector. It is not production-ready, but it shows the moving parts.

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain_core.prompts import ChatPromptTemplate

# 1. Indexing (run once)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
store = PGVector(
    embeddings=embeddings,
    collection_name="company_docs",
    connection="postgresql+psycopg://user:pass@localhost/db",
)
store.add_texts(["...your chunked documents..."])

# 2. Querying (every request)
def answer(question: str) -> str:
    docs = store.similarity_search(question, k=4)
    context = "\n\n".join(d.page_content for d in docs)
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer using only the context. If the answer is not in the context, say so."),
        ("user", "Context:\n{context}\n\nQuestion: {question}"),
    ])
    llm = ChatOpenAI(model="gpt-4o-mini")
    return llm.invoke(prompt.format(context=context, question=question)).content

Twenty lines, give or take. That is the foundation. Everything else in this series is what you do when those twenty lines are not enough.

Where the Real Work Begins

The minimal pattern works in development. It works for the demo. What happens when you put it in front of real users with real data is a different story: how to monitor whether retrieval is actually working, what to do when faithfulness scores plateau, and which advanced techniques (hybrid retrieval, rerankers, query rewriting, graph-based methods) are worth the added complexity. Those are the next posts in this series.

If you are building a RAG system right now and want help making the architecture decisions before they bake into your codebase, that is the kind of work we do.

Related Posts

Featured image for Cuttlesoft's "Agent Sprawl Is the New Shadow IT" post showing a six-by-three grid of eighteen circular agent icons centered against a dark midnight-zone background, with three icons highlighted in seafoam and surrounded by soft halos and tracker rings to represent the AI agents enterprise IT and security teams actually monitor, and fifteen dim outlined icons representing the unmonitored agents employees have installed across coding assistants, MCP servers, browser extensions, and SaaS tools, with a faint scatter of even smaller noise dots fading into the dark at the edges of the frame to suggest the larger uninventoried fleet beyond, visually arguing that AI agent sprawl has become the new shadow IT problem for governance, supply chain security, and budget accountability in 2026.
May 5, 2026 • Frank Valcarcel

Agent Sprawl Is the New Shadow IT

An engineer wires up a Claude Code workflow on Monday. A product manager installs a Notion agent on Tuesday. By Friday there are six new identities operating against company data with no clear owner.

AWS logo centered over dark blue stylized map of Europe with concentric radar-style rings emanating from Germany, representing the AWS European Sovereign Cloud infrastructure launch for EU data sovereignty and GDPR compliance
January 26, 2026 • Frank Valcarcel

AWS Launches European Sovereign Cloud

AWS launched a physically separate cloud infrastructure in Europe with EU-only governance, zero US dependencies, and over 90 services. Here is what organizations in healthcare, finance, and government need to know about the sovereign cloud and how to evaluate it for their compliance strategy.

Let's work together

Tell us about your project and how Cuttlesoft can help. Schedule a consultation with one of our experts today.

Contact Us