How does RAG find the right context?
Retrieval-Augmented Generation (RAG) helps a language model answer with facts drawn from your own documents. Instead of relying only on what the model “knows,” it retrieves relevant text passages and then writes a response grounded in them. The key question is how it selects the right passages from a large collection during a conversation.
What RAG is trying to solve
Large language models can write fluent text, but they may:
- Miss details that exist in private or recent documents
- Hallucinate specifics when the prompt is vague
- Struggle with long source material that does not fit in the context window
RAG adds a retrieval step: find candidate passages from a knowledge base, attach them to the prompt, then generate an answer using both the user’s message and those passages.
The basic RAG pipeline
A typical RAG system has two phases: indexing (offline) and retrieval + generation (online).
1) Indexing: turning documents into embeddings
-
Chunking
Documents are split into smaller pieces (chunks). Chunk size matters: too small loses context; too big dilutes meaning and wastes tokens. -
Embedding
Each chunk is converted into a vector (an embedding) using an embedding model. This vector is a dense numeric representation where semantic similarity tends to correspond to geometric closeness. -
Vector storage
Vectors are stored in a vector database (or any approximate nearest neighbor index). Each vector keeps metadata such as document id, section title, timestamps, and access permissions.
2) Retrieval: finding candidate chunks for a user query
When a user asks a question, the system:
- Builds a search query (often embedded into a vector)
- Performs nearest neighbor search to find the top-K chunks whose embeddings are closest to the query embedding
- Optionally applies filters (permissions, product area, date range)
- Optionally reranks the candidates using a stronger model that reads the text directly and scores relevance
3) Generation: writing with retrieved context
The retrieved chunks are inserted into the model prompt (often as “Context” or “Sources”). The model is instructed to answer using that context, and sometimes to cite which chunk each claim comes from.
How embeddings match meaning
Embeddings are not keywords; they encode patterns of meaning learned from large text corpora. Similar phrases, paraphrases, and related concepts tend to land near each other in vector space. For example:
- “refund policy for yearly plan” can match “annual subscription cancellations and refunds”
- “reset password” can match “account recovery steps”
Distance metrics like cosine similarity or dot product measure closeness. Retrieval then becomes a geometric lookup problem: find the nearest vectors to the query vector.
How RAG works in a conversation
Conversation adds a twist: the “query” is rarely just the last message. Users use pronouns (“that feature”), omit nouns (“What about pricing?”), and refer to earlier turns.
RAG systems handle this with a conversation-aware query construction step.
Query rewriting (standalone question)
A common approach is to rewrite the user’s latest turn into a standalone question using the chat history. Example:
- User: “Does it support SSO?”
- Assistant: “Which plan are you on?”
- User: “Enterprise. Also, what about SCIM?”
Standalone rewrite: “For the Enterprise plan, does the product support SCIM provisioning, and what are the requirements?”
This rewritten query is embedded and used for retrieval. The rewrite reduces ambiguity, which improves embedding search.
Multi-vector retrieval for richer intent
Some systems generate multiple queries:
- One optimized for definitions (“What is SCIM?”)
- One for procedures (“How to configure SCIM?”)
- One for constraints (“SCIM requirements Enterprise plan”)
Each query retrieves chunks, then the union is reranked.
Reranking: picking the truly relevant chunks
Nearest neighbors from embeddings are good candidates, but not always the best. A reranker reads the actual text of each chunk and scores relevance to the rewritten question. This step often improves accuracy, especially when many chunks share similar vocabulary.
Memory vs retrieval
Chat “memory” (stored user preferences, profile, prior decisions) is different from RAG retrieval:
- Memory answers “Who is the user and what do they prefer?”
- Retrieval answers “What do the documents say about this topic?”
Many assistants use both: memory to shape the response, retrieval to ground factual claims.
Why the system sometimes retrieves the wrong thing
Common failure modes include:
- Poor chunking that splits key sentences across chunks
- Missing metadata filters (wrong product version, wrong region)
- Vague questions with no rewrite step
- Embedding model mismatch with the domain’s language
- Overly large top-K that floods the prompt with semi-related text
What “right embeddings” really means
The model does not “hunt embeddings” during generation. The retrieval system computes embeddings for the query, finds nearby chunk embeddings, reranks, and then feeds selected text into the model. The assistant’s output looks smart because the retrieved context is well chosen, not because the model secretly searches the database while writing.












