RAG Glossary: Key Terms and Concepts

A reference guide to the terminology used in Retrieval-Augmented Generation systems. Bookmark this for quick lookups when navigating RAG documentation and discussions.

Core Concepts

RAG (Retrieval-Augmented Generation)

An architecture that combines information retrieval with language model generation. Instead of relying solely on a model's training data, RAG retrieves relevant documents at query time and uses them to ground responses.

Vector Embedding

A numerical representation of text (or other content) in a high-dimensional space where semantically similar items are close together. Embeddings enable semantic search—finding content by meaning rather than exact keyword matches.

Semantic Search

Search based on meaning rather than keywords. Uses vector embeddings to find documents that are conceptually related to a query, even if they don't share exact terms.

Knowledge Base

The collection of documents, data, and information that a RAG system can retrieve from. May include documents, FAQs, product catalogs, internal wikis, and other content sources.

Document Processing

Chunking

The process of splitting documents into smaller pieces for embedding and retrieval. Chunk size and strategy significantly impact retrieval quality.

Chunk Overlap

Including some content from adjacent chunks to preserve context at boundaries. Typically 10-20% overlap helps maintain coherence.

Document Loader

A component that ingests documents from various sources (PDFs, web pages, databases) and prepares them for processing.

Text Splitter

The algorithm or method used to divide documents into chunks. Options include character-based, token-based, semantic, and recursive splitting.

Metadata

Additional information attached to document chunks (source, date, author, category) that can be used for filtering during retrieval.

Embeddings & Storage

Embedding Model

A neural network that converts text into vector embeddings. Popular options include OpenAI's text-embedding-ada-002, Cohere's embed models, and open-source alternatives like BAAI/bge.

Vector Database

A database optimized for storing, indexing, and searching vector embeddings. Examples: Pinecone, Weaviate, Chroma, Qdrant, Milvus, pgvector.

Vector Index

A data structure that enables efficient similarity search over vector embeddings. Common types include HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index).

Dimension

The number of elements in a vector embedding. Common dimensions range from 384 to 1536. Higher dimensions can capture more nuance but require more storage and computation.

Retrieval

Similarity Search

Finding vectors that are mathematically close to a query vector. Common metrics include cosine similarity, Euclidean distance, and dot product.

Top-K Retrieval

Returning the K most similar documents to a query. K is typically 3-10, balancing relevance with context window limits.

Dense Retrieval

Retrieval using vector embeddings (dense vectors). Captures semantic meaning but may miss exact keyword matches.

Sparse Retrieval

Traditional keyword-based retrieval using sparse vectors (like BM25). Good at exact matching but misses semantic relationships.

Hybrid Search

Combining dense and sparse retrieval to leverage the strengths of both. Often implemented with weighted combination of scores.

Reranking

A second-stage process that uses a more sophisticated model to reorder initial retrieval results for better relevance.

Cross-Encoder

A model used for reranking that processes the query and document together to produce a relevance score. More accurate than bi-encoders but slower.

Bi-Encoder

A model architecture where query and document are encoded separately, enabling fast retrieval through pre-computed document embeddings.

Generation

Context Window

The maximum amount of text (measured in tokens) that a language model can process at once. Retrieved documents must fit within this limit along with the query and response.

Prompt Template

A structured template that combines the user's query with retrieved context and instructions for the language model.

Grounding

Ensuring that generated responses are based on and supported by the retrieved context, rather than the model's general knowledge.

Citation

Referencing the specific source document(s) that support claims in a generated response. Improves transparency and trustworthiness.

Hallucination

When a language model generates content that is plausible-sounding but not factually accurate or not supported by the provided context.

Advanced Concepts

Query Expansion

Techniques to reformulate or expand user queries to improve retrieval. May include adding synonyms, generating sub-questions, or using LLMs to create better search queries.

HyDE (Hypothetical Document Embeddings)

A technique where the LLM first generates a hypothetical answer, which is then embedded and used for retrieval. Can improve retrieval for certain query types.

Multi-Query Retrieval

Generating multiple variants of a query and combining retrieval results. Increases recall by capturing different phrasings.

Parent-Child Retrieval

Retrieving smaller chunks for precision but returning larger parent sections for context. Balances specificity with completeness.

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)

A technique that builds hierarchical summaries of documents for multi-level retrieval.

Graph RAG

Combining knowledge graphs with RAG to capture entity relationships and enable more structured reasoning.

Agentic RAG

RAG systems where an AI agent decides when and how to retrieve, potentially making multiple retrieval calls and reasoning over results.

MAG (Memory-Augmented Generation)

Systems that maintain persistent memory across conversations, enabling long-term context and personalization.

Evaluation Metrics

Retrieval Precision

The proportion of retrieved documents that are actually relevant to the query.

Retrieval Recall

The proportion of all relevant documents that were successfully retrieved.

MRR (Mean Reciprocal Rank)

Measures how high the first relevant document appears in the ranked results. Higher is better.

NDCG (Normalized Discounted Cumulative Gain)

A metric that accounts for both relevance and ranking position of retrieved documents.

Faithfulness

Whether generated answers are supported by the retrieved context (vs. using general knowledge or hallucinating).

Answer Relevance

Whether generated answers actually address the user's question.

Context Relevance

Whether the retrieved context is appropriate for answering the query.

Infrastructure Terms

Latency

The time between a query being submitted and a response being returned. RAG adds retrieval latency to generation latency.

Throughput

The number of queries a RAG system can handle per unit of time.

Cold Start

Delay when a system needs to initialize or load models/indexes before processing queries.

Caching

Storing frequently retrieved documents or embeddings to reduce latency for common queries.

This glossary covers the most common RAG terminology. As the field evolves, new concepts and techniques continue to emerge.

Building a RAG system and need guidance? Get in touch to discuss your implementation.