How to Evaluate RAG Systems

Every few weeks, someone declares that retrieval augmented generation (RAG) is dead. But here’s the thing: retrieval isn’t going anywhere. Any large language models system worth its salt needs to retrieve data at some point—whether through an MCP call, database query, or document lookup. The real question isn’t whether retrieval augmented generation systems are dead, but how to evaluate RAG systems properly to ensure optimal RAG performance.

Evaluating RAG Pipelines: Why Architecture-Specific Testing Matters

Think about it this way: if you’re running end-to-end evaluations and everything fails, you might spend hours debugging prompts when the real culprit is a broken retrieval component. This is why evaluating RAG pipelines at the component level is crucial—you need to test each piece separately before running comprehensive end-to-end tests.

Consider a CRM system where every contact lookup is essentially a retrieval process. If that retrieval is broken, no amount of prompt engineering will save your system. You need to evaluate the retrieval mechanism first, then build from there.

The Evolution from Vanilla to Agentic RAG

Traditional “vanilla” RAG follows a simple pattern: user query → retrieve relevant documents from vector database → insert retrieved context into LLM prompt → generate answer. This straightforward approach might be what people mean when they say “RAG is dead.”

What’s actually gaining traction is agentic RAG: given query → language models decide how to retrieve → dynamic retrieval (search, lookup, etc.) → generate response. The key insight is that retrieval and LLM steps remain separate and individually evaluable, allowing you to optimize RAG systems more effectively.

Why “Just Put Everything in Context” Doesn’t Work

Some argue that massive context windows eliminate the need for retrieval. This thinking falls apart for three reasons:

  1. Efficiency: Processing enormous contexts is computationally expensive
  2. Cost: Token costs scale with context size
  3. Practicality: Most data is dynamic, not static

In real-world applications, you’re often retrieving the most up-to-date information from constantly changing databases. You can’t just dump your entire contact database into a context window—it’s being updated in real-time with LinkedIn changes, new interactions, and fresh data. This is where retrieval quality becomes critical for maintaining response accuracy.

Building Your Evaluation Dataset

The fundamental question driving RAG evaluation is simple: Have I retrieved the right information to put into context? Without retrieving relevant information, your RAG system performance suffers from garbage in, garbage out.

Creating a Quality Evaluation Dataset

Most organizations have plenty of source documents but lack the queries needed for evaluation. An automated evaluation framework requires both comprehensive test data and ground truth answers to measure retrieval performance effectively.

The Three-Step Process:

  1. Chunk your source documents (size matters—we’ll cover this later)
  2. Convert chunks into factual claims (e.g., “The pool closes at 5pm”)
  3. Invert claims into questions (e.g., “What time does the pool close?”)

This process creates synthetic data that serves as your ground truth for evaluation. Each generated question should have a known correct answer based on the retrieved documents.

Quality Control Through Iteration:

  • Start with real queries as few-shot examples
  • Generate synthetic queries based on your documents
  • Rate queries 1-5 for realism (how likely are humans to ask this?)
  • Feed ratings back to improve future generation
  • Keep only 4s and 5s, discard anything rated 3 or below

This fuzzy 1-5 scale works perfectly because you’re trying to separate obviously terrible queries from genuinely useful ones. While human evaluation provides the gold standard for query quality, this automated approach scales much better for large evaluation datasets. The goal is ensuring your test data reflects real business objectives and user needs.

Advanced Testing: Adversarial Examples

To really test your RAG system, create challenging scenarios with similar chunks where queries can only be answered by one specific chunk, despite using terminology from others. This tests whether your system ranks relevant documents properly and retrieves the most relevant documents for each query.

For example:

  • “When does the pool open?” vs. “Is the pool open-air?”
  • Both chunks discuss pools, but only one provides the ground truth answer for each question

This process reveals two critical insights about your RAG components:

  • Redundancy: Multiple chunks answering the same question (consolidation opportunity)
  • Gaps: Questions your documentation can’t answer (content improvement opportunity)

This isn’t just evaluation—it’s documentation improvement. You might discover you have 50 different references to pool hours (streamline this) but zero information about bar drinks (add this content). This comprehensive understanding helps optimize your document relevance and retrieval accuracy.

Essential Evaluation Metrics for RAG Systems

While countless retrieval metrics exist, four evaluation metrics cover most practical use cases. These quantitative metrics help you understand both retrieval performance and response quality:

1. Context Recall

What it measures: Completeness—of all relevant documents available, how many did we find?

If there are 50 references to pool opening times in your source documents, did your RAG system retrieve all 50? This metric tells you about coverage and is crucial for ensuring your system doesn’t miss important relevant context.

2. Answer Relevance (Precision at K)

What it measures: Context relevance—of the documents retrieved, how many were actually useful?

Your system returned 50 chunks, but only 7 specified actual opening times while 43 just mentioned the pool being “open” without times. This metric was more critical when context windows were smaller, but it’s still valuable for measuring how well your system identifies the most relevant documents and avoids factually incorrect responses.

3. Mean Reciprocal Rank (MRR)

What it measures: Ranking quality—how high up was the first relevant document?

If your system returns 10 results but the useful one is #7, that’s suboptimal. The mean reciprocal rank matters most for systems designed to extract single facts from documents. A higher reciprocal rank indicates better retrieval effectiveness.

4. Discounted Cumulative Gain (DCG)

What it measures: Weighted relevance—accounts for some chunks being more relevant than others.

This helps when you need to prioritize certain types of information over others in your results. DCG and its normalized variant (normalized discounted cumulative gain) are particularly useful when some retrieved chunks contain more comprehensive or authoritative information than others. This metric provides detailed metrics about the relevance value of each position in your ranked results.

Evaluation Frameworks: The ARES Approach

RAG evaluation frameworks like ARES suggest plotting two key dimensions for comprehensive evaluation:

Factual Accuracy: Did the system avoid hallucinations, omissions, and misinterpretations? This measures answer correctness and ensures the generated response aligns with the ground truth answer.

Answer Relevance: Were the retrieved items actually needed for the task? This evaluates whether the relevant context supports the generated answer.

Track these metrics over time to understand your RAG system performance trends and identify areas for improvement. Many teams use tools like ragas import evaluate to implement these evaluation frameworks systematically.

How to Evaluate Your RAG System: A Step-by-Step Approach

When your retrieval metrics indicate problems, you have three primary adjustment mechanisms:

1. Adjust Retrieval Prompts

Fine-tune the prompts that control how your system decides what to retrieve. This directly impacts the system’s ability to identify and rank relevant documents based on each user query.

2. Fix Engineering Problems

Often, issues stem from basic implementation problems in the RAG pipeline rather than sophisticated AI challenges. This includes checking your embedding model performance, ensuring proper document preprocessing, and verifying your vector database configuration. Custom metrics can help identify specific bottlenecks in retrieving relevant information.

3. Optimize Chunk Size

Here’s the human-centered approach: Take a query from your test set and answer it yourself using the source documents. Notice what you needed to read—was it a single sentence, a full paragraph, or multiple sections?

Repeat this exercise several times to identify the average chunk size humans need to answer your typical questions. This becomes your starting point for chunk size optimization and directly impacts retrieval quality.

For RAG systems with varied requirements, implement multiple chunking strategies rather than forcing one size for all use cases. Different question types may require different amounts of relevant context to produce accurate answers.

Common Pitfalls to Avoid

Testing retrieval before locking it down: Don’t run end-to-end tests until your retrieval evaluation is solid.

Overfitting to poor synthetic data: Ensure your synthetic dataset actually represents real user needs and business objectives.

Choosing inappropriate evaluation metrics: Select metrics that align with your RAG application’s specific use case and answer relevancy requirements.

Ignoring chunk size impact: This factor has more influence on retrieval performance than many realize.

Skipping grounding checks: Always verify that responses are actually based on retrieved context, not generated from scratch. This prevents factually incorrect responses.

Optimizing the wrong component: Start by asking whether retrieval is actually the problem in your RAG system.

Beyond RAG: Applying These Principles to Tool Calling

These evaluation principles extend beyond document retrieval to tool calling systems in retrieval augmented generation systems. Evaluate:

  • Tool selection accuracy: Did we call the correct tool for the given query?
  • Argument validity: Did the language models provide structurally valid arguments?
  • Execution success: What’s the failure rate, and do failures happen silently?
  • Output quality: Apply the same four evaluation metrics you use for retrieval to measure response quality

In most tool calling problems, fixing tool descriptions resolves the majority of issues and improves the system’s ability to generate accurate responses.

The Bottom Line

Retrieval evaluation isn’t just about RAG—it’s about systematically testing any component that fetches information for your LLM system. By creating quality synthetic datasets, choosing appropriate metrics, and methodically calibrating your system, you can build retrieval components that actually work.

The key insight is separation of concerns: evaluate your retrieval independently, get it working well, then integrate it into your broader system evaluation. This approach will save you countless hours of debugging and lead to more reliable AI applications.


Discover more from Josh Pitzalis

Subscribe to get the latest posts sent to your email.