Chapter 16: Evaluation - Using RAGAS to Test Accuracy

Theoretical Foundations

In the world of traditional software development, we have a simple contract with our code: if we write if (x > 5) { ... }, we can prove, with mathematical certainty, that for any given x, the logic will behave as expected. The system is deterministic. But when we introduce Large Language Models into the stack, we enter the realm of probabilistic systems. We are no longer giving the machine a set of explicit instructions; we are giving it a vast sea of information and a vague sense of a goal, hoping it navigates to the correct answer.

This is the fundamental challenge that evaluation seeks to solve. It’s the process of moving from "I think this works" to "I can prove this works with a 95% confidence score."

Imagine you are building a high-performance engine. You wouldn't just assemble the pistons, the crankshaft, and the fuel injectors and then immediately put the car on the road. You would first put the engine on a dynamometer. You would run it through a thousand different cycles, measuring horsepower, torque, fuel efficiency, and heat output under various loads. You would stress-test every component to find its breaking point. Evaluation is the dynamometer for your RAG pipeline. It's the rigorous, automated testing suite that ensures your engine not only runs but runs reliably, efficiently, and safely before it ever faces a real user.

The Illusion of a Single Metric

A common mistake for newcomers is to ask, "What is the score for my RAG system?" This is like asking, "How good is this doctor?" The answer is multifaceted. Is the doctor good at diagnosing rare diseases? Are they good at communicating with patients? Do they have a low rate of surgical errors? Each of these is a separate, critical metric.

Similarly, a RAG system's "goodness" is not a single number. It's a dashboard of metrics, each probing a different part of the complex machinery we built in the previous chapters. The core problem we are solving is the "chain of failure." A RAG pipeline is a chain of steps:

User asks a question.
The question is converted to an embedding.
The embedding is used to search a vector database.
The top-k retrieved chunks are sent to the LLM.
The LLM synthesizes an answer.

A failure at any step can corrupt the entire final output. The user's question might be ambiguous. The embedding model might misunderstand the semantic intent. The retrieval might fetch irrelevant chunks. The LLM might ignore the retrieved context and rely on its own parametric knowledge (hallucination) or misinterpret the provided information.

Evaluation metrics are our tools for isolating these failures. They are the diagnostic instruments that tell us which part of the engine is misfiring.

A Web Development Analogy: The API Contract

Let's use an analogy from the world of web development, which will be familiar to anyone working with JavaScript. Think of your entire RAG pipeline as a microservices architecture.

The User's Query is an incoming API request to your query-service.
The Retriever is a downstream search-service. It takes the request, queries a database, and returns a payload.
The Generator (LLM) is a downstream llm-service. It takes the original request and the payload from the search-service and generates a final response.

In a well-architected microservices system, you would never allow these services to communicate without a strict contract. You would use TypeScript interfaces and Zod schemas to validate the data flowing between them. You'd have unit tests for each service and integration tests for the whole flow.

This is precisely what evaluation metrics do for your RAG pipeline. They are the automated test suite for your AI-powered API.

Context Precision is like checking the search-service. Did it return the most relevant records for the query? If you ask for a user's email and it returns their street_address first, its precision is low. It's a bad database query.
Answer Relevance is like checking the query-service and the llm-service together. Did the final response actually answer the original API request? If the user asked "How do I reset my password?" and the response was "Our company was founded in 2010," the contract was violated. The response is irrelevant.
Faithfulness is the most critical check on the llm-service. It's the equivalent of asking, "Does the data in this response actually come from the payload we sent you, or did you just make it up?" In a traditional API, this is impossible; the service can only work with the data it's given. An LLM, however, can invent data. Faithfulness is our runtime validation that prevents this "API contract violation" from ever reaching the user.

Without these checks, you are shipping a microservice that you know can, and will, return fabricated data or irrelevant information. It's an architectural nightmare.

Deconstructing the Core Metrics: The Three Pillars of RAG Quality

To truly master our data, we must understand the anatomy of these metrics. They are not arbitrary numbers; they are cleverly designed calculations, often performed by other LLMs acting as "judges," that probe the pipeline's integrity.

1. Faithfulness: The Anchor to Reality

What it is: Faithfulness measures the degree to which the generated answer is grounded in the retrieved context. It is a direct check against hallucination.

Why it's critical: In an enterprise setting, an LLM claiming "The Q3 revenue was \(12.3M" when the source documents say "\)11.9M" is not just a minor error; it's a catastrophic business risk. Faithfulness ensures that every factual claim made in the answer can be traced back, like a footnote in a legal document, to a specific piece of retrieved evidence.

The Under-the-Hood Mechanism: The process is a beautiful example of meta-analysis. 1. The system takes the final answer and the retrieved_context. 2. It feeds both to a "judge" LLM with a very specific prompt: "Given this context and this answer, break down the answer into individual statements. For each statement, check if it can be inferred from the context. If it cannot, mark it as unfaithful." 3. The score is then calculated: Number of Faithful Statements / Total Statements in Answer.

This is like having a meticulous fact-checker read a journalist's article and compare every single claim against the source interviews. If the journalist wrote "The CEO was furious," but the interview only said "The CEO was disappointed," the fact-checker flags it as unfaithful.

2. Answer Relevance: The Compass of Intent

What it is: Answer Relevance measures how well the generated answer addresses the user's original query. It is not concerned with factual accuracy (that's Faithfulness) but with utility and directness.

Why it's critical: A technically perfect, 100% faithful answer that doesn't answer the user's question is useless. Imagine asking a colleague, "What's the status of the Alpha project?" and they respond with a perfectly faithful, detailed summary of the budget for the Alpha project. The information is correct, but the answer is irrelevant to the question asked. This leads to user frustration and a loss of trust in the system.

The Under-the-Hood Mechanism: This is another meta-analysis that cleverly reverses the problem. 1. The system takes only the answer. 2. It feeds the answer to a "judge" LLM with the prompt: "Given this answer, what was the most likely original question that prompted it?" 3. It then uses an embedding model to calculate the semantic similarity between the generated question and the actual original user question. 4. A high similarity score means the answer was highly relevant to the original query.

Think of it as a game of charades. If I give you an answer and you can't guess the original question I was thinking of, then my answer was a bad one.

3. Context Precision: The Quality of the Ingredients

What it is: Context Precision measures the signal-to-noise ratio in the retrieved chunks. It asks: "Out of all the chunks we retrieved, how many of them were actually useful for answering the question?"

Why it's critical: This metric directly diagnoses the health of your retrieval system—your embeddings, your vector search algorithm, and your chunking strategy. If you are consistently getting low Context Precision, it means your search-service is noisy. It's returning irrelevant documents alongside relevant ones. This not only wastes tokens and money but also confuses the LLM, which has to sift through the noise to find the signal, increasing the chance of it getting distracted or misinterpreting the context.

The Under-the-Hood Mechanism: This is a more direct, deterministic-style check. 1. The system takes the retrieved_context (a list of chunks) and the user_query. 2. A "judge" LLM evaluates each individual chunk and asks: "Is this chunk relevant to answering the query?" 3. The score is calculated by looking at the proportion of relevant chunks in the top-k results. It heavily rewards relevant chunks appearing at the top of the list.

This is like reviewing a search engine's results page. If the first five results are all perfect matches, the precision is high. If only the eighth result is relevant, the precision is low, even if a relevant result exists somewhere on the page.

The Role of Structured Output and Validation

This is where the definitions you've learned become not just theoretical concepts but practical necessities. When we ask a "judge" LLM to evaluate our pipeline, we cannot trust its free-form text response. We need a structured, predictable output that we can programmatically parse to calculate our scores.

This is the role of JSON Schema Output. We instruct the judge LLM: "Do not just talk to me. Output your evaluation in this exact JSON format: { 'statement': '...', 'is_faithful': true }." This allows us to reliably extract the judgment.

But what happens if the LLM has a bad day and outputs malformed JSON? Or a string instead of a boolean? Our entire evaluation pipeline would crash. This is the role of Zod Schema. On our application side, we define a Zod schema that perfectly matches the JSON structure we expect from the LLM. We then use this schema to parse the LLM's output.

// We define the contract for what a valid judgment must look like.
import { z } from 'zod';

// This Zod schema is our runtime guardrail.
const FaithfulnessJudgmentSchema = z.object({
  statement: z.string().min(1),
  is_faithful: z.boolean(),
});

// An array of these judgments is what we expect from the LLM.
const EvaluationResultSchema = z.array(FaithfulnessJudgmentSchema);

// When the LLM responds, we parse it with Zod.
// If the LLM's output doesn't match the schema, Zod throws a clear error.
// This prevents silent failures and ensures our metrics are calculated on valid data.
const llmResponse = '[{"statement": "The sky is blue", "is_faithful": true}]'; // Example LLM output
const parsedResult = EvaluationResultSchema.parse(JSON.parse(llmResponse));

By combining JSON Schema Output (to guide the LLM) and Zod (to validate the LLM's response), we create a robust, type-safe bridge between the probabilistic world of the LLM and the deterministic world of our evaluation logic. This is the essence of professional AI engineering.

Visualizing the Evaluation Flow

To cement this understanding, let's visualize the entire process. Notice how the evaluation flow runs in parallel to the main RAG flow, acting as a quality gate.

This diagram illustrates how the evaluation flow runs in parallel to the main RAG pipeline, acting as a quality gate that assesses the system's output before it reaches the user.

The Ultimate Goal: CI/CD Integration and the Single Responsibility Principle

This entire theoretical foundation leads to one final, crucial concept. The reason we go through the trouble of defining these metrics, building judge LLMs, and structuring our outputs is to automate quality assurance.

In modern software, we don't manually test every deployment; we have Continuous Integration/Continuous Deployment (CI/CD) pipelines that automatically run tests. We must do the same for our AI systems. By wrapping our evaluation logic in a TypeScript function, we can add it as a mandatory step in our deployment pipeline.

// A conceptual CI/CD pipeline step
async function runRagEvaluationPipeline(): Promise<boolean> {
  const testCases = loadTestSuite();
  const results = [];

  for (const testCase of testCases) {
    // Run the full RAG pipeline
    const { answer, context } = await runRagPipeline(testCase.query);

    // Run the evaluation metrics
    const faithfulness = await evaluateFaithfulness(answer, context);
    const relevance = await evaluateAnswerRelevance(answer, testCase.query);

    results.push({ faithfulness, relevance });
  }

  // Determine if the pipeline meets quality gates
  const averageFaithfulness = calculateAverage(results.map(r => r.faithfulness));
  return averageFaithfulness > 0.95; // Only deploy if >95% faithful
}

This brings us back to the Single Responsibility Principle (SRP). Our runRagPipeline function has one job: generate an answer. Our runRagEvaluationPipeline function has one job: verify the quality of that answer. By keeping these concerns strictly separated, we build a system that is robust, testable, and maintainable. We are not just building an AI that can answer questions; we are building a reliable, enterprise-grade data product that we can stand behind and prove is trustworthy. That is the power and the purpose of evaluation.

Basic Code Example

This example demonstrates a minimal, self-contained TypeScript implementation of a RAG evaluation step within a SaaS context. We will simulate a simplified RAG pipeline that retrieves context for a user query and generates an answer. Crucially, we will then evaluate the Faithfulness of that answer using a heuristic approach (since a full RAGAS implementation typically requires a Python backend or heavy LLM calls, which we will abstract for this "Hello World" example).

Scenario: A user asks a question about a company's internal API documentation. We retrieve relevant text chunks and generate an answer. We then check if the answer contains any claims that are not supported by the retrieved context.

The Logic Flow

Input: A user query and a set of retrieved document chunks (Context).
Generation: A function simulates an LLM generating an answer based on the context.
Evaluation: A function analyzes the generated answer against the context to calculate a Faithfulness score (0.0 to 1.0).
Output: The result is returned as a JSON object, suitable for a web API response.

TypeScript Implementation

/**
 * @fileoverview A basic implementation of RAG evaluation (Faithfulness)
 *               for a TypeScript Web Application context.
 */

// ============================================================================
// 1. TYPE DEFINITIONS
// ============================================================================

/**
 * Represents a single retrieved document chunk.
 * In a real app, this might come from a Vector Database (e.g., Pinecone, Qdrant).
 */
type ContextChunk = {
  content: string;
  score: number; // Similarity score
};

/**
 * The result object returned by our evaluation endpoint.
 */
type EvaluationResult = {
  answer: string;
  metrics: {
    faithfulness: number; // 0.0 to 1.0
    reasoning: string;    // Explanation of the score
  };
};

// ============================================================================
// 2. MOCK DATA & UTILITIES
// ============================================================================

/**
 * Mock database of document chunks.
 * In a production environment, this would be a query to a Vector Store
 * using a Query Vector generated from the user's input.
 */
const MOCK_DOCUMENT_STORE: ContextChunk[] = [
  { content: "The API endpoint /v1/users requires a Bearer token in the header.", score: 0.92 },
  { content: "Rate limits are set to 100 requests per minute for the free tier.", score: 0.85 },
  { content: "Authentication uses JWT (JSON Web Tokens) signed with HS256.", score: 0.78 },
];

/**
 * Simulates a vector similarity search.
 * In a real app, this would involve embedding the query and comparing vectors.
 * @param query - The user's natural language question.
 * @returns - A list of context chunks relevant to the query.
 */
function retrieveContext(query: string): ContextChunk[] {
  // Simple keyword matching simulation for the example
  if (query.toLowerCase().includes("auth") || query.toLowerCase().includes("token")) {
    return [MOCK_DOCUMENT_STORE[0], MOCK_DOCUMENT_STORE[2]]; // Return auth-related chunks
  }
  return [MOCK_DOCUMENT_STORE[1]]; // Default to rate limits
}

/**
 * Simulates an LLM generating an answer based on context.
 * In a real app, this would be a call to OpenAI or a local model.
 * @param query - The user question.
 * @param context - The retrieved chunks.
 * @returns - The generated string answer.
 */
function generateAnswer(query: string, context: ContextChunk[]): string {
  const contextText = context.map(c => c.content).join(" ");

  // SIMULATION: We will generate a "Hallucination" here for testing purposes.
  // The context mentions "HS256", but we will force the LLM to say "RSA256".
  if (query.includes("algorithm")) {
    return "Based on the documentation, the authentication algorithm used is RSA256.";
  }

  // Standard answer
  return `Answer: ${contextText}`;
}

// ============================================================================
// 3. CORE EVALUATION LOGIC (SIMPLIFIED RAGAS)
// ============================================================================

/**
 * Calculates Faithfulness Score.
 * 
 * Faithfulness measures whether the generated answer contains claims
 * that are supported by the retrieved context.
 * 
 * Implementation Strategy:
 * Since we cannot run a full LLM inside this lightweight TS example,
 * we use a keyword-based heuristic. In a real RAGAS setup, you would
 * send the 'answer' and 'context' to a Python service or a secondary LLM call.
 * 
 * @param answer - The generated answer string.
 * @param context - The list of retrieved context chunks.
 * @returns - A score between 0.0 and 1.0.
 */
function calculateFaithfulness(answer: string, context: ContextChunk[]): number {
  // 1. Flatten context into a single string for checking
  const contextText = context.map(c => c.content.toLowerCase()).join(" ");

  // 2. Extract "claims" from the answer (Simplified: We look for specific nouns/adjectives)
  // In a real scenario, an LLM would decompose the answer into statements.
  // Here, we hardcode a check for the specific hallucination we planted.

  const claims = [
    "rsa256", // This is the hallucination
    "hs256",  // This is the truth
    "bearer token",
    "rate limit"
  ];

  let supportedClaims = 0;
  let totalClaims = 0;

  const lowerAnswer = answer.toLowerCase();

  for (const claim of claims) {
    if (lowerAnswer.includes(claim)) {
      totalClaims++;
      if (contextText.includes(claim)) {
        supportedClaims++;
      }
    }
  }

  // 3. Calculate Score
  if (totalClaims === 0) return 1.0; // No claims made implies perfect faithfulness (vacuously true)

  return supportedClaims / totalClaims;
}

// ============================================================================
// 4. MAIN EXECUTION FLOW (API ROUTE SIMULATION)
// ============================================================================

/**
 * Simulates a Next.js API Route or Edge Runtime function.
 * 
 * @param query - The user input.
 * @returns - Promise<EvaluationResult>
 */
async function evaluateRagPipeline(query: string): Promise<EvaluationResult> {
  console.log(`\n--- Processing Query: "${query}" ---`);

  // Step 1: Retrieval
  // In a real app, this uses the Query Vector.
  const context = retrieveContext(query);
  console.log("Retrieved Context:", context.map(c => c.content));

  // Step 2: Generation
  const answer = generateAnswer(query, context);
  console.log("Generated Answer:", answer);

  // Step 3: Evaluation
  const faithfulnessScore = calculateFaithfulness(answer, context);

  // Step 4: Construct Result
  const result: EvaluationResult = {
    answer,
    metrics: {
      faithfulness: faithfulnessScore,
      reasoning: faithfulnessScore < 1.0 
        ? "The answer contained claims not found in the retrieved context." 
        : "All claims in the answer are supported by the context."
    }
  };

  return result;
}

// ============================================================================
// 5. EXECUTION (Entry Point)
// ============================================================================

/**
 * Main entry point to run the simulation.
 */
async function main() {
  // Scenario A: Hallucination detected
  const queryA = "What algorithm does the API use for authentication?";
  const resultA = await evaluateRagPipeline(queryA);

  console.log("\n=== RESULT A ===");
  console.log(JSON.stringify(resultA, null, 2));

  // Scenario B: Faithful answer
  const queryB = "What is the rate limit?";
  const resultB = await evaluateRagPipeline(queryB);

  console.log("\n=== RESULT B ===");
  console.log(JSON.stringify(resultB, null, 2));
}

// Execute if this file is run directly (Node.js environment)
if (require.main === module) {
  main().catch(console.error);
}

export { evaluateRagPipeline, calculateFaithfulness };

Line-by-Line Explanation

Here is the detailed breakdown of the code logic, designed to map directly to enterprise RAG architecture concepts.

1. Type Definitions

ContextChunk: Defines the shape of data coming from your Vector Database. It includes the text content and the similarity score (cosine distance) used for ranking.
EvaluationResult: Defines the output structure expected by a frontend dashboard or monitoring tool. It encapsulates the raw answer and the computed metrics.

2. Mock Data & Utilities

MOCK_DOCUMENT_STORE: Represents the "Knowledge Base." In a production environment, this data resides in a vector store like Pinecone or Weaviate. We simulate it here to ensure the example is self-contained.
retrieveContext: Simulates the Retrieval phase of RAG. It accepts a query and returns relevant chunks. In a real system, this step involves converting the user query into a Query Vector and performing a similarity search against the database.
generateAnswer: Simulates the Generation phase. It takes the context and the query and produces a string.
- Critical Detail: In the code, we intentionally inject a hallucination ("RSA256") when the query asks about the "algorithm." This allows us to test the evaluation logic later. In a real app, this would be a call to an LLM like GPT-4.

3. Core Evaluation Logic (`calculateFaithfulness`)

This is the heart of the "Hello World" example. It implements a simplified version of the RAGAS Faithfulness metric. * Context Flattening: We combine all retrieved context chunks into a single lowercase string to make searching easier. * Claim Extraction: In a production RAGAS implementation, an LLM is used to decompose the generated answer into individual statements (claims). In this TypeScript example, we simulate this by defining a list of claims (keywords/phrases) we expect to see. * Verification Loop: The code iterates through the claims. If a claim exists in the answer, it checks if that same claim exists in the contextText. * Scoring: The score is calculated as the ratio of supportedClaims to totalClaims. * If the answer says "RSA256" (Hallucination) but the context only has "HS256", the score drops to 0.5 (1 supported claim / 2 total claims). * If the answer is purely based on context, the score remains 1.0.

4. Main Execution Flow (`evaluateRagPipeline`)

This function orchestrates the pipeline, mimicking a serverless function (like a Vercel Edge Function or Next.js API Route). 1. Retrieval: Calls retrieveContext. 2. Generation: Calls generateAnswer. 3. Evaluation: Calls calculateFaithfulness. 4. Packaging: Returns a structured JSON object.

5. Execution (`main`)

This block runs the simulation locally. It tests two scenarios: 1. Scenario A: Queries for the authentication algorithm. This triggers the hallucination logic, resulting in a lower faithfulness score. 2. Scenario B: Queries for the rate limit. This retrieves the correct context and generates a faithful answer, resulting in a score of 1.0.

Visualization of the Data Flow

The following diagram illustrates the logic flow within the evaluation pipeline.

The diagram visually maps the sequential data flow through the evaluation pipeline, tracing the movement of inputs, processing steps, and outputs to clarify the system's logic.

Common Pitfalls in TypeScript/Node.js RAG Evaluation

When implementing RAG evaluation in a TypeScript environment (especially for SaaS/Web Apps), watch out for these specific issues:

1. Async/Await Race Conditions in Loops When evaluating multiple queries in parallel (e.g., batch evaluation), developers often misuse Promise.all with non-thread-safe operations. * The Issue: If you are updating a global counter for metrics inside a forEach loop using await, the state might not update correctly before the next iteration. * The Fix: Use Promise.all with map and ensure state updates are atomic or handled via pure functions.

// BAD
let totalScore = 0;
queries.forEach(async (q) => {
    const res = await evaluate(q); // Race condition
    totalScore += res.score;
});

// GOOD
const results = await Promise.all(queries.map(q => evaluate(q)));
const totalScore = results.reduce((sum, r) => sum + r.score, 0);

2. Vercel/AWS Lambda Timeouts RAG evaluation, especially if you call an external LLM for scoring (e.g., using GPT-4 to check faithfulness), can take 5–10 seconds per query. * The Issue: Serverless functions (Vercel Edge, AWS Lambda) often have strict timeouts (e.g., 10 seconds for Hobby plans). If you evaluate a batch of 10 queries sequentially, you will time out. * The Fix: * Move evaluation to a background job (e.g., Vercel Cron or a queue like BullMQ). * If doing it live, ensure you stream the response or use Edge Runtime with streamToResponse.

3. Hallucinated JSON Parsing When building the evaluator that outputs structured data (e.g., asking an LLM to return a JSON object with scores), LLMs often return invalid JSON (trailing commas, unquoted keys). * The Issue: JSON.parse() throws a syntax error, crashing the Node.js process. * The Fix: Use a validation library like Zod to parse and validate the output strictly before using it.

import { z } from 'zod';

const EvaluationSchema = z.object({
    score: z.number().min(0).max(1),
    reasoning: z.string()
});

try {
    const parsed = EvaluationSchema.parse(llmOutput);
} catch (error) {
    // Handle parsing error gracefully
}

4. Token Limit Exceeded in Context When passing context to the evaluation LLM (if using a second LLM for scoring), you might exceed the context window. * The Issue: If you retrieve 20 chunks of text and pass them all to GPT-4 for evaluation, you might hit the 8k/32k token limit. * The Fix: Implement a "context compression" step or truncate the context string before passing it to the evaluator, prioritizing the highest-scoring chunks from the vector search.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.