Skip to content

Chapter 18: Testing AI Applications - Jest & Evals

Theoretical Foundations

At the heart of testing AI applications lies a fundamental tension that does not exist in traditional software engineering: the collision between deterministic logic and probabilistic generation. In conventional web development, when you send a request to an API endpoint, you expect a predictable response. If you send a GET request to /api/users/1, the database query returns a specific row. The logic is binary; it either works or it fails. The code executes instructions sequentially, and the output is a direct, reproducible function of the input.

Large Language Models (LLMs), however, operate on a completely different paradigm. They are not instruction-following machines in the traditional sense; they are probabilistic engines. When you ask an LLM a question, it does not "look up" an answer. Instead, it calculates the statistical likelihood of the next token (a word or sub-word) appearing in a sequence, based on the patterns it learned during training. This introduces non-determinism. Even with the exact same prompt and system instructions, an LLM might generate a slightly different response, or a radically different one, depending on the "temperature" (randomness) setting.

This creates a unique challenge for developers. How do you write tests for code that is, by its very nature, unpredictable? If you write a test asserting that generateEmailSubject("Urgent") returns exactly "Action Required: Urgent Request", the test will eventually fail, not because your code is broken, but because the model might return "Immediate Attention Needed" or "Urgent Matter Requiring Your Input". Both are semantically correct, but they break traditional assertions.

This is why the theoretical foundation of testing AI applications shifts from simple "pass/fail" unit tests to a more nuanced approach involving evaluation frameworks and deterministic wrappers. We must treat the LLM not as a calculator, but as a probabilistic component within a larger deterministic system.

The Web Development Analogy: The Unpredictable Microservice

To understand this, let's use a web development analogy. Imagine you are building an e-commerce platform. In your backend, you have a PaymentService microservice. This service is deterministic. You send it a payload with a credit card number and amount, and it returns a JSON object with a status: "success" or status: "failed". You can unit test this easily with Jest: you mock the input, assert the output, and you're done.

Now, imagine replacing that PaymentService with an EmailDraftingService powered by an LLM. You send it a payload containing a customer complaint, and it returns a drafted email response.

If you treat this LLM service like the PaymentService, your tests will be flaky. The LLM might decide to be more empathetic today, or slightly more formal, changing the text of the email. Your assertion expect(email).toBe("Dear Customer...") will fail because the model generated "Hello Customer...".

Therefore, the theoretical approach to testing AI apps is to build a "wrapper" around the LLM. This wrapper acts as a contract enforcer. It ensures that while the content of the LLM's output might vary, the structure of the data it returns remains rigid. This is where concepts from previous chapters, specifically Zod, become critical. Zod allows us to define a schema for the LLM's output, effectively forcing the probabilistic model into a deterministic data shape.

The Role of Structured Output and Zod

In Chapter 16, we explored Zod as a tool for runtime type validation. In the context of AI testing, Zod evolves from a validation utility into a testing enabler. Without structured output, testing an LLM's response is like trying to catch a greased pig; it's slippery and unpredictable.

When we instruct an LLM to output JSON Schema Output, we are essentially giving it a template. We are saying, "Do not just free-form text. Structure your internal reasoning into this specific object format."

This is analogous to how an HTML form works. When a user fills out a <form>, they can type anything they want into the <input> fields, but the browser enforces the type="email" attribute. It ensures the data conforms to a specific structure before it is sent to the server. Zod acts as that browser validator on the server side. It takes the LLM's raw output and validates it against a schema.

If the LLM outputs a string that cannot be parsed into the expected JSON structure, Zod throws an error. This allows us to write a deterministic test: expect(() => schema.parse(llmOutput)).not.toThrow(). This test doesn't care about the semantic content of the email draft; it cares about the structural integrity of the data. This is the first layer of testing: Input/Output Structure Testing.

The Analogy of the "Golden Dataset"

Moving beyond structure, we must address content. How do we test if the LLM is actually generating good emails? We cannot rely on manual inspection for every iteration. This leads us to the concept of the Golden Dataset (also known as a "Regression Dataset").

In traditional software testing, we often have "fixtures"—static data files used to simulate specific scenarios. In AI testing, the Golden Dataset is a collection of input-output pairs that have been manually verified by a human expert.

Think of this as a standardized test for students. You don't grade a student by comparing their essay word-for-word to a "perfect" essay (because there isn't one). Instead, you grade them based on whether they covered the key points (rubric) and whether the grammar is correct.

In AI, a Golden Dataset might look like this: * Input: "Customer is angry about a late shipment." * Expected Output (Verified): An email that is empathetic, offers a discount, and apologizes.

When we run our evaluation, we feed the Input into the LLM and compare the Generated Output to the Expected Output. Since we cannot do a simple string equality check (due to the non-determinism discussed earlier), we use Semantic Similarity.

This is where the analogy of Embeddings (from Chapter 15) becomes crucial. Embeddings convert text into high-dimensional vectors (lists of numbers). In the web development world, think of an Embedding as a Hash Map or a Search Index. Just as a hash map converts a key (string) into a specific memory address (number) to allow for fast lookups, an embedding converts a sentence into a vector that captures its meaning.

To test the LLM, we generate the embedding for the Expected Output and the Generated Output. We then calculate the Cosine Similarity between these two vectors. If the angle between them is small (a high similarity score), the LLM has produced a semantically equivalent response, even if the words are different. This allows us to automate the grading of the LLM's performance.

The Testing Pyramid for AI Applications

The theoretical framework for testing AI applications can be visualized as a pyramid, similar to the traditional testing pyramid, but adapted for the unique properties of LLMs.

The pyramid visually prioritizes LLM testing layers, starting with high-volume unit tests at the base, moving through integration and evaluation tests, and culminating in targeted manual reviews at the apex.
Hold "Ctrl" to enable pan & zoom

The pyramid visually prioritizes LLM testing layers, starting with high-volume unit tests at the base, moving through integration and evaluation tests, and culminating in targeted manual reviews at the apex.
  1. The Base: Deterministic Unit Tests (Jest & Zod) This is the foundation. We use Jest not to test the LLM's creativity, but to test the glue code surrounding it. We test that our functions correctly format the prompt, that our API calls handle errors, and that our Zod schemas correctly parse the structured output.

    • Analogy: Testing the plumbing of a house. You don't test if the water is "tasty"; you test that the pipes are connected correctly and don't leak.
  2. The Middle: Semantic Evaluation (Golden Datasets) This layer tests the actual intelligence of the application. We run our LLM against a dataset of known inputs and compare the outputs using metrics like semantic similarity (via embeddings) or string metrics (like Levenshtein distance for simple edits).

    • Analogy: A spell-checker or grammar tool. It doesn't rewrite your essay, but it highlights areas where the meaning might deviate from standard expectations.
  3. The Top: Human Evaluation (Evals) At the very top, we have "Evals." This is a term popularized by OpenAI, referring to the continuous monitoring and manual review of model performance. No automated metric can perfectly capture nuance, humor, or brand voice. This layer involves human experts reviewing a sample of the LLM's outputs to ensure quality.

    • Analogy: Code review. A linter (automated test) can catch syntax errors, but only a human developer can judge the architectural elegance and maintainability of the code.

Under the Hood: How Jest and V8 Handle AI Tests

It is important to understand the role of the V8 Engine in this process. When we run our tests using Jest (which runs on Node.js, powered by V8), we are executing JavaScript in a highly optimized environment.

Jest does not "know" that it is testing an AI application. To Jest, an AI test is just another asynchronous function. When we write a test that calls an LLM, Jest waits for the Promise to resolve. However, because LLM calls are slow and expensive, we cannot run them on every single code change like we do with pure logic tests.

This necessitates a shift in the testing workflow: 1. Mocking the LLM: During rapid development, we use Jest to mock the OpenAI API response. We simulate the LLM returning a specific string. This allows us to test our parsing logic (Zod) and application logic instantly, leveraging V8's speed. 2. Integration Testing: When we are ready to validate the AI behavior, we switch off the mocks and run the "Golden Dataset" tests. These are slower and might take minutes to run, rather than milliseconds.

The V8 engine ensures that the deterministic parts of our application (the code that processes the LLM's output) remain incredibly fast, even if the LLM calls themselves are slow. By isolating the probabilistic component (the LLM call) behind a deterministic interface (the API wrapper), we maintain the reliability of our application.

Theoretical Foundations

To summarize, testing AI applications requires a hybrid approach: 1. Enforce Structure: Use JSON Schema and Zod to force the LLM into a predictable data format, allowing for deterministic parsing. 2. Validate Semantics: Use Golden Datasets and Embeddings to measure the "meaning" of the output, rather than the exact text. 3. Layer Your Tests: Use Jest for the logic surrounding the LLM (fast, deterministic) and Evals for the LLM's output itself (slow, probabilistic).

This theoretical foundation moves us away from the binary thinking of "pass/fail" and toward a statistical understanding of software quality, where we measure confidence and accuracy rather than absolute correctness.

Basic Code Example

In the context of a SaaS web application, a common pattern is to receive streaming AI responses (via Server-Sent Events) and validate them against a strict schema before rendering them in the UI. This prevents the "hallucinated JSON" issue where the LLM outputs malformed data that crashes the client application.

Below is a self-contained TypeScript example demonstrating how to: 1. Define a strict schema using Zod. 2. Create a mock LLM function that simulates streaming JSON tokens (simulating an OpenAI or LangChain.js response). 3. Parse the accumulated stream using Zod. 4. Write a Jest test to validate the parsing logic.

The TypeScript Implementation

// file: ai-stream-parser.ts
import { z } from 'zod';

// 1. Define the Schema
// We expect the AI to return a structured summary of a user's project.
// This ensures type safety in our React/Vue/Svelte frontend.
const ProjectSummarySchema = z.object({
  title: z.string().min(1),
  sentiment: z.enum(['positive', 'neutral', 'negative']),
  keyFeatures: z.array(z.string()).max(3),
  estimatedHours: z.number().positive(),
});

// Type inference for TypeScript
export type ProjectSummary = z.infer<typeof ProjectSummarySchema>;

/**
 * Simulates an LLM streaming response (Server-Sent Events).
 * In a real app, this would be an OpenAI stream or LangChain.js callback.
 * 
 * @param input - The prompt sent to the AI
 * @returns An async generator yielding string tokens
 */
async function* mockLLMStream(input: string): AsyncGenerator<string, void, unknown> {
  // Simulating a JSON response stream from the AI
  const jsonTokens = JSON.stringify({
    title: "Project Alpha",
    sentiment: "positive",
    keyFeatures: ["Fast", "Scalable", "Secure"],
    estimatedHours: 40
  }).split(''); // Split into individual characters to simulate streaming

  // Yield tokens with a slight delay to simulate network latency
  for (const char of jsonTokens) {
    yield char;
    await new Promise(resolve => setTimeout(resolve, 10)); // Simulate network delay
  }
}

/**
 * Accumulates stream tokens and validates against the Zod schema.
 * 
 * @param stream - The AsyncGenerator from the LLM
 * @returns The validated, typed object
 * @throws Error if the accumulated JSON is invalid or schema validation fails
 */
export async function parseAndValidateStream(
  stream: AsyncGenerator<string, void, unknown>
): Promise<ProjectSummary> {
  let accumulatedContent = '';

  // 1. Consume the stream
  for await (const chunk of stream) {
    accumulatedContent += chunk;
  }

  // 2. Attempt to parse the raw string into JSON
  let parsedData: unknown;
  try {
    parsedData = JSON.parse(accumulatedContent);
  } catch (error) {
    throw new Error(`Failed to parse JSON from stream: ${accumulatedContent}`);
  }

  // 3. Validate against the Zod schema
  // This is the critical step: converting unstructured AI output into typed data
  const validationResult = ProjectSummarySchema.safeParse(parsedData);

  if (!validationResult.success) {
    // Format Zod errors for debugging (useful in SaaS logs)
    const errorMessages = validationResult.error.issues.map(issue => 
      `${issue.path.join('.')}: ${issue.message}`
    ).join(', ');
    throw new Error(`Schema validation failed: ${errorMessages}`);
  }

  return validationResult.data;
}

The Jest Test Implementation

// file: ai-stream-parser.test.ts
import { parseAndValidateStream, mockLLMStream } from './ai-stream-parser';

describe('AI Stream Parser', () => {
  // Unit test: Logic validation
  test('should correctly parse and validate a valid AI stream', async () => {
    // Arrange: Create the stream
    const stream = mockLLMStream("Summarize project Alpha");

    // Act: Parse the stream
    const result = await parseAndValidateStream(stream);

    // Assert: Check the result matches the schema and types
    expect(result).toBeDefined();
    expect(result.title).toBe("Project Alpha");
    expect(result.sentiment).toBe("positive");
    expect(result.estimatedHours).toBe(40);
    expect(Array.isArray(result.keyFeatures)).toBe(true);
  });

  // Unit test: Error handling for malformed JSON
  test('should throw an error if the stream contains invalid JSON', async () => {
    // Helper generator that yields broken JSON
    async function* brokenStream() {
      yield '{ "title": "Unclosed'; // Missing closing quote and brace
    }

    // Act & Assert
    await expect(parseAndValidateStream(brokenStream())).rejects.toThrow(
      'Failed to parse JSON from stream'
    );
  });

  // Unit test: Error handling for schema mismatch
  test('should throw a Zod error if data types do not match schema', async () => {
    // Helper generator that yields valid JSON but wrong types
    async function* mismatchedStream() {
      yield JSON.stringify({
        title: "Project Beta",
        sentiment: "unknown_value", // Invalid enum value
        keyFeatures: ["Feature 1"],
        estimatedHours: "forty" // Should be a number, not a string
      });
    }

    // Act & Assert
    await expect(parseAndValidateStream(mismatchedStream())).rejects.toThrow(
      'Schema validation failed'
    );
  });
});

Line-by-Line Explanation

1. Schema Definition (ProjectSummarySchema) * Why: LLMs are non-deterministic. They might return text instead of numbers, or extra fields you don't expect. * How: We use z.object to define the shape. z.enum restricts values to specific strings. z.array(z.string()) ensures keyFeatures is a list of strings. * Under the Hood: Zod creates a validation function that runs at runtime. It checks types and values, returning a safeParse result (success/failure) rather than throwing an exception immediately, which is safer for stream processing.

2. Mock LLM Stream (mockLLMStream) * Why: To test the parser, we need a stream that behaves like a real network connection (Server-Sent Events). * How: We use an AsyncGenerator function (async function*). It yields one character at a time after a 10ms delay. * Under the Hood: This mimics the V8 Engine's event loop. The yield keyword pauses execution until the consumer (the for await loop) requests the next chunk, simulating backpressure.

3. The Parser (parseAndValidateStream) * Why: This is the core logic of the application. It bridges the gap between the raw text stream and the typed application state. * Line 1 (let accumulatedContent): Initializes a buffer to collect tokens. * Line 2 (for await...of): This loop handles the asynchronous nature of the stream. It waits for each chunk to arrive before processing the next. * Line 3 (JSON.parse): Converts the accumulated string into a JavaScript object. We wrap this in a try/catch because JSON.parse throws a fatal error on invalid syntax (e.g., a missing comma). * Line 4 (ProjectSummarySchema.safeParse): This is the validation step. Unlike .parse(), safeParse returns an object containing either the data (if valid) or error (if invalid). This prevents the application from crashing and allows us to handle validation errors gracefully (e.g., showing a "Retry" button in the UI).

4. Jest Tests * Test 1 (Happy Path): Verifies that valid input results in a correctly typed object. * Test 2 (JSON Error): Verifies that if the AI hallucinates broken JSON syntax, the parser catches the exception and throws a descriptive error. * Test 3 (Schema Error): Verifies that if the AI returns valid JSON but wrong data (e.g., a string instead of a number), Zod catches it and throws a validation error.

Visualizing the Data Flow

The following diagram illustrates the flow of data from the LLM to the validated TypeScript object.

The diagram illustrates the data flowing from the LLM output through a Zod schema validation step, where type mismatches trigger errors, before finally arriving as a validated TypeScript object.
Hold "Ctrl" to enable pan & zoom

The diagram illustrates the data flowing from the LLM output through a Zod schema validation step, where type mismatches trigger errors, before finally arriving as a validated TypeScript object.

Common Pitfalls

When implementing AI streaming validation in a production SaaS environment, watch out for these specific issues:

  1. Incomplete JSON Chunks (The "Cut-off" Problem)

    • Issue: LLMs often stream tokens faster than the client can process them, or the connection might drop mid-stream. The accumulatedContent might end with {"title": "My Proj (incomplete).
    • Solution: Always wrap JSON.parse in a try/catch. If parsing fails, do not crash; instead, buffer the data and wait for more chunks (or implement a timeout). In a real app, you might use a streaming JSON parser (like jsonrepair or a state-machine parser) rather than waiting for the stream to end to parse.
  2. Vercel/AWS Lambda Timeouts

    • Issue: Serverless functions have strict execution timeouts (e.g., 10s on Vercel Hobby). If the LLM is slow or the stream is long, the function might time out before the stream finishes.
    • Solution: Do not await the full stream in the serverless function if possible. Instead, pipe the stream directly to the HTTP response (using Web Streams API). The validation should happen on the client side (or in a separate background worker) if the response time is critical.
  3. Async/Await Loops and Memory Leaks

    • Issue: Using while loops with await inside can block the Node.js Event Loop if not managed correctly, or create memory leaks if streams aren't closed properly.
    • Solution: Use for await...of syntax, which handles backpressure and cleanup automatically. Ensure you implement proper AbortController signals to cancel the stream if the user navigates away from the page.
  4. Zod Performance on Large Objects

    • Issue: Validating massive JSON objects (e.g., a 10MB response) synchronously can block the main thread, making the UI unresponsive.
    • Solution: For large payloads, validate in chunks or offload validation to a Web Worker. Alternatively, use zod's .pipe() to validate parts of the object incrementally.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.