Chapter 13: Unit Testing Prompts - CI/CD for AI

Theoretical Foundations

In traditional software engineering, a unit test is a contract of certainty. When you write a function add(a, b), you assert that add(2, 2) will always return 4. This is deterministic. The logic is rigid, the inputs are defined, and the output is a mathematical guarantee.

When we move to the world of Large Language Models (LLMs), this contract shatters. We are no longer dealing with deterministic logic; we are dealing with probabilistic inference. An LLM is a "stochastic parrot"—given the prompt "The capital of France is", it will likely output "Paris", but it might output "Paris, a city of light" or "Paris (population 2.1 million)".

Unit Testing Prompts is the discipline of enforcing quality and consistency on these probabilistic outputs. It is the bridge between the wild, creative potential of an LLM and the rigorous reliability required for production software.

The Analogy: The Master Chef and the Food Critic

Imagine you are building a system for a high-end restaurant (your application). You hire a brilliant, creative Master Chef (the LLM) to generate new dishes (text outputs).

The Problem: You cannot simply tell the Chef "make a soup" and expect the exact same bowl every time. The Chef might use a different garnish, a slightly different spice blend, or describe it with different poetry. This is the stochastic nature of the LLM.
The Solution: You hire a team of Food Critics (your Test Suite). These Critics don't have a recipe for the "perfect" soup. Instead, they have a set of criteria:
1. Deterministic Assertion (The "Salt" Test): Does the soup contain salt? (Does the output contain the word "Paris"?). This is a hard, binary check.
2. Semantic Similarity (The "Taste" Test): Does the soup taste like a French onion soup? Even if the Chef used a different type of onion, the essence should be the same. We measure this with a "flavor profile" score (Cosine Similarity of Embeddings).
3. Structural Validation (The "Plating" Test): Is the soup served in a bowl, not a shoe? (Is the output valid JSON? Does it adhere to a specific schema?).

By running these critics automatically (CI/CD), we ensure that no matter how creatively the Chef works, the output consistently meets our restaurant's standards.

Why This is Necessary: The "Butterfly Effect" of Prompts

In Book 4, Chapter 11, we explored Prompt Engineering Strategies, learning how to craft the perfect instruction to guide the LLM. We learned that a single word change can drastically alter the output.

Unit testing is the safety net for this fragility. We need it for three critical reasons:

Regression Prevention: When you decide to switch your model from a generic 7B parameter model to a more specialized 13B fine-tuned model, or even just update the system prompt, you need to know if your application's behavior has unexpectedly drifted. Without tests, you are flying blind.
Cost and Latency Management: As we move to the edge using tools like Ollama and WebGPU, we are often constrained by hardware. Running an LLM is expensive. If a change in your prompt logic causes the model to output verbose, rambling text instead of concise answers, your token costs will explode and latency will spike. Tests catch this immediately.
Behavioral Guardrails: LLMs can be unpredictable. A test suite ensures that the model never outputs harmful content, violates formatting constraints, or fails to adhere to the logical rules of your application.

The Testing Pyramid for LLMs

Just as in traditional web development, we don't test everything with the most expensive method. We use a pyramid approach.

1. Deterministic Assertions (The Base)

This is the fastest and cheapest layer. We treat the LLM output as a string and apply standard software logic.

Regex: Does the output match a specific pattern (e.g., an email address, a date format)?
Keyword Inclusion/Exclusion: Must the output contain the word "success"? Must it not contain the word "error"?
Length Constraints: Is the summary under 100 words?

2. Semantic Similarity (The Middle)

This is where we validate the intent and meaning rather than the exact characters. This requires understanding from Book 2, Chapter 4, where we discussed Embeddings.

How it works: We convert both the expected output (the "golden" response) and the actual LLM output into vector embeddings.
The Metric: We calculate the Cosine Similarity between these two vectors. A score of 1.0 means they are identical in meaning. A score of 0.95 is likely a pass. A score of 0.6 indicates a significant drift in meaning.

3. LLM-as-a-Judge (The Peak)

For complex, subjective tasks (e.g., "Is this summary coherent?", "Is this code efficient?"), we can use another, more powerful LLM to evaluate the output of the application LLM. This is the "Recursive Critic" pattern.

Visualizing the Test Execution Flow

The following diagram illustrates how a single prompt test flows through our system, comparing the LLM's raw output against our various assertion layers.

This diagram visually traces a single prompt's journey through the system, contrasting the LLM's raw output against a hierarchy of assertion layers to demonstrate the Recursive Critic pattern in action.

The Role of the Edge Runtime and `useCompletion`

In the context of Book 5, our focus is on the Edge. We want these tests to run fast. We don't want to spin up a heavy server for every test run.

This is where the Edge Runtime becomes critical. It provides a lightweight environment (like V8 Isolates) that starts instantly. When we integrate our testing framework into a CI/CD pipeline (like GitHub Actions), we can execute these tests in an environment that mimics our production edge deployment.

Furthermore, the testing patterns we establish here directly influence our frontend architecture. When we build our UI, we will use hooks like useCompletion from the Vercel AI SDK. This hook is optimized for the exact type of non-conversational generation we are testing here (summarization, classification, etc.).

By building a robust testing suite, we ensure that the backend logic (the prompt and model) is solid, so when we connect it to useCompletion on the frontend, we are confident in the data stream we receive.

The ReAct Pattern and Iterative Testing

Finally, we must consider that not all LLM interactions are simple, one-shot generations. As we build more advanced agents, we will encounter the Thought-Action-Observation Triple (from Book 4, Chapter 12).

Testing an agent is fundamentally different. You are not testing a single output; you are testing a loop. A unit test for a ReAct agent must simulate an entire cycle:

Assert the Thought: Is the reasoning step logically sound?
Assert the Action: Did the agent choose the correct tool with the correct arguments?
Simulate the Observation: Feed a mock result back into the agent.

This requires a stateful test runner that can manage the conversation context, something we will delve into deeper in the next section. For now, understand that the principles of deterministic and semantic validation are the atomic tools we use to validate every step of these complex agent loops.

Basic Code Example

In a SaaS application, you might have a feature that generates a "friendly" summary of a user's support ticket. The prompt is critical: if it's too vague, the model might hallucinate details; if it's too strict, the output might sound robotic. Unit testing this prompt ensures that the logic remains consistent even if you switch models (e.g., from GPT-4 to a local Ollama model) or tweak the system instructions.

Below is a self-contained TypeScript example. It simulates a testing environment where we mock the LLM response (using a deterministic function for this example) to demonstrate how to validate the output structure and semantic intent. In a real CI/CD pipeline, you would replace the mock with a call to ollama run or an API wrapper.

/**
 * @fileoverview A basic unit test example for validating LLM prompt outputs.
 * This simulates a SaaS feature where we generate a summary for a support ticket.
 * 
 * Dependencies: None (Native Node.js APIs used for demonstration).
 * In production, you would import an LLM client (e.g., `ollama-js` or `openai`).
 */

// ==========================================
// 1. Types & Interfaces
// ==========================================

/**
 * Represents the input data for our prompt.
 */
interface SupportTicket {
  id: string;
  subject: string;
  description: string;
  priority: 'low' | 'medium' | 'high';
}

/**
 * Represents the expected structured output from the LLM.
 * We enforce this schema to prevent hallucinations.
 */
interface TicketSummary {
  summary: string;
  sentiment: 'positive' | 'neutral' | 'negative';
  suggestedAction: string;
}

// ==========================================
// 2. The Prompt Logic (The "System")
// ==========================================

/**
 * Generates a system prompt for the LLM based on the ticket data.
 * 
 * @param ticket - The support ticket object.
 * @returns A formatted string prompt.
 */
function createPrompt(ticket: SupportTicket): string {
  return `
    You are a helpful support assistant. 
    Analyze the following support ticket and provide a JSON summary.

    Ticket ID: ${ticket.id}
    Subject: ${ticket.subject}
    Description: ${ticket.description}
    Priority: ${ticket.priority}

    Respond ONLY with valid JSON in the following format:
    {
      "summary": "A brief 1-sentence summary of the issue.",
      "sentiment": "positive" | "neutral" | "negative",
      "suggestedAction": "One specific action to take."
    }
  `.trim();
}

/**
 * Mocks the LLM call. 
 * In a real app, this would be `await ollama.generate({ model: 'llama2', prompt: ... })`
 * 
 * For this deterministic test, we return a fixed response that matches our expected schema.
 * 
 * @param prompt - The generated prompt string.
 * @returns A promise resolving to a JSON string.
 */
async function callLLM(prompt: string): Promise<string> {
  // Simulate network latency
  await new Promise(resolve => setTimeout(resolve, 100));

  // In a real test, this might return different variations based on the input.
  // Here, we return a "happy path" response.
  const mockResponse = {
    summary: "User is experiencing login failures due to expired credentials.",
    sentiment: "negative",
    suggestedAction: "Send a password reset link immediately."
  };

  return JSON.stringify(mockResponse);
}

// ==========================================
// 3. The Unit Test Logic
// ==========================================

/**
 * Executes the prompt generation and validates the LLM output.
 * This function represents the core logic of our unit test.
 * 
 * @param ticket - The input ticket data.
 * @returns An object containing validation results and the parsed output.
 */
async function runPromptTest(ticket: SupportTicket) {
  // Step 1: Generate the prompt
  const prompt = createPrompt(ticket);

  // Step 2: Call the LLM (Mocked for this example)
  const rawOutput = await callLLM(prompt);

  // Step 3: Parse and Validate
  let parsedOutput: TicketSummary;
  let isValidJSON = false;
  let hasRequiredFields = false;
  let semanticCheckPass = false;

  try {
    // A. Syntax Validation: Check if output is valid JSON
    parsedOutput = JSON.parse(rawOutput);
    isValidJSON = true;

    // B. Schema Validation: Check if required fields exist
    const requiredKeys = ['summary', 'sentiment', 'suggestedAction'];
    hasRequiredFields = requiredKeys.every(key => key in parsedOutput);

    if (hasRequiredFields) {
      // C. Semantic Validation: Check if sentiment matches the priority context
      // (Simple heuristic: High priority tickets usually imply negative sentiment)
      if (ticket.priority === 'high' && parsedOutput.sentiment === 'negative') {
        semanticCheckPass = true;
      } else if (ticket.priority !== 'high') {
        // For low/medium, any sentiment is acceptable in this simple logic
        semanticCheckPass = true;
      }
    }
  } catch (error) {
    // JSON parsing failed
    isValidJSON = false;
  }

  return {
    input: ticket,
    output: parsedOutput,
    validation: {
      isValidJSON,
      hasRequiredFields,
      semanticCheckPass,
      overallPass: isValidJSON && hasRequiredFields && semanticCheckPass
    }
  };
}

// ==========================================
// 4. Execution & Assertions (Simulating a Test Runner)
// ==========================================

/**
 * Main entry point to run the test suite.
 */
async function main() {
  console.log("🧪 Running Unit Test: Support Ticket Summary\n");

  // Test Case 1: High Priority Ticket (Expecting 'negative' sentiment)
  const highPriorityTicket: SupportTicket = {
    id: "T-1001",
    subject: "Critical System Outage",
    description: "The server is down and we cannot access the database.",
    priority: "high"
  };

  const result1 = await runPromptTest(highPriorityTicket);

  console.log("--- Test Case 1: High Priority ---");
  console.log("Input:", JSON.stringify(result1.input, null, 2));
  console.log("Output:", JSON.stringify(result1.output, null, 2));
  console.log("Validation Result:", result1.validation);

  // Assertion
  if (!result1.validation.overallPass) {
    console.error("❌ Test Case 1 FAILED");
    process.exit(1); // Fail the CI/CD pipeline
  } else {
    console.log("✅ Test Case 1 PASSED\n");
  }

  // Test Case 2: Low Priority Ticket (Expecting structure adherence)
  const lowPriorityTicket: SupportTicket = {
    id: "T-1002",
    subject: "Feature Request",
    description: "It would be nice to have dark mode.",
    priority: "low"
  };

  // To test failure, we can simulate a bad response (e.g., hallucinated JSON)
  // Let's temporarily mock the LLM to return garbage
  const originalCallLLM = callLLM;
  (global as any).callLLM = async () => "{ 'summary': 'Bad JSON', "; // Intentional syntax error

  const result2 = await runPromptTest(lowPriorityTicket);

  console.log("--- Test Case 2: Invalid JSON Handling ---");
  console.log("Validation Result:", result2.validation);

  if (result2.validation.overallPass) {
    console.error("❌ Test Case 2 FAILED (Should have caught bad JSON)");
    process.exit(1);
  } else {
    console.log("✅ Test Case 2 PASSED (Correctly detected invalid output)\n");
  }

  console.log("🎉 All tests completed successfully.");
}

// Run the main function
if (require.main === module) {
  main();
}

Line-by-Line Explanation

1. Types & Interfaces

SupportTicket: Defines the data structure entering our system. In a real SaaS app, this data is fetched via Data Fetching in SCs (Server Components) before the AI logic runs.
TicketSummary: Defines the strict schema we expect the LLM to return. This is the first line of defense against hallucinations. By defining this interface, TypeScript ensures that downstream code (e.g., a UI component displaying the summary) won't crash if the LLM omits a field.

2. The Prompt Logic

createPrompt: This function constructs the "System" instruction. Notice the explicit instruction: "Respond ONLY with valid JSON...". This is a prompt engineering technique to force the model to adhere to a structure.
callLLM: This is a mock implementation of an LLM client (like ollama-js). In a real environment, this function handles the network request to your local Ollama instance or a cloud provider. We wrap it in a setTimeout to simulate the asynchronous nature of LLM inference.

3. The Unit Test Logic (`runPromptTest`)

This is the core of the testing framework. It breaks down validation into three distinct layers: 1. Syntax Validation (JSON.parse): LLMs are probabilistic. Even with strict instructions, they might output trailing commas or hallucinate text outside the JSON block. Wrapping parsing in a try/catch block is essential. 2. Schema Validation: We check if the parsed object contains the keys defined in TicketSummary. This ensures the LLM didn't just return a string but actually followed the structural requirements. 3. Semantic Validation: This is a "deterministic assertion" about the content. We check if the sentiment aligns with the priority. While simple, this prevents the model from being "too cheerful" about a critical outage.

4. Execution & Assertions

main: Acts as the test runner (like Jest or Mocha).
Test Case 1: A "Happy Path" test. It verifies that the mock LLM returns the correct structure and that our semantic logic holds true.
Test Case 2: A "Sad Path" test. We intentionally break the callLLM function to return malformed JSON. This verifies that our validation logic correctly identifies a failure and prevents the application from processing invalid data.

Visualizing the Data Flow

The following diagram illustrates the flow of data through the unit test, highlighting where validation occurs.

This diagram illustrates the data flow through a unit test, highlighting how validation logic intercepts invalid data before it can be processed by the application.

Common Pitfalls in JS/TS Unit Testing for LLMs

When implementing these tests in a CI/CD pipeline (e.g., GitHub Actions), watch out for these specific issues:

Non-Deterministic JSON Parsing
- The Issue: LLMs often output Markdown code blocks (e.g., ```json ... ```) alongside the raw JSON. Standard JSON.parse() will fail if the string contains these markers.
- The Fix: Use a regex extractor before parsing: const jsonString = rawOutput.match(/```json([\s\S]*?)```/)?.[1] || rawOutput;.
Vercel/Serverless Timeouts
- The Issue: If you run these tests against a local Ollama instance, the initial "cold start" or model loading time can exceed the default timeout of serverless functions (usually 10s).
- The Fix: Increase the timeout setting in your CI runner or test configuration. For local testing, ensure the Ollama service is kept warm.
Async/Await Loops in Test Runners
- The Issue: When running multiple tests sequentially, improper handling of async/await can cause the test runner to exit before all assertions are made, or errors might be swallowed.
- The Fix: Always use async/await in your test cases and ensure the main test runner function awaits all sub-tests. Use process.exit(1) explicitly on failure to signal CI/CD failure.
Token Drift in CI Environments
- The Issue: Running tests locally with WebGPU might produce slightly different tokenization than a CPU-only CI runner (e.g., GitHub Actions standard runners). This can lead to subtle changes in the output string length or content.
- The Fix: Focus assertions on structure (JSON keys) and semantic meaning (sentiment analysis) rather than exact string matching. Avoid checking for specific word counts unless strictly necessary.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.