Chapter 19: Testing Async Code - Deterministic Testing for Non-Deterministic AI
Theoretical Foundations
The fundamental challenge in testing asynchronous AI pipelines stems from the collision between two worlds: the deterministic, sequential logic of traditional software engineering and the non-deterministic, probabilistic nature of Large Language Models (LLMs). When we build an AI application, we are essentially orchestrating a symphony of asynchronous operations—network calls to external APIs, concurrent processing of multiple data streams, and parallel execution of independent tasks. However, the outputs of these operations are governed by statistical sampling rather than deterministic algorithms. This creates a testing environment where traditional assertions fail, latency varies wildly, and race conditions are not just bugs but expected behaviors.
To understand this deeply, we must first dissect the nature of asynchronous operations in C# and how they differ from their synchronous counterparts. In a synchronous world, execution flows linearly. When you call a function, the thread blocks until the function returns. This predictability makes testing straightforward: given the same input, you expect the same output at the same time. However, in an asynchronous world, we utilize the async and await keywords to free up the calling thread while waiting for an operation to complete. This is managed internally by the C# compiler through a state machine that captures the context and resumes execution when the awaited task finishes.
Consider the analogy of a restaurant kitchen. In a synchronous kitchen, a chef prepares a dish step-by-step: chop vegetables, boil water, cook pasta, and plate it. The chef cannot start the next step until the current one is finished. This is easy to test—if you measure the time and quality of the dish, you get consistent results. In an asynchronous kitchen, the chef starts boiling water and, while waiting for it to boil, immediately starts chopping vegetables. When the water boils, they switch back to add the pasta. This is efficient but introduces variability. The water might boil faster or slower depending on the stove's heat, and the chef might be interrupted by another task. Testing this kitchen requires verifying that the pasta is cooked correctly eventually, regardless of the exact timing of the boiling water.
In AI applications, this variability is amplified. An LLM call is not a simple function that returns a value; it is a network request to a remote service that involves token generation, model inference, and potential rate limiting. Furthermore, the output is non-deterministic: even with the same input prompt, the model might generate different responses due to internal sampling parameters (like temperature). Therefore, testing an AI pipeline cannot rely on verifying exact string matches or fixed execution times.
The theoretical foundation of testing such systems lies in Deterministic Mocking and Structured Concurrency. Deterministic mocking is the practice of replacing non-deterministic dependencies (like LLM API calls) with deterministic substitutes. Instead of calling a real model that returns a random poem, we substitute it with a mock that returns a predefined poem. This isolates the business logic—the code that processes the poem—from the unpredictability of the model. Structured concurrency, a concept heavily emphasized in modern C# through Task.WhenAll and Task.WhenAny, ensures that we manage the lifecycle of asynchronous operations as a coherent unit. It prevents "fire-and-forget" patterns where tasks are lost in the system, leading to unobserved exceptions or resource leaks.
To visualize the flow of an asynchronous AI pipeline, consider the following diagram. It illustrates the separation between the deterministic business logic and the non-deterministic external calls.
The diagram highlights a critical architectural pattern: the Adapter Pattern. In Book 3, Chapter 15, we discussed how to design interfaces for AI services (e.g., IChatClient). This interface allows us to swap implementations. In testing, we inject a MockChatClient that implements IChatClient but returns deterministic data. This decoupling is vital. Without it, testing a pipeline that requires an LLM response would force us to actually call the API, making tests slow, expensive, and flaky.
Let's delve deeper into the concept of Non-Determinism. In C#, non-determinism usually arises from external factors: system time, random number generators, or I/O operations. In AI, it is intrinsic to the model's generation process. When we await a call to an LLM, we are suspending execution until a stream of tokens arrives. This stream might be received in chunks. The IAsyncEnumerable<T> interface, introduced in C# 8.0, is the cornerstone of handling such streams. It allows us to iterate over a sequence asynchronously, yielding items as they become available.
Testing IAsyncEnumerable requires a shift in mindset. We cannot simply check the return value of a method; we must consume the stream and verify the sequence of items produced. However, because the stream's length and timing are variable, we need Time-Bound Retries. This is a strategy where we poll the stream for a specific condition (e.g., "does the stream contain the word 'success'?") within a defined timeout window. If the condition is met within the window, the test passes; if the window expires, the test fails.
Consider the analogy of a package delivery tracker. You don't know exactly when the package will arrive, but you know it should arrive between 9 AM and 5 PM. You check the tracker periodically. If you see "Delivered" by 5 PM, the delivery was successful. If not, it failed. This is the essence of time-bound assertions in asynchronous testing.
Another critical concept is Structured Concurrency using Task.WhenAll. When an AI application needs to perform multiple independent operations—such as summarizing a document and extracting keywords simultaneously—we use Task.WhenAll to await the completion of all tasks. This ensures that the application doesn't proceed until all background work is finished.
However, this introduces a complexity in testing: Partial Failures. What if the summary task succeeds but the keyword extraction fails? In a synchronous world, an exception would halt execution. In an asynchronous world, Task.WhenAll aggregates exceptions. If one task fails, the AggregateException is thrown only when all tasks have completed (or failed). Testing this requires verifying that the system handles these aggregated exceptions gracefully, perhaps by logging the failure of one task while utilizing the result of the successful one.
To manage this, we often use Cancellation Tokens (CancellationToken). In testing, we can simulate a timeout by passing a token that cancels after a specific duration. This allows us to test how the pipeline behaves when an LLM call takes too long. Does the system clean up resources? Does it return a fallback response? This is crucial for building resilient AI applications.
Let's visualize the state machine of an asynchronous task in the context of testing. The task transitions through various states: Created, WaitingForActivation, Running, RanToCompletion, Faulted, or Canceled. A test must assert that the task ends in the expected state.
The theoretical underpinning of testing these states relies on the Task API. We can inspect the Task.Status property to verify the state. However, in unit tests, we rarely interact with the raw Task object directly; instead, we use helper methods provided by testing frameworks (like xUnit or NUnit) that understand async methods.
For example, a test method marked with async Task will automatically await the completion of the task being tested. If the task throws an exception, the test framework catches it and marks the test as failed. This is the standard pattern. However, when testing timeouts, we need to be more explicit. We might wrap the awaited task in a Task.WhenAny with a Task.Delay representing the timeout.
// Conceptual example of a timeout wrapper
var timeoutTask = Task.Delay(TimeSpan.FromSeconds(5));
var operationTask = _mockLlm.SummarizeAsync(text);
var completedTask = await Task.WhenAny(operationTask, timeoutTask);
if (completedTask == timeoutTask)
{
throw new TimeoutException("LLM call took too long.");
}
// If we are here, operationTask completed first.
var result = await operationTask;
This pattern is fundamental to testing streaming responses. When we consume an IAsyncEnumerable, we might want to ensure that the first item arrives within a specific time, or that the entire stream finishes within a total budget.
Now, let's discuss Deterministic Mocking in the context of IAsyncEnumerable. A mock of an LLM that streams tokens needs to simulate the delay between tokens. In a production environment, network latency causes these delays. In a test, we want to verify the logic that processes the stream, not the network latency. Therefore, we create a mock that yields items immediately. This makes the test run instantly and deterministically.
However, there is a nuance: Race Conditions. If the consumer of the stream processes items faster than the mock yields them, or vice versa, the behavior might differ. By using a deterministic mock that yields items at a predictable rate (even if zero delay), we can verify the logic's correctness regarding the sequence of items.
Consider a scenario where we are building a chatbot that maintains a conversation history. The history is updated asynchronously as the LLM generates a response. This is a classic race condition scenario. If two requests come in simultaneously, they might read the history, append their own context, and write back, resulting in a lost update.
In Book 3, we covered Concurrent Collections (like ConcurrentDictionary). In the context of AI pipelines, we often use Channel<T> or BlockingCollection<T> to handle producer-consumer scenarios. Testing these requires simulating high concurrency. We might spawn multiple Task instances that write to a channel simultaneously and verify that the consumer receives all items without loss.
The theoretical solution to race conditions in async code is Locking or Atomic Operations. However, in asynchronous code, traditional locks (lock keyword) can cause deadlocks if not used carefully, especially in UI applications or when mixing synchronous and asynchronous code. The modern C# approach uses SemaphoreSlim with WaitAsync(), which allows asynchronous waiting for a resource.
Testing code that uses SemaphoreSlim involves verifying that the semaphore is released correctly, even in the face of exceptions. This is where the try-finally block becomes critical. A test case must simulate an exception during the resource usage and verify that the semaphore count returns to its original state.
Let's look at the analogy of a nightclub with a bouncer. The bouncer (Semaphore) allows a fixed number of people (threads) inside. If the club is full, people wait outside. Testing this involves sending a burst of people (tasks) and verifying that exactly the allowed number enters, and that when someone leaves (releases the semaphore), the next person in line enters. If a person faints inside (exception), the bouncer must still allow the next person in (via finally).
In AI pipelines, we often use Polly for resilience (retry, circuit breaker patterns). Testing these policies requires simulating failures. We mock the LLM to throw an exception or return a specific error code. Then we verify that the Polly policy retries the correct number of times before giving up.
This leads us to the concept of Flakiness. A test is flaky if it passes sometimes and fails others without code changes. In async AI testing, flakiness arises from:
- Timing: Assuming a task finishes in X milliseconds.
- Non-Determinism: Assuming the LLM output is fixed.
- Resource Contention: Tests running in parallel fighting for CPU or I/O.
To combat flakiness, we must design tests that are Deterministic. This means:
- Mocking all external dependencies.
- Using
Task.Yield()orTask.Delay(0)to control scheduling.Task.Yield()forces the continuation to be scheduled on the message loop, allowing other tasks to run. This is useful for simulating context switches in tests. - Avoiding
Thread.Sleep(). In tests, never use blocking sleeps. Useawait Task.Delay()to keep the test thread free.
Let's discuss the Event Loop. In C#, the synchronization context (like the UI thread or ASP.NET request context) maintains an event loop. When you await a task, the continuation is posted to this loop. In unit tests, there is usually no synchronization context (unless testing UI code), so continuations run on the thread pool. However, when testing async code that interacts with UI elements or specific frameworks, we must be aware of the context.
For example, in a Blazor application, updating a component must happen on the UI thread. If an AI pipeline finishes on a background thread, we must marshal the call back to the UI thread. Testing this requires a test harness that simulates the synchronization context.
The theoretical foundation of Event-Loop Aware Assertions suggests that assertions should be made on the correct thread. In xUnit, tests run on a thread pool thread, so there is no synchronization context by default. However, if we are testing code that relies on SynchronizationContext.Current (like legacy ASP.NET or UI apps), we might need to capture and restore it.
Finally, let's synthesize these concepts into a cohesive testing strategy for an AI pipeline.
- Isolate the LLM Call: Create an interface
IAsyncLlmClient. - Implement a Deterministic Mock: Create
MockLlmClientthat implementsIAsyncLlmClient. It should return predefined strings or throw predefined exceptions. - Test the Business Logic: Inject the mock into the service under test. Assert that the service transforms the LLM output correctly.
- Test Concurrency: Use
Task.WhenAllto spawn multiple calls to the service. Verify thread safety and data integrity. - Test Streaming: Implement
IAsyncEnumerable<string> MockStream()in the mock. Verify that the consumer aggregates the stream correctly. - Test Timeouts: Use
Task.WhenAnywith aTask.Delayto ensure the system doesn't hang.
This approach ensures that we are testing the logic of our application, not the reliability of the network or the stability of the LLM provider. By mocking the non-deterministic parts, we gain confidence that our code handles the responses correctly, regardless of how they are delivered.
The shift from synchronous to asynchronous testing requires a deep understanding of the Task Parallel Library (TPL) and the async/await state machine. It moves the focus from "what is the result?" to "what is the behavior over time?". In AI applications, where the "result" is often probabilistic, this behavioral testing is the only reliable way to ensure software quality.
using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
The theoretical foundation of testing asynchronous AI pipelines rests on the principle of Deterministic Isolation. In the realm of software engineering, determinism is the bedrock of reliability; a function is deterministic if, given the same input, it always produces the same output and exhibits the same side effects. However, when we introduce asynchronous operations—specifically those interacting with non-deterministic external systems like Large Language Models (LLMs)—we shatter this predictability. The challenge is not merely that the code runs asynchronously, but that the results of those asynchronous operations are variable in both content and timing.
To understand this, we must first dissect the nature of asynchronous execution in C#. When we mark a method with the async modifier, we are instructing the compiler to transform that method into a state machine. This state machine captures the local variables and the execution context, allowing the method to suspend its execution at await points and resume later, typically when the awaited task completes. This suspension and resumption are managed by the SynchronizationContext or the TaskScheduler. In a typical application (like ASP.NET Core), there is no synchronization context, and tasks are scheduled on the thread pool. In a UI application (like WPF or Blazor), the context captures the UI thread, ensuring that continuations update the user interface safely.
The non-determinism of AI pipelines arises from two distinct sources: Latency Variance and Output Stochasticity.
Latency Variance is the time it takes for the LLM to respond. This is influenced by network conditions, server load, and the complexity of the prompt. In a synchronous world, we might block a thread waiting for this response. In an asynchronous world, we release the thread back to the pool, allowing the application to remain responsive. However, when testing, we cannot rely on wall-clock time. A test that waits for a real LLM API call might pass in 500ms on a fast connection but timeout on a slower one, leading to flaky tests.
Output Stochasticity is the probabilistic nature of LLM generation. Even with a fixed temperature (randomness parameter) and seed, some models still produce slight variations, or the underlying infrastructure might introduce subtle differences. If a test asserts that an AI pipeline returns the exact string "The capital of France is Paris.", but the model returns "The capital of France is Paris!", the test fails despite the business logic being correct.
To navigate this, we employ Deterministic Mocking. This concept was briefly touched upon in Book 3, Chapter 12, regarding Dependency Injection, where we decoupled the business logic from concrete implementations. In the context of AI, we define an interface, say IAsyncLlmClient, which exposes methods like Task<string> GetCompletionAsync(string prompt) or IAsyncEnumerable<string> StreamCompletionAsync(string prompt). In production, we inject a concrete implementation that calls the actual API. In testing, we inject a mock implementation that returns predefined data. This isolates the unit of work—the code that processes the LLM response—from the unpredictability of the network and the model.
Consider the analogy of a Restaurant Kitchen. The Chef (Business Logic) prepares a dish based on ingredients provided by a Supplier (LLM). In a real scenario, the Supplier might be late, or the ingredients might vary slightly in quality (non-determinism). Testing the Chef's skill by waiting for the real Supplier is inefficient and unreliable. Instead, we use a Mock Supplier who delivers a specific, pre-chosen set of ingredients exactly at a scheduled time. This allows us to verify that the Chef can turn those specific ingredients into the correct dish every single time.
However, mocking is not enough. We must also handle the Concurrency of the operations. In AI pipelines, we often perform multiple tasks simultaneously. For example, a user might ask a question that requires the AI to search a database, summarize three different documents, and generate an answer. These are independent operations and should be executed in parallel to minimize latency.
In C#, parallelism is achieved using Task.WhenAll. This method takes an array of tasks and returns a single task that completes when all the input tasks have completed. It is crucial to understand that Task.WhenAll does not block the thread; it yields control back to the event loop while waiting.
// Conceptual example of parallel execution
var task1 = SummarizeDocumentAsync(doc1);
var task2 = SummarizeDocumentAsync(doc2);
var task3 = SummarizeDocumentAsync(doc3);
await Task.WhenAll(task1, task2, task3);
var summary1 = task1.Result;
var summary2 = task2.Result;
var summary3 = task3.Result;
Testing parallel code introduces the risk of Race Conditions. If two tasks modify a shared resource (like a list or a counter) without proper synchronization, the result is unpredictable. In the context of AI, imagine a pipeline that aggregates tokens from a streaming response into a shared StringBuilder. If multiple streams write to the same builder concurrently, the data will be corrupted.
To visualize the flow of a parallel AI pipeline, consider the following diagram. It shows how independent tasks are forked and then joined.
The "Aggregation" step in the diagram is critical. It often involves merging the results of the parallel tasks. In a deterministic test, we verify that the aggregation logic correctly combines the mocked outputs. For instance, if Task 1 returns "Summary A" and Task 2 returns "Summary B", the aggregation should produce a result containing both.
Moving from parallelism to Streaming, we encounter IAsyncEnumerable<T>. This interface, introduced in C# 8.0, is the standard for representing a stream of data that is produced asynchronously. When an LLM streams a response, it yields tokens one by one (or in chunks). The consumer of this stream (the UI or the next stage of the pipeline) processes these tokens as they arrive, rather than waiting for the entire response.
Testing an IAsyncEnumerable requires consuming the stream and verifying the sequence of items. However, because the stream is asynchronous, we cannot use a standard foreach loop. We must use await foreach.
// Conceptual consumption of a stream
await foreach (var token in llmClient.StreamCompletionAsync("Prompt"))
{
Console.Write(token);
}
The challenge here is Time-Bound Verification. In a real scenario, tokens arrive with delays. In a test, we want to verify that the stream yields the correct tokens, but we don't want to wait for real delays. We use a mock that yields tokens immediately. However, we must also test the resilience of the consumer. What happens if the stream is slow? What if it never ends?
This leads us to the concept of Cancellation Tokens (CancellationToken). In asynchronous programming, cancellation tokens allow a caller to request that an operation stop before it completes. In AI pipelines, we often attach a timeout to an LLM call to prevent the application from hanging indefinitely.
// Using a cancellation token for timeout
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
try
{
var response = await llmClient.GetCompletionAsync(prompt, cts.Token);
}
catch (OperationCanceledException)
{
// Handle timeout
}
Testing code that uses cancellation tokens involves verifying that the operation stops gracefully when the token is cancelled. In a unit test, we can manually trigger the cancellation by calling cts.Cancel() and asserting that the operation throws OperationCanceledException or returns a specific fallback value.
Let's synthesize these concepts into a cohesive testing strategy. We will focus on three pillars:
-
Mocking the Non-Deterministic Layer: We create a
DeterministicLlmClient. This mock implementsIAsyncLlmClient. Instead of making network calls, it reads from a pre-defined set of inputs and outputs. For streaming, it usesyield returnto simulate token arrival, optionally insertingawait Task.Delay(0)to simulate the asynchronous context switch without actual delay. -
Testing Parallel Aggregation: We verify that when multiple tasks complete, the aggregation logic works. We use
Task.WhenAllin the test itself to await the parallel operations. We assert that the combined result is correct. We also useConcurrentExclusiveSchedulerPairorSemaphoreSlimin the production code if shared resources are involved, and verify their usage in the test. -
Testing Streaming with Time-Bound Assertions: We consume the
IAsyncEnumerablestream in the test. We collect the tokens into a list and assert the content. To test timeouts, we wrap the stream consumption in aTask.WhenAnywith aTask.Delay. If the delay completes first, the test fails (indicating the stream was too slow or hung).
A common pattern in modern C# AI applications is the Producer-Consumer pattern using System.Threading.Channels. A producer (the LLM call) writes to a channel, and a consumer (the UI or database writer) reads from it. This decouples the generation of tokens from their processing.
Testing channels requires verifying that data flows correctly from producer to consumer. We can create a bounded channel and write a mock stream to it. Then, we read from the channel in the test and assert the values.
Consider the analogy of a Assembly Line. The Producer is the worker adding parts to a conveyor belt (the Channel). The Consumer is the worker at the end of the line assembling the final product. Testing this involves checking that the parts arrive in the correct order and that the belt doesn't jam (backpressure). If the producer is too fast (simulated by an immediate mock), the consumer must be able to keep up, or the channel must signal backpressure.
In the context of AI, backpressure is vital. If an LLM generates tokens faster than the UI can render them, the application might become unresponsive. Using a bounded channel (with a limited capacity) allows the producer to pause when the buffer is full, applying natural backpressure.
Finally, we must address Event-Loop Awareness. In UI applications (Blazor, MAUI), the synchronization context ensures that updates to the UI happen on the UI thread. When we await an LLM call, the continuation might run on a background thread if we use ConfigureAwait(false), or it might return to the UI thread if we don't. In tests, there is usually no synchronization context, so all code runs on thread pool threads.
However, when testing UI logic, we need to simulate the synchronization context. Libraries like Xunit.StaFact or manual synchronization context mocks allow us to run tests on a specific thread (like the UI thread). This ensures that our assertions about UI state updates are valid.
To summarize the theoretical foundation:
- Isolate via Interfaces: Decouple business logic from LLM calls using
IAsyncLlmClient. - Mock for Determinism: Replace the real LLM with a mock that returns fixed data and simulates streaming via
IAsyncEnumerablewithyield return. - Handle Concurrency: Use
Task.WhenAllfor parallel tasks and verify aggregation logic. Use synchronization primitives (SemaphoreSlim) for shared resources. - Manage Time: Use
CancellationTokenfor timeouts andTask.Delayfor simulating or bounding latency. AvoidThread.Sleep. - Verify Streams: Consume
IAsyncEnumerablein tests to verify sequence correctness, usingawait foreach. - Respect the Event Loop: Be aware of synchronization contexts, especially in UI applications, and use appropriate testing harnesses to simulate them.
By adhering to these principles, we transform flaky, non-deterministic AI tests into reliable, fast, and deterministic unit tests. This allows us to refactor and extend our AI pipelines with confidence, knowing that the core business logic is verified independently of the volatile external dependencies.
Basic Code Example
Let's model a scenario where an AI-powered "News Summarizer" service needs to process a batch of articles. To ensure responsiveness, it processes articles in parallel. However, the underlying AI model is non-deterministic and occasionally fails or takes too long. We need to test this business logic without actually calling the AI or waiting for real network latency.
We will use Microsoft.Extensions.AI for abstraction and Moq for deterministic mocking.
using System.Collections.Concurrent;
using System.Diagnostics;
using Microsoft.Extensions.AI; // Requires NuGet: Microsoft.Extensions.AI
using Moq; // Requires NuGet: Moq
// 1. THE PROBLEM CONTEXT
// Imagine a service that summarizes news articles. It calls an external AI model.
// The test must verify that:
// a) Successful summaries are collected.
// b) Failed articles are tracked as errors.
// c) The entire process respects a timeout (doesn't hang).
// We cannot rely on the real AI (too slow/non-deterministic), so we mock it.
public class NewsSummarizer
{
private readonly IChatClient _aiClient;
private readonly ILogger<NewsSummarizer> _logger;
public NewsSummarizer(IChatClient aiClient, ILogger<NewsSummarizer> logger)
{
_aiClient = aiClient;
_logger = logger;
}
// 2. THE METHOD UNDER TEST
// Uses Parallel.ForEachAsync to process articles concurrently.
// Uses a CancellationTokenSource to enforce a timeout.
public async Task<SummaryResult> SummarizeBatchAsync(
IEnumerable<string> articles,
TimeSpan timeout)
{
var results = new ConcurrentBag<string>();
var errors = new ConcurrentBag<string>();
// Create a timeout token. If the operation takes longer than 'timeout', it cancels.
using var cts = new CancellationTokenSource(timeout);
try
{
// Modern C# Parallelism: Process items asynchronously in parallel.
await Parallel.ForEachAsync(articles, new ParallelOptions
{
MaxDegreeOfParallelism = 3, // Limit concurrency to avoid overwhelming the mock/LLM
CancellationToken = cts.Token
}, async (article, token) =>
{
try
{
// Call the AI (which will be mocked in our test)
var response = await _aiClient.GetResponseAsync(
$"Summarize this news: {article}",
cancellationToken: token);
var summary = response.Text;
results.Add(summary);
}
catch (OperationCanceledException)
{
_logger.LogWarning("Processing timed out for an article.");
errors.Add("Timeout");
throw; // Re-throw to stop the parallel operation
}
catch (Exception ex)
{
_logger.LogError(ex, "Error summarizing article.");
errors.Add(ex.Message);
}
});
}
catch (OperationCanceledException)
{
_logger.LogError("Batch processing was cancelled due to timeout.");
}
return new SummaryResult(results.ToList(), errors.ToList());
}
}
// Simple DTO for the result
public record SummaryResult(List<string> Summaries, List<string> Errors);
// Dummy Logger implementation for the example to be runnable
public class ConsoleLogger<T> : ILogger<T>
{
public IDisposable BeginScope<TState>(TState state) => null!;
public bool IsEnabled(LogLevel logLevel) => true;
public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter)
{
// In a real test, we'd capture logs. Here we just print.
Console.WriteLine($"[{logLevel}] {formatter(state, exception)}");
}
}
// 3. THE TEST
public class DeterministicTest
{
public static async Task Main()
{
// --- SETUP ---
// Create the mock for the IChatClient
var mockClient = new Mock<IChatClient>();
// Define the behavior: "Deterministic Mocking"
// We configure the mock to return specific values based on specific inputs.
// This removes randomness.
mockClient
.Setup(c => c.GetResponseAsync(
It.Is<string>(s => s.Contains("Article A")),
It.IsAny<CancellationToken>()))
.ReturnsAsync(new AIResponse(new ChatMessage(ChatRole.Assistant, "Summary A")));
mockClient
.Setup(c => c.GetResponseAsync(
It.Is<string>(s => s.Contains("Article B")),
It.IsAny<CancellationToken>()))
.ReturnsAsync(new AIResponse(new ChatMessage(ChatRole.Assistant, "Summary B")));
// Setup a "slow" article to test our timeout logic
mockClient
.Setup(c => c.GetResponseAsync(
It.Is<string>(s => s.Contains("Article C")),
It.IsAny<CancellationToken>()))
.ReturnsAsync(async () =>
{
await Task.Delay(2000); // Simulate 2 seconds of latency
return new AIResponse(new ChatMessage(ChatRole.Assistant, "Summary C"));
});
// Setup a "failing" article
mockClient
.Setup(c => c.GetResponseAsync(
It.Is<string>(s => s.Contains("Article D")),
It.IsAny<CancellationToken>()))
.ThrowsAsync(new HttpRequestException("Network unstable"));
var logger = new ConsoleLogger<NewsSummarizer>();
var service = new NewsSummarizer(mockClient.Object, logger);
// --- EXECUTION ---
var articles = new[] { "Article A", "Article B", "Article C", "Article D" };
// We set a strict timeout of 1 second.
// Article A and B should pass instantly.
// Article C takes 2 seconds (should timeout).
// Article D throws immediately (should be caught and logged).
var result = await service.SummarizeBatchAsync(articles, TimeSpan.FromSeconds(1));
// --- ASSERTIONS ---
Console.WriteLine("\n--- TEST RESULTS ---");
Console.WriteLine($"Successes: {string.Join(", ", result.Summaries)}");
Console.WriteLine($"Errors: {string.Join(", ", result.Errors)}");
// Verify the mock was called correctly
mockClient.Verify(
x => x.GetResponseAsync(It.IsAny<string>(), It.IsAny<CancellationToken>()),
Times.Exactly(4)); // Called for all 4 articles
}
}
Detailed Explanation
1. The Problem: Flakiness in Async AI Pipelines
When testing code that relies on external AI services, standard unit tests often fail due to:
- Network Jitter: The AI takes 100ms one time, 5000ms the next.
- Non-Determinism: The AI returns "Yes" in one run and "Yes." in another.
- Concurrency Bugs: Race conditions that only appear under specific timing.
To solve this, we must isolate the code. We treat the AI as a "Black Box" and force it to behave predictably using Mocking.
2. The Mocking Strategy (Microsoft.Extensions.AI + Moq)
We use the IChatClient interface. This allows us to swap the real OpenAI/Azure call for a fake one.
It.Is<string>(...): This is the key to deterministic testing. Instead of saying "Mock any call," we say "Mock calls where the input string contains 'Article A'." This allows us to simulate specific logic paths.ReturnsAsync: We immediately return a pre-cannedAIResponse. No network delay.- Simulating Failure: We use
ThrowsAsync(new HttpRequestException(...))to force the code into thecatchblock without needing to unplug a network cable.
3. The Business Logic (Parallel.ForEachAsync)
The SummarizeBatchAsync method uses Parallel.ForEachAsync.
MaxDegreeOfParallelism = 3: We limit concurrency. In a real scenario, hitting an AI with 1000 parallel requests might hit rate limits. In our test, it ensures the mock executes in a controlled manner.CancellationTokenSource: We pass aTimeSpanto create a token. This is crucial for Timeout Testing.- Why? If Article C hangs indefinitely, the test would hang forever. By passing the token to the AI call (
tokenargument), we ensure that if the global timeout hits, the internal call is also cancelled.
- Why? If Article C hangs indefinitely, the test would hang forever. By passing the token to the AI call (
4. The Event Loop & Assertions
- Synchronous Test Runner: Notice
Mainisasync Task Main. The test runner (the console app) enters the event loop. - Deterministic Verification: After execution, we check the
SummaryResult.- We expect 2 successes (A, B).
- We expect 2 errors (C [Timeout], D [Network Error]).
- Because the mock is deterministic, this result will be the same every single time we run the code.
Common Pitfalls
-
Using
Task.Delayin Tests (The "Sleep" Trap):- Mistake: Writing
await Task.Delay(1000)inside a unit test to "wait for the AI." - Why it's bad: It makes tests slow (flakiness via slowness) and unreliable. If the machine is busy, the test fails.
- Fix: Use
Task.CompletedTaskor return immediately. Only useTask.Delayinside the Mock to simulate latency, never in the test runner itself.
- Mistake: Writing
-
Testing Implementation Details, Not Behavior:
- Mistake: Verifying that
Parallel.ForEachAsyncwas called exactly once. - Why it's bad: You might refactor the code to use a
Task.WhenAlllater. The test will break even though the code works. - Fix: Verify the outcome (did we get 2 summaries?) and the external interactions (was the AI called 4 times?).
- Mistake: Verifying that
-
Ignoring the
CancellationToken:- Mistake: Mocking an AI call that ignores the passed
CancellationToken. - Why it's bad: If your code has a timeout, but the mock ignores it, the test will pass, but production code might hang forever if the real AI hangs.
- Fix: Ensure your mock setup includes the
It.IsAny<CancellationToken>()and that the logic inside the mock respects it (e.g., by throwingOperationCanceledExceptionif triggered).
- Mistake: Mocking an AI call that ignores the passed
-
Shared State in Parallel Tests:
- Mistake: Using a standard
List<string>inside the parallel loop. - Why it's bad:
List<T>is not thread-safe. Accessing it from multiple parallel tasks causesIndexOutOfRangeExceptionor data corruption. - Fix: Always use thread-safe collections like
ConcurrentBag<T>or lock access when aggregating results in parallel workflows.
- Mistake: Using a standard
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.