Chapter 12: Resilience Patterns - Retries and Circuit Breakers (Polly)

Theoretical Foundations

In the landscape of building AI-powered web APIs with ASP.NET Core, the reliability of the underlying infrastructure is just as critical as the intelligence of the models themselves. When your application acts as a client to external AI services—such as OpenAI, Azure Cognitive Services, or third-party APIs—it enters a world of inherent unpredictability. These external dependencies are not static libraries running within your process; they are distributed systems subject to network latency, transient server errors, rate limits, and occasional downtime. Relying on a simple HttpClient call without defensive mechanisms is akin to building a skyscraper on sand. This is where resilience patterns, specifically Retries and Circuit Breakers, become the bedrock of a production-grade AI application.

The Fragile Nature of AI Service Consumption

In previous chapters, we explored how to structure ASP.NET Core applications to consume AI services, often utilizing interfaces to abstract the underlying provider (e.g., IOpenAIService). We learned to inject these dependencies and configure them in the DI container. However, that abstraction primarily addresses architectural cleanliness and testability; it does not inherently solve the problem of reliability. When an AI model API returns a 503 Service Unavailable or a 429 Too Many Requests error, the default behavior of a standard HTTP client is to fail immediately, propagating that exception up the call stack and ultimately returning an error response to the user.

In the context of AI applications, this fragility is amplified. AI inference is often computationally expensive and time-consuming. A request to a large language model (LLM) might take several seconds to process. If a transient network blip occurs during that window, simply failing the request wastes valuable compute time and degrades the user experience. Furthermore, AI APIs are often rate-limited to protect the provider's resources. A burst of traffic to your application could trigger a cascade of rate-limiting errors from the AI provider, which, if not handled gracefully, could crash your service or render it unresponsive.

The Analogy: The Busy Coffee Shop

To understand the necessity of resilience patterns, imagine your AI API is a customer (you) trying to order a complex coffee from a barista (the external AI service).

Scenario 1: No Resilience (The Naive Approach) You walk up to the counter. The barista is currently overwhelmed with orders and ignores you. You immediately leave the shop, coffee-less, and tell your friends the shop is terrible. This is a simple failure. If you had just waited a moment (a Retry), the barista would have been free to serve you.

Scenario 2: Simple Retries You decide to be persistent. You walk up, get ignored, walk away, and immediately walk back to the counter. You get ignored again. You repeat this 10 times. Eventually, the barista gets annoyed and kicks you out. This is a naive retry strategy without backoff. It doesn't give the system time to recover and can actually worsen congestion (a "thundering herd" problem).

Scenario 3: Retries with Exponential Backoff You walk up, get ignored, and wait 1 minute before trying again. If ignored again, you wait 2 minutes, then 4 minutes, then 8 minutes. This gives the barista time to catch up. This is Exponential Backoff. It's polite and increases the likelihood of success without overwhelming the service.

Scenario 4: The Circuit Breaker You walk up, and the "Closed" sign is on the counter (the circuit is closed). You try to order, but the barista has a meltdown and screams that the espresso machine is broken. You note this failure. After a few more failed attempts (exceeding a failure threshold), you decide not to even walk up to the counter anymore. You "trip" the circuit to "Open." For the next 15 minutes (the Reset Timeout), you don't even try to order; you just sit down. This prevents you from wasting energy and frustrating the barista. After 15 minutes, the circuit enters a Half-Open state. You cautiously walk up to see if the machine is fixed. If the order succeeds, the circuit closes (normal operation resumes). If it fails, the circuit opens again, and you wait another 15 minutes.

This analogy illustrates the core concepts:

Transient Faults: The barista being momentarily busy (recoverable).
Cascading Failures: If you keep nagging the barista while they are trying to fix the machine, you prevent them from fixing it, causing a total outage for everyone.
Resource Protection: The circuit breaker protects the barista's sanity and your time.

Deep Dive: Retries with Exponential Backoff and Jitter

A retry policy is the first line of defense. It assumes that many failures are temporary and that a subsequent attempt might succeed. In the context of AI APIs, this is highly relevant for:

Transient Network Errors: TCP packet loss, DNS resolution failures, or temporary routing issues.
Transient Server Errors: HTTP 5xx errors (e.g., 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable) that indicate the server is temporarily unable to handle the request but might recover quickly.
Rate Limiting (with caution): HTTP 429 errors. While retries are useful here, they must be handled carefully to avoid violating the rate limit further.

The Mechanics of Exponential Backoff

Simply retrying immediately is dangerous. If the AI service is experiencing high load, immediate retries from thousands of clients will synchronize and create a massive spike in traffic, likely causing a complete outage. This is the "Thundering Herd" problem.

Exponential Backoff solves this by increasing the delay between retries exponentially. The formula is typically:

Delay = BaseDelay * (2 ^ AttemptCount)

Attempt 1: Wait 0ms (or a small base delay).
Attempt 2: Wait 100ms.
Attempt 3: Wait 200ms.
Attempt 4: Wait 400ms.
Attempt 5: Wait 800ms.

This gives the external service progressively more time to recover between each failed attempt.

The Nuance of Jitter

Exponential backoff has a flaw: if multiple clients fail at the same time (e.g., a database outage affects all users simultaneously), their retry schedules will be perfectly synchronized. When they all retry after 400ms, they will hit the recovering service at the exact same moment, potentially knocking it over again.

Jitter introduces a random element to the delay, desynchronizing the clients. Instead of a deterministic 400ms, the delay might be 400ms ± random(0, 100ms). This spreads the retry load over time, significantly increasing the probability that the external service recovers gracefully.

Architectural Implications for AI APIs

In an AI application, the cost of a retry is non-trivial. Retrying a request to an LLM means paying for inference again (if billing is per token) or consuming GPU cycles. Therefore, the retry policy must be selective.

Idempotency: Retries are safest on idempotent operations. A GET request to fetch a model's status is idempotent. A POST request to generate a chat completion is technically non-idempotent (it might generate a different response), but most AI APIs treat repeated requests with the same parameters as idempotent for billing purposes. However, you must be aware of the business logic implications.
Latency: Retries increase the total latency of the request. If your API has a strict SLA (e.g., respond within 2 seconds), and the AI service takes 1.5 seconds per attempt, you might only have time for one retry.

Deep Dive: The Circuit Breaker Pattern

While retries handle transient faults, they are useless against persistent failures. If the AI service is down for maintenance or experiencing a major outage, retrying every 5 seconds for 5 minutes will only waste resources and delay the reporting of the true error to the user. The Circuit Breaker pattern prevents an application from repeatedly trying to execute an operation that is likely to fail, allowing it to fail fast and spare resources.

The Three States of a Circuit Breaker

Closed (Normal Operation):
- Behavior: Requests pass through to the external service.
- Monitoring: The circuit breaker tracks failures (e.g., specific HTTP status codes or exceptions).
- Transition: If the number of failures exceeds a configured threshold within a time window, the circuit "trips" and transitions to the Open state.
Open (Failure Mode):
- Behavior: Requests do not go to the external service. The circuit breaker immediately throws a BrokenCircuitException (or similar) back to the caller. This is a "fail-fast" mechanism.
- Duration: The circuit remains open for a configured ResetTimeout (e.g., 30 seconds).
- Transition: After the timeout expires, the circuit transitions to the Half-Open state.
Half-Open (Testing Recovery):
- Behavior: The circuit allows a limited number of "trial" requests to pass through to the external service.
- Monitoring: It monitors the outcome of these trial requests.
- Transition:
  - Success: If the trial requests succeed, the circuit assumes the service has recovered and transitions back to Closed.
  - Failure: If any trial request fails, the circuit assumes the service is still down and immediately transitions back to Open, restarting the timeout period (often with an exponential backoff for the reset timeout itself).

Why Circuit Breakers are Essential for AI APIs

AI services are often hosted on shared infrastructure. A sudden spike in demand (e.g., a viral social media post) can cause the provider to throttle or degrade service. Without a circuit breaker:

Resource Exhaustion: Your application's threads might be blocked waiting for timed-out HTTP requests, exhausting the thread pool and making your API unresponsive to other users.
Cascading Failure: If your API is a dependency for other services, the failure propagates downstream.
Poor User Experience: Users wait for the full request timeout (e.g., 100 seconds) instead of getting an immediate "Service Unavailable" message.

By implementing a circuit breaker, you isolate the failure. When the AI service goes down, your API immediately fails fast, preserving its own stability and allowing it to serve other requests or fallback responses (e.g., a cached response or a simpler model).

Integrating Resilience with HttpClient

In modern .NET, HttpClient is the primary tool for communicating with external APIs. However, the standard HttpClient does not have built-in resilience features. This is where the Polly library comes in.

Polly is a .NET resilience and transient-fault-handling library that allows you to express policies such as Retry, Circuit Breaker, Timeout, and Fallback in a fluent, thread-safe manner. It integrates seamlessly with HttpClient via the IHttpClientFactory pattern, which was introduced in .NET Core 2.1 to manage the lifecycle of HttpClient instances correctly (avoiding DNS stale connection issues).

The Concept of Policy Composition

Polly allows you to compose multiple policies into a single "policy wrap." The order of composition is critical. A typical order for an AI API client would be:

Fallback Policy: (Outermost) If all else fails, return a cached response or a graceful degradation message.
Circuit Breaker Policy: Prevents calls to the external service if it's known to be down.
Retry Policy: Handles transient faults by retrying the request.
Timeout Policy: (Innermost) Ensures a single attempt doesn't hang indefinitely.

This composition creates a resilient chain. If the circuit is open, the retry policy never executes. If the retry policy fails after all attempts, the fallback policy executes.

Visualizing the Resilience Flow

The following diagram illustrates how these patterns interact within an ASP.NET Core application consuming an AI service.

A visual flow showing an ASP.NET Core application orchestrating resilience patterns—such as retry, circuit breaker, and fallback—around an AI service call to handle transient failures and ensure graceful degradation.

Practical Considerations for AI Workloads

When applying these patterns to AI APIs, specific nuances arise:

Handling Rate Limits (HTTP 429):
- A simple retry policy will exacerbate rate limiting.
- Solution: Use Polly's WaitAndRetryAsync with a delay extracted from the Retry-After header if provided by the API. If not, use a conservative exponential backoff. Some AI APIs (like OpenAI) return a 429 with a header indicating how long to wait. Polly can be configured to respect this.
Cost Management:
- Retrying a failed AI inference request costs money (tokens/seconds).
- Solution: Limit the number of retries (e.g., max 3 attempts). Use the Circuit Breaker to stop retries quickly during outages. Implement a Fallback policy that switches to a cheaper, faster model (e.g., a smaller LLM or a cached response) if the primary model is unavailable.
Latency Sensitivity:
- AI applications often require low latency (e.g., real-time chat).
- Solution: Configure aggressive timeouts. If the AI service takes longer than 2 seconds, fail fast and serve a "Thinking..." message or a cached result. Use the Circuit Breaker to fail fast if the service is slow (e.g., track timeouts as failures).
Idempotency and State:
- AI APIs often maintain conversation state (context windows). Retrying a request might result in duplicate messages in the conversation history.
- Solution: Design your API client to be idempotent by including a unique request ID. If the AI service supports it, pass this ID to ensure deduplication on the server side.

The Role of `IHttpClientFactory`

In .NET, creating a new HttpClient for every request is an anti-pattern that leads to socket exhaustion. Conversely, using a single static HttpClient can lead to DNS staleness (the client doesn't respect DNS TTL changes). The IHttpClientFactory solves this by managing the lifetime of HttpClient instances.

When using Polly with IHttpClientFactory, you register named or typed clients. The resilience policies are attached to these clients at registration time in Program.cs (or Startup.cs). This ensures that every time your service requests an HttpClient from the factory, it gets one pre-configured with the same resilience policies.

Theoretical Foundations

In summary, building a robust AI API requires more than just correct business logic; it requires defensive programming against the volatility of distributed systems. Retries with exponential backoff and jitter provide a mechanism to recover from transient faults without overwhelming the external service. The Circuit Breaker pattern acts as a safety valve, preventing cascading failures and allowing the system to fail fast during prolonged outages. By composing these policies using a library like Polly and integrating them with IHttpClientFactory, you create a resilient communication layer that ensures your AI application remains responsive, stable, and cost-effective, even when the underlying AI services are experiencing turbulence. This resilience is not an optional add-on; it is a fundamental requirement for any production-grade system that relies on external dependencies.

Basic Code Example

Imagine you are building a weather forecasting service. Your service relies on a third-party external API (like OpenWeatherMap) to fetch data. Sometimes, that external API might be temporarily unavailable due to network glitches, rate limiting, or brief server hiccups. If your service simply fails immediately, your users see an error. To solve this, we implement a Retry Pattern: if the first call fails, wait a moment and try again. If it fails again, wait a bit longer and try one more time. If it still fails, then we give up.

Here is a minimal, self-contained console application demonstrating how to implement this using the Polly library.

using Polly;
using Polly.Retry;
using System.Net.Http;
using System.Threading.Tasks;
using System.Threading;
using System;
using System.Net;

namespace PollyRetryDemo
{
    class Program
    {
        static async Task Main(string[] args)
        {
            // 1. Setup a mock handler that simulates failure
            var mockHandler = new SimulatedFailureHandler();
            var client = new HttpClient(mockHandler);

            // 2. Define the Retry Policy with Exponential Backoff
            // We will try up to 3 times.
            // Wait times: 2s, 4s, 8s (Exponential)
            AsyncRetryPolicy retryPolicy = Policy
                .Handle<HttpRequestException>()
                .WaitAndRetryAsync(
                    retryCount: 3,
                    sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)),
                    onRetry: (exception, timespan, retryCount, context) =>
                    {
                        Console.WriteLine($"[Retry] Attempt {retryCount}: Waiting {timespan.TotalSeconds}s due to {exception.Message}");
                    });

            Console.WriteLine("--- Starting Request Execution ---");

            try
            {
                // 3. Wrap the execution inside the policy
                await retryPolicy.ExecuteAsync(async () =>
                {
                    Console.WriteLine("Executing HTTP Request...");
                    // This will hit our mock handler which simulates failures
                    var response = await client.GetAsync("https://api.mock-weather.com/data");
                    response.EnsureSuccessStatusCode();
                    return response;
                });

                Console.WriteLine("SUCCESS: Data retrieved successfully.");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"FAILURE: All retries exhausted. Final Error: {ex.Message}");
            }

            Console.WriteLine("\n--- Execution Finished ---");
        }
    }

    // --- Mock Infrastructure to simulate the scenario ---

    /// <summary>
    /// A custom HttpMessageHandler that simulates a flaky external API.
    /// It fails the first 3 requests and succeeds on the 4th.
    /// </summary>
    public class SimulatedFailureHandler : HttpMessageHandler
    {
        private int _requestCount = 0;

        protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
        {
            _requestCount++;
            Console.WriteLine($"   [Server Log] Received request #{_requestCount}");

            // Simulate network latency
            await Task.Delay(500, cancellationToken);

            if (_requestCount <= 3)
            {
                // Simulate a transient server error (503 Service Unavailable)
                throw new HttpRequestException("Simulated transient network error");
            }

            // Success on the 4th attempt
            return new HttpResponseMessage(HttpStatusCode.OK)
            {
                Content = new StringContent("Temperature: 22°C")
            };
        }
    }
}

Line-by-Line Explanation

using Polly;: Imports the core Polly namespace.
using Polly.Retry;: Imports the specific namespace for Retry policies.
SimulatedFailureHandler: Since we don't actually have a failing external API, we create a class inheriting from HttpMessageHandler. This allows us to intercept the HTTP request and artificially throw exceptions or return errors.
- _requestCount: Tracks how many times we've been called.
- _requestCount <= 3: We deliberately fail the first three calls.
- throw new HttpRequestException(...): This is the specific exception type the Polly policy will be looking for.
AsyncRetryPolicy retryPolicy = Policy...: This is the core Polly setup.
- .Handle<HttpRequestException>(): Tells Polly, "Only catch and retry on this specific exception type. If it's a different error (like a coding bug), let it crash immediately."
- .WaitAndRetryAsync(...): Configures the retry logic.
- retryCount: 3: We allow a maximum of 3 attempts (the original call + 2 retries).
- sleepDurationProvider: This lambda calculates the wait time. We use Math.Pow(2, attempt) to get exponential growth (2, 4, 8 seconds). This is crucial to avoid "thundering herd" problems where many clients retry simultaneously.
- onRetry: A callback that executes before the wait happens. It's great for logging so developers know the system is working as intended.
retryPolicy.ExecuteAsync(...): This wraps the actual code we want to protect. It takes an async delegate containing the HTTP call.
client.GetAsync(...): The actual network call. Because client uses our SimulatedFailureHandler, this will throw HttpRequestException three times in a row.
The Execution Flow:
- Attempt 1: Throws exception. Policy catches it. Logs "Waiting 2s". Waits.
- Attempt 2: Throws exception. Policy catches it. Logs "Waiting 4s". Waits.
- Attempt 3: Throws exception. Policy catches it. Logs "Waiting 8s". Waits.
- Attempt 4: The SimulatedFailureHandler returns OK. The policy sees success and returns the result.
catch (Exception ex): If all 3 retries fail (or if the error wasn't an HttpRequestException), the policy throws the last exception, which is caught here.

Visualizing the Flow

The diagram would show a flow starting with an HTTP request, passing through a retry loop that attempts up to three times on failure, and ultimately routing to a catch (Exception ex) block that captures the final error if all retries are exhausted. — The diagram would show a flow starting with an HTTP request, passing through a retry loop that attempts up to three times on failure, and ultimately routing to a `catch (Exception ex)` block that captures the final error if all retries are exhausted.

Common Pitfalls

1. Catching Exception instead of specific types The most dangerous mistake is using .Handle<Exception>(). This tells Polly to retry on every error, including programming errors like NullReferenceException or ArgumentOutOfRangeException. If your code has a bug, Polly will enter an infinite loop (or hit the max retry limit) trying to fix a problem that isn't network related. Always catch the specific network exceptions you expect (e.g., HttpRequestException, TimeoutException).

2. Not using Exponential Backoff Using a fixed delay (e.g., "wait 1 second every time") is bad practice for external dependencies. If the external service is down, a fixed delay causes your application to hit it relentlessly. Exponential backoff gives the external service breathing room to recover.

3. Blocking inside the retry policy Polly supports both synchronous and asynchronous execution. In ASP.NET Core (which is asynchronous by nature), you must use the Async variants (e.g., WaitAndRetryAsync, ExecuteAsync). Using the synchronous versions inside an async controller method can lead to thread pool starvation.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.

Chapter 12: Resilience Patterns - Retries and Circuit Breakers (Polly)

Theoretical Foundations

The Fragile Nature of AI Service Consumption

The Analogy: The Busy Coffee Shop

Deep Dive: Retries with Exponential Backoff and Jitter

The Mechanics of Exponential Backoff

The Nuance of Jitter

Architectural Implications for AI APIs

Deep Dive: The Circuit Breaker Pattern

The Three States of a Circuit Breaker

Why Circuit Breakers are Essential for AI APIs

Integrating Resilience with HttpClient

The Concept of Policy Composition

Visualizing the Resilience Flow

Practical Considerations for AI Workloads

The Role of IHttpClientFactory

Theoretical Foundations

Basic Code Example

Line-by-Line Explanation

Visualizing the Flow

Common Pitfalls

The Role of `IHttpClientFactory`