Chapter 13: Caching Responses to Save API Costs (HybridCache)

Theoretical Foundations

The fundamental challenge of building AI-powered APIs with ASP.NET Core is managing the inherent cost and latency of Large Language Model (LLM) invocations. Unlike traditional database queries where execution plans and indexes offer predictable performance, LLM inference is computationally expensive and financially variable. Every call to an external provider like OpenAI or a self-hosted model consumes tokens, translating directly to operational cost. Furthermore, the network round-trip introduces latency that degrades user experience.

To address this, we introduce HybridCache, a sophisticated caching mechanism designed specifically for the stateful and complex nature of AI responses. While traditional caching libraries focus on simple key-value storage, HybridCache addresses the specific needs of AI APIs: handling complex object graphs (serialized AI responses), managing cache stampedes (thundering herds) during high load, and ensuring consistency when underlying model configurations change.

The Cost and Latency Bottleneck in AI APIs

Consider an API endpoint responsible for generating a summary of a legal document. The workflow involves:

Receiving the document text.
Constructing a prompt (system instructions + user input).
Sending the request to an LLM.
Receiving and parsing the response.

If 1,000 users request summaries of the same legal document, naive implementation would result in 1,000 distinct LLM calls. This is inefficient. The response text for a specific input and model configuration is deterministic. Caching this output eliminates redundant inference costs.

However, caching AI responses introduces complexity. AI responses are rarely primitive strings; they are structured objects containing metadata, token usage statistics, and sometimes nested conversation histories. Furthermore, LLM behavior changes if we update the system prompt or switch model versions. A cache entry valid for "Model GPT-4" is invalid for "Model GPT-4-Turbo."

Analogy: The Library and the Librarian

To understand HybridCache, imagine a university library (the API) serving students (users) who ask complex research questions (LLM prompts).

The Expensive Researcher (The LLM): There is a single, highly specialized researcher in the basement. Conducting research takes hours and costs money (tokens). If 50 students ask the exact same question, asking the researcher 50 times is wasteful.
The Notebook (Simple Cache): A student writes the answer in a notebook and leaves it on a desk. The next student finds the notebook and reads the answer. This is fast and free. However, if the library updates its reference books (System Prompt/Model Weights), the notebook's answer is outdated. Someone must manually erase the notebook (Manual Invalidation).
The Librarian (HybridCache): The Librarian manages a sophisticated system.
- Time-Based Expiration: The Librarian knows that scientific facts change. Even if a question was answered yesterday, the answer might be stale today. So, the Librarian stamps an expiration date on the answer.
- Stale-While-Revalidate: If a student asks a question and the answer is "expired" but still usable, the Librarian hands over the old answer immediately (to keep the student happy/fast) while quietly asking the Researcher for an updated answer in the background. Once the Researcher replies, the Librarian updates the notebook.
- Tag-Based Invalidation: If the library buys a new edition of an encyclopedia (Model Update), the Librarian doesn't just erase one notebook. They look at a master index (Tags) to find all notebooks related to that encyclopedia and remove them instantly.

Serialization Strategies for Complex AI Object Graphs

In Chapter 12, we discussed Dependency Injection and how to swap between different AI providers (e.g., IOpenAIService vs. ILlamaService). When caching, we face a serialization challenge. AI responses are rarely flat strings.

A typical AI response object might look like this:

public class AiResponse
{
    public string Content { get; set; }
    public List<Citation> Citations { get; set; }
    public UsageMetrics Usage { get; set; }
    public DateTime CreatedAt { get; set; }
}

When storing this in a distributed cache (like Redis) or an in-memory store, the object graph must be serialized into a byte stream.

The Serialization Dilemma:

JSON: Human-readable and flexible, but verbose. For high-volume caching, the size of the serialized JSON (especially with long content strings) increases memory pressure and network bandwidth between the API server and the cache store.
Binary (e.g., MessagePack): Highly efficient and compact. However, it requires strict contract definitions. If we change the AiResponse class (e.g., adding a new property), deserializing old cached entries might fail or lose data.

HybridCache abstracts this complexity. It allows developers to define serialization strategies that handle versioning. For example, using a schema version prefix in the cache key ensures that a change in the response object structure automatically invalidates old cache entries, preventing runtime errors.

Time-Based Expiration and Stale-While-Revalidate

In the context of AI, "freshness" is a trade-off. For a chatbot summarizing news, data must be near real-time. For a code-generation assistant, a cached response from yesterday might still be valid today.

Time-Based Expiration (TTL - Time To Live): This is the simplest strategy. We define a duration for which a cached response is considered valid.

Why: Prevents the cache from growing indefinitely and ensures data doesn't become too outdated.
Implementation: When storing a response, we set a DateTimeOffset expiration.

Stale-While-Revalidate (SWR): This is a more advanced pattern crucial for AI APIs where latency is the primary user experience metric.

Request arrives: Check cache.
Cache Hit (Expired): Return the expired data immediately to the user. Simultaneously, trigger a background task to fetch fresh data from the LLM.
Cache Hit (Fresh): Return the data.
Cache Miss: Fetch from LLM, store in cache, return.

Why is this critical for AI? LLM inference can take 5–30 seconds. If a cache entry expires, waiting for the LLM to generate a new response results in a timeout or a terrible user experience. SWR ensures the user always gets an instant response, even if it's slightly stale, while the system updates itself in the background.

Tag-Based Invalidation: The Key to Consistency

Traditional caching relies on key-value pairs (e.g., cache.Set("summary_doc_123", data)). The problem arises when the underlying data source changes. If we update the system prompt used to generate summaries, summary_doc_123 is now invalid.

Naive approach: Clear the entire cache. This is destructive and causes a "cold start" problem where every request misses the cache and hammers the LLM.

Tag-Based Invalidation allows us to associate metadata (tags) with cache entries.

Entry: Key: "summary_doc_123", Tags: ["model:gpt-4", "prompt:v2"]
Invalidation: When we update the prompt to version 3, we issue an invalidation command: InvalidateByTag("prompt:v2").

The cache store finds all entries with the tag prompt:v2 and removes them. This surgical precision ensures that only relevant cache entries are cleared.

Visualizing the HybridCache Flow

The following diagram illustrates the decision flow within a HybridCache implementation for an AI API endpoint.

This diagram visualizes the decision flow within a HybridCache implementation, demonstrating the surgical precision required to selectively clear only relevant cache entries for an AI API endpoint.

Architectural Implications for AI APIs

Implementing HybridCache fundamentally changes how we design AI endpoints.

Deterministic Key Generation: Since LLMs are stateless functions (conceptually), the cache key is the sum of all inputs. This includes:
- The User Prompt.
- The System Prompt (Crucial! Changing the system prompt changes the output).
- The Model Identifier (e.g., "gpt-4-turbo").
- Any Tool/Function definitions passed to the model.
In Chapter 12, we used interfaces to abstract the AI provider. Here, we must ensure that the CacheKey generator incorporates the specific configuration of the concrete implementation. If we swap from OpenAI to a local Llama model via Dependency Injection, the cache key must reflect this to avoid serving OpenAI responses to a request intended for Llama.
Memory Management: AI responses can be large (thousands of tokens). Caching 10,000 unique long-form summaries can consume gigabytes of RAM. HybridCache strategies must include eviction policies (like Least Recently Used - LRU) to prevent the cache from overwhelming the server's memory, which could lead to OutOfMemoryException and crashing the API.
Distributed vs. Local Caching:
- Local (In-Memory): Fastest, but specific to a single server instance. In a load-balanced environment with multiple API replicas, a cache hit on Server A does not help a request routed to Server B.
- Distributed (Redis/SQL): Slower due to network overhead, but shared across all instances.
HybridCache in .NET 9 is designed to be a "two-level" cache. It checks local memory first (L1) and falls back to a distributed store (L2) if needed. This optimizes for the highest hit rate while maintaining consistency across scaled-out infrastructure.

Theoretical Foundations

By understanding these theoretical components—serialization, expiration strategies, and tag-based invalidation—we prepare to implement a robust caching layer. This layer acts as a shock absorber between the high-volume demands of our API users and the expensive, slow reality of LLM inference. The goal is not just speed, but cost sustainability: allowing the application to scale to thousands of users without incurring prohibitive token usage fees.

Basic Code Example

using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Caching.Hybrid;
using Microsoft.Extensions.DependencyInjection;
using System.Text.Json;

var builder = WebApplication.CreateBuilder(args);

// 1. Register the HybridCache service with default configuration
builder.Services.AddHybridCache(options =>
{
    // 2. Set a default expiration time for cached entries
    options.DefaultEntryOptions = new HybridCacheEntryOptions
    {
        // 3. AbsoluteExpirationRelativeToNow defines the hard limit
        AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(5)
    };
});

var app = builder.Build();

// 4. Define a simple domain object to cache
public record WeatherForecast(string City, DateTime Date, double TemperatureC, string Summary);

// 5. Define a request object for the endpoint
public record WeatherRequest(string City);

// 6. Create a mock service to simulate expensive AI model calls or database queries
public class WeatherService
{
    private readonly HybridCache _cache;

    public WeatherService(HybridCache cache)
    {
        _cache = cache;
    }

    // 7. Method to get weather with caching
    public async Task<WeatherForecast> GetWeatherAsync(string city, CancellationToken ct)
    {
        // 8. Create a unique cache key based on the input
        string cacheKey = $"weather:{city.ToLowerInvariant()}";

        // 9. Attempt to get the value from cache or compute it
        return await _cache.GetOrCreateAsync(
            cacheKey,
            async (ct) =>
            {
                // 10. This block executes ONLY on a cache miss
                // Simulate a delay (e.g., LLM inference time)
                await Task.Delay(1000, ct);

                // 11. Simulate generating a response
                var rng = new Random();
                return new WeatherForecast(
                    City: city,
                    Date: DateTime.UtcNow,
                    TemperatureC: rng.Next(-10, 35),
                    Summary: rng.Next(0, 10) > 5 ? "Sunny" : "Cloudy"
                );
            },
            // 12. Optional: Override default options per entry
            new HybridCacheEntryOptions
            {
                Expiration = TimeSpan.FromMinutes(2) // Shorter TTL for this specific data
            },
            // 13. Tags allow invalidating multiple related entries later
            tags: ["weather", $"city:{city}"],
            cancellationToken: ct
        );
    }
}

// 14. Define the API endpoint
app.MapPost("/api/weather", async (WeatherRequest request, WeatherService service, CancellationToken ct) =>
{
    // 15. Call the service which handles caching internally
    var forecast = await service.GetWeatherAsync(request.City, ct);
    return Results.Ok(forecast);
});

// 16. Define an endpoint to demonstrate cache invalidation
app.MapDelete("/api/weather/cache", async (string city, HybridCache cache, CancellationToken ct) =>
{
    // 17. Invalidate all entries tagged with the specific city
    await cache.RemoveByTagAsync($"city:{city}", ct);
    return Results.Ok($"Cache for {city} invalidated.");
});

app.Run();

Detailed Explanation

1. The Real-World Problem: The "Cold Start" Cost

Imagine you are building a chatbot interface using an LLM. A user asks, "What are the benefits of ASP.NET Core?".

The Input: The prompt is sent to the model.
The Processing: The model processes the tokens (expensive compute time).
The Output: A response is generated.

If a second user asks the exact same question 5 seconds later, sending the request through the entire pipeline again is wasteful. It costs you money (token usage) and latency (waiting for the model). This code example solves that by introducing a memory cache that sits between the API endpoint and the expensive operation.

2. Code Breakdown (Step-by-Step)

Step 1-3: Service Registration

builder.Services.AddHybridCache(options =>
{
    options.DefaultEntryOptions = new HybridCacheEntryOptions
    {
        AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(5)
    };
});

AddHybridCache: This is the entry point in .NET 9. It registers the HybridCache service. Unlike the older IMemoryCache, this is designed for modern serialization needs and supports "Stale-While-Revalidate" patterns (though we are using basic expiration here).
DefaultEntryOptions: We define a global configuration. AbsoluteExpirationRelativeToNow ensures that data doesn't live forever. If the underlying data changes (e.g., weather updates), the cache will eventually clear itself.

Step 4-6: Domain Logic and Service

public class WeatherService
{
    private readonly HybridCache _cache;
    // ... constructor ...
}

Dependency Injection: We inject HybridCache into our service. This keeps the caching logic separate from the business logic.
The Mock: The WeatherService simulates an expensive operation (like an HTTP call to an LLM provider) using Task.Delay(1000). In a real scenario, this would be await _llmClient.GenerateAsync(...).

Step 7-13: The Caching Logic (GetOrCreateAsync) This is the core of the example.

return await _cache.GetOrCreateAsync(
    cacheKey,
    async (ct) => { /* Expensive Logic */ },
    options,
    tags,
    ct
);

Step 8 (Cache Key): string cacheKey = $"weather:{city.ToLowerInvariant()}";
- Why: Caching relies on unique identifiers. If we don't normalize the input (e.g., "London" vs "london"), we might miss the cache or create duplicates.
Step 9 (The Factory): async (ct) => { ... }
- How it works: This is a lambda function passed as an argument. The HybridCache library checks its internal store.
  - Hit: If the key exists and isn't expired, it returns the value immediately. The lambda is never executed.
  - Miss: If the key is missing, the library executes the lambda, waits for the result, stores it in the cache, and then returns it.
Step 12 (Per-Entry Options):
- While we set a global 5-minute default, we override it here to 2 minutes. This is useful for data that changes more frequently than others.
Step 13 (Tags):
- Tags are metadata associated with cache entries. Here, we tag the entry with "weather" and "city:London". This is powerful because you can invalidate all cached weather data, or just data related to a specific city, without knowing the exact keys.

Step 14-17: The API Endpoints

MapPost("/api/weather"): This is the consumer of the service. It simply passes the request to the service. It doesn't know if the data came from the cache or the "AI model."
MapDelete("/api/weather/cache"): This demonstrates Tag-Based Invalidation. If the underlying weather data source updates (e.g., a new sensor reading), we can call this endpoint to wipe all cached data for "London" instantly.

Visualizing the Flow

A flowchart shows a Clear London Cache endpoint receiving an external trigger, which sends an immediate command to the cache layer to purge all stored data associated with the London key.

Common Pitfalls

1. Mutable State in Cached Objects A frequent mistake is caching mutable objects (like lists or classes with public setters) and modifying them after retrieval.

The Mistake:

var list = await cache.GetOrCreateAsync("key", ...);
list.Add(new Item()); // Modifies the cached instance!

The Consequence: Since HybridCache stores the reference (or a serialized version), subsequent requests might see the modified data or throw exceptions due to concurrent modification. Always treat cached objects as immutable.
The Fix: Use record types (as shown in the example) or clone objects before modifying them.

2. Ignoring Cancellation Tokens In the example, CancellationToken ct is passed to GetOrCreateAsync.

The Mistake: Omitting the token in the cache call.
The Consequence: If a user cancels a request (closes the browser), the background factory method might still run to completion, wasting resources.
The Fix: Always pass the CancellationToken from the HTTP context down to the cache layer.

3. Cache Stampede (Thundering Herd) If a popular cache key expires, and 1000 requests arrive simultaneously, they might all miss the cache and hit the database/LLM at the same time.

HybridCache Advantage: The .NET 9 HybridCache library handles this internally. It uses locking mechanisms to ensure that if a key is being computed, other requests for the same key wait for the result rather than triggering the factory multiple times. However, if you implement custom locking manually, this is a critical risk.

4. Key Collisions Using simple keys like "data" or "summary".

The Consequence: Different users or different contexts overwrite each other's data.
The Fix: Always namespace your keys (e.g., user:123:summary or model:gpt-4:response). In the example, we used weather:{city}.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.