Skip to content

Chapter 10: Handling Backpressure - When the AI Generates Faster than the UI

Theoretical Foundations

In asynchronous AI pipelines, a fundamental mismatch often arises between the rate at which an AI model generates tokens and the rate at which the consuming client—typically a UI—can render them. This phenomenon is known as backpressure. Backpressure is not merely a performance nuisance; it is a critical system design constraint. If left unmanaged, it leads to memory exhaustion, unresponsive user interfaces, and a degradation of the user experience that mimics system failure.

To understand this deeply, we must first establish the theoretical mechanics of data flow in a C# asynchronous context, specifically focusing on how IAsyncEnumerable<T> and System.Threading.Channels manage flow control.

The Core Problem: Producer-Consumer Imbalance

In an AI application, the "Producer" is the LLM (Large Language Model) inference engine. Modern GPUs and TPUs can generate text at speeds exceeding hundreds of tokens per second. The "Consumer" is the UI rendering loop (e.g., a React component updating the DOM or a Blazor WASM rendering tree).

The disparity is vast:

  • Producer Speed: 500 tokens/sec (GPU bound).
  • Consumer Speed: 60 frames/sec (Browser refresh rate, ~16ms per frame).

If the Producer pushes data indiscriminately into a shared buffer, the buffer grows indefinitely. In a managed memory environment like .NET, this triggers frequent Garbage Collection (GC) pauses, causing the UI thread to stutter. In a browser-based WASM context, it can exhaust the linear memory heap, crashing the application.

Analogy: The Firehose and the Teacup

Imagine a firehose (the AI) connected to a garden hose (the network), emptying into a tiny teacup (the UI render loop).

  • Without Backpressure: You open the valve fully. The teacup overflows instantly. Water splashes everywhere (memory leaks/crashes). The cup never actually gets to enjoy the water because it's constantly drowning.
  • With Backpressure: You install a valve (a buffer) and a pressure sensor. You only open the valve when the cup is empty. The firehose waits. The system remains stable, and the cup receives water at a manageable rate.

Theoretical Foundations

In C#, asynchronous programming is built upon the Task and Task<T> types, which represent a single unit of work. However, for streaming data—like an AI generating a stream of text—Task<T> is insufficient because it represents a single future value, not a sequence.

1. IAsyncEnumerable and the Pull Model

Introduced in C# 8.0, IAsyncEnumerable<T> is the cornerstone of asynchronous streaming. It represents a sequence of elements that can be queried asynchronously.

The Mechanism: Unlike IEnumerable<T>, which is synchronous (you pull data, and the thread blocks until data is available), IAsyncEnumerable<T> uses the await foreach construct. This creates a Pull Model. The consumer (the UI loop) signals readiness for the next item by awaiting the enumerator.

// Conceptual representation of the consumer loop
await foreach (var token in aiResponseStream)
{
    // The UI renders the token here.
    // The loop does NOT proceed to the next token until this iteration completes.
}

Why this matters for Backpressure: The await foreach loop inherently applies backpressure. The MoveNextAsync() method returns a ValueTask<bool>. The loop pauses execution at the await keyword, suspending the call stack. The AI generator (Producer) cannot push the next token until the consumer requests it via MoveNextAsync(). This is a natural flow-control mechanism.

The Limitation: While IAsyncEnumerable<T> handles logical backpressure (the consumer controls the pull rate), it does not inherently handle buffering or multicast scenarios well. If multiple UI components need the same stream, or if the Producer is faster than the Consumer even within a single iteration (e.g., complex token processing), IAsyncEnumerable alone is not enough.

2. System.Threading.Channels: The Push Model with Flow Control

For high-throughput AI scenarios, we often need a Push Model decoupled from the consumer. The System.Threading.Channels namespace provides primitives for this. Specifically, Channel<T>.

The Mechanism: A Channel<T> consists of a writer (ChannelWriter<T>) and a reader (ChannelReader<T>). The AI writes to the writer; the UI reads from the reader.

Crucially, channels can be bounded. We define a Capacity. When the channel is full, the writer blocks (or returns false/throws an exception depending on configuration). This is explicit backpressure.

Analogy: The Assembly Line Buffer Imagine a factory assembly line (the AI pipeline). Parts (tokens) are produced rapidly.

  • Unbounded Channel: The conveyor belt is infinitely long. Parts pile up at the end, eventually burying the factory (OOM).
  • Bounded Channel: The conveyor belt has a limited length (e.g., 100 parts). When the buffer is full, the machine producing parts stops. The machine waits until space is available. This keeps the factory floor clear and efficient.
// Conceptual definition of a bounded channel for AI tokens
var options = new BoundedChannelOptions(capacity: 100)
{
    FullMode = BoundedChannelFullMode.Wait // The critical backpressure setting
};
var channel = Channel.CreateBounded<string>(options);

// Producer (AI) logic
await channel.Writer.WriteAsync(token);

// Consumer (UI) logic
while (await channel.Reader.WaitToReadAsync())
{
    while (channel.Reader.TryRead(out var token))
    {
        // Render token
    }
}

Decoupling Generation from Consumption

The primary architectural goal in handling backpressure is decoupling. We must decouple the generation rate of the AI from the consumption rate of the UI.

In a naive implementation, the AI generation runs directly on the UI thread (or a request thread). If the AI is slow (e.g., waiting for a GPU), the UI freezes. If the AI is fast, the UI floods.

The Decoupled Architecture:

  1. Generation Thread: A background thread (or thread-pool operation) running the LLM inference. It pushes tokens into a buffer (Channel).
  2. Buffer: A bounded memory queue holding tokens waiting to be rendered.
  3. Render Thread: The UI thread (or a dedicated render loop) pulls from the buffer at its own pace.

This architecture ensures that the UI never receives more data than it can handle in a single frame cycle. If the AI generates 1000 tokens in 1 second, but the UI can only render 60 tokens per second, the buffer fills up to its capacity (e.g., 100 tokens). The AI generation thread is signaled to pause (backpressure). The UI drains the buffer at 60 tokens/sec. Once the buffer is half-empty, the AI resumes.

Streaming LLM Responses: The Chunking Problem

LLMs do not output a complete response instantly. They output a stream of tokens. In HTTP terms, this is a Transfer-Encoding: chunked response.

When handling this in C#, we often deal with System.Net.Http.HttpResponseMessage. The Content is a Stream. Reading this stream asynchronously is the first line of defense against backpressure.

The ReadAsync Buffer: When reading from an HttpResponseStream (network stream), we read into a fixed-size byte buffer (e.g., 4KB).

byte[] buffer = new byte[4096];
int bytesRead = await responseStream.ReadAsync(buffer, 0, buffer.Length);
This is a form of physical backpressure. The network card fills the buffer, and the CPU reads it. If the CPU reads slower than the network sends, the TCP window closes, and the sender slows down. This happens at the OS level.

However, once the bytes are in our application memory, we must decode them (UTF-8 to String) and then parse them (String to Token Objects). This is where application-level backpressure is required.

Advanced Backpressure Strategies

1. Rate Limiting (Token Bucket Algorithm)

Sometimes, even if the UI could render faster, we want to artificially limit the rate to prevent overwhelming the user or to comply with API rate limits.

The Token Bucket algorithm is ideal here. Imagine a bucket that holds a maximum number of tokens. It refills at a specific rate (e.g., 10 tokens per second). Every time the AI generates a token, we remove one from the bucket. If the bucket is empty, the AI pauses.

This is distinct from channel buffering. Buffering handles bursts; Rate Limiting smooths out the flow.

2. Virtualization (UI Level)

Even with perfect backpressure management, rendering thousands of tokens in the DOM is expensive. If the AI generates a massive code block, the UI thread will hang during DOM reconciliation.

UI Virtualization is the practice of rendering only the items currently visible in the viewport.

  • Concept: The UI maintains a "window" of rendered elements. As the user scrolls, elements outside the window are removed from the DOM, and new elements are added.
  • Backpressure Connection: Virtualization acts as a "read head." It only pulls data from the IAsyncEnumerable or Channel when the scroll position demands it. This is the ultimate lazy loading.

3. Reactive Extensions (Rx.NET)

For complex pipelines, System.Reactive (Rx) offers powerful backpressure operators.

  • Buffer: Groups tokens into batches (e.g., render every 10 tokens or every 100ms).
  • Throttle: Emits a value only after a specified timespan has passed without another value (useful for autocomplete UIs).
  • Sample: Emits the most recent value at regular intervals (useful for updating a progress bar without flooding the UI thread).

Visualizing the Data Flow

The following diagram illustrates the flow of data through a decoupled system using a bounded channel. Note the feedback loop where the "Full" state of the channel signals the generator to pause.

A diagram illustrating a decoupled system where a bounded channel acts as a buffer between a generator and a consumer, featuring a feedback loop that signals the generator to pause when the channel is full.
Hold "Ctrl" to enable pan & zoom

A diagram illustrating a decoupled system where a bounded channel acts as a buffer between a generator and a consumer, featuring a feedback loop that signals the generator to pause when the channel is full.

Architectural Implications in .NET

When building AI applications in .NET (e.g., using Microsoft.SemanticKernel or custom OnnxRuntime bindings), the choice of synchronization context is vital.

  1. ConfigureAwait(false): When awaiting the AI generation or reading from channels, always use .ConfigureAwait(false). This prevents capturing the SynchronizationContext (the UI thread context). If you capture it, you might deadlock the UI while waiting for the AI, or you might push too much work onto the UI thread.
  2. ValueTask vs Task: For high-frequency token generation (millions of tokens per minute), allocating a Task object for every token is a GC disaster. IAsyncEnumerable often leverages ValueTask<T>, which is a struct (stack-allocated) when the result is synchronous. This reduces GC pressure significantly.
  3. Thread Pool Management: Long-running AI inference should not block the thread pool. In .NET, we use Task.Run or Task.Factory.StartNew with TaskCreationOptions.LongRunning. This hints the scheduler to dedicate a thread, preventing thread starvation for other requests.

Theoretical Foundations

  • Backpressure is the resistance of a stream consumer to the flow of data, causing the producer to slow down.
  • Pull Models (IAsyncEnumerable) naturally handle backpressure by letting the consumer dictate the pace.
  • Push Models (Channels) require explicit bounded capacities to prevent memory exhaustion.
  • Decoupling via buffers allows the fast AI producer and slow UI consumer to operate independently without blocking each other.
  • UI Virtualization is the final layer of backpressure, ensuring that only necessary data is rendered to the screen.

By mastering these theoretical foundations, we ensure that our AI applications remain responsive, memory-efficient, and scalable, regardless of the speed of the underlying model.

Basic Code Example

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Channels;

// Simulating an AI model that generates text tokens rapidly
public class FastAiModel
{
    private readonly Random _random = new();

    // Generates tokens faster than a typical UI can render comfortably
    public async IAsyncEnumerable<string> GenerateTokensAsync(
        string prompt, 
        [System.Runtime.CompilerServices.EnumeratorCancellation] CancellationToken ct = default)
    {
        string[] tokens = ["The", " quick", " brown", " fox", " jumps", " over", " the", " lazy", " dog.", " This", " is", " a", " demonstration", " of", " backpressure."];

        for (int i = 0; i < 100; i++) // Generate 100 tokens rapidly
        {
            ct.ThrowIfCancellationRequested();

            // Simulate variable generation speed (some tokens take longer)
            int delay = _random.Next(10, 50); // 10-50ms per token
            await Task.Delay(delay, ct);

            string token = tokens[_random.Next(tokens.Length)];
            yield return token;
        }
    }
}

// The UI layer that renders tokens (simulated as slow)
public class SlowUiRenderer
{
    public async Task RenderTokenAsync(string token)
    {
        // Simulate UI rendering latency (e.g., DOM updates, layout calculations)
        // This is intentionally slow to demonstrate backpressure
        await Task.Delay(100); // 100ms per token (slower than AI generation)

        // In a real UI, this would update the DOM, Canvas, etc.
        Console.Write(token);
    }
}

// The backpressure manager using a bounded Channel (Producer-Consumer pattern)
public class BackpressureManager
{
    private readonly Channel<string> _buffer;
    private readonly FastAiModel _aiModel;
    private readonly SlowUiRenderer _uiRenderer;
    private readonly CancellationTokenSource _cts;

    public BackpressureManager(int bufferSize = 5)
    {
        // Create a bounded channel with a capacity of 5 tokens
        // This is the core of backpressure: when full, the producer (AI) waits
        _buffer = Channel.CreateBounded<string>(new BoundedChannelOptions(bufferSize)
        {
            // DropOldest: When full, remove the oldest item to make room for new ones
            // Useful for real-time streaming where recent data is more important
            FullMode = BoundedChannelFullMode.DropOldest,

            // SingleReader/SingleWriter optimizations for performance
            SingleReader = true,
            SingleWriter = true
        });

        _aiModel = new FastAiModel();
        _uiRenderer = new SlowUiRenderer();
        _cts = new CancellationTokenSource();
    }

    // Producer: AI generates tokens and writes to the buffer
    private async Task ProduceAsync()
    {
        try
        {
            await foreach (var token in _aiModel.GenerateTokensAsync("Hello", _cts.Token))
            {
                // WriteAsync will block if the channel is full (backpressure applied)
                await _buffer.Writer.WriteAsync(token, _cts.Token);

                // Log buffer status for demonstration
                Console.WriteLine($"\n[Producer] Wrote '{token}'. Buffer count: {_buffer.Reader.Count}");
            }

            // Signal that no more data will be written
            _buffer.Writer.Complete();
        }
        catch (OperationCanceledException)
        {
            _buffer.Writer.Complete(new OperationCanceledException());
        }
    }

    // Consumer: UI reads from buffer and renders
    private async Task ConsumeAsync()
    {
        try
        {
            // ReadAsync will return immediately if data is available, 
            // or wait if the channel is empty (waiting for producer)
            await foreach (var token in _buffer.Reader.ReadAllAsync(_cts.Token))
            {
                await _uiRenderer.RenderTokenAsync(token);

                // Log buffer status for demonstration
                Console.WriteLine($" [Consumer] Rendered '{token}'. Buffer count: {_buffer.Reader.Count}");
            }
        }
        catch (OperationCanceledException)
        {
            // Handle cancellation gracefully
        }
    }

    // Start both producer and consumer concurrently
    public async Task RunAsync()
    {
        Console.WriteLine("Starting Backpressure Demo...");
        Console.WriteLine($"Buffer capacity: 5 tokens");
        Console.WriteLine($"AI generation: ~20-50ms per token");
        Console.WriteLine($"UI rendering: 100ms per token");
        Console.WriteLine("================================\n");

        // Run producer and consumer in parallel
        var producerTask = ProduceAsync();
        var consumerTask = ConsumeAsync();

        // Wait for both to complete
        await Task.WhenAll(producerTask, consumerTask);

        Console.WriteLine("\n\nDemo completed.");
    }

    // Graceful shutdown
    public void Stop()
    {
        _cts.Cancel();
        _buffer.Writer.Complete();
    }
}

// Main program entry point
public class Program
{
    public static async Task Main(string[] args)
    {
        // Create the backpressure manager with a buffer of 5 tokens
        var manager = new BackpressureManager(bufferSize: 5);

        // Run the demo
        await manager.RunAsync();

        // Wait for user input to exit
        Console.WriteLine("\nPress any key to exit...");
        Console.ReadKey();
    }
}

Detailed Line-by-Line Explanation

1. Using Directives

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Channels;

  • System: Basic .NET types and console I/O.
  • System.Collections.Concurrent: Thread-safe collections (not used directly here, but included for context).
  • System.Collections.Generic: Generic collections like IEnumerable<T>.
  • System.Threading: Threading primitives like CancellationToken.
  • System.Threading.Tasks: Task-based asynchronous programming (TAP).
  • System.Threading.Channels: CRITICAL: Provides Channel<T>, a modern async/await alternative to BlockingCollection<T>. Channels are optimized for high-performance producer-consumer scenarios and are the cornerstone of this backpressure implementation.

2. FastAiModel Class

public class FastAiModel
{
    private readonly Random _random = new();

    public async IAsyncEnumerable<string> GenerateTokensAsync(
        string prompt, 
        [System.Runtime.CompilerServices.EnumeratorCancellation] CancellationToken ct = default)
    {
        string[] tokens = ["The", " quick", " brown", " fox", " jumps", " over", " the", " lazy", " dog.", " This", " is", " a", " demonstration", " of", " backpressure."];

        for (int i = 0; i < 100; i++) // Generate 100 tokens rapidly
        {
            ct.ThrowIfCancellationRequested();

            int delay = _random.Next(10, 50); // 10-50ms per token
            await Task.Delay(delay, ct);

            string token = tokens[_random.Next(tokens.Length)];
            yield return token;
        }
    }
}

  • Random _random: Introduces variability in generation speed to simulate real-world conditions.
  • IAsyncEnumerable<string>: MODERN C# FEATURE (C# 8.0+). Allows streaming data asynchronously. This is perfect for LLM responses which arrive token-by-token.
  • [EnumeratorCancellation]: MODERN C# FEATURE. Allows the CancellationToken to be passed to the async enumerator, enabling cancellation of the entire stream.
  • await Task.Delay(...): Simulates the time it takes for the AI model to compute a token. The delay is short (10-50ms), making the AI "fast".
  • yield return token: Streams each token immediately as it's generated, without buffering the entire result in memory.

3. SlowUiRenderer Class

public class SlowUiRenderer
{
    public async Task RenderTokenAsync(string token)
    {
        await Task.Delay(100); // 100ms per token
        Console.Write(token);
    }
}

  • await Task.Delay(100): SIMULATION OF SLOW UI. This represents the overhead of rendering in a UI framework (e.g., React re-rendering, DOM updates, layout calculations). At 100ms per token, the UI is 2-10x slower than the AI generation.
  • Console.Write(token): In a real application, this would be replaced with element.textContent += token or similar UI update logic.

4. BackpressureManager Class (Core Logic)

public class BackpressureManager
{
    private readonly Channel<string> _buffer;
    private readonly FastAiModel _aiModel;
    private readonly SlowUiRenderer _uiRenderer;
    private readonly CancellationTokenSource _cts;

    public BackpressureManager(int bufferSize = 5)
    {
        _buffer = Channel.CreateBounded<string>(new BoundedChannelOptions(bufferSize)
        {
            FullMode = BoundedChannelFullMode.DropOldest,
            SingleReader = true,
            SingleWriter = true
        });
        // ... rest of constructor
    }
}

  • Channel<string> _buffer: The heart of the system. A bounded channel acts as a queue with a fixed capacity.
  • BoundedChannelOptions: Configures the channel's behavior.
  • bufferSize = 5: The channel can hold at most 5 tokens. This is a small buffer to clearly demonstrate backpressure.
  • FullMode = DropOldest: CRITICAL CHOICE. When the buffer is full and the AI tries to write, the oldest token is dropped to make room for the new one. This is useful for real-time streaming where you care more about the latest data than every single token. Alternative: Wait (blocks the producer) or DropNewest (drops the new token).
  • SingleReader/SingleWriter: Optimization hints. If you know only one task will read/write, this improves performance by reducing synchronization overhead.

5. Producer Method (ProduceAsync)

private async Task ProduceAsync()
{
    try
    {
        await foreach (var token in _aiModel.GenerateTokensAsync("Hello", _cts.Token))
        {
            await _buffer.Writer.WriteAsync(token, _cts.Token);
            Console.WriteLine($"\n[Producer] Wrote '{token}'. Buffer count: {_buffer.Reader.Count}");
        }
        _buffer.Writer.Complete();
    }
    catch (OperationCanceledException) { /* ... */ }
}

  • await foreach: Consumes the IAsyncEnumerable from the AI model. This loop runs as fast as the AI generates tokens.
  • await _buffer.Writer.WriteAsync(...): BACKPRESSURE APPLIED HERE.
  • If the buffer has space, the token is written immediately.
  • If the buffer is full (5 tokens), WriteAsync awaits (blocks) until space becomes available (if using FullMode = Wait) or drops data (if using DropOldest). In our case, it drops the oldest token and writes the new one immediately.
  • _buffer.Reader.Count: Shows the current number of items in the buffer for debugging/monitoring.
  • _buffer.Writer.Complete(): Signals to the consumer that no more data will be written. This allows the consumer's ReadAllAsync loop to exit gracefully.

6. Consumer Method (ConsumeAsync)

private async Task ConsumeAsync()
{
    try
    {
        await foreach (var token in _buffer.Reader.ReadAllAsync(_cts.Token))
        {
            await _uiRenderer.RenderTokenAsync(token);
            Console.WriteLine($" [Consumer] Rendered '{token}'. Buffer count: {_buffer.Reader.Count}");
        }
    }
    catch (OperationCanceledException) { /* ... */ }
}

  • _buffer.Reader.ReadAllAsync(): MODERN C# FEATURE. Returns an IAsyncEnumerable that yields items as they become available. It automatically waits when the buffer is empty and stops when the writer is completed.
  • await _uiRenderer.RenderTokenAsync(token): The UI consumes the token. Because the UI is slow (100ms), the buffer will naturally fill up if the AI is fast, triggering backpressure.

7. RunAsync Method

public async Task RunAsync()
{
    var producerTask = ProduceAsync();
    var consumerTask = ConsumeAsync();
    await Task.WhenAll(producerTask, consumerTask);
}

  • Task.WhenAll: Runs the producer and consumer concurrently. This is crucial because the producer and consumer need to run in parallel for the buffer to be effective. If we ran them sequentially, the buffer would be pointless.

Common Pitfalls

  1. Unbounded Channels (The Memory Leak Trap)
  2. Mistake: Using Channel.CreateUnbounded<string>() without a backpressure strategy.
  3. Consequence: If the UI crashes or is slow, the AI continues generating tokens and the channel grows infinitely, consuming all available RAM. Eventually, the application crashes with an OutOfMemoryException.
  4. Solution: Always use Channel.CreateBounded for streaming data. Set a reasonable buffer size based on available memory and expected throughput.

  5. Ignoring DropOldest vs. Wait Semantics

  6. Mistake: Using FullMode = Wait for real-time chat applications.
  7. Consequence: The AI model's generation will be throttled to the UI's rendering speed. If the UI lags for 5 seconds, the AI will pause for 5 seconds, creating a jarring user experience.
  8. Solution: For real-time streaming (like a chatbot), use DropOldest. This ensures the user sees the most recent, relevant tokens even if some intermediate tokens are lost. For non-real-time tasks (like generating a report), use Wait to ensure no data is lost.

  9. Forgetting Complete()

  10. Mistake: Not calling _buffer.Writer.Complete() after the producer finishes.
  11. Consequence: The consumer's ReadAllAsync loop will wait forever for more data, causing a deadlock where the program never exits.
  12. Solution: Always call Complete() on the writer when done. Also, handle ChannelClosedException gracefully in the consumer.

  13. Blocking the UI Thread

  14. Mistake: Running ConsumeAsync on the UI thread without proper async/await.
  15. Consequence: Even though we use async/await, if the UI framework's synchronization context is blocked (e.g., by calling .Result or .Wait()), the UI will freeze.
  16. Solution: Use async void for event handlers (in UI frameworks) or ConfigureAwait(false) in library code to avoid deadlocks. Ensure the UI rendering loop is truly asynchronous.

  17. Buffer Size Too Small

  18. Mistake: Setting bufferSize = 1 for a high-throughput AI.
  19. Consequence: Extreme backpressure. The AI will be constantly blocked waiting for the UI, negating the benefits of async streaming.
  20. Solution: Profile your application. Start with a buffer size of 10-20 tokens and adjust based on memory usage and latency requirements.

Visualizing the Data Flow

This diagram visualizes the iterative data flow of an AI application, showing how profiling memory usage and latency guides the adjustment of buffer sizes to find the optimal performance configuration.
Hold "Ctrl" to enable pan & zoom

This diagram visualizes the iterative data flow of an AI application, showing how profiling memory usage and latency guides the adjustment of buffer sizes to find the optimal performance configuration.

Diagram Explanation:

  1. AI (Producer): Generates tokens rapidly. It writes to the buffer using WriteAsync.
  2. Buffer (Bounded Channel): Acts as a shock absorber. With a capacity of 5, it can hold 5 tokens before applying backpressure.
  3. UI (Consumer): Reads from the buffer using ReadAllAsync and renders slowly.
  4. Backpressure Signal: When the buffer is full, the WriteAsync call waits (or drops data), effectively slowing down the AI's effective output rate to match the UI's consumption rate. This prevents memory overflow and UI freezing.

Real-World Context: A Chat Application

Imagine a chat application where an AI assistant responds to a user. The AI model (e.g., GPT-4) generates text at ~50 tokens/second. The user's browser, however, can only render ~10 tokens/second due to DOM updates and CSS calculations.

Without Backpressure:

  • The AI sends 50 tokens/second to the browser.
  • The browser's JavaScript event queue is flooded.
  • The UI becomes unresponsive, keystrokes lag, and eventually, the browser tab crashes due to memory exhaustion.

With Backpressure (using this code):

  • The AI generates tokens and pushes them into a bounded channel (buffer).
  • The UI consumes tokens from the channel at its own pace (10/second).
  • If the UI is slow, the buffer fills up. The AI's WriteAsync call blocks, throttling the generation rate.
  • If the buffer is configured to DropOldest, the user sees the most recent, relevant tokens without lag, even if some intermediate tokens are skipped.
  • The application remains responsive and memory usage stays constant.

This pattern is essential for building robust, high-performance AI applications that handle real-time streaming data.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.