Chapter 8: Streaming LLM Tokens - Implementing the 'Typewriter Effect'

Theoretical Foundations

The fundamental limitation of traditional request-response models in AI applications is their latency. When a user sends a prompt to an LLM, the model generates a response token by token. In a standard synchronous call, the client waits for the entire sequence to be generated, serialized, and transmitted before the first byte of UI is updated. This creates a "dead air" period where the user stares at a loading spinner, perceiving the system as slow or unresponsive, even if the total generation time is identical to a streaming approach.

Streaming LLM responses, often visualized as the "Typewriter Effect," solves this by treating the LLM's output not as a monolithic string, but as an asynchronous sequence of discrete tokens (words, sub-words, or characters) arriving over time. This architectural shift transforms the user experience from a binary "waiting" state to an active "watching" state, significantly improving perceived performance.

The Mechanics of Token Streams

At the protocol level, most LLM APIs (like OpenAI or Anthropic) utilize Server-Sent Events (SSE). Unlike WebSockets, which are bidirectional, SSE is a unidirectional protocol where the server pushes data to the client over a persistent HTTP connection. The data is framed as distinct events, typically delimited by double newlines.

In the context of C#, this means we are not dealing with a standard HttpResponseMessage that returns a complete body. Instead, we are dealing with a continuous stream of bytes that must be parsed incrementally. The HttpClient provides the GetStreamAsync method, which returns a Stream that can be read asynchronously. However, raw streams are byte-oriented; we need to layer a text decoder on top to handle multi-byte characters (like UTF-8) that might be split across TCP packets.

Consider the analogy of a live radio broadcast. In a traditional HTTP request (like downloading a podcast), you must wait for the entire audio file to finish downloading before you can listen. In an SSE stream (like the radio), the audio arrives continuously. You can start listening (processing) immediately, even though the broadcast hasn't finished. The challenge in C# is that the "radio signal" might be noisy or fragmented, requiring a buffer to assemble coherent "words" (tokens) from the incoming signal.

Asynchronous Iteration and `IAsyncEnumerable<T>`

C# 8.0 introduced IAsyncEnumerable<T>, a pivotal feature for handling streams of data without blocking the calling thread. This interface is the asynchronous counterpart to IEnumerable<T>. While IEnumerable<T> represents a sequence that can be enumerated synchronously, IAsyncEnumerable<T> represents a sequence that is produced asynchronously, requiring the await foreach loop to consume it.

In the context of AI pipelines, IAsyncEnumerable<string> is the ideal abstraction for a token stream. It allows the application to yield control back to the event loop while waiting for the next token to arrive from the LLM.

Why is this critical for AI applications? In a desktop or web application, the UI thread is responsible for rendering the interface. If we block this thread waiting for a token, the application freezes. By using IAsyncEnumerable, we can await the next token without blocking the thread. The thread is free to handle other events (like button clicks or scrolling) while the network request progresses in the background.

The relationship between the raw network stream and IAsyncEnumerable can be visualized as a pipeline:

This diagram illustrates an asynchronous pipeline where a raw network stream feeds tokens into an IAsyncEnumerable iterator, allowing the UI thread to remain responsive and handle user events while awaiting the next data chunk in the background. — This diagram illustrates an asynchronous pipeline where a raw network stream feeds tokens into an `IAsyncEnumerable` iterator, allowing the UI thread to remain responsive and handle user events while awaiting the next data chunk in the background.

Handling Partial JSON Fragments

A common complexity arises when the LLM is instructed to return structured data (e.g., JSON). In a streaming context, a single JSON object is rarely sent in one event. Instead, the stream might look like this:

Event 1: {"content": "The weather"
Event 2: in New York is"
Event 3: sunny."}

If we attempt to deserialize each event individually as a complete JSON object, we will fail because the fragments are invalid JSON on their own.

The Solution: Incremental Parsing We must maintain a buffer that accumulates the incoming text fragments until a valid, parseable object is complete. This requires a stateful parser. In C#, libraries like System.Text.Json.Utf8JsonReader are designed for high-performance, low-allocation parsing of JSON streams. However, for the specific case of LLM token streams, we often implement a custom buffer that concatenates strings until a closing delimiter (like }) is found.

Analogy: The Jigsaw Puzzle Imagine receiving a jigsaw puzzle one piece at a time via mail. You cannot assemble the picture until you have enough pieces to form a recognizable section. Similarly, we cannot parse the JSON object until we have received enough tokens (pieces) to form a complete syntax structure. We keep the pieces in a box (buffer) and only attempt to assemble (parse) them when we detect a boundary (a complete JSON object).

Architectural Implications for AI Agents

In a previous chapter, we discussed Dependency Injection (DI) and Interfaces for swapping model providers. Streaming introduces a new dimension to this architecture. The interface for an AI model provider must evolve from returning a Task<string> to returning an IAsyncEnumerable<string>.

For example, an IModelProvider interface might look like this:

public interface IModelProvider
{
    // Previous synchronous approach
    // Task<string> GenerateAsync(string prompt);

    // New streaming approach
    IAsyncEnumerable<string> GenerateStreamAsync(string prompt, CancellationToken cancellationToken);
}

This change propagates through the entire application stack. The consuming service (e.g., an AgentService) no longer awaits a single result; it iterates over the stream. This allows for progressive rendering in the UI and early termination. If the user cancels the operation, the CancellationToken propagates down to the network layer, closing the connection immediately rather than waiting for the server to finish.

The "Typewriter Effect" and UI Responsiveness

The "Typewriter Effect" is not merely a cosmetic feature; it is a feedback mechanism. It confirms to the user that the system is working. In high-frequency updates, however, rendering every single token can cause performance bottlenecks in UI frameworks (e.g., excessive re-renders in React or WPF).

Debouncing and Batching To mitigate this, we often implement a buffering strategy within the IAsyncEnumerable consumer. Instead of updating the UI for every token, we might accumulate tokens in a local buffer and flush them to the UI only when:

A punctuation mark is encountered (natural pause).
A specific time interval has elapsed (e.g., 50ms).
The buffer reaches a certain size (e.g., 10 tokens).

This requires a custom IAsyncEnumerable wrapper that implements this buffering logic, decoupling the raw network speed from the UI update rate.

Theoretical Foundations

The shift to streaming LLM responses in C# represents a move from imperative programming (do this, then do that) to reactive programming (react to data as it arrives). By leveraging IAsyncEnumerable, HttpClient streams, and robust text parsing, we transform the AI interaction from a monolithic transaction into a fluid conversation. This architecture supports the dynamic, real-time nature of modern AI agents, ensuring that the application remains responsive and the user remains engaged, even during complex, long-running generations.

Basic Code Example

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Runtime.CompilerServices;
using System.Text;
using System.Text.Json;
using System.Threading;
using System.Threading.Tasks;

namespace StreamingLlmTypewriter
{
    // Represents a single token or chunk of text from an LLM response.
    public record TokenChunk(string Text, bool IsComplete);

    // Simulates an LLM API endpoint that streams tokens.
    // In a real scenario, this would be an HttpClient call to an external service.
    public class MockLlmApi
    {
        private static readonly Random _rng = new();

        // Simulates a streaming response using an async iterator.
        // This mimics the behavior of Server-Sent Events (SSE) or HTTP streaming.
        public async IAsyncEnumerable<TokenChunk> GetStreamingResponseAsync(
            string prompt,
            [EnumeratorCancellation] CancellationToken cancellationToken = default)
        {
            // 1. Define the response content.
            // We are simulating a "Hello World" response from an LLM.
            string[] tokens = ["Hello", " ", "World", "!", " This", " is", " a", " streaming", " response", "."];

            // 2. Iterate through the tokens and yield them one by one.
            foreach (string token in tokens)
            {
                // Check for cancellation before processing.
                cancellationToken.ThrowIfCancellationRequested();

                // Simulate network latency (random delay between 50ms and 150ms).
                int delay = _rng.Next(50, 150);
                await Task.Delay(delay, cancellationToken);

                // Yield the token chunk.
                // IsComplete is false for intermediate tokens, true for the last one.
                bool isComplete = token == tokens.Last();
                yield return new TokenChunk(token, isComplete);
            }
        }
    }

    // Handles the rendering of tokens to the console.
    // This class simulates the UI layer (e.g., a text block in a GUI).
    public class TypewriterRenderer
    {
        // Renders the stream of tokens to the console with a typewriter effect.
        public async Task RenderStreamAsync(IAsyncEnumerable<TokenChunk> stream, CancellationToken cancellationToken = default)
        {
            Console.WriteLine("\n--- Start Typewriter Output ---\n");

            // 1. Asynchronously iterate over the stream.
            // This is the core mechanism that enables non-blocking consumption of data.
            await foreach (var chunk in stream.WithCancellation(cancellationToken))
            {
                // 2. Write the token to the console immediately.
                // In a UI application (WPF, MAUI, Blazor), this would update a TextBlock.
                Console.Write(chunk.Text);

                // 3. Flush the output buffer to ensure immediate display.
                // Crucial for console apps to see real-time updates.
                Console.Out.Flush();
            }

            Console.WriteLine("\n\n--- End Typewriter Output ---\n");
        }
    }

    public class Program
    {
        public static async Task Main(string[] args)
        {
            // Setup dependencies.
            var api = new MockLlmApi();
            var renderer = new TypewriterRenderer();

            // Create a cancellation token source to handle graceful shutdown.
            using var cts = new CancellationTokenSource();

            // Handle Ctrl+C to cancel the stream gracefully.
            Console.CancelKeyPress += (sender, e) =>
            {
                e.Cancel = true; // Prevent immediate process termination.
                cts.Cancel();    // Signal cancellation to the async operations.
                Console.WriteLine("\nCancellation requested...");
            };

            try
            {
                // 1. Get the stream from the API.
                // Note: No data is fetched yet; this is just setting up the async iterator.
                var tokenStream = api.GetStreamingResponseAsync("Say Hello World", cts.Token);

                // 2. Render the stream.
                await renderer.RenderStreamAsync(tokenStream, cts.Token);
            }
            catch (OperationCanceledException)
            {
                Console.WriteLine("Operation was cancelled.");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"An error occurred: {ex.Message}");
            }
        }
    }
}

Detailed Explanation

This code example demonstrates a complete, self-contained simulation of streaming tokens from an LLM (Large Language Model) and rendering them in real-time. This pattern is fundamental for creating responsive AI applications where waiting for a full response would result in a poor user experience.

1. The `TokenChunk` Record

public record TokenChunk(string Text, bool IsComplete);

Purpose: Defines the data structure for a single piece of data coming from the stream.
Why a Record? Records are immutable by default in C# 9+. This is ideal for data transfer objects (DTOs) in asynchronous streams because it prevents accidental modification of data during processing, ensuring thread safety.
Fields:
- Text: The actual string content (e.g., a word or a punctuation mark).
- IsComplete: A boolean flag indicating if this is the final chunk in the sequence. This is useful for closing network connections or finalizing UI updates.

2. The `MockLlmApi` Class (The Producer)

public class MockLlmApi
{
    // ... implementation ...
}

Purpose: Simulates the behavior of an external LLM API (like OpenAI's GPT or Azure OpenAI) that supports streaming (Server-Sent Events).
IAsyncEnumerable<TokenChunk>: This is the core of modern asynchronous streams in C#. It allows a method to return a sequence of values asynchronously, meaning it can await data generation (like network delays) without blocking the calling thread.
[EnumeratorCancellation]: This attribute ensures that the CancellationToken passed to the async iterator is respected when the loop is cancelled externally.
The Loop:
```
foreach (string token in tokens)
{
    // ... delay ...
    yield return new TokenChunk(token, isComplete);
}
```
- Simulation: We split a sentence ("Hello World...") into an array of strings to mimic how LLMs output token-by-token.
- Latency Simulation: Task.Delay mimics the real-world network latency. Without this, the stream would finish instantly, hiding the benefits of async streaming.
- yield return: This keyword pauses the method execution, returns the value to the caller, and waits for the caller to request the next value. This is the mechanism that creates the "stream".

3. The `TypewriterRenderer` Class (The Consumer)

public class TypewriterRenderer
{
    public async Task RenderStreamAsync(IAsyncEnumerable<TokenChunk> stream, ...)
    {
        await foreach (var chunk in stream.WithCancellation(cancellationToken))
        {
            Console.Write(chunk.Text);
            Console.Out.Flush();
        }
    }
}

Purpose: Consumes the stream and updates the UI (here, the Console).
await foreach: This is the syntax for consuming an IAsyncEnumerable. It iterates over the stream asynchronously. The loop pauses at the await foreach line until the next item is available from the producer.
Console.Out.Flush(): In console applications, output is often buffered. Flush() forces the buffer to write to the screen immediately. Without this, you might see nothing until the entire stream finishes.
UI Responsiveness: In a GUI application (WPF, WinUI, MAUI), Console.Write would be replaced by Dispatcher.Invoke or DispatcherQueue.TryEnqueue to update a TextBlock on the UI thread. Because the processing happens asynchronously, the UI remains responsive (buttons clickable, scrollable) while tokens arrive.

4. The `Program` Class (Orchestration)

public static async Task Main(string[] args)
{
    // ... setup ...
    var tokenStream = api.GetStreamingResponseAsync(...);
    await renderer.RenderStreamAsync(tokenStream, ...);
}

Lazy Execution: Notice that api.GetStreamingResponseAsync is called, but the data isn't fetched immediately. The iteration only begins when await renderer.RenderStreamAsync starts consuming the IAsyncEnumerable. This is a key concept of deferred execution in async streams.
Cancellation Handling:
- CancellationTokenSource: Manages the cancellation token.
- Console.CancelKeyPress: Hooks into the OS signal (Ctrl+C). When triggered, it calls cts.Cancel().
- Propagation: The token is passed to both the producer (API) and consumer (Renderer). If cancelled, OperationCanceledException is thrown, breaking the loop gracefully.

Common Pitfalls

Blocking the Stream:
- Mistake: Performing CPU-intensive work or synchronous I/O (e.g., Thread.Sleep or File.ReadAllText) inside the await foreach loop.
- Consequence: This blocks the thread processing the stream. If this is the UI thread, the application will freeze (hang) between tokens.
- Fix: Ensure all operations inside the loop are non-blocking (use await, Task.Delay, async file I/O).
Forgetting Flush():
- Mistake: Writing to Console.Out or a StreamWriter without calling Flush().
- Consequence: The output will be buffered. The user will see nothing for several seconds, and then the entire text will appear instantly, defeating the purpose of the typewriter effect.
- Fix: Always call Flush() after writing to a stream in a real-time display loop.
Improper Exception Handling:
- Mistake: Letting an exception bubble up from the async iterator without handling it in the consumer.
- Consequence: If the network connection drops (simulated by an exception in the API), the stream stops abruptly. The UI might be left in an inconsistent state (e.g., a "Loading..." indicator that never disappears).
- Fix: Wrap the await foreach in a try/catch block to handle errors and update the UI accordingly (e.g., show an error message).
Ignoring Cancellation Tokens:
- Mistake: Not passing CancellationToken to Task.Delay or await foreach.
- Consequence: If a user clicks "Cancel" or closes the window, the background process may continue running and consuming resources (memory, network) unnecessarily, leading to memory leaks or zombie processes.

Visualizing the Data Flow

The following diagram illustrates the flow of data from the LLM API to the UI renderer.

The diagram shows the continuous data flow from the LLM API to the UI renderer, illustrating how a background process persists and consumes resources even after a user cancels the operation, potentially leading to memory leaks.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.

Chapter 8: Streaming LLM Tokens - Implementing the 'Typewriter Effect'

Theoretical Foundations

The Mechanics of Token Streams

Asynchronous Iteration and IAsyncEnumerable<T>

Handling Partial JSON Fragments

Architectural Implications for AI Agents

The "Typewriter Effect" and UI Responsiveness

Theoretical Foundations

Basic Code Example

Detailed Explanation

1. The TokenChunk Record

2. The MockLlmApi Class (The Producer)

3. The TypewriterRenderer Class (The Consumer)

4. The Program Class (Orchestration)

Common Pitfalls

Visualizing the Data Flow

Asynchronous Iteration and `IAsyncEnumerable<T>`

1. The `TokenChunk` Record

2. The `MockLlmApi` Class (The Producer)

3. The `TypewriterRenderer` Class (The Consumer)

4. The `Program` Class (Orchestration)