Chapter 9: Streaming Inference to the Console

Theoretical Foundations

The core challenge in making a local Large Language Model (LLM) feel "alive" is overcoming the latency of the generation process. When a user sends a prompt to a local model running via ONNX Runtime, the model does not simply return a finished paragraph. Instead, it performs a massive matrix multiplication operation to predict the very next token, returns that token, feeds it back into itself as input for the next step, and repeats this cycle hundreds of times. If we were to wait for this entire cycle to complete before showing any output to the user, the application would appear frozen for seconds or even minutes. This creates a jarring, unnatural user experience.

To solve this, we must treat the LLM not as a static function that returns a result, but as a continuous source of data—a stream. In C#, the premier mechanism for handling asynchronous data streams is the IAsyncEnumerable<T> interface. This interface allows us to yield results as they become available, creating a bridge between the heavy computational world of ONNX Runtime and the responsive, interactive world of the user interface.

The "Assembly Line" Analogy

Imagine a highly skilled chef (the LLM) preparing a complex, multi-course meal (the generated text). The chef works in a kitchen isolated from the dining room (the user interface).

The Synchronous Approach (The "All-at-Once" Meal): The waiter takes the order and waits at the kitchen door. The chef cooks the entire seven-course meal, plates it, and only then hands it to the waiter. The waiter walks out and serves the entire meal at once. The diner is left waiting for a very long time with an empty table, unsure if anything is happening.
The Asynchronous Streaming Approach (The "Course-by-Course" Meal): The waiter takes the order and immediately goes to the kitchen door. As soon as the chef finishes the appetizer (the first few tokens), the waiter takes it and serves it to the diner. While the diner is enjoying the appetizer, the chef is already working on the soup. The waiter keeps checking back, taking each dish the moment it's ready. The diner gets immediate feedback, the meal feels engaging, and the kitchen's workflow is continuous and efficient.

In this analogy:

The Chef is the ONNX Runtime InferenceSession.
The Kitchen is the background processing thread.
The Waiter is the IAsyncEnumerable pipeline.
The Dishes are the individual tokens.
The Diner is the Console or UI.

Our goal is to build the "waiter" and the "kitchen workflow" in C# so that the diner never has to wait for the entire meal to be finished.

The Foundation: `IAsyncEnumerable<T>`

Introduced in C# 8.0, IAsyncEnumerable<T> is the asynchronous counterpart to IEnumerable<T>. While IEnumerable<T> represents a sequence of values that can be iterated over synchronously, IAsyncEnumerable<T> represents a sequence that can be iterated over asynchronously.

The key method here is await foreach. This language construct allows a consumer to pull items from a producer as they become available, without blocking the calling thread.

// Conceptual usage of the consumer
public async Task ConsumeStreamAsync(IAsyncEnumerable<string> tokenStream)
{
    await foreach (var token in tokenStream)
    {
        // This 'await' pauses execution here until the next token is yielded
        // by the producer, but it frees up the thread to do other work.
        Console.Write(token);
    }
}

This is crucial for AI applications because the "producer" (the inference loop) is subject to variable latency. Some tokens are generated quickly (common words), while others take longer (rare or complex words). IAsyncEnumerable respects this natural cadence.

The Technical Workflow: From Tensor to Text

The process of generating a stream of text involves a tight loop that bridges the high-level C# logic with the low-level numerical operations of the ONNX model. This loop is the "engine" of our streaming pipeline.

1. The Inference Loop

The loop's primary job is to repeatedly call the ONNX Runtime InferenceSession.Run() method. However, Run() is a stateless operation. It takes an input tensor and produces an output tensor. It has no memory of what was generated in the previous step.

To give the model memory, we must maintain the state ourselves. This is the inference loop:

Tokenize the Prompt: Convert the initial user prompt into a list of token IDs.
Initial Run: Feed these token IDs into the model to get the first set of logits (raw predictions).
Process Logits: Convert the logits into a probability distribution and select the most likely token ID.
Yield Token: Convert the token ID back to text and "yield" it to the stream.
Update State: Append the newly generated token ID to our list of input tokens.
Repeat: Go back to step 2, using the updated list of tokens as the new input.

This loop is the heart of the generation process. It must be implemented carefully to ensure it can be paused and resumed as we stream tokens.

2. Tensor Manipulation and Post-processing

When InferenceSession.Run() returns, it gives us a NamedOnnxValue object containing a Tensor<float>. This tensor represents the "logits"—a raw, unnormalized score for every single token in the model's vocabulary.

To turn these logits into a meaningful token, we perform two critical steps:

Softmax (Conceptual): We convert the raw logits into probabilities. A higher logit score becomes a higher probability. This is often handled by the model itself, but conceptually, we are looking for the token with the highest score.
Argmax (Selection): We scan the tensor to find the index of the highest value. This index is the ID of the next token.
Detokenization: We use a vocabulary mapping (loaded from a tokenizer.json or similar file, which we built in a previous chapter) to look up the string representation of this token ID.

This process transforms a multi-dimensional array of floating-point numbers into a single, human-readable character or word piece.

Architectural Flow

The relationship between the consumer (UI), the streaming pipeline, and the ONNX engine can be visualized as a series of hand-offs.

Why This Matters for Edge AI

This architectural pattern is not just a "nice-to-have"; it is fundamental to the viability of local AI.

Responsiveness: By streaming, we can display the first token within milliseconds of the request, even if the full response takes 30 seconds. This immediate feedback psychologically reduces the perceived wait time.
Memory Efficiency: Instead of storing the entire generated response in memory before displaying it, we process and display it token by token. For very long generations, this is critical.
Cancellation: Because the generation happens inside an async iterator, it is naturally interruptible. A user can press a "Stop" button, and we can simply cancel the await foreach loop. The CancellationToken can be passed down into the inference loop, allowing it to gracefully terminate the InferenceSession.Run() call if it's in the middle of a long computation. This is impossible with a synchronous, blocking call.

The Role of Previous Concepts

This chapter builds directly upon the foundation laid in Book 8: The ONNX Pipeline. Specifically, we rely on the InferenceSession object created there. The InferenceSession is the loaded model, the "brain" that we will now learn to converse with in a real-time dialogue. Furthermore, the tokenization logic—the mapping of strings to token IDs—is the inverse of the detokenization we will perform here. The vocabulary map created during tokenization is essential for converting the generated token IDs back into text that the user can read. Without that pre-processing step, the output of this chapter would be nothing but a stream of meaningless integers.

Basic Code Example

Here is a basic code example demonstrating asynchronous streaming of LLM tokens to the console using C# and ONNX Runtime.

The Problem: The "Waiting" Cursor

Imagine you are building a local chat application. When you ask a question, the AI takes a few seconds to generate a response. If you wait for the entire response to finish processing before displaying anything, the user sees a frozen screen or a spinning wheel. This feels slow and unresponsive. The goal is to mimic human typing: as soon as the AI generates a word (or a token), we want it to appear on the screen immediately. This requires streaming inference.

The Code Example

This example simulates the core logic of streaming tokens. It uses a mock InferenceSession to represent the ONNX model execution, focusing on the IAsyncEnumerable pipeline that handles the asynchronous generation and display.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;

namespace EdgeAIStreamingDemo
{
    // ---------------------------------------------------------
    // 1. MOCK INFRASTRUCTURE (Simulating ONNX Runtime)
    // ---------------------------------------------------------

    // Represents a simplified ONNX Runtime InferenceSession.
    // In a real scenario, this wraps the Microsoft.ML.OnnxRuntime.InferenceSession.
    public class MockInferenceSession : IDisposable
    {
        private readonly string[] _mockVocabulary;

        public MockInferenceSession()
        {
            // A tiny vocabulary for demonstration purposes.
            _mockVocabulary = new[] { "Hello", " ", "World", "!", "How", "are", "you", "?", "\n" };
        }

        // Simulates the complex 'Run' method of ONNX Runtime.
        // It takes an input token ID and returns the next predicted token ID.
        // We simulate a delay to represent model computation time.
        public async Task<int> RunAsync(int inputTokenId, CancellationToken cancellationToken)
        {
            // Simulate GPU/CPU inference latency (e.g., 100ms - 300ms per token)
            await Task.Delay(new Random().Next(100, 300), cancellationToken);

            // Simple logic to generate a sequence: "Hello World!" -> "How are you?"
            // This is purely for the demo to produce readable output.
            int nextTokenId = inputTokenId switch
            {
                0 => 1, // "Hello" -> " "
                1 => 2, // " " -> "World"
                2 => 3, // "World" -> "!"
                3 => 4, // "!" -> "How"
                4 => 5, // "How" -> " "
                5 => 6, // " " -> "are"
                6 => 7, // "are" -> " "
                7 => 8, // " " -> "you"
                8 => 3, // "you" -> "!"
                _ => -1 // End of sequence
            };

            return nextTokenId;
        }

        // Helper to convert ID back to text for display.
        public string TokenIdToString(int tokenId)
        {
            if (tokenId < 0 || tokenId >= _mockVocabulary.Length) return "";
            return _mockVocabulary[tokenId];
        }

        public void Dispose() { /* Cleanup native resources */ }
    }

    // ---------------------------------------------------------
    // 2. CORE LOGIC: Streaming Pipeline
    // ---------------------------------------------------------

    public class StreamingGenerator
    {
        private readonly MockInferenceSession _session;

        public StreamingGenerator(MockInferenceSession session)
        {
            _session = session;
        }

        /// <summary>
        /// Generates text asynchronously and yields tokens as they are produced.
        /// </summary>
        /// <param name="promptTokenId">The starting token ID (e.g., BOS token).</param>
        /// <param name="cancellationToken">Cancellation token to stop generation.</param>
        /// <returns>An async stream of strings (tokens).</returns>
        public async IAsyncEnumerable<string> GenerateStreamAsync(
            int promptTokenId, 
            [EnumeratorCancellation] CancellationToken cancellationToken)
        {
            int currentTokenId = promptTokenId;

            // The Inference Loop
            while (true)
            {
                // 1. Run inference asynchronously (non-blocking)
                int nextTokenId = await _session.RunAsync(currentTokenId, cancellationToken);

                // 2. Check for end-of-sequence (EOS) token
                if (nextTokenId == -1) 
                    break;

                // 3. Decode the token ID to text
                string tokenText = _session.TokenIdToString(nextTokenId);

                // 4. Yield the token immediately to the consumer
                yield return tokenText;

                // 5. Update state for the next iteration
                currentTokenId = nextTokenId;
            }
        }
    }

    // ---------------------------------------------------------
    // 3. CONSUMER: Console Application
    // ---------------------------------------------------------

    class Program
    {
        static async Task Main(string[] args)
        {
            Console.WriteLine("Initializing Local LLM Stream...");

            // Initialize the mock session (In real code, load .onnx file here)
            using var inferenceSession = new MockInferenceSession();
            var generator = new StreamingGenerator(inferenceSession);

            // Define a cancellation token source (e.g., handle Ctrl+C)
            using var cts = new CancellationTokenSource();

            try
            {
                Console.WriteLine("\nGenerated Output: ");
                Console.WriteLine("-------------------");

                // Start the stream
                // We start with Token ID 0 ("Hello")
                await foreach (var token in generator.GenerateStreamAsync(0, cts.Token))
                {
                    // Write token to console immediately without waiting for the next line
                    Console.Write(token);

                    // Optional: Flush standard output to ensure text appears immediately
                    // (Usually handled by Console.Write, but critical in some redirections)
                    Console.Out.Flush();
                }

                Console.WriteLine("\n-------------------");
                Console.WriteLine("Stream finished.");
            }
            catch (OperationCanceledException)
            {
                Console.WriteLine("\n[Stream cancelled by user]");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"\n[Error: {ex.Message}]");
            }
        }
    }
}

Detailed Line-by-Line Explanation

1. Mock Infrastructure (`MockInferenceSession`)

In a real-world scenario, you would use the Microsoft.ML.OnnxRuntime NuGet package to load a model file (e.g., llama-2-7b.onnx). Since we cannot bundle a 4GB model file in a text example, we create a MockInferenceSession.

_mockVocabulary: Represents the model's tokenizer mapping. In reality, this is a massive dictionary mapping integers to strings (sub-words).
RunAsync: This is the critical method. In ONNX Runtime, you call session.Run(inputs). Here, we simulate that call.
- Task.Delay: Real inference takes time (milliseconds to seconds depending on hardware). We simulate this latency to demonstrate why async is necessary. Without await, the UI thread would freeze.
- Logic: We implement a simple state machine to output a specific sentence ("Hello World! How are you?"). In a real LLM, the logic is the neural network math, which predicts the next likely token based on probability.
TokenIdToString: Converts the integer output of the model back into human-readable text.

2. Core Logic (`StreamingGenerator`)

This class encapsulates the inference loop and exposes it via IAsyncEnumerable.

IAsyncEnumerable<string>: This is the key C# feature for this chapter. It allows a method to return a sequence of items asynchronously. Unlike Task<string[]>, it doesn't wait for the entire array to be filled; it yields items one by one as they become available.
[EnumeratorCancellation]: This attribute allows the CancellationToken passed to GenerateStreamAsync to propagate to the await foreach loop in the consumer, enabling graceful stopping.
The Loop (while true):
1. await _session.RunAsync(...): We await the model's prediction. This frees up the thread to do other work (or handle UI events) while the model "thinks".
2. EOS Check: We check for a specific ID (e.g., -1) that signals the model has finished generating the response.
3. yield return tokenText: This is the "magic" of streaming. Execution pauses here, returns the token to the caller (the Main method), and waits for the caller to request the next item. Once the caller processes it (writes to console), the loop resumes to fetch the next token.

3. Consumer (`Program.Main`)

This simulates the application logic.

await foreach (var token in ...): This is the consumer side of IAsyncEnumerable. It iterates over the stream.
Console.Write(token): Because await foreach waits for the yield return in the generator, the text appears on screen token-by-token, creating the "typing" effect.
Console.Out.Flush(): While Console.Write usually flushes automatically, explicitly flushing ensures the text appears immediately, especially if the output is being piped to another process or rendered in a specific UI context.

Visualizing the Data Flow

The following diagram illustrates the flow of control and data between the Consumer, the Streaming Generator, and the ONNX Inference Session.

This diagram illustrates the data flow from the Consumer, through the Streaming Generator, and into the ONNX Inference Session, highlighting how explicit flushing ensures that text appears immediately even when output is piped to another process or a specific UI context.

Common Pitfalls

Blocking the UI Thread with Result or Wait():
- Mistake: Calling session.RunAsync(...).Result or .Wait() inside the generation loop.
- Consequence: In a GUI application (WPF, MAUI, WinUI), this freezes the entire interface. The user cannot click "Cancel" or interact with the app while the model generates tokens.
- Fix: Always use await and async all the way down. Ensure the InferenceSession.Run method is truly asynchronous (or offloaded to a background thread if the native library is synchronous).
Buffering Tokens Before Display:
- Mistake: Collecting tokens into a List<string> or StringBuilder inside the generator and returning the full string only after the loop finishes.
- Consequence: You lose the real-time streaming benefit. The user sees nothing until the entire generation is complete, defeating the purpose of this chapter.
- Fix: Use yield return immediately after decoding the token.
Ignoring CancellationToken:
- Mistake: Not passing the CancellationToken into the RunAsync method or the Task.Delay simulation.
- Consequence: If the user clicks "Stop", the application might continue generating tokens in the background, wasting CPU/GPU resources and potentially causing memory leaks.
- Fix: Always check cancellationToken.ThrowIfCancellationRequested() in long-running loops and pass the token to every await call.
Thread Starvation:
- Mistake: Running heavy synchronous code (like tensor manipulation in pure C#) inside the async loop without offloading.
- Consequence: Even though the code is async, if the post-processing of the tensor (e.g., softmax calculations) is done on the main thread, it can block the consumption of the next token.
- Fix: Ensure heavy CPU-bound work is wrapped in Task.Run or handled by the ONNX Runtime's native execution providers (which usually run on separate threads).

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.