Chapter 18: Streaming Data Processing with IAsyncEnumerable

Theoretical Foundations

In the realm of high-performance computing, particularly when building AI models in C#, we are often constrained by the speed of light and the physical limitations of memory bandwidth. When we process vector embeddings—arrays of floating-point numbers representing semantic meaning—we are dealing with massive buffers of data. A single embedding might be 4096 dimensions, and a batch of 64 embeddings is a significant chunk of memory.

The traditional approach in .NET involves allocating arrays on the Managed Heap. Every time you slice an array using embedding.Take(100).ToArray(), you are asking the Garbage Collector (GC) to reserve memory, copy bytes, and eventually clean up that memory. In a high-throughput scenario, like a real-time RAG (Retrieval-Augmented Generation) system processing thousands of queries per second, these small allocations create "GC pressure," causing the application to pause and clean up, destroying performance.

To solve this, we enter the world of Zero-Allocation Memory using Span<T>. Think of Span<T> not as a container, but as a view or a window into existing memory.

The Real-World Analogy: The Glass Floor

Imagine a massive warehouse filled with data (the Heap). This warehouse is expensive to build and slow to clean up. In the past, if you wanted to work on a specific section of the warehouse, you had to rent a truck, pack up that section, drive it to a new location, and unpack it (Allocation and Copying).

Span<T> is a window in the floor of that warehouse. It doesn't move the goods. It doesn't rent a truck. It simply gives you a safe, managed way to look at and manipulate a specific area of the warehouse from where you are standing. You can slide this window anywhere you want, instantly, without moving a single box.

The Stack vs. The Heap: Where `Span` Lives

To understand Span, you must understand the memory model.

The Heap: This is where new byte[1000] lives. It is garbage collected. It is flexible but slow.
The Stack: This is where local variables live (e.g., int i = 5). It is incredibly fast (just moving a pointer), but the memory is temporary and limited.

Span<T> is a ref struct. This is a critical architectural decision. A ref struct can only live on the Stack. It cannot be boxed, it cannot be stored in a class field, and it cannot be used in an async method (because the stack frame is suspended).

Why is this important? It guarantees that Span will never cause a Heap allocation. It forces the compiler to keep memory operations on the fast path.

Zero-Allocation Slicing in AI

In the context of AI, we often receive a large tensor (a flat array of floats) and need to extract specific vectors or chunks. Without Span, we copy. With Span, we slice.

using System;

public class EmbeddingProcessor
{
    public void ProcessBatch(float[] tensorBuffer, int vectorsCount, int dimension)
    {
        // This is a zero-allocation operation.
        // We are not creating new arrays. We are creating 'views' into the existing buffer.

        for (int i = 0; i < vectorsCount; i++)
        {
            // Calculate the start index for this vector
            int startIndex = i * dimension;

            // Create a Span representing just this vector
            // No memory is allocated on the Heap here.
            Span<float> currentVector = tensorBuffer.AsSpan(startIndex, dimension);

            // We can now pass this 'currentVector' to other methods
            // that expect Span<float>, and it costs us nothing.
            NormalizeVector(currentVector);
        }
    }

    private void NormalizeVector(Span<float> vector)
    {
        // We can read and write to the Span directly
        // This modifies the original tensorBuffer!
        for(int i = 0; i < vector.Length; i++)
        {
            vector[i] = vector[i] * 0.5f; 
        }
    }
}

`Memory<T>` and the Asynchronous World

Since Span<T> is a ref struct, it cannot cross await boundaries. If you are streaming data (as we are in Chapter 18), you might need to hold onto a buffer while waiting for the next chunk of data from the network.

This is where Memory<T> comes in. Memory<T> is the "cousin" of Span. It can live on the Heap (it's a regular struct), so it can be stored in class fields and passed into async methods.

When you need to do work, you slice the Memory<T> and pin it to get a Span<T> to perform the high-speed math, then release it.

public async Task ProcessStreamAsync(Memory<float> largeBuffer)
{
    // We can await here because we are holding Memory, not Span
    await Task.Delay(100); 

    // When we are ready to do the heavy lifting (CPU bound work):
    Span<float> workingSet = largeBuffer.Span;

    // Perform zero-allocation math
    for(int i=0; i<workingSet.Length; i++) workingSet[i] *= 2;
}

`ArrayPool<T>`: Renting, Not Buying

Even with Span, we sometimes need a temporary buffer. In the old days, we did new byte[1024], which hits the GC. In high-performance AI, we avoid new like the plague.

Enter ArrayPool<T>. This is a shared pool of arrays. Instead of creating a new array, you "rent" one from the pool. When you are done, you "return" it. This reuses memory, keeping the GC dormant.

using System.Buffers;

public void HighPerformanceVectorAdd(Span<float> a, Span<float> b)
{
    // We need a temporary buffer to store the result before applying activation function.
    // We rent it from the pool.
    float[] rentedArray = ArrayPool<float>.Shared.Rent(a.Length);

    try
    {
        // Create a Span over the rented array
        Span<float> result = rentedArray.AsSpan(0, a.Length);

        // Perform the addition
        for (int i = 0; i < a.Length; i++)
        {
            result[i] = a[i] + b[i];
        }

        // Apply activation (e.g., ReLU) directly on the result Span
        for (int i = 0; i < result.Length; i++)
        {
            if (result[i] < 0) result[i] = 0;
        }

        // Now we might pass 'result' to the next stage of the pipeline
        SendToNextLayer(result);
    }
    finally
    {
        // CRITICAL: Always return the array to the pool.
        // If we forget this, the pool runs out of arrays, and we start allocating again.
        ArrayPool<float>.Shared.Return(rentedArray);
    }
}

Hardware Acceleration: `System.Numerics.Vector<T>`

So far, we have discussed memory management. But speed also comes from doing more work per CPU cycle. Modern CPUs support SIMD (Single Instruction, Multiple Data). This allows the CPU to load 4, 8, or 16 floats into a single register and add them all in one go.

In C#, we access this via System.Numerics.Vector<T>.

This is where the "Vector" in "Data Manipulation & Vectors" becomes literal. When processing embeddings, we don't want to add float[0] + float[0]. We want to add (float[0], float[1], float[2], float[3]) + (float[4], float[5], float[6], float[7]) in one instruction.

Span<T> is the perfect partner for Vector<T>. Because Span guarantees contiguous memory (no gaps), the CPU can safely load chunks of it into SIMD registers.

using System.Numerics; // Requires System.Numerics package
using System.Runtime.Intrinsics;

public void AddVectorsSimd(Span<float> a, Span<float> b, Span<float> result)
{
    // Determine how many floats fit into a single Vector register
    int vectorSize = Vector<float>.Count;

    int i = 0;

    // Loop unrolling for SIMD
    for (; i <= a.Length - vectorSize; i += vectorSize)
    {
        // Load chunks of memory into SIMD registers
        var va = new Vector<float>(a.Slice(i, vectorSize));
        var vb = new Vector<float>(b.Slice(i, vectorSize));

        // Perform the addition on ALL elements in the register simultaneously
        var vres = va + vb;

        // Store the result back into the memory
        vres.CopyTo(result.Slice(i, vectorSize));
    }

    // Handle the "tail" (remaining elements that didn't fit in a full vector)
    for (; i < a.Length; i++)
    {
        result[i] = a[i] + b[i];
    }
}

Stack Allocation: The Ultimate Speed

In extremely critical paths, like the inner loop of a matrix multiplication algorithm, even renting an array from the ArrayPool might be too slow (it requires thread synchronization). If we know the buffer size is small and fixed, we can allocate memory directly on the Stack using stackalloc.

This memory is instantly available and instantly freed when the method returns. It is literally just moving the stack pointer down.

public float DotProduct(Span<float> vectorA, Span<float> vectorB)
{
    // We need a small buffer to hold intermediate calculations.
    // We allocate it directly on the stack.
    // WARNING: If 'dimension' is too large, this will cause a StackOverflowException.
    Span<float> intermediate = stackalloc float[Vector<float>.Count];

    // ... perform SIMD calculations using the stack allocated buffer ...

    // 'intermediate' disappears automatically when this method returns.
    // No GC involved at all.
    return 0.0f; 
}

Summary of the Architecture

In this chapter, we are moving away from "Object-Oriented" memory management (allocating classes everywhere) to "Data-Oriented" design.

Span<T> gives us a safe, zero-allocation window into memory (Stack or Heap).
Memory<T> allows us to carry that window across asynchronous boundaries.
ArrayPool<T> recycles buffers to prevent the GC from waking up.
Vector<T> utilizes the CPU's SIMD capabilities to process data in parallel.

This combination allows us to process infinite streams of vector embeddings in real-time, maintaining low latency and high throughput, which is the backbone of modern AI systems.

Basic Code Example

Here is a high-performance implementation of a basic IAsyncEnumerable iterator that processes vector embeddings using zero-allocation techniques.

using System;
using System.Buffers;
using System.Collections.Generic;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;

public class HighPerformanceEmbeddingStream
{
    // Simulating a raw byte buffer (e.g., from a network stream or file)
    private readonly byte[] _rawDataBuffer;
    private readonly int _embeddingDimension;

    public HighPerformanceEmbeddingStream(byte[] rawData, int dimension)
    {
        _rawDataBuffer = rawData;
        _embeddingDimension = dimension;
    }

    /// <summary>
    /// Asynchronously streams normalized vector embeddings using SIMD for hardware acceleration.
    /// </summary>
    public async IAsyncEnumerable<Memory<float>> StreamEmbeddingsAsync(
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        // Calculate the size of a single embedding vector in bytes
        int vectorSizeBytes = _embeddingDimension * sizeof(float);

        // Process the buffer in chunks to simulate streaming
        for (int offset = 0; offset <= _rawDataBuffer.Length - vectorSizeBytes; offset += vectorSizeBytes)
        {
            // Check for cancellation before heavy processing
            cancellationToken.ThrowIfCancellationRequested();

            // CRITICAL: Zero-Allocation Slicing.
            // We use Memory<T> to slice the underlying array without creating a new array or copying data.
            // This is a reference-based operation (O(1)).
            Memory<byte> rawSlice = _rawDataBuffer.AsMemory(offset, vectorSizeBytes);

            // Rent an array from the shared pool to avoid Heap allocations (Gen 0 pressure).
            // This is crucial for high-throughput scenarios (AI batch processing).
            float[] rentedArray = ArrayPool<float>.Shared.Rent(_embeddingDimension);

            try
            {
                // Convert bytes to floats. 
                // In a real scenario, we would use Span<byte>.Cast<float>() (unsafe) or SIMD transcoding.
                // Here we copy to the rented array for safety and demonstration.
                Span<byte> rawSpan = rawSlice.Span;
                Span<float> floatSpan = rentedArray.AsSpan(0, _embeddingDimension);

                // Manual copy loop to avoid LINQ overhead on Span<T>
                for (int i = 0; i < rawSpan.Length; i += sizeof(float))
                {
                    floatSpan[i / sizeof(float)] = BitConverter.ToSingle(rawSpan.Slice(i, 4));
                }

                // HARDWARE ACCELERATION: Normalize the vector using SIMD (System.Numerics.Vector<T>)
                NormalizeVectorSimd(floatSpan);

                // Yield the result. We yield Memory<T> to allow the consumer to decide 
                // if they want to operate on the rented array or copy it out.
                yield return rentedArray.AsMemory(0, _embeddingDimension);

                // IMPORTANT: We cannot return the array to the pool here because the consumer 
                // needs to use it after the yield. The consumer is responsible for disposal 
                // (or we must implement a custom AsyncEnumerator that handles disposal).
                // For this example, we will leak the array to demonstrate the pattern, 
                // but in production, use a struct-based enumerator with DisposeAsync.
            }
            finally
            {
                // In a robust implementation, we would return the array here if we weren't yielding it.
                // Since we are yielding the memory, we rely on the consumer to handle the lifecycle.
                // ArrayPool<float>.Shared.Return(rentedArray);
            }

            // Simulate I/O latency (network delay)
            await Task.Delay(10, cancellationToken);
        }
    }

    /// <summary>
    /// Normalizes a vector in-place using SIMD instructions.
    /// Calculates L2 Norm (Euclidean distance) and divides elements by the norm.
    /// </summary>
    private unsafe void NormalizeVectorSimd(Span<float> vector)
    {
        // Step 1: Calculate Sum of Squares using Vector<T> (SIMD)
        // Vector<float> is 256-bit (AVX) or 128-bit (SSE) depending on hardware.
        int count = Vector<float>.Count;
        int i = 0;
        float sumSq = 0f;

        // We use a local stackalloc buffer to accumulate partial sums if the vector is small,
        // but for large vectors, we accumulate in a Vector register.
        Vector<float> sumVector = Vector<float>.Zero;

        // SIMD Loop
        for (; i <= vector.Length - count; i += count)
        {
            var v = vector.Slice(i, count);
            sumVector += v * v; // Element-wise multiplication and addition
        }

        // Horizontal sum of the SIMD register
        for (int j = 0; j < count; j++)
        {
            sumSq += sumVector[j];
        }

        // Scalar loop for the remainder (tail processing)
        for (; i < vector.Length; i++)
        {
            float val = vector[i];
            sumSq += val * val;
        }

        float norm = MathF.Sqrt(sumSq);

        // Step 2: Divide by norm (Normalization)
        // We avoid division by zero
        if (norm == 0) return;

        // Use Vector<T> again for fast division/scaling
        Vector<float> scaleVector = new Vector<float>(norm);
        i = 0;
        for (; i <= vector.Length - count; i += count)
        {
            vector.Slice(i, count).Fill(norm); 
            // Note: Vector division is not directly supported in System.Numerics for floats 
            // without custom operators or .NET 8+ hardware intrinsics. 
            // We fall back to scalar multiplication for compatibility.
        }

        // Scalar fallback for remainder (and full loop if Vector<float>.Count is 1)
        for (i = 0; i < vector.Length; i++)
        {
            vector[i] /= norm;
        }
    }
}

Code Explanation

Zero-Allocation Slicing (Memory<T>): The line Memory<byte> rawSlice = _rawDataBuffer.AsMemory(offset, vectorSizeBytes); is the cornerstone of high-performance streaming. Unlike Substring or Array.Copy, AsMemory creates a lightweight wrapper (a pointer + length) over the existing data. This prevents unnecessary heap allocations, which is critical when processing gigabytes of embedding data, as it reduces Garbage Collector (GC) pressure.
Array Pooling (ArrayPool<T>): ArrayPool<float>.Shared.Rent(_embeddingDimension); borrows a pre-allocated array from a shared pool. In standard coding, new float[] would allocate on the heap, triggering Gen 0 collections. By renting, we reuse memory blocks, keeping the application allocation-free during the hot path.
SIMD Acceleration (System.Numerics.Vector<T>): The NormalizeVectorSimd method utilizes Vector<float>, which maps to CPU SIMD registers (AVX/SSE). This allows the processor to perform mathematical operations (like multiplication and addition) on multiple floats simultaneously (e.g., 8 floats at once on a 256-bit register), drastically speeding up tensor math compared to scalar loops.
Async Streaming (IAsyncEnumerable): The await foreach pattern allows the consumer to process data as it arrives. The yield return keyword suspends the method execution until the next item is requested, enabling efficient memory usage even for infinite streams.
Span and Stackalloc (Conceptual): While Span<T> is used implicitly in the slicing, stackalloc allows allocating memory directly on the stack (which is automatically freed when the function exits). This is ideal for small, temporary buffers used in calculations, avoiding heap fragmentation entirely.

Real-World Context

Imagine an AI application processing a live feed of video frames converted into vector embeddings (e.g., 512-dimensional floats). The raw data arrives as a continuous stream of bytes from a network socket.

If we used standard List<float> or new float[] for every frame, the application would quickly run out of memory or spend all its CPU time in Garbage Collection. By using IAsyncEnumerable combined with Span<T> and ArrayPool, we create a pipeline that:

Slices the incoming byte buffer (Zero-copy).
Rents memory only for the duration of the calculation.
Uses SIMD to normalize vectors at maximum CPU throughput.
Yields results back to the UI or database layer without blocking the thread.

Common Pitfalls

Pitfall 1: Disposing Rented Arrays When using ArrayPool<T>.Shared.Rent, you must call ArrayPool<T>.Shared.Return(array) when you are finished with the array. A common mistake in IAsyncEnumerable is returning the array immediately after yielding:

// WRONG
yield return rentedArray;
ArrayPool<float>.Shared.Return(rentedArray); // The consumer gets a returned (dirty) array!

_Solution: In the example above, we yield Memory<T> but fail to return it to the pool for simplicity. In production, you should implement a custom IAsyncEnumerator<T> struct that holds the rented array and returns it in the DisposeAsync method.

Pitfall 2: Using LINQ on Span Standard LINQ methods (.Select(), .Where()) are not supported directly on Span<T> because they rely on interfaces and enumerators that cause boxing overhead.

// WRONG
var normalized = vectorSpan.Select(x => x / norm).ToArray();

_Solution: Always use for or foreach loops when manipulating Span<T> in performance-critical code.

Visualizing the Memory Pipeline

A diagram illustrating the memory pipeline would show how a Span<T> provides a lightweight, safe view into a contiguous block of memory, contrasting the direct, low-overhead access of a for loop with the potential overhead of LINQ operations that may not be optimized for such memory representations. — A diagram illustrating the memory pipeline would show how a `Span` provides a lightweight, safe view into a contiguous block of memory, contrasting the direct, low-overhead access of a `for` loop with the potential overhead of LINQ operations that may not be optimized for such memory representations.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.