Skip to content

Chapter 3: Memory - The Asynchronous & Heap-Stable Counterpart

Theoretical Foundations

The fundamental limitation of Span<T> is its stack-based nature, which restricts its lifetime to the execution of the current method. This constraint poses a significant architectural challenge in high-performance AI applications, particularly when processing large token streams asynchronously. To understand the necessity of Memory<T>, we must first revisit the constraints established in previous chapters regarding memory ownership and the execution context of modern neural networks.

In the preceding discussions on Span<T>, we established that it is a ref struct, meaning it can only exist on the stack and cannot be boxed or stored in heap-based data structures. While this provides blazing-fast access to contiguous memory regions without heap allocations, it creates a hard boundary for asynchronous control flows. Consider a typical AI inference pipeline: a request arrives, a tokenizer converts text into tokens, and an asynchronous method processes these tokens through a neural network. If we attempt to pass a Span<T> of tokens from the tokenizer to the inference engine across an await boundary, the compiler will emit an error. The Span<T> would be invalid by the time the asynchronous method resumes execution, as its stack frame would have been popped.

This is where Memory<T> and ReadOnlyMemory<T> enter the architecture. They serve as the heap-allocated, asynchronous-safe counterparts to Span<T>. Unlike Span<T>, Memory<T> is a struct that holds a reference to a heap object (such as an array) or a native memory pointer, along with metadata about the offset and length. Because it does not contain any stack-restricted references, it can be stored in class fields, passed across await boundaries, and enqueued in producer-consumer queues used for batching AI inference requests.

To visualize the relationship between these types and the execution context, consider the following diagram:

A Memory<T> struct points to heap-allocated or native memory, enabling its use in asynchronous methods and multi-threaded queues, unlike stack-restricted Span<T>.
Hold "Ctrl" to enable pan & zoom

A `Memory` struct points to heap-allocated or native memory, enabling its use in asynchronous methods and multi-threaded queues, unlike stack-restricted `Span`.

Memory<T> is a struct that encapsulates a contiguous region of memory, similar to Span<T>, but with a crucial distinction: it is not a ref struct. This allows it to be stored on the heap, making it compatible with asynchronous programming models. It acts as a "handle" or "capability" to a memory region that remains valid for the duration of the object's lifecycle, regardless of the stack state.

The Asynchronous Safety Mechanism

In high-performance AI, we often deal with "pipelines" where data flows through distinct stages: Tokenization -> Embedding -> Inference -> Decoding. These stages are often decoupled and run asynchronously to maximize hardware utilization (e.g., overlapping GPU compute with CPU pre-processing).

When a tokenizer produces a sequence of integers (tokens), it typically allocates an array on the heap. If we were to wrap this array in a Span<int>, we could only use it within the scope of the tokenizer method. To pass these tokens to an asynchronous inference method, we must wrap them in a Memory<int>. The inference method can then accept Memory<int> as a parameter, and when it is ready to process the data, it can create a Span<int> view of that memory safely within its own execution context.

This is analogous to a logistics warehouse. Imagine a pallet of goods (the Memory<T>) sitting in a loading bay. The pallet is stable and stays there (on the heap). A worker (the asynchronous method) can arrive at any time, pick up a forklift (the Span<T>), and move specific boxes from the pallet. The worker doesn't need to hold the pallet; they just need a temporary view of it. If another worker arrives later, they can do the same. The pallet remains valid as long as the warehouse exists.

Zero-Copy Data Handling

A critical aspect of Memory<T> in AI is enabling zero-copy data transfers. In large language models (LLMs), the input context window can be massive (e.g., 32k tokens). Copying this data for every asynchronous operation would saturate memory bandwidth and introduce latency.

Memory<T> allows us to hold a reference to a single, shared buffer. For instance, a buffer might hold the concatenated tokens of a conversation history. We can slice this Memory<T> to create views for specific system prompts or user queries without allocating new arrays. These slices remain valid as long as the parent Memory<T> is alive. This is particularly useful in batch processing, where we might stack multiple sequences into a single tensor buffer on the GPU. We can use Memory<T> to manage the offsets of these sequences on the CPU side before transferring them to the GPU.

The MemoryManager<T> Abstraction

While Memory<T> is typically constructed from arrays (new Memory<T>(new int[100])), high-performance scenarios often require more control over memory allocation. This is where MemoryManager<T> comes into play. It is an abstract class that allows developers to create custom memory sources that integrate seamlessly with the Memory<T> ecosystem.

MemoryManager<T> is crucial for AI applications that interface with native libraries or specialized memory pools. For example, when working with GPU-accelerated libraries (like CUDA or DirectML), we often allocate memory directly in device memory or pinned (page-locked) host memory for faster transfers. Standard .NET arrays are managed by the GC and may be moved in memory, which is problematic for native pointers.

By implementing MemoryManager<T>, we can wrap a native memory pointer or a pinned array and expose it as a Memory<T>. This allows the high-level C# code to treat native memory exactly like managed memory, passing it through async pipelines without worrying about the underlying implementation details.

Architectural Implications for AI Pipelines

The introduction of Memory<T> fundamentally changes how we design AI pipelines. Previously, developers might have relied on Task<byte[]> or Task<ArraySegment<byte>> to pass data asynchronously. These approaches often involved hidden allocations or copies. With Memory<T>, we can design APIs that are allocation-free and highly composable.

Consider an IAsyncEnumerable<Token> stream. In a high-throughput scenario, we want to buffer tokens efficiently. Using Memory<T>, we can implement a ring buffer or a sliding window mechanism that reuses memory blocks. As tokens arrive, we write them into a Memory<T> slice. When the buffer is full, we yield the Memory<T> to the consumer for processing. The consumer processes the data asynchronously, and once finished, signals that the memory can be recycled. This pattern minimizes GC pressure, which is critical for maintaining low latency in real-time AI applications (e.g., chatbots, code completion).

Furthermore, Memory<T> is the foundation for System.Buffers.ArrayPool<T>. While Span<T> is often used with ArrayPool<T>.Shared.Rent() to get a temporary buffer, Memory<T> allows us to hold onto that rented buffer across asynchronous boundaries. This is vital for scenarios where the lifetime of the data exceeds the lifetime of a single synchronous method call, such as streaming a large file through an AI model for summarization.

Real-World Analogy: The Construction Site Blueprint

To fully grasp the utility of Memory<T>, let us use the analogy of a construction site blueprint.

  1. The Raw Data (The Heap): Imagine a massive, complex blueprint for a skyscraper stored in a secure vault (the Heap). This blueprint is large and expensive to reproduce.
  2. Span<T> (The Magnifying Glass): A construction worker (a synchronous method) needs to look at a specific section of the blueprint. They take a magnifying glass (Span<T>) and look directly at the blueprint in the vault. They cannot take the magnifying glass out of the vault room; once they leave the room, the view is lost. This is fast and efficient but very restrictive.
  3. Memory<T> (The Blueprint Reference Ticket): Instead of looking directly, the site manager issues a "Reference Ticket" (Memory<T>) that points to the specific section of the blueprint in the vault. This ticket is a physical object (a struct) that can be carried around the construction site (the heap). It doesn't contain the blueprint itself, just the instructions on how to find it.
  4. Asynchronous Execution (The Night Shift): The day shift worker finishes their planning but needs the night shift to continue the work. The day shift hands the Reference Ticket (Memory<T>) to the night shift worker. The night shift worker arrives later, presents the ticket, and uses their own magnifying glass (Span<T>) to view the blueprint section.
  5. MemoryManager<T> (The Architect's Custom Drawer): Sometimes, the blueprint isn't in a standard vault but in a specialized, custom-built drawer that requires a specific key. The architect (MemoryManager<T>) builds a custom mechanism that allows the reference ticket to work with this special drawer, ensuring that even non-standard storage can be accessed through the same ticketing system.

Deep Dive: Memory Safety and Lifetime Management

One of the most nuanced aspects of Memory<T> is the management of memory lifetime. Since Memory<T> is a reference to heap memory, it is possible to hold a Memory<T> instance that points to memory which has been disposed or garbage collected. Unlike Span<T>, which is guaranteed to be valid as long as the stack frame exists, Memory<T> requires explicit lifetime management.

In the context of AI, this is managed through ownership patterns. When a MemoryManager<T> (like a pooled buffer manager) rents a block of memory, it returns a Memory<T>. The consumer is responsible for returning the memory to the pool once processing is complete. Holding onto a Memory<T> after returning it to the pool results in undefined behavior (likely accessing reused memory, leading to data corruption).

This contrasts with the "fire and forget" nature of some GC-managed objects. Memory<T> forces the developer to think about the lifecycle of data buffers, which is a necessary discipline for high-performance computing. It bridges the gap between the safety of managed code and the performance requirements of manual memory management.

Integration with AI Token Processing

In the specific domain of token processing, Memory<T> enables sophisticated batching strategies. Modern AI models often require inputs to be padded to a fixed length or grouped into batches for parallel processing on the GPU.

Using Memory<T>, we can construct a "batched" input tensor on the CPU without copying data. Suppose we have three sentences of varying lengths. We can allocate a single contiguous Memory<byte> buffer large enough to hold all three. We then calculate the offsets and create three distinct Memory<byte> slices representing each sentence. These slices can be passed to an asynchronous encoding method. The method can then gather these slices (potentially using MemoryMarshal.TryGetArray for interop) and construct a single contiguous block for the GPU transfer.

Without Memory<T>, this would require either copying each sentence into a new array (allocating memory) or using unsafe pointers and pinning objects manually, which is error-prone and complicates the asynchronous flow.

Conclusion

The theoretical foundation of Memory<T> rests on its ability to decouple the reference to memory from the execution context of the stack. By providing a heap-stable, asynchronous-safe wrapper around contiguous memory, it solves the critical problem of data lifetime in complex, non-blocking pipelines. It is the essential glue that allows high-performance synchronous operations (via Span<T>) to integrate seamlessly with the asynchronous nature of modern AI workloads, ensuring that massive token streams can be processed with zero-copy efficiency and minimal GC overhead.

Basic Code Example

using System;
using System.Buffers;
using System.IO;
using System.Text;
using System.Threading.Tasks;

namespace MemoryTDemo
{
    // A custom MemoryManager to demonstrate how to create heap-stable memory
    // without copying data. This is useful for wrapping native memory or
    // pooled buffers in the AI pipeline.
    public sealed class PooledTokenBuffer : MemoryManager<char>
    {
        private readonly char[] _pooledArray;
        private readonly int _length;

        public PooledTokenBuffer(int size)
        {
            // Simulate renting from a high-performance ArrayPool (e.g., ArrayPool<char>.Shared)
            _pooledArray = new char[size];
            _length = size;
        }

        public override Memory<char> Memory => _pooledArray.AsMemory(0, _length);

        public override Span<char> GetSpan() => _pooledArray.AsSpan(0, _length);

        public override MemoryHandle Pin(int elementIndex = 0)
        {
            // Pinning ensures the garbage collector does not move the memory
            // while the pointer is being used (e.g., by native code or SIMD).
            return _pooledArray.AsMemory().Pin();
        }

        public override void Unpin()
        {
            // In a real scenario, this would release a pinned handle.
            // For managed arrays, this is often a no-op but is required by the interface.
        }

        protected override void Dispose(bool disposing)
        {
            if (disposing)
            {
                // Return the array to the pool or clean up resources
                Array.Clear(_pooledArray, 0, _length);
            }
        }
    }

    class Program
    {
        static async Task Main(string[] args)
        {
            // CONTEXT: An AI model processes a stream of tokens (text chunks).
            // We receive a large stream from a file or network, but we want to
            // process it asynchronously without copying the data into new strings
            // for every step, which would cause GC pressure.

            Console.WriteLine("--- 1. Creating Heap-Stable Memory from Array ---");

            // 1. Standard allocation: Memory<T> wraps an existing array.
            // This is "heap-stable" because the array is on the managed heap,
            // and Memory<T> tracks it safely across async contexts.
            char[] tokenBuffer = new char[1024];
            "Hello AI World".AsSpan().CopyTo(tokenBuffer);
            Memory<char> memorySource = tokenBuffer.AsMemory(0, 14);

            // 2. Demonstrate Async Processing
            // We pass 'memorySource' to an async method. Unlike Span<T>,
            // Memory<T> is allowed to "cross await boundaries".
            await ProcessTokenStreamAsync(memorySource);

            Console.WriteLine("\n--- 2. Using Custom MemoryManager (Pooled) ---");

            // 3. Custom Memory Source
            // Using MemoryManager<T> allows us to wrap custom memory sources
            // (like native memory or pooled arrays) and expose them as Memory<T>.
            using (var pooledBuffer = new PooledTokenBuffer(64))
            {
                // Fill with data
                "Optimized Token Processing".AsSpan().CopyTo(pooledBuffer.GetSpan());

                // Pass the Memory<T> property to the async processor
                await ProcessTokenStreamAsync(pooledBuffer.Memory);
            }

            Console.WriteLine("\n--- 3. ReadOnlyMemory<T> for Input Safety ---");

            // 4. Using ReadOnlyMemory<T> for input parameters
            // When the callee doesn't need to modify the data, use ReadOnlyMemory<T>.
            // This prevents accidental modification and signals intent.
            ReadOnlyMemory<char> readOnlySource = "Read-Only Data".AsMemory();
            await InspectTokenStreamAsync(readOnlySource);
        }

        // ASYNC SAFE: Span<T> cannot be used here. Memory<T> is required.
        // This method simulates an asynchronous AI pipeline step (e.g., tokenization).
        static async Task ProcessTokenStreamAsync(Memory<char> data)
        {
            Console.WriteLine($"Processing: '{data}'");

            // Simulate I/O latency (e.g., waiting for a GPU or network)
            await Task.Delay(50); 

            // We can access the underlying Span for synchronous processing
            // within the async method safely.
            Span<char> span = data.Span;

            // Example: Convert to uppercase in-place (modifying the original buffer)
            for (int i = 0; i < span.Length; i++)
            {
                if (char.IsLower(span[i]))
                    span[i] = char.ToUpper(span[i]);
            }

            Console.WriteLine($"Result:    '{data}'");
        }

        static async Task InspectTokenStreamAsync(ReadOnlyMemory<char> data)
        {
            // Simulate reading data without modifying it
            await Task.Delay(20);

            // We can slice the ReadOnlyMemory without allocating new memory
            ReadOnlyMemory<char> slice = data.Slice(0, 4); // "Read"
            Console.WriteLine($"Inspected prefix: '{slice}'");
        }
    }
}

Detailed Explanation

1. The Problem: Asynchronous Data Flow

In high-performance AI applications, data often flows through a pipeline: Disk/Network -> Tokenizer -> Model Inference -> Post-processing.

  • Synchronous Context: Span<T> is perfect for synchronous, stack-allocated, or temporary memory manipulation. It is extremely fast and allocation-free.
  • Asynchronous Context: Span<T> has a critical limitation: it cannot be used across await boundaries. The compiler enforces this because Span<T> is a ref struct (it lives on the stack), and the stack frame of an async method can be suspended and resumed on a different thread context.
  • The Solution: Memory<T> and ReadOnlyMemory<T> are standard structs (not ref structs) that wrap a memory source (array, string, or custom buffer). They are safe to store in fields, pass to async methods, and "slice" without copying data.

2. Code Breakdown

Part 1: The PooledTokenBuffer Class

public sealed class PooledTokenBuffer : MemoryManager<char>

  • Why MemoryManager<T>? In AI workloads, we often use Object Pools or Native Memory (via NativeMemory.Alloc) to avoid GC pauses. MemoryManager<T> acts as an adapter. It allows us to present these custom memory sources as a standard Memory<T> that the .NET ecosystem understands.
  • Memory<char> Memory: This property returns the view of the memory. In our implementation, it wraps _pooledArray.
  • Pin(): This is crucial for high-performance scenarios. If you need to pass a pointer to this memory to a SIMD library or native code (like a C++ CUDA kernel), Pin() locks the memory in place so the Garbage Collector doesn't move it.

Part 2: The Main Method

  1. Standard Allocation:

    char[] tokenBuffer = new char[1024];
    Memory<char> memorySource = tokenBuffer.AsMemory(0, 14);
    
    We create a standard array and wrap it in Memory<char>. This is the most common use case.

  2. Passing to Async:

    await ProcessTokenStreamAsync(memorySource);
    
    We pass memorySource into the async method. Unlike Span, Memory is a safe ref struct and can survive the context switch of await.

Part 3: The ProcessTokenStreamAsync Method

  1. Signature:

    static async Task ProcessTokenStreamAsync(Memory<char> data)
    
    The parameter is Memory<char>, not Span<char>. This is the defining characteristic of an async-friendly buffer.

  2. Accessing the Span:

    Span<char> span = data.Span;
    
    Inside the async method, we often need to do synchronous processing (like iterating characters). We access the .Span property to get a view of the memory for synchronous operations.

  3. In-Place Modification:

    span[i] = char.ToUpper(span[i]);
    
    Because Memory<T> wraps a reference to the original array, modifying the Span modifies the original data in tokenBuffer. This is zero-copy processing.

Part 4: ReadOnlyMemory<T>

static async Task InspectTokenStreamAsync(ReadOnlyMemory<char> data)

  • Immutability: When the downstream processor only needs to read data (e.g., sending tokens to a logging service), using ReadOnlyMemory<T> prevents accidental modification and allows the compiler to enforce safety.
  • Slicing:
    ReadOnlyMemory<char> slice = data.Slice(0, 4);
    
    Slicing Memory<T> (or ReadOnlyMemory<T>) is extremely cheap. It creates a new struct pointing to the same underlying array but with different offset and length. No data is copied.

Visualizing the Memory Flow

The diagram below illustrates how Memory<T> acts as a bridge between the managed heap (or custom pools) and the asynchronous execution context.

This diagram illustrates how Memory<T> acts as a zero-copy bridge, allowing data to be accessed directly from the managed heap or custom pools within an asynchronous execution context.
Hold "Ctrl" to enable pan & zoom

This diagram illustrates how `Memory` acts as a zero-copy bridge, allowing data to be accessed directly from the managed heap or custom pools within an asynchronous execution context.

Common Pitfalls

1. Storing Span<T> in a Class Field A frequent mistake is trying to store Span<T> in a field of a class to hold data between asynchronous steps.

// ❌ INCORRECT
class TokenProcessor {
    private Span<char> _buffer; // Error: Span<T> cannot be used as a field.
}
Why it fails: Span<T> is a ref struct. It must live on the stack. The Garbage Collector does not track references inside ref structs, meaning it cannot safely manage them as class fields. The Fix: Use Memory<T> for fields.
// ✅ CORRECT
class TokenProcessor {
    private Memory<char> _buffer;
}

2. Forgetting to .Pin() for Native Interop When passing Memory<T> to a P/Invoke method (native code), you cannot simply pass the Memory<T> object.

// ❌ INCORRECT
[DllImport("NativeLib")]
static extern void ProcessData(Memory<char> data); // This won't compile or work as expected.
The Fix: You must use the Pin() method to get a pointer, or use MemoryMarshal.TryGetArray to get the underlying array.
// ✅ CORRECT (Conceptual)
using (var handle = memory.Pin()) {
    ProcessData((char*)handle.Pointer);
}

3. Confusing Memory<T> with ReadOnlyMemory<T> Using Memory<T> when the data is intended to be immutable.

// ❌ SUBOPTIMAL
void LogTokens(Memory<char> tokens) { ... }
Why it's bad: It signals to the caller that the method might modify the data, even if it doesn't. This restricts the caller from passing ReadOnlyMemory<T> directly without casting. The Fix: Always prefer ReadOnlyMemory<T> for input parameters unless modification is explicitly intended.

4. Slicing Overhead Misconception Developers sometimes avoid slicing because they fear it allocates.

// ❌ UNNECESSARY ALLOCATION
Memory<char> slice = memory.Slice(0, 10);
The Reality: Memory<T> is a struct (value type). Slicing it creates a copy of the struct (which is cheap) pointing to the same underlying data. It does not allocate on the heap. Slicing is encouraged for zero-copy segmentation of token streams.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.