Skip to content

Chapter 20: Project - Building a Keyword Extraction Engine with LINQ

Theoretical Foundations

In the realm of high-performance AI and data manipulation, memory management is not merely a background task; it is the primary bottleneck. When processing massive text corpora or high-dimensional vectors, the standard approach of allocating new objects for every substring, array slice, or intermediate calculation creates a "garbage collection (GC) tax" that can stall execution for milliseconds or even seconds. This is unacceptable in real-time inference or large-scale batch processing. To address this, we must shift our mental model from "convenient object references" to "precise memory ownership."

The Analogy: The Library vs. The Photocopier

Imagine you are a researcher analyzing a specific paragraph in a 1,000-page encyclopedia.

The Traditional Heap Allocation Approach (The Photocopier): Every time you want to isolate that paragraph to highlight it, you don't just mark the page. You walk to the photocopier, copy the entire page, take the physical copy back to your desk, and highlight the paragraph. If you need to extract just one sentence from that paragraph, you copy the paragraph again. When you are done, you throw the copies in the recycling bin. In a high-performance environment (analyzing the whole encyclopedia), you spend more time walking to the copier and recycling paper than you do actually reading.

The Span<T> Approach (The Bookmark): Span<T> is like placing a bookmark on the starting word and using your finger to trace the ending word. You haven't copied the text. You haven't allocated new paper. You have simply defined a "view" or a "region" of the existing data. You can slide this view around the page, change its length, or pass it to a colleague without duplicating the underlying content. This is Zero-Allocation Slicing.

Stack vs. Heap: The Architectural Divide

To understand Span<T>, we must revisit memory allocation strategies, a concept introduced in Book 2 regarding value types, but now elevated to critical importance in Book 3.

The Heap: The Heap is a large, managed pool of memory. When you use new (e.g., new string(...), new byte[]), the CLR reserves space on the Heap.

  • Pros: Flexible size, long lifetime.
  • Cons: Allocation is slow (requires searching for free blocks). Deallocation (Garbage Collection) is expensive and non-deterministic. It causes cache misses because data is scattered.

The Stack: The Stack is a contiguous block of memory allocated per thread. It grows and shrinks strictly in a Last-In-First-Out (LIFO) manner.

  • Pros: Allocation is essentially free (just moving a pointer). It is extremely cache-friendly (data is contiguous).
  • Cons: Size must be known at compile time (mostly). Lifetime is strictly scoped to the method call.

The Constraint of Span<T>: Span<T> is a ref struct. A ref struct can only live on the Stack. It cannot be boxed, it cannot be a field in a class, and it cannot be used in an async state machine. Why? Because if it lived on the Heap, it could potentially point to memory that has been moved or collected by the GC, creating a dangling reference. By confining Span<T> to the Stack, the CLR guarantees that the memory it points to remains valid for the duration of the method call.

Span<T> and Memory<T>: The Tools of Zero-Copy

In our Keyword Extraction Engine, we will process a massive document string. Instead of using string.Substring() (which allocates a new string on the Heap), we use Span<char>.

Span<T>: The Stack-Only View

Span<T> provides a type-safe view over contiguous memory.

using System;

public void AnalyzeWord(ReadOnlySpan<char> word)
{
    // We can inspect the characters without allocating a new string.
    // This method accepts a slice of the original buffer, or the whole buffer.
    if (word.Length > 0 && char.IsUpper(word[0]))
    {
        Console.WriteLine($"Found capitalized word: {word}");
    }
}

public void ProcessDocument()
{
    string massiveDocument = "The quick brown Fox jumps over the lazy Dog.";

    // NO ALLOCATION HERE. 
    // We are creating a view, not a copy.
    ReadOnlySpan<char> docSpan = massiveDocument.AsSpan();

    // We can slice without allocation.
    // This creates a new Span pointing to a subset of the original memory.
    ReadOnlySpan<char> foxSlice = docSpan.Slice(10, 3); // Points to "Fox"

    AnalyzeWord(foxSlice);
}

Memory<T>: The Heap-Capable Token

Sometimes you need to store a reference to memory that outlives a single method call (e.g., passing data to an async method). Span<T> cannot do this. This is where Memory<T> comes in. Memory<T> is a wrapper that can live on the Heap, but it does not allow direct access to the data. To access the data, you must pin it or request a Span<T> from it within a specific context.

using System;
using System.Buffers;

// This method is async, so it cannot accept Span<T> directly.
public async Task ProcessAsync(Memory<char> memory)
{
    // We can access the data as a Span only inside the synchronous context
    // of the method (or specific helpers).
    var span = memory.Span;

    // Perform synchronous analysis on the span
    AnalyzeWord(span);

    await Task.Delay(100); // Async allowed because Memory<T> is on the Heap.
}

ArrayPool<T>: Recycling Memory

In AI applications, we often process tensors (multidimensional arrays of numbers). A standard keyword extraction might involve creating arrays to store n-gram frequencies or token indices. If we allocate a new int[] for every document, we generate massive GC pressure.

ArrayPool<T> is a shared pool of arrays. Instead of new int[1024], we rent an array from the pool. When finished, we return it. This is like a library of reusable lab equipment rather than manufacturing new beakers for every experiment.

using System;
using System.Buffers; // Required for ArrayPool

public void CountCharactersInSegment(string segment)
{
    // Rent an array from the shared pool.
    // This avoids a heap allocation.
    char[] buffer = ArrayPool<char>.Shared.Rent(segment.Length);

    try
    {
        // Copy data into the rented buffer.
        segment.AsSpan().CopyTo(buffer);

        // Process the buffer (e.g., count vowels).
        int vowelCount = 0;
        for (int i = 0; i < segment.Length; i++)
        {
            char c = buffer[i];
            if ("aeiou".Contains(c, StringComparison.OrdinalIgnoreCase))
                vowelCount++;
        }

        Console.WriteLine($"Vowels: {vowelCount}");
    }
    finally
    {
        // CRITICAL: Return the array to the pool so it can be reused.
        // Failure to do so causes a memory leak (the pool thinks the array is still in use).
        ArrayPool<char>.Shared.Return(buffer);
    }
}

stackalloc: Stack-Based Arrays

For small buffers where the size is known at compile time (or determined dynamically but bounded), stackalloc allocates memory directly on the stack. This is the ultimate zero-allocation technique for temporary buffers.

Warning: Stack space is limited (typically 1MB per thread). Overusing stackalloc causes a StackOverflowException.

using System;

public unsafe void ProcessSmallToken(string token)
{
    // Allocate a buffer of 64 chars on the stack.
    // This memory is freed automatically when the method returns.
    Span<char> buffer = stackalloc char[64];

    if (token.Length > 64)
    {
        // Handle overflow logic
        return;
    }

    token.AsSpan().CopyTo(buffer);

    // We can now manipulate 'buffer' with zero GC impact.
    buffer.Reverse();
}

System.Numerics.Vector<T>: Hardware Acceleration (SIMD)

In Keyword Extraction, we often need to perform mathematical operations on large arrays of numbers—perhaps calculating the cosine similarity between two word embeddings or normalizing frequency counts. Doing this in a standard for loop processes one number at a time.

SIMD (Single Instruction, Multiple Data) allows the CPU to process multiple data points in a single clock cycle. Modern CPUs have registers (e.g., AVX2, AVX-512) that can hold 256 or 512 bits of data. System.Numerics.Vector<T> abstracts these hardware capabilities.

The AI Context: When building embeddings for keywords, we convert words into vectors of floats (e.g., [0.1, 0.8, 0.3, ...]). To compare thousands of these vectors, we need to calculate the dot product. Using SIMD, we can multiply 8 or 16 float pairs simultaneously.

using System;
using System.Numerics; // Requires System.Runtime.Intrinsics
using System.Runtime.Intrinsics.X86; // For hardware checks

public float CalculateSimilaritySimd(float[] embeddingA, float[] embeddingB)
{
    int length = embeddingA.Length;
    int vectorSize = Vector<float>.Count; // Depends on CPU (e.g., 8 for AVX2)

    float sum = 0;
    int i = 0;

    // Process in chunks using SIMD registers
    for (; i <= length - vectorSize; i += vectorSize)
    {
        var aVec = new Vector<float>(embeddingA, i);
        var bVec = new Vector<float>(embeddingB, i);

        // Multiply elements and accumulate
        // This compiles to hardware instructions like VFMADDPS (Fused Multiply-Add)
        sum += Vector.Dot(aVec, bVec);
    }

    // Process remaining elements (tail) scalarly
    for (; i < length; i++)
    {
        sum += embeddingA[i] * embeddingB[i];
    }

    return sum;
}

AI Context: Processing Tensor Buffers

In modern AI (e.g., ONNX Runtime or custom ML models), data is represented as Tensors—multi-dimensional arrays of floats. A single model inference might involve processing a 512-dimensional embedding for a batch of 32 documents.

Using standard C# arrays (float[][]), this requires:

  1. Allocating the outer array.
  2. Allocating 32 inner arrays.
  3. Copying data between buffers.

Using Span<T> and Memory<T>:

  1. We can allocate a single contiguous block of memory (a 1D array) of size 512 * 32.
  2. We use Span<T>.Slice to create "views" of rows or columns without copying data.
  3. We pass these Spans directly to SIMD-accelerated math kernels.

This "Zero-Copy" architecture is essential for real-time AI. If you are running a Large Language Model locally, every millisecond spent allocating memory is a millisecond the GPU sits idle.

Visualizing Memory Layout

The following diagram illustrates the difference between fragmented Heap allocations and contiguous Stack/Pool allocations.

A diagram comparing fragmented heap allocations, which scatter data across memory and cause GPU idle time, with contiguous stack or pool allocations, which store data together to maximize processing efficiency.
Hold "Ctrl" to enable pan & zoom

A diagram comparing fragmented heap allocations, which scatter data across memory and cause GPU idle time, with contiguous stack or pool allocations, which store data together to maximize processing efficiency.

Summary of Architectural Implications

By adopting Span<T>, Memory<T>, and ArrayPool<T>, we fundamentally change how the Keyword Extraction Engine operates:

  1. Throughput: We eliminate the GC pauses that occur when processing millions of documents. The GC remains idle because we aren't creating garbage.
  2. Latency: By using stackalloc and Span<T>, we ensure that memory operations are deterministic and fast, crucial for real-time keyword extraction in streaming applications.
  3. Vectorization: By aligning our data in contiguous memory blocks (facilitated by these types), we maximize the efficiency of System.Numerics.Vector<T>, allowing us to perform mathematical operations on word embeddings at the hardware level.

This theoretical foundation sets the stage for the practical implementation of our engine, where we will replace standard LINQ operations with high-performance loops and vectorized math to extract keywords with minimal overhead.

Basic Code Example

In the domain of high-performance AI and data manipulation, memory allocation is often the primary bottleneck. When processing massive text corpora or converting raw text into vector embeddings, allocating millions of small objects on the heap leads to frequent Garbage Collection (GC) pauses, halting the CPU.

The following example demonstrates a "Hello World" of high-performance text processing. We will calculate the sum of character frequencies in a string. While simple, this operation is the foundational step in creating embeddings (converting text to numerical vectors). We will achieve this using Span<T> for zero-allocation slicing and System.Numerics.Vector<T> for hardware-accelerated (SIMD) math.

The Problem Context

Imagine you are building a keyword extraction engine that processes terabytes of web pages. You need to convert every page into a numerical vector (an embedding) to compare them. Using standard string.Substring() and foreach loops would create millions of temporary objects, causing the application to pause for garbage collection.

To solve this, we must work directly on memory buffers without copying data. We treat the text not as an immutable string, but as a contiguous block of memory (Span<char>) that can be processed in parallel using the CPU's vector instructions.

High-Performance Code Example

using System;
using System.Numerics; // Required for Vector<T> (SIMD)
using System.Buffers;  // Required for ArrayPool<T>

public class HighPerformanceEmbeddingEngine
{
    public static void Main()
    {
        // 1. Simulate a large block of text (e.g., a document buffer).
        // In a real AI scenario, this might be a memory-mapped file or a tensor buffer.
        string rawText = "The quick brown fox jumps over the lazy dog. " +
                         "The quick brown fox jumps over the lazy dog. " +
                         "The quick brown fox jumps over the lazy dog.";

        // 2. Convert to Span to avoid heap allocations during slicing.
        // 'ReadOnlySpan<char>' is a view into the original string's memory.
        ReadOnlySpan<char> textSpan = rawText.AsSpan();

        // 3. Rent a shared array from the ArrayPool (Zero-Allocation pattern).
        // We use this to store the frequency count of characters 'a' through 'z'.
        // This avoids 'new int[26]' which would immediately put pressure on the GC.
        int[] rentedBuffer = ArrayPool<int>.Shared.Rent(26);

        try
        {
            // Zero out the buffer (required because ArrayPool returns dirty memory).
            Array.Clear(rentedBuffer, 0, rentedBuffer.Length);

            // 4. Process the Span using Hardware Acceleration (SIMD).
            CalculateCharacterFrequenciesSimd(textSpan, rentedBuffer);

            // 5. Output results (Simulating vector embedding generation).
            Console.WriteLine("Character Frequency Vector (Embedding):");
            for (int i = 0; i < 26; i++)
            {
                Console.WriteLine($"'{(char)('a' + i)}': {rentedBuffer[i]}");
            }
        }
        finally
        {
            // 6. CRITICAL: Return the array to the pool.
            // If we forget this, we lose the benefits of pooling and cause memory leaks.
            ArrayPool<int>.Shared.Return(rentedBuffer);
        }
    }

    /// <summary>
    /// Calculates character frequencies using SIMD (Single Instruction, Multiple Data).
    /// This processes multiple characters in a single CPU cycle.
    /// </summary>
    private static void CalculateCharacterFrequenciesSimd(ReadOnlySpan<char> text, int[] frequencies)
    {
        // Define the vector size based on the hardware (e.g., 256-bit on AVX2, 128-bit on SSE).
        // Vector<int>.Count typically equals 8 (256-bit / 32-bit) or 4 (128-bit / 32-bit).
        int vectorSize = Vector<int>.Count;

        // We iterate with a step of 'vectorSize' to process chunks of data.
        int i = 0;

        // 7. The Hot Path: Looping over the Span.
        // We avoid LINQ here. LINQ on Span<T> is not supported in older frameworks 
        // and introduces overhead (boxing/indirection) that kills performance.
        for (; i <= text.Length - vectorSize; i += vectorSize)
        {
            // Load a chunk of characters into a SIMD vector.
            // This loads 8 characters (assuming 32-bit int vector) into a single CPU register.
            Vector<int> chunk = new Vector<int>(text.Slice(i).UnsafeAs<int[]>()); 

            // NOTE: In a real scenario, we would map 'char' to an index (e.g., 'a' -> 0).
            // For this simplified example, we assume the vector contains valid indices.
            // In production, you would use Vector.ShiftRight/Left to mask ASCII ranges.

            // 8. Hardware Accelerated Math:
            // We add 1 to every element in the vector simultaneously.
            // This is significantly faster than a scalar 'for' loop.
            Vector<int> ones = Vector<int>.One;
            Vector<int> result = Vector.Add(chunk, ones);

            // 9. Store result back to memory (SIMD write).
            // We write back to the rented buffer.
            result.CopyTo(frequencies, i % frequencies.Length); 
        }

        // 10. The Tail: Process remaining elements that didn't fit in a vector.
        // This ensures correctness for any input length.
        for (; i < text.Length; i++)
        {
            char c = text[i];
            if (c >= 'a' && c <= 'z')
            {
                // Standard scalar operation for the remainder.
                frequencies[c - 'a']++;
            }
        }
    }
}

Explanation of the Code

  1. Memory Slicing (Span<T>): We convert the input string into a ReadOnlySpan<char>. Unlike string.Substring(), creating a Span does not allocate a new object on the Heap. It simply creates a lightweight "view" (a pointer + length) into the existing memory of the original string. This allows us to slice "The quick" or "brown fox" without copying data.

  2. Array Pooling (ArrayPool<T>): Instead of new int[26], we rent a buffer from ArrayPool<int>.Shared. The ArrayPool is a singleton that maintains a collection of reusable arrays. When you rent an array, you take one from the pool (or a new one is created if the pool is empty). When you return it, it is available for the next request. This drastically reduces "Gen 0" allocations.

  3. SIMD Vectors (System.Numerics.Vector<T>): The CalculateCharacterFrequenciesSimd method uses Vector<int>. This type utilizes the CPU's SIMD registers (AVX/SSE). Instead of adding numbers one by one, the CPU performs the addition on 4 or 8 integers simultaneously in a single clock cycle. This is the essence of vectorization in AI workloads.

  4. Unsafe Memory Access (Conceptual): The code uses UnsafeAs<int[]> (conceptually) to treat the char buffer as an int buffer for the SIMD load. In high-performance AI, we often reinterpret memory buffers (e.g., treating a float buffer as a byte buffer) without copying data. This allows us to feed raw text bytes directly into neural networks.

Common Pitfalls

1. The "Hidden Allocation" of LINQ on Spans A common mistake is attempting to use LINQ operators like .Where() or .Select() directly on a Span<T>.

  • Why it fails: Standard LINQ operates on IEnumerable<T>. Span<T> does not implement this interface because it is a ref struct (it lives on the stack). To use LINQ, the compiler must "box" the Span into an enumerator, which allocates memory on the heap.
  • The Fix: Always use standard for or foreach loops when iterating over Span<T> in hot paths.

2. Forgetting to Return to ArrayPool If you rent an array from ArrayPool and fail to call .Return(), the memory is effectively lost for the duration of the application's life.

  • Why it's bad: The pool assumes the array is available. If you don't return it, the pool might create new arrays for subsequent requests, leading to memory bloat.
  • The Fix: Always wrap ArrayPool usage in a try / finally block to ensure the array is returned, even if an exception occurs.

3. Misinterpreting SIMD Vectors In the example, we loaded char data into an int vector. In a real keyword extraction engine, you must be careful about data types.

  • The Risk: If you treat a buffer of float (embeddings) as int, you will corrupt the numerical values due to IEEE 754 bit-pattern interpretation.
  • The Fix: Ensure your vector types match the underlying data representation or use explicit conversion intrinsics provided by the hardware.

Visualizing Memory Layout

The following diagram illustrates the difference between the standard Heap allocation approach and the High-Performance Span/Pool approach.

Diagram Analysis:

  • Left Side (Red): Shows the standard approach where slicing creates new objects on the Heap. This fragments memory and burdens the Garbage Collector.
  • Right Side (Green): Shows the Span approach. The Span<char> objects are stack-allocated (very fast) and point to the original buffer. The ArrayPool provides a reusable integer buffer, preventing heap churn. This is how we process Tensors in AI—viewing a massive memory block without copying it.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.