Chapter 20: Project - Building a Keyword Extraction Engine with LINQ
Theoretical Foundations
In the realm of high-performance AI and data manipulation, memory management is not merely a background task; it is the primary bottleneck. When processing massive text corpora or high-dimensional vectors, the standard approach of allocating new objects for every substring, array slice, or intermediate calculation creates a "garbage collection (GC) tax" that can stall execution for milliseconds or even seconds. This is unacceptable in real-time inference or large-scale batch processing. To address this, we must shift our mental model from "convenient object references" to "precise memory ownership."
The Analogy: The Library vs. The Photocopier
Imagine you are a researcher analyzing a specific paragraph in a 1,000-page encyclopedia.
The Traditional Heap Allocation Approach (The Photocopier): Every time you want to isolate that paragraph to highlight it, you don't just mark the page. You walk to the photocopier, copy the entire page, take the physical copy back to your desk, and highlight the paragraph. If you need to extract just one sentence from that paragraph, you copy the paragraph again. When you are done, you throw the copies in the recycling bin. In a high-performance environment (analyzing the whole encyclopedia), you spend more time walking to the copier and recycling paper than you do actually reading.
The Span<T> Approach (The Bookmark):
Span<T> is like placing a bookmark on the starting word and using your finger to trace the ending word. You haven't copied the text. You haven't allocated new paper. You have simply defined a "view" or a "region" of the existing data. You can slide this view around the page, change its length, or pass it to a colleague without duplicating the underlying content. This is Zero-Allocation Slicing.
Stack vs. Heap: The Architectural Divide
To understand Span<T>, we must revisit memory allocation strategies, a concept introduced in Book 2 regarding value types, but now elevated to critical importance in Book 3.
The Heap:
The Heap is a large, managed pool of memory. When you use new (e.g., new string(...), new byte[]), the CLR reserves space on the Heap.
- Pros: Flexible size, long lifetime.
- Cons: Allocation is slow (requires searching for free blocks). Deallocation (Garbage Collection) is expensive and non-deterministic. It causes cache misses because data is scattered.
The Stack: The Stack is a contiguous block of memory allocated per thread. It grows and shrinks strictly in a Last-In-First-Out (LIFO) manner.
- Pros: Allocation is essentially free (just moving a pointer). It is extremely cache-friendly (data is contiguous).
- Cons: Size must be known at compile time (mostly). Lifetime is strictly scoped to the method call.
The Constraint of Span<T>:
Span<T> is a ref struct. A ref struct can only live on the Stack. It cannot be boxed, it cannot be a field in a class, and it cannot be used in an async state machine. Why? Because if it lived on the Heap, it could potentially point to memory that has been moved or collected by the GC, creating a dangling reference. By confining Span<T> to the Stack, the CLR guarantees that the memory it points to remains valid for the duration of the method call.
Span<T> and Memory<T>: The Tools of Zero-Copy
In our Keyword Extraction Engine, we will process a massive document string. Instead of using string.Substring() (which allocates a new string on the Heap), we use Span<char>.
Span<T>: The Stack-Only View
Span<T> provides a type-safe view over contiguous memory.
using System;
public void AnalyzeWord(ReadOnlySpan<char> word)
{
// We can inspect the characters without allocating a new string.
// This method accepts a slice of the original buffer, or the whole buffer.
if (word.Length > 0 && char.IsUpper(word[0]))
{
Console.WriteLine($"Found capitalized word: {word}");
}
}
public void ProcessDocument()
{
string massiveDocument = "The quick brown Fox jumps over the lazy Dog.";
// NO ALLOCATION HERE.
// We are creating a view, not a copy.
ReadOnlySpan<char> docSpan = massiveDocument.AsSpan();
// We can slice without allocation.
// This creates a new Span pointing to a subset of the original memory.
ReadOnlySpan<char> foxSlice = docSpan.Slice(10, 3); // Points to "Fox"
AnalyzeWord(foxSlice);
}
Memory<T>: The Heap-Capable Token
Sometimes you need to store a reference to memory that outlives a single method call (e.g., passing data to an async method). Span<T> cannot do this. This is where Memory<T> comes in. Memory<T> is a wrapper that can live on the Heap, but it does not allow direct access to the data. To access the data, you must pin it or request a Span<T> from it within a specific context.
using System;
using System.Buffers;
// This method is async, so it cannot accept Span<T> directly.
public async Task ProcessAsync(Memory<char> memory)
{
// We can access the data as a Span only inside the synchronous context
// of the method (or specific helpers).
var span = memory.Span;
// Perform synchronous analysis on the span
AnalyzeWord(span);
await Task.Delay(100); // Async allowed because Memory<T> is on the Heap.
}
ArrayPool<T>: Recycling Memory
In AI applications, we often process tensors (multidimensional arrays of numbers). A standard keyword extraction might involve creating arrays to store n-gram frequencies or token indices. If we allocate a new int[] for every document, we generate massive GC pressure.
ArrayPool<T> is a shared pool of arrays. Instead of new int[1024], we rent an array from the pool. When finished, we return it. This is like a library of reusable lab equipment rather than manufacturing new beakers for every experiment.
using System;
using System.Buffers; // Required for ArrayPool
public void CountCharactersInSegment(string segment)
{
// Rent an array from the shared pool.
// This avoids a heap allocation.
char[] buffer = ArrayPool<char>.Shared.Rent(segment.Length);
try
{
// Copy data into the rented buffer.
segment.AsSpan().CopyTo(buffer);
// Process the buffer (e.g., count vowels).
int vowelCount = 0;
for (int i = 0; i < segment.Length; i++)
{
char c = buffer[i];
if ("aeiou".Contains(c, StringComparison.OrdinalIgnoreCase))
vowelCount++;
}
Console.WriteLine($"Vowels: {vowelCount}");
}
finally
{
// CRITICAL: Return the array to the pool so it can be reused.
// Failure to do so causes a memory leak (the pool thinks the array is still in use).
ArrayPool<char>.Shared.Return(buffer);
}
}
stackalloc: Stack-Based Arrays
For small buffers where the size is known at compile time (or determined dynamically but bounded), stackalloc allocates memory directly on the stack. This is the ultimate zero-allocation technique for temporary buffers.
Warning: Stack space is limited (typically 1MB per thread). Overusing stackalloc causes a StackOverflowException.
using System;
public unsafe void ProcessSmallToken(string token)
{
// Allocate a buffer of 64 chars on the stack.
// This memory is freed automatically when the method returns.
Span<char> buffer = stackalloc char[64];
if (token.Length > 64)
{
// Handle overflow logic
return;
}
token.AsSpan().CopyTo(buffer);
// We can now manipulate 'buffer' with zero GC impact.
buffer.Reverse();
}
System.Numerics.Vector<T>: Hardware Acceleration (SIMD)
In Keyword Extraction, we often need to perform mathematical operations on large arrays of numbers—perhaps calculating the cosine similarity between two word embeddings or normalizing frequency counts. Doing this in a standard for loop processes one number at a time.
SIMD (Single Instruction, Multiple Data) allows the CPU to process multiple data points in a single clock cycle. Modern CPUs have registers (e.g., AVX2, AVX-512) that can hold 256 or 512 bits of data. System.Numerics.Vector<T> abstracts these hardware capabilities.
The AI Context:
When building embeddings for keywords, we convert words into vectors of floats (e.g., [0.1, 0.8, 0.3, ...]). To compare thousands of these vectors, we need to calculate the dot product. Using SIMD, we can multiply 8 or 16 float pairs simultaneously.
using System;
using System.Numerics; // Requires System.Runtime.Intrinsics
using System.Runtime.Intrinsics.X86; // For hardware checks
public float CalculateSimilaritySimd(float[] embeddingA, float[] embeddingB)
{
int length = embeddingA.Length;
int vectorSize = Vector<float>.Count; // Depends on CPU (e.g., 8 for AVX2)
float sum = 0;
int i = 0;
// Process in chunks using SIMD registers
for (; i <= length - vectorSize; i += vectorSize)
{
var aVec = new Vector<float>(embeddingA, i);
var bVec = new Vector<float>(embeddingB, i);
// Multiply elements and accumulate
// This compiles to hardware instructions like VFMADDPS (Fused Multiply-Add)
sum += Vector.Dot(aVec, bVec);
}
// Process remaining elements (tail) scalarly
for (; i < length; i++)
{
sum += embeddingA[i] * embeddingB[i];
}
return sum;
}
AI Context: Processing Tensor Buffers
In modern AI (e.g., ONNX Runtime or custom ML models), data is represented as Tensors—multi-dimensional arrays of floats. A single model inference might involve processing a 512-dimensional embedding for a batch of 32 documents.
Using standard C# arrays (float[][]), this requires:
- Allocating the outer array.
- Allocating 32 inner arrays.
- Copying data between buffers.
Using Span<T> and Memory<T>:
- We can allocate a single contiguous block of memory (a 1D array) of size
512 * 32. - We use
Span<T>.Sliceto create "views" of rows or columns without copying data. - We pass these Spans directly to SIMD-accelerated math kernels.
This "Zero-Copy" architecture is essential for real-time AI. If you are running a Large Language Model locally, every millisecond spent allocating memory is a millisecond the GPU sits idle.
Visualizing Memory Layout
The following diagram illustrates the difference between fragmented Heap allocations and contiguous Stack/Pool allocations.
Summary of Architectural Implications
By adopting Span<T>, Memory<T>, and ArrayPool<T>, we fundamentally change how the Keyword Extraction Engine operates:
- Throughput: We eliminate the GC pauses that occur when processing millions of documents. The GC remains idle because we aren't creating garbage.
- Latency: By using
stackallocandSpan<T>, we ensure that memory operations are deterministic and fast, crucial for real-time keyword extraction in streaming applications. - Vectorization: By aligning our data in contiguous memory blocks (facilitated by these types), we maximize the efficiency of
System.Numerics.Vector<T>, allowing us to perform mathematical operations on word embeddings at the hardware level.
This theoretical foundation sets the stage for the practical implementation of our engine, where we will replace standard LINQ operations with high-performance loops and vectorized math to extract keywords with minimal overhead.
Basic Code Example
In the domain of high-performance AI and data manipulation, memory allocation is often the primary bottleneck. When processing massive text corpora or converting raw text into vector embeddings, allocating millions of small objects on the heap leads to frequent Garbage Collection (GC) pauses, halting the CPU.
The following example demonstrates a "Hello World" of high-performance text processing. We will calculate the sum of character frequencies in a string. While simple, this operation is the foundational step in creating embeddings (converting text to numerical vectors). We will achieve this using Span<T> for zero-allocation slicing and System.Numerics.Vector<T> for hardware-accelerated (SIMD) math.
The Problem Context
Imagine you are building a keyword extraction engine that processes terabytes of web pages. You need to convert every page into a numerical vector (an embedding) to compare them. Using standard string.Substring() and foreach loops would create millions of temporary objects, causing the application to pause for garbage collection.
To solve this, we must work directly on memory buffers without copying data. We treat the text not as an immutable string, but as a contiguous block of memory (Span<char>) that can be processed in parallel using the CPU's vector instructions.
High-Performance Code Example
using System;
using System.Numerics; // Required for Vector<T> (SIMD)
using System.Buffers; // Required for ArrayPool<T>
public class HighPerformanceEmbeddingEngine
{
public static void Main()
{
// 1. Simulate a large block of text (e.g., a document buffer).
// In a real AI scenario, this might be a memory-mapped file or a tensor buffer.
string rawText = "The quick brown fox jumps over the lazy dog. " +
"The quick brown fox jumps over the lazy dog. " +
"The quick brown fox jumps over the lazy dog.";
// 2. Convert to Span to avoid heap allocations during slicing.
// 'ReadOnlySpan<char>' is a view into the original string's memory.
ReadOnlySpan<char> textSpan = rawText.AsSpan();
// 3. Rent a shared array from the ArrayPool (Zero-Allocation pattern).
// We use this to store the frequency count of characters 'a' through 'z'.
// This avoids 'new int[26]' which would immediately put pressure on the GC.
int[] rentedBuffer = ArrayPool<int>.Shared.Rent(26);
try
{
// Zero out the buffer (required because ArrayPool returns dirty memory).
Array.Clear(rentedBuffer, 0, rentedBuffer.Length);
// 4. Process the Span using Hardware Acceleration (SIMD).
CalculateCharacterFrequenciesSimd(textSpan, rentedBuffer);
// 5. Output results (Simulating vector embedding generation).
Console.WriteLine("Character Frequency Vector (Embedding):");
for (int i = 0; i < 26; i++)
{
Console.WriteLine($"'{(char)('a' + i)}': {rentedBuffer[i]}");
}
}
finally
{
// 6. CRITICAL: Return the array to the pool.
// If we forget this, we lose the benefits of pooling and cause memory leaks.
ArrayPool<int>.Shared.Return(rentedBuffer);
}
}
/// <summary>
/// Calculates character frequencies using SIMD (Single Instruction, Multiple Data).
/// This processes multiple characters in a single CPU cycle.
/// </summary>
private static void CalculateCharacterFrequenciesSimd(ReadOnlySpan<char> text, int[] frequencies)
{
// Define the vector size based on the hardware (e.g., 256-bit on AVX2, 128-bit on SSE).
// Vector<int>.Count typically equals 8 (256-bit / 32-bit) or 4 (128-bit / 32-bit).
int vectorSize = Vector<int>.Count;
// We iterate with a step of 'vectorSize' to process chunks of data.
int i = 0;
// 7. The Hot Path: Looping over the Span.
// We avoid LINQ here. LINQ on Span<T> is not supported in older frameworks
// and introduces overhead (boxing/indirection) that kills performance.
for (; i <= text.Length - vectorSize; i += vectorSize)
{
// Load a chunk of characters into a SIMD vector.
// This loads 8 characters (assuming 32-bit int vector) into a single CPU register.
Vector<int> chunk = new Vector<int>(text.Slice(i).UnsafeAs<int[]>());
// NOTE: In a real scenario, we would map 'char' to an index (e.g., 'a' -> 0).
// For this simplified example, we assume the vector contains valid indices.
// In production, you would use Vector.ShiftRight/Left to mask ASCII ranges.
// 8. Hardware Accelerated Math:
// We add 1 to every element in the vector simultaneously.
// This is significantly faster than a scalar 'for' loop.
Vector<int> ones = Vector<int>.One;
Vector<int> result = Vector.Add(chunk, ones);
// 9. Store result back to memory (SIMD write).
// We write back to the rented buffer.
result.CopyTo(frequencies, i % frequencies.Length);
}
// 10. The Tail: Process remaining elements that didn't fit in a vector.
// This ensures correctness for any input length.
for (; i < text.Length; i++)
{
char c = text[i];
if (c >= 'a' && c <= 'z')
{
// Standard scalar operation for the remainder.
frequencies[c - 'a']++;
}
}
}
}
Explanation of the Code
-
Memory Slicing (
Span<T>): We convert the input string into aReadOnlySpan<char>. Unlikestring.Substring(), creating a Span does not allocate a new object on the Heap. It simply creates a lightweight "view" (a pointer + length) into the existing memory of the original string. This allows us to slice "The quick" or "brown fox" without copying data. -
Array Pooling (
ArrayPool<T>): Instead ofnew int[26], we rent a buffer fromArrayPool<int>.Shared. The ArrayPool is a singleton that maintains a collection of reusable arrays. When you rent an array, you take one from the pool (or a new one is created if the pool is empty). When you return it, it is available for the next request. This drastically reduces "Gen 0" allocations. -
SIMD Vectors (
System.Numerics.Vector<T>): TheCalculateCharacterFrequenciesSimdmethod usesVector<int>. This type utilizes the CPU's SIMD registers (AVX/SSE). Instead of adding numbers one by one, the CPU performs the addition on 4 or 8 integers simultaneously in a single clock cycle. This is the essence of vectorization in AI workloads. -
Unsafe Memory Access (Conceptual): The code uses
UnsafeAs<int[]>(conceptually) to treat thecharbuffer as anintbuffer for the SIMD load. In high-performance AI, we often reinterpret memory buffers (e.g., treating afloatbuffer as abytebuffer) without copying data. This allows us to feed raw text bytes directly into neural networks.
Common Pitfalls
1. The "Hidden Allocation" of LINQ on Spans
A common mistake is attempting to use LINQ operators like .Where() or .Select() directly on a Span<T>.
- Why it fails: Standard LINQ operates on
IEnumerable<T>.Span<T>does not implement this interface because it is aref struct(it lives on the stack). To use LINQ, the compiler must "box" the Span into an enumerator, which allocates memory on the heap. - The Fix: Always use standard
fororforeachloops when iterating overSpan<T>in hot paths.
2. Forgetting to Return to ArrayPool
If you rent an array from ArrayPool and fail to call .Return(), the memory is effectively lost for the duration of the application's life.
- Why it's bad: The pool assumes the array is available. If you don't return it, the pool might create new arrays for subsequent requests, leading to memory bloat.
- The Fix: Always wrap ArrayPool usage in a
try/finallyblock to ensure the array is returned, even if an exception occurs.
3. Misinterpreting SIMD Vectors
In the example, we loaded char data into an int vector. In a real keyword extraction engine, you must be careful about data types.
- The Risk: If you treat a buffer of
float(embeddings) asint, you will corrupt the numerical values due to IEEE 754 bit-pattern interpretation. - The Fix: Ensure your vector types match the underlying data representation or use explicit conversion intrinsics provided by the hardware.
Visualizing Memory Layout
The following diagram illustrates the difference between the standard Heap allocation approach and the High-Performance Span/Pool approach.
Diagram Analysis:
- Left Side (Red): Shows the standard approach where slicing creates new objects on the Heap. This fragments memory and burdens the Garbage Collector.
- Right Side (Green): Shows the Span approach. The
Span<char>objects are stack-allocated (very fast) and point to the original buffer. TheArrayPoolprovides a reusable integer buffer, preventing heap churn. This is how we process Tensors in AI—viewing a massive memory block without copying it.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.