Chapter 17: Tokenization Basics - Counting Words vs Tokens

Theoretical Foundations

The efficiency of tokenization in AI pipelines is not merely a matter of code cleanliness; it is a fundamental bottleneck in the throughput of Large Language Models (LLMs). When processing massive datasets for training or inference, the allocation of memory for string objects—immutable and heap-allocated—creates a "stop-and-copy" garbage collection overhead that cripples performance. To understand this, we must look at how C# manages memory and how we can manipulate it directly to align with the hardware.

The Memory Hierarchy: Stack vs. Heap

In .NET, data allocation occurs primarily in two locations: the Stack and the Heap.

The Stack: A contiguous block of memory with strict Last-In-First-Out (LIFO) semantics. Allocating on the stack is essentially moving a pointer (the stack pointer). It is incredibly fast and deterministic. Crucially, stack memory is automatically reclaimed when a method returns. It is limited in size (typically 1MB per thread) and cannot store variable-sized data like strings.
The Heap: A free-store memory region for dynamic data. Objects allocated here (like string, List<T>, class instances) require a header for metadata and garbage collection (GC) tracking. Allocation is slower, and deallocation is non-deterministic, managed by the GC.

The AI Bottleneck: In a standard tokenization workflow, splitting a sentence like "Hello, AI!" by whitespace creates a string[]. Each substring is a new object on the Heap. If you process a 1GB dataset, you might temporarily allocate 2GB+ of strings, triggering frequent Gen 0/Gen 1 GC pauses. In real-time AI inference, a 50ms GC pause can cause a timeout in a streaming response.

Zero-Copy Slicing with `Span<T>`

To solve this, we introduce Span<T>. A Span<T> is a lightweight ref struct (a structure that can only live on the stack). It represents a contiguous region of arbitrary memory. It is essentially a type-safe pointer combined with a length.

Key Characteristics:

Zero-Allocation: Creating a Span does not allocate on the heap. It simply points to existing memory.
Zero-Copy: Slicing a Span (e.g., getting a substring) does not copy the data; it moves the pointer and adjusts the length.
Restrictions: Because it is a ref struct, it cannot be boxed, cannot be a field in a class, and cannot be used in async state machines. This enforces high-performance patterns.

Real-World Analogy: Imagine a massive scroll of text (the Heap). Standard string slicing asks a scribe to copy the relevant sentence onto a new piece of paper (new Heap allocation). Span<T> is like placing a finger on the start of the sentence and another finger on the end. You are reading the original scroll without creating a copy. This is vital when processing "Context Windows" in LLMs—loading a 200k token context into separate string objects is impossible; loading it into a Span is trivial.

`ReadOnlySpan<char>` for Tokenization

In the context of tokenization, we rarely need to modify the input text. Therefore, ReadOnlySpan<char> is the preferred type. It allows us to scan the raw text buffer and identify boundaries (spaces, punctuation) to yield tokens as views into the original text.

Let's contrast the "Old" (Heap-heavy) way with the "New" (Stack-only) way.

using System;
using System.Buffers; // For ArrayPool
using System.Collections.Generic;
using System.Linq;

public class TokenizerComparison
{
    // OLD WAY: Heap Allocations
    // Returns a List<string>, meaning every token is a new object on the Heap.
    public static List<string> TokenizeNaive(string text)
    {
        // Splits create new string objects immediately.
        return text.Split(new[] { ' ', '.', ',', '!' }, StringSplitOptions.RemoveEmptyEntries)
                   .ToList();
    }

    // NEW WAY: Zero-Allocation
    // Returns a List<ReadOnlyMemory<char>>. 
    // The text data remains in the original buffer; we just store pointers and lengths.
    public static List<ReadOnlyMemory<char>> TokenizeWithSpan(ReadOnlySpan<char> text)
    {
        var tokens = new List<ReadOnlyMemory<char>>();

        // We iterate manually to avoid LINQ (which causes allocations on Spans in older frameworks 
        // or requires specific overloads) and to control memory precisely.
        int start = -1;
        for (int i = 0; i < text.Length; i++)
        {
            char c = text[i];

            // Define delimiters (simplified for this example)
            if (c == ' ' || c == '.' || c == ',' || c == '!')
            {
                if (start != -1)
                {
                    // SLICE: This is a zero-copy operation.
                    // It creates a new Span pointing to a segment of the original text.
                    ReadOnlySpan<char> tokenSpan = text.Slice(start, i - start);

                    // IMPORTANT: We cannot store Span<T> in a List (it's a ref struct).
                    // We must convert to ReadOnlyMemory<T> (which is heap-allocated but lightweight)
                    // or process it immediately.
                    tokens.Add(tokenSpan.ToArray().AsMemory()); 
                    start = -1;
                }
            }
            else
            {
                if (start == -1) start = i;
            }
        }

        // Handle the last token
        if (start != -1)
        {
            tokens.Add(text.Slice(start).ToArray().AsMemory());
        }

        return tokens;
    }
}

Note on the Code: While Span<T> itself cannot be stored in a List<T> (because List<T> requires a reference type or a value type that can be boxed), we often use ReadOnlyMemory<T> as a bridge, or we process the Span immediately within a loop. The critical performance gain here is that text.Slice(...) does not allocate. The allocation happens only when we decide to persist the token (via ToArray()), which is often necessary only for the final vocabulary lookup.

Hardware Acceleration with `System.Numerics.Vector<T>`

Tokenization isn't just about splitting strings; it's often about validating characters (e.g., checking if a character is alphanumeric or a specific delimiter). In high-performance AI, we don't check characters one by one. We check them in blocks using SIMD (Single Instruction, Multiple Data).

System.Numerics.Vector<T> abstracts the CPU's SIMD registers (like SSE, AVX2, AVX-512). A single Vector operation can compare 128, 256, or 512 bits of data in a single CPU cycle.

Scenario: Filtering out non-alphanumeric characters from a massive buffer to prepare for embedding.

using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

public static class VectorizedFilter
{
    public static int CountAlphanumeric(ReadOnlySpan<char> text)
    {
        int count = 0;
        int i = 0;

        // Determine the vector size based on hardware support.
        // Vector<char>.Count is usually 8 (for 128-bit registers holding 16-bit chars).
        int vectorLength = Vector<char>.Count;
        int lastIndex = text.Length - vectorLength;

        // Create a vector of the delimiter/space character to compare against.
        Vector<char> spaceVector = new Vector<char>(' ');

        // Process in chunks
        if (i <= lastIndex)
        {
            ref char startRef = ref text[i];
            // We take a reference to the start of the current chunk and treat it as a Vector
            ref Vector<char> currentVector = ref Unsafe.As<char, Vector<char>>(ref startRef);

            for (; i <= lastIndex; i += vectorLength)
            {
                // Load a vector from memory
                Vector<char> data = currentVector;

                // Compare all elements in the vector against the space.
                // This returns a mask where non-spaces are 0xFFFF (true).
                Vector<char> comparison = Vector.Equals(data, spaceVector);

                // Count the number of non-space characters (Vector<T> doesn't have a direct Count method for conditions,
                // so we often use a mask or compare against zero).
                // Simplified logic for demonstration:
                if (Vector.IsHardwareAccelerated)
                {
                    // In real scenarios, we use BitOperations.PopCount on the mask.
                    // This is a simplified abstraction of the SIMD logic.
                    // We are essentially checking 8 characters at once.
                    if (!Vector.EqualsAll(data, spaceVector))
                    {
                        // Logic to count valid alphanumeric chars in the vector
                        // This requires deeper intrinsics usage (e.g., Avx2.MoveMask)
                        // but illustrates the concept of batch processing.
                    }
                }
            }
        }

        // Process remaining elements (tail) scalarly
        for (; i < text.Length; i++)
        {
            if (char.IsLetterOrDigit(text[i])) count++;
        }

        return count;
    }
}

Memory Pools: `ArrayPool<T>`

In AI pipelines, we frequently allocate temporary buffers to hold token IDs or intermediate tensor data. ArrayPool<T> provides a shared pool of arrays to reduce GC pressure.

Instead of new byte[1024] (which allocates on the Heap and eventually triggers GC), we request an array from the pool:

using System.Buffers;

public class TokenEncoder
{
    public void EncodeBatch(string[] inputs)
    {
        // Rent an array from the shared pool.
        // This is a zero-allocation operation if one is available in the pool.
        int[] rentedBuffer = ArrayPool<int>.Shared.Rent(1024);

        try
        {
            // Use the buffer for heavy computation (e.g., converting tokens to IDs)
            for (int i = 0; i < inputs.Length; i++)
            {
                // Process input into rentedBuffer
                // ...
            }
        }
        finally
        {
            // CRITICAL: Return the array to the pool.
            // If we forget this, we cause a memory leak (buffer remains in use).
            // If we clear it, we prevent data leakage between uses.
            ArrayPool<int>.Shared.Return(rentedBuffer, clearArray: true);
        }
    }
}

Architectural Implications for AI

Why does this matter for the "Embeddings" chapter?

Tensor Buffers: AI models (like BERT or GPT) rely on Tensors—multi-dimensional arrays of floats. In .NET, these are often stored as float[]. When we tokenize, we eventually map tokens to IDs (integers). These IDs are then looked up in an embedding matrix (a large 2D array).
The Bridge: Using Span<T> allows us to process the raw text string, identify token boundaries, and immediately write the token IDs into the float[] tensor buffer without intermediate allocations.
Vectorization: Once the IDs are in the buffer, System.Numerics.Vector<T> allows us to perform the embedding lookup (matrix multiplication) in parallel across the CPU's vector units.

Visualizing the Data Flow:

The diagram illustrates how Vector<T> leverages CPU vector units to parallelize an embedding lookup (matrix multiplication), efficiently mapping input indices to dense vector representations by processing multiple elements simultaneously. — The diagram illustrates how `Vector` leverages CPU vector units to parallelize an embedding lookup (matrix multiplication), efficiently mapping input indices to dense vector representations by processing multiple elements simultaneously.

Summary of Constraints

In this chapter, we strictly adhere to:

Zero-Allocation Slicing: Using Span<T> to avoid copying text data.
Hardware Acceleration: Using Vector<T> to process characters in parallel.
Memory Reuse: Using ArrayPool<T> to manage temporary buffers for token IDs.

We explicitly avoid standard LINQ (Where, Select) on hot paths involving Span<T>, as LINQ relies on interfaces and enumerators that cause boxing or heap allocations, destroying the performance gains we seek. Instead, we use for loops and direct memory manipulation.

Basic Code Example

using System;
using System.Buffers;
using System.Numerics;
using System.Text;

// Example: Zero-Allocation Tokenization and Vectorized Counting
// Context: Processing a high-volume text stream for AI model input.
// Goal: Avoid heap allocations (GC pressure) and leverage SIMD for counting.
public class ZeroAllocationTokenizer
{
    public static void ProcessText()
    {
        // Simulating a large input buffer (e.g., from a file or network stream).
        // In a real scenario, this might be a MemoryMappedFile view or a rented ArrayPool buffer.
        string rawText = "The quick brown fox jumps over the lazy dog. The dog, however, wasn't lazy!";

        // 1. Convert to ReadOnlySpan<char> to enable zero-allocation slicing.
        // NO heap allocation for the string data itself; we are just creating a view.
        ReadOnlySpan<char> textSpan = rawText.AsSpan();

        // 2. Rent a buffer from the ArrayPool to store tokens.
        // This avoids 'new byte[]' allocations on the heap.
        // We estimate an upper bound: raw length is usually enough for tokens.
        char[] tokenBuffer = ArrayPool<char>.Shared.Rent(textSpan.Length);

        // 3. Tokenize using a stack-allocated span for delimiters (SIMD-friendly).
        // We define punctuation as delimiters.
        // Using stackalloc for small, fixed-size arrays avoids heap allocation entirely.
        Span<char> delimiters = stackalloc char[] { ' ', '.', ',', '!', '?' };

        try
        {
            // Tokenization Logic: Manual loop over Span for zero allocation.
            // We iterate through the textSpan, identifying word boundaries.
            int tokenCount = 0;
            int start = 0;

            // We will store tokens as (start index, length) tuples in a stack buffer for this example.
            // In a real high-performance scenario, we might process tokens immediately or stream them.
            Span<(int start, int length)> tokenIndices = stackalloc (int, int)[128]; // Stack allocated indices

            for (int i = 0; i < textSpan.Length; i++)
            {
                char c = textSpan[i];
                bool isDelimiter = false;

                // Check against delimiters (SIMD potential here, but simple loop for clarity)
                foreach (char d in delimiters)
                {
                    if (c == d)
                    {
                        isDelimiter = true;
                        break;
                    }
                }

                if (isDelimiter)
                {
                    if (i > start)
                    {
                        // Found a token
                        if (tokenCount < tokenIndices.Length)
                        {
                            tokenIndices[tokenCount] = (start, i - start);
                            tokenCount++;
                        }
                    }
                    start = i + 1;
                }
            }

            // Handle the last token if text doesn't end with a delimiter
            if (start < textSpan.Length)
            {
                if (tokenCount < tokenIndices.Length)
                {
                    tokenIndices[tokenCount] = (start, textSpan.Length - start);
                    tokenCount++;
                }
            }

            // 4. Vectorized Counting (SIMD)
            // Goal: Count vowels in the tokenized text without allocating strings.
            // We process the original textSpan directly using System.Numerics.Vector<T>.

            int vowelCount = 0;
            int i = 0;

            // Determine if Vector<T> is supported (hardware acceleration)
            int vectorSize = Vector<char>.Count;

            // Define vowels for SIMD comparison (SIMD requires loading constants into vectors)
            // We create a vector of 'a' to compare against chunks of the text.
            // Note: Handling case-insensitivity in SIMD requires careful masking or pre-processing.
            // For this example, we count lowercase vowels.

            Vector<char> aVec = new Vector<char>('a');
            Vector<char> eVec = new Vector<char>('e');
            Vector<char> iVec = new Vector<char>('i');
            Vector<char> oVec = new Vector<char>('o');
            Vector<char> uVec = new Vector<char>('u');

            // Process the textSpan in Vector-sized chunks
            for (; i <= textSpan.Length - vectorSize; i += vectorSize)
            {
                var chunk = new Vector<char>(textSpan.Slice(i, vectorSize));

                // SIMD Comparison: Compare chunk with each vowel vector
                // Vector.Equals returns a mask where matching elements are all 1s (0xFFFF)
                vowelCount += Vector.Sum(Vector.Equals(chunk, aVec));
                vowelCount += Vector.Sum(Vector.Equals(chunk, eVec));
                vowelCount += Vector.Sum(Vector.Equals(chunk, iVec));
                vowelCount += Vector.Sum(Vector.Equals(chunk, oVec));
                vowelCount += Vector.Sum(Vector.Equals(chunk, uVec));
            }

            // Process remaining elements (tail) using a standard loop
            for (; i < textSpan.Length; i++)
            {
                char c = textSpan[i];
                if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
                {
                    vowelCount++;
                }
            }

            // Output results
            Console.WriteLine($"Token Count: {tokenCount}");
            Console.WriteLine($"Total Vowels (Vectorized): {vowelCount}");

            // Demonstrate accessing a specific token without allocation
            // We slice the original span using the indices we found earlier
            if (tokenCount > 0)
            {
                var firstTokenInfo = tokenIndices[0];
                ReadOnlySpan<char> firstToken = textSpan.Slice(firstTokenInfo.start, firstTokenInfo.length);
                Console.Write("First Token: ");
                Console.WriteLine(firstToken); // Console.WriteLine handles ReadOnlySpan<char> efficiently in .NET Core
            }
        }
        finally
        {
            // CRITICAL: Return the rented array to the pool to prevent memory leaks
            // and allow reuse by other parts of the application.
            ArrayPool<char>.Shared.Return(tokenBuffer);
        }
    }
}

Explanation of the Code

Context Setup: The code simulates an AI preprocessing pipeline where a raw text string needs to be broken down into tokens (vocabulary units) and analyzed (counting vowels) before being converted into a vector embedding.
Span Creation: rawText.AsSpan() creates a ReadOnlySpan<char>. This is a ref struct allocated on the stack. It points to the memory of the original string (which is on the heap) but does not copy the data. This is the foundation of zero-allocation slicing.
Array Pooling: ArrayPool<char>.Shared.Rent() borrows a character array from a shared pool. This avoids the expensive new char[] allocation on the heap, reducing Garbage Collector (GC) pressure. We use this buffer as a scratchpad, though in this specific logic, we rely more on stack allocation for indices.
Stack Allocation: Span<char> delimiters = stackalloc char[] { ... } allocates memory directly on the stack. This is extremely fast and has zero GC cost. It is ideal for small, fixed-size data structures used within a single method scope.
Tokenization Loop: We iterate through the textSpan. Instead of creating new string objects for every word (which would be disastrous for performance), we record the start index and length of each token relative to the original span.
SIMD Vectorization:
- System.Numerics.Vector<T> utilizes hardware SIMD (Single Instruction, Multiple Data) instructions (like SSE/AVX on x86).
- We load chunks of the text (e.g., 16 characters at a time if Vector<char>.Count is 16) into a CPU register.
- We compare these chunks against vowel vectors in parallel.
- Vector.Sum aggregates the results. This is significantly faster than a scalar if loop for large datasets.
Memory Management: The finally block ensures the rented array is returned to the pool. Failing to do this causes memory leaks and depletes the pool, forcing new allocations.

Real-World Problem: High-Throughput Tokenization

Imagine you are building a search engine or a Large Language Model (LLM) inference engine. You receive gigabytes of text data daily. Standard string manipulation (e.g., text.Split()) creates millions of temporary string objects. This causes frequent GC pauses, freezing the application.

The Solution: Use Span<T> and ArrayPool<T>.

Zero-Copy: We process the data where it lives (in the input buffer).
Stack vs Heap: By moving temporary variables to the stack (stackalloc, Span), we keep the heap clean.
Vectorization: Counting characters or matching patterns (like finding specific tokens) is accelerated by the CPU's vector units, processing 16+ characters in a single clock cycle.

Common Pitfalls

Pitfall 1: Boxing and LINQ on Spans You cannot use standard LINQ (e.g., tokenSpan.Where(c => c == 'a')) directly on a Span<T> because Span<T> is a ref struct and cannot be used as a type parameter in generic methods (like those used by LINQ delegates). Attempting to do so causes boxing (allocating on the heap) or compilation errors.

Correct Approach: Use for or foreach loops. They are JIT-optimized and zero-allocation.

Pitfall 2: Escaping Stack Memory Span<T> and Memory<T> are tied to their memory context. You cannot store a Span in a field of a class or return it from an async method, as the stack frame might disappear, leaving the span pointing to invalid memory.

Correct Approach: Use Memory<T> for scenarios requiring deferred execution or storage, and convert to Span<T> only when you can guarantee the memory is alive (synchronously).

Pitfall 3: Forgetting ArrayPool Return Renting from ArrayPool is manual memory management. If you forget the Return call, the memory is effectively lost until the application ends, or the pool runs out of buffers, forcing expensive new allocations.

Correct Approach: Always wrap rented arrays in a try-finally block.

Visualizing Memory Layout

The following diagram illustrates how Span<T> acts as a window over existing memory, avoiding copies.

A diagram illustrating the concept of Span<T> as a window over existing memory would show a Span<T> object pointing to a contiguous block of memory (such as an array or stack-allocated buffer) without making a copy, visually distinguishing it from the correct approach of wrapping rented arrays in a try-finally block for safe memory management. — A diagram illustrating the concept of `Span` as a window over existing memory would show a `Span` object pointing to a contiguous block of memory (such as an array or stack-allocated buffer) without making a copy, visually distinguishing it from the correct approach of wrapping rented arrays in a `try-finally` block for safe memory management.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.