Chapter 19: Case Study - Accelerating Cosine Similarity with SIMD and Span

Theoretical Foundations

At the heart of modern Natural Language Processing (NLP) and vector databases lies a fundamental mathematical operation: measuring the similarity between two pieces of text. While large language models (LLMs) generate text, the underlying architecture often relies on comparing numerical representations (embeddings) of that text. The most common metric for this is Cosine Similarity. In the previous book, we discussed how to represent text as vectors using tokenization. Here, we will explore how to compute the geometric relationship between these vectors with maximum efficiency.

The Mathematical Essence of Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In the context of AI, these vectors are often high-dimensional embeddings (e.g., 768 dimensions for BERT-base, 1536 for text-embedding-ada-002).

The formula is defined as:

\[ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} \]

Where:

\(A \cdot B\) is the dot product of vectors \(A\) and \(B\).
\(\|A\|\) is the Euclidean norm (magnitude) of vector \(A\).

The Analogy: The Library of Concepts Imagine a vast library where every book represents a concept. The library is multi-dimensional; shelves represent topics (history, science, art), and the depth of the shelf represents the intensity of the topic.

Vector A is a book about "Maritime History." It sits on the "History" shelf (high value), touches the "Geography" shelf (medium value), and ignores the "Physics" shelf (zero value).
Vector B is a book about "Pirate Biographies." It also sits heavily on "History" and "Geography."
Vector C is a book about "Quantum Mechanics." It sits on the "Physics" shelf.

To find how related two books are, we don't compare their titles; we compare their positions in the library. Cosine similarity calculates the angle between the vectors pointing from the library entrance to these books. If the angle is small (0°), they are identical in topic. If the angle is 90°, they are unrelated (orthogonal). If the angle is 180°, they are opposites.

Why Optimization is Critical: The Token Explosion

In the context of AI, we are not dealing with simple 3-dimensional vectors. We are dealing with Token Vectors. When you process a document, it is broken down into tokens. Each token is mapped to a high-dimensional vector.

Consider a scenario where we are comparing a user query against a database of 1 million document embeddings, each with 1536 dimensions.

Naive Approach: A standard double uses 8 bytes. \(1536 \times 8 = 12,288\) bytes per vector. Comparing two vectors involves reading 24KB of memory, performing 1536 multiplications, 1536 additions, and square roots.
The Bottleneck: If we process this sequentially for 1 million comparisons, the CPU spends the vast majority of its time moving data from RAM to the CPU cache and waiting for the Arithmetic Logic Unit (ALU) to crunch numbers one by one.

This is where SIMD (Single Instruction, Multiple Data) and Span become the pillars of high-performance C# AI.

Memory Efficiency: The Role of `Span<T>`

In previous chapters, we discussed the garbage collector (GC) and the cost of heap allocations. When processing token vectors, a naive implementation might look like this:

// The "Old" Way - High Allocation
public float CalculateSimilarity(float[] vectorA, float[] vectorB)
{
    float dotProduct = 0;
    for (int i = 0; i < vectorA.Length; i++)
    {
        dotProduct += vectorA[i] * vectorB[i];
    }
    // ... norm calculation ...
    return dotProduct;
}

If vectorA and vectorB are coming from a stream of data (like a network packet or a file), creating arrays for every comparison triggers GC pressure.

Span<T> is a ref struct that represents a contiguous region of arbitrary memory. It allows us to work with slices of arrays, stack memory, or native memory without allocating on the heap.

The Analogy: The Librarian's Hand Imagine the library (heap memory) is massive. To compare two books, you could ask a librarian to photocopy both books (allocate new arrays) and bring them to your desk. This is slow and creates clutter (garbage). Using Span<T> is like the librarian simply walking to the shelf and holding the open pages of both books in their hands. No copies are made. The "Span" is the view of the data the librarian is currently holding. It is ephemeral, fast, and zero-allocation.

In AI applications, Span<T> is crucial because it allows us to slice a large embedding matrix (e.g., a batch of 100 documents) and pass references to specific vectors to our calculation engine without copying memory.

Computational Speed: The Power of SIMD

While Span<T> solves the memory bandwidth issue, SIMD solves the CPU execution issue.

Modern CPUs are not just scalar processors (doing one thing at a time); they are vector processors. A standard CPU register might be 64 bits wide. A SIMD register (like AVX2) is 256 bits wide, and AVX-512 is 512 bits wide.

The Analogy: The Assembly Line

Scalar Processing (The Old Way): Imagine an assembly line where one worker takes a single widget, paints it, and passes it on. Then the next worker takes the next widget.
SIMD Processing (The New Way): Imagine the assembly line now uses a wide tray that holds 8 widgets at once. A worker paints all 8 widgets in a single motion (Single Instruction). The throughput is 8x higher.

In C#, the System.Numerics namespace provides types like Vector<T>. When you use Vector<float>, the .NET JIT (Just-In-Time compiler) attempts to emit SIMD instructions (SSE, AVX) if the hardware supports it.

How it works for Cosine Similarity: Instead of:

Load A[0], Load B[0], Multiply, Store.
Load A[1], Load B[1], Multiply, Store.

SIMD does:

Load A[0..7], Load B[0..7], Multiply Packed, Store.

This is particularly effective for Cosine Similarity because the operations are "embarrassingly parallel"—the calculation for index i is completely independent of index j.

Architectural Implications for AI Systems

When building AI systems in C#, such as a local semantic search engine or a RAG (Retrieval-Augmented Generation) pipeline, these optimizations shift the architectural possibilities.

1. Real-time Processing: Without SIMD, comparing a query against a large index might take hundreds of milliseconds, introducing latency in chat interfaces. With AVX2/AVX-512, this can drop to single-digit milliseconds. This enables real-time "type-ahead" semantic search.

2. Throughput Scaling: In server-side scenarios (e.g., an ASP.NET Core API handling embedding requests), raw throughput is money. Optimizing Cosine Similarity allows you to serve more requests per second on the same hardware, reducing cloud costs.

3. Quantization Compatibility: Modern AI often uses quantization (converting float32 to int8 or float16 to save space). Span<T> and SIMD intrinsics allow us to work directly with these low-precision formats without converting them back to float, preserving memory bandwidth.

Visualizing the Data Flow

The following diagram illustrates how data moves from storage to the CPU registers for processing. Note the distinction between the logical view (the array) and the physical view (the SIMD registers).

This diagram visually contrasts the logical view of data as a contiguous array with its physical view as SIMD registers, illustrating how information flows from storage to the CPU for processing.

The "What If": Handling Edge Cases

In theoretical math, vectors are perfect. In AI engineering, they are messy.

Zero Vectors: If a document is empty or a token is unknown, its norm is 0. Division by zero occurs.
- Solution: We must check if the magnitude is close to zero (using an epsilon comparison) before dividing.
Dimension Mismatch: Embeddings from different models (e.g., BERT vs. Ada-002) have different dimensions.
- Solution: Span<T>.Length allows us to check dimensions instantly. If they don't match, similarity is impossible (or requires projection, which is out of scope).
Hardware Support: Not all CPUs support AVX-512.
- Solution: .NET's Vector<T> abstracts this. It automatically selects the largest supported vector size at runtime. If AVX2 is available, it uses 256-bit registers; otherwise, it falls back to SSE (128-bit) or scalar operations.

Summary of Concepts

To summarize this theoretical foundation, we are combining three specific high-performance concepts to solve the Cosine Similarity problem:

Span (Memory): We treat our token vectors as views into memory, not copies. This is essential for handling large batches of embeddings without exhausting the Heap.
SIMD (Compute): We utilize the CPU's vector registers to perform arithmetic on multiple data points simultaneously. This is the key to reducing the latency of the dot product and norm calculations.
Cosine Similarity (Logic): The mathematical core that allows us to determine semantic closeness, independent of vector magnitude (length), focusing only on orientation (direction).

In the next section, we will move from theory to implementation, utilizing the System.Runtime.Intrinsics namespace to manually orchestrate these operations for maximum performance.

Basic Code Example

Here is a self-contained, high-performance C# code example demonstrating the core concepts of Span<T> and SIMD vectorization for calculating Cosine Similarity.

using System;
using System.Numerics; // Required for Vector<T> (SIMD)
using System.Runtime.CompilerServices; // Required for AggressiveInlining

public static class CosineSimilarityCalculator
{
    /// <summary>
    /// Calculates the cosine similarity between two vectors using standard loops.
    /// This serves as our baseline for performance comparison.
    /// </summary>
    /// <param name="vectorA">First vector (read-only span).</param>
    /// <param name="vectorB">Second vector (read-only span).</param>
    /// <returns>The cosine similarity score between -1.0 and 1.0.</returns>
    public static double CalculateStandard(ReadOnlySpan<float> vectorA, ReadOnlySpan<float> vectorB)
    {
        if (vectorA.Length != vectorB.Length)
            throw new ArgumentException("Vectors must be of the same length.");

        double dotProduct = 0.0;
        double magnitudeA = 0.0;
        double magnitudeB = 0.0;

        // Standard sequential loop
        for (int i = 0; i < vectorA.Length; i++)
        {
            dotProduct += vectorA[i] * vectorB[i];
            magnitudeA += vectorA[i] * vectorA[i];
            magnitudeB += vectorB[i] * vectorB[i];
        }

        magnitudeA = Math.Sqrt(magnitudeA);
        magnitudeB = Math.Sqrt(magnitudeB);

        if (magnitudeA == 0 || magnitudeB == 0)
            return 0.0;

        return dotProduct / (magnitudeA * magnitudeB);
    }

    /// <summary>
    /// Calculates cosine similarity using SIMD (Single Instruction, Multiple Data) 
    /// and Span<T> for zero-allocation memory management.
    /// </summary>
    public static double CalculateSimd(ReadOnlySpan<float> vectorA, ReadOnlySpan<float> vectorB)
    {
        if (vectorA.Length != vectorB.Length)
            throw new ArgumentException("Vectors must be of the same length.");

        // 1. Define the vector type (Vector<float>) which maps to hardware registers 
        // (e.g., 128-bit SSE or 256-bit AVX on x64).
        int length = vectorA.Length;
        int vectorSize = Vector<float>.Count; 
        int i = 0;

        // 2. Initialize accumulators using SIMD vectors. 
        // These hold partial sums for the dot product and magnitudes.
        Vector<float> dotProductVec = Vector<float>.Zero;
        Vector<float> magnitudeAVec = Vector<float>.Zero;
        Vector<float> magnitudeBVec = Vector<float>.Zero;

        // 3. Process data in blocks matching the hardware vector width.
        for (; i <= length - vectorSize; i += vectorSize)
        {
            // Load a block of data from memory into CPU registers.
            // 'Unsafe' is used here for raw pointer access, avoiding bounds checks 
            // inside the tight loop for maximum performance.
            var aBlock = Unsafe.ReadUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref vectorA[i]));
            var bBlock = Unsafe.ReadUnaligned<Vector<float>>(ref Unsafe.As<float, byte>(ref vectorB[i]));

            // 4. Perform SIMD operations:
            // - Multiply vectors element-wise in parallel.
            // - Add the results to the running accumulators.
            dotProductVec += aBlock * bBlock;
            magnitudeAVec += aBlock * aBlock;
            magnitudeBVec += bBlock * bBlock;
        }

        // 5. Horizontal Reduction:
        // SIMD registers are wide (e.g., 8 floats). We need to sum the values 
        // within the register to get a single scalar value.
        double dotProduct = 0.0;
        double magnitudeA = 0.0;
        double magnitudeB = 0.0;

        for (int j = 0; j < vectorSize; j++)
        {
            dotProduct += dotProductVec[j];
            magnitudeA += magnitudeAVec[j];
            magnitudeB += magnitudeBVec[j];
        }

        // 6. Process the remaining elements (the "tail") using standard scalar operations.
        for (; i < length; i++)
        {
            dotProduct += vectorA[i] * vectorB[i];
            magnitudeA += vectorA[i] * vectorA[i];
            magnitudeB += vectorB[i] * vectorB[i];
        }

        magnitudeA = Math.Sqrt(magnitudeA);
        magnitudeB = Math.Sqrt(magnitudeB);

        if (magnitudeA == 0 || magnitudeB == 0)
            return 0.0;

        return dotProduct / (magnitudeA * magnitudeB);
    }

    public static void Main()
    {
        // Context: Comparing user queries against document embeddings in a search engine.
        // We have two vectors representing semantic meaning.

        // Generate dummy data (e.g., 1024-dimensional embeddings)
        int dimensions = 1024;
        float[] vecAData = new float[dimensions];
        float[] vecBData = new float[dimensions];

        var rng = new Random(42);
        for(int i = 0; i < dimensions; i++)
        {
            vecAData[i] = (float)rng.NextDouble();
            vecBData[i] = (float)rng.NextDouble();
        }

        // Convert to Spans (Zero-allocation view over existing memory)
        ReadOnlySpan<float> vecA = vecAData;
        ReadOnlySpan<float> vecB = vecBData;

        Console.WriteLine($"Calculating Cosine Similarity for {dimensions} dimensions...\n");

        // Run Standard Calculation
        var sw = System.Diagnostics.Stopwatch.StartNew();
        double resultStandard = CalculateStandard(vecA, vecB);
        sw.Stop();
        Console.WriteLine($"Standard Loop Result: {resultStandard:F6}");
        Console.WriteLine($"Time Taken: {sw.Elapsed.TotalMilliseconds:F2} ms\n");

        // Run SIMD Calculation
        sw.Restart();
        double resultSimd = CalculateSimd(vecA, vecB);
        sw.Stop();
        Console.WriteLine($"SIMD + Span Result:    {resultSimd:F6}");
        Console.WriteLine($"Time Taken: {sw.Elapsed.TotalMilliseconds:F2} ms");

        // Verify correctness
        if (Math.Abs(resultStandard - resultSimd) < 0.00001)
            Console.WriteLine("\nResults match! Optimization successful.");
        else
            Console.WriteLine("\nWarning: Results do not match.");
    }
}

Detailed Line-by-Line Explanation

1. The Standard Baseline (`CalculateStandard`)

Before optimizing, we establish a baseline using standard sequential loops.

ReadOnlySpan<float>: We use ReadOnlySpan instead of float[] or List<float>. This is a "ref struct" that provides a type-safe window into memory without allocating new objects. It allows us to pass arrays, stack-allocated memory, or even pointers to the method efficiently.
The Loop: The loop iterates index by index.
- dotProduct += vectorA[i] * vectorB[i]: Standard multiplication and accumulation.
- magnitudeA += vectorA[i] * vectorA[i]: Squaring the element for magnitude calculation.
Scalar Math: Every operation happens one at a time. If the CPU can process 4 floats in parallel, this code leaves 75% of the hardware idle (on a 128-bit register).

2. The SIMD Implementation (`CalculateSimd`)

This is where high-performance computing comes into play.

Vector<float>: This is the core abstraction in System.Numerics.
- On a modern x64 CPU with AVX support, Vector<float> typically contains 8 floats (256 bits).
- On older SSE hardware, it contains 4 floats (128 bits).
- The JIT compiler automatically selects the optimal size based on the hardware the code is running on.
Vector<float>.Zero: Initializes a vector register filled with zeros. This acts as our starting accumulator.
The SIMD Loop:
- Alignment: We calculate vectorSize (e.g., 8) and iterate i in steps of 8.
- Unsafe.ReadUnaligned: Standard array access vectorA[i] involves bounds checking. In high-performance loops, this check adds overhead. We use Unsafe to read a block of memory directly into a CPU register.
  - ref Unsafe.As<float, byte>: This tricks the compiler into treating the float array as a raw byte address, allowing us to read arbitrary memory blocks.
- Parallel Math: dotProductVec += aBlock * bBlock performs 8 multiplications and 8 additions in a single CPU instruction. This is the essence of SIMD (Single Instruction, Multiple Data).
Horizontal Reduction:
- At the end of the loop, dotProductVec holds 8 partial sums (e.g., [sum1, sum2, ..., sum8]).
- We cannot simply return this vector; we need a scalar double.
- The for (int j = 0; j < vectorSize; j++) loop extracts these values and sums them manually. (Note: .NET 6+ introduced Vector.Sum(), but the manual loop is often inlined efficiently and demonstrates the concept clearly).
The Tail Loop:
- If the vector length is 100 and Vector<float>.Count is 8, we process 96 elements in the SIMD loop. The remaining 4 elements must be processed scalarly to avoid reading past the end of the array. This is called handling the "tail".

3. Execution Context (`Main`)

Memory Management: We allocate arrays once. When passing them to methods via Span, no heap allocations occur during the calculation. This is critical for AI workloads where garbage collection (GC) pauses can cause latency spikes.
Benchmarking: We use Stopwatch to measure execution time. In real-world scenarios, you would use BenchmarkDotNet for statistical rigor, but this suffices for demonstration.

Visualizing the SIMD Process

The following diagram illustrates how Vector<float> processes 4 elements simultaneously (assuming a 128-bit SSE register).

The diagram illustrates a single CPU instruction operating on a 128-bit register to process four 32-bit floating-point numbers in parallel, demonstrating the core principle of SIMD (Single Instruction, Multiple Data) optimization.

Common Pitfalls

1. Misaligned Memory Access When using Unsafe.ReadUnaligned, you are bypassing safety checks. If the memory address is not aligned to the vector size (e.g., trying to read a 16-byte aligned vector from an odd memory address), it can cause a hardware exception (on older architectures) or a severe performance penalty (on x64, which handles unaligned reads but slower).

Mitigation: Ensure your backing arrays are aligned, or stick to Vector.LoadUnsafe which handles alignment better in newer .NET versions.

2. Ignoring the "Tail" A common mistake is looping for (; i < length; i += vectorSize) without a final loop for the remaining elements. If your data length isn't a perfect multiple of the vector width, you will either skip the last few elements or read out of bounds (causing a crash or garbage data).

Mitigation: Always implement the "tail loop" to handle length % vectorSize elements.

3. Premature Allocation Using new float[] inside the calculation method forces the Garbage Collector to work. In AI scenarios processing millions of tokens per second, this creates memory pressure.

Mitigation: Use Span<T> and stack allocation (stackalloc float[128]) where possible, or reuse buffers from a pool.

4. Double vs. Float Precision Vector<float> uses single-precision floating-point numbers. While faster, they have lower precision than double. In cosine similarity, small errors usually don't matter, but in cumulative calculations over billions of operations, floating-point error can accumulate.

Mitigation: If precision is critical, consider using Vector<double>, but be aware that the vector width will be halved (e.g., 4 doubles instead of 8 floats on AVX), reducing throughput.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.