Chapter 6: Parallelism on a Single Core - Introduction to SIMD with Vector

Theoretical Foundations

At the heart of high-performance AI applications in C# lies the relentless pursuit of efficiency, particularly when dealing with the massive streams of numerical data that constitute tokens, embeddings, and model weights. While we previously explored in Book 9 how Span<T> and Memory<T> allow us to manage memory allocation and slicing with zero-copy semantics, effectively reducing the overhead of memory management, we now face a different bottleneck: the raw execution speed of the CPU. Even if memory access is optimized, processing data element-by-element (scalar processing) leaves the vast majority of the processor's execution units idle. This is where Single Instruction, Multiple Data (SIMD) enters the picture, specifically through the portable Vector<T> API in .NET.

To understand SIMD, we must first appreciate the architectural design of modern CPUs. A processor core is not merely a single arithmetic logic unit (ALU) capable of performing one operation per clock cycle. Modern processors are equipped with wide vector registers—often 128-bit, 256-bit, or even 512-bit wide—and specialized vector processing units (AVX, SSE, NEON). These units can perform the same mathematical operation on multiple pieces of data simultaneously. For instance, a 256-bit register can hold eight 32-bit floating-point numbers. A single SIMD instruction can instruct the CPU to add eight pairs of numbers in the time it takes to add a single pair using scalar instructions. This is a theoretical 8x speedup for arithmetic-bound operations, which is critical in AI where operations like calculating dot products, applying activation functions, or normalizing vectors are ubiquitous.

The "why" of SIMD in AI is driven by the sheer volume of operations. When processing a token stream, we often need to compute token embeddings or apply attention mechanisms. These involve massive matrix-vector multiplications. If we process these vectors one element at a time, we are ignoring the parallel processing capabilities of the hardware. By utilizing Vector<T>, we bridge the gap between high-level C# code and the low-level hardware intrinsics, allowing the Just-In-Time (JIT) compiler to emit optimized assembly instructions tailored to the specific CPU architecture running the application.

Consider the analogy of a grocery store checkout. In a scalar scenario (traditional loop), a cashier scans one item at a time, moves it to the bagging area, and then scans the next. This is efficient for a single item but slow for a full cart. SIMD is akin to a specialized scanner that can read the barcodes of eight items simultaneously as they pass under the sensor in a single sweep. The cashier (CPU core) performs one action (sweeping) but processes eight items. This dramatically increases throughput. However, this requires the items to be lined up correctly (data alignment) and the items must be compatible (same data type, e.g., all cans of soup, not a mix of soup and bread). Vector<T> abstracts this complexity, ensuring that we get the maximum "sweep" width supported by the hardware (whether it's a 128-bit, 256-bit, or 512-bit register) without writing platform-specific code.

In the context of AI, this is vital for token processing. When converting tokens to vectors or performing element-wise operations during the inference of a Transformer model, we are often iterating over arrays of float or double. Using Vector<T>, we can process these arrays in chunks equal to the vector width. For example, if we are applying a simple transformation to an embedding vector (e.g., scaling by a learned parameter), a scalar loop would require one multiplication per element. A vectorized loop performs one multiplication instruction that acts on, say, 8 elements at once. This reduces the instruction count and leverages the parallel execution units, directly increasing the tokens-per-second throughput of the AI model.

The core concept of Vector<T> is its portability and abstraction. Unlike hardware intrinsics (e.g., Avx2.Add), which require checking the CPU capabilities and writing different code paths for different architectures, Vector<T> is resolved at runtime. The JIT compiler determines the maximum safe vector size for the underlying hardware. If the code runs on a CPU supporting AVX2 (256-bit), Vector<float>.Count will be 8. If it runs on a CPU supporting AVX-512 (512-bit), it will be 16. If the hardware is older, it might fall back to SSE (128-bit, 4 elements). This ensures that your AI application runs optimally everywhere without recompilation.

However, this power comes with constraints. Data alignment is a critical factor. For maximum performance, data should be aligned to the vector size. While .NET memory allocation often aligns data well, explicitly managing alignment can prevent the JIT from generating less efficient "peel" loops (handling the remainder of a loop that doesn't fit perfectly into a vector width). Furthermore, we must handle "tail" elements—the remaining elements after the largest possible vector operation. A robust vectorized algorithm must have a scalar cleanup loop to handle these remaining elements.

Let us visualize the flow of a vectorized operation compared to a scalar one. In a scalar operation, the CPU fetches, decodes, and executes one instruction per data element. In a vectorized operation, the CPU fetches one instruction that operates on a vector of data.

A single instruction fetched by the CPU operates on a single data element in a scalar operation, whereas in a vectorized operation, that same single instruction is applied to a whole vector of data elements simultaneously.

To implement this in C#, we utilize the System.Numerics namespace. The Vector<T> struct acts as a wrapper around these vector registers. When we load data from an array into a Vector<T>, we are essentially performing a "gather" operation, moving contiguous blocks of memory into the register. When we perform arithmetic, we are using overloaded operators that map to hardware instructions.

For example, adding two arrays element-wise (a common operation in neural network layers) looks like this conceptually:

using System.Numerics;
using System.Runtime.InteropServices;

public void AddArraysSimd(float[] a, float[] b, float[] result)
{
    int i = 0;
    int vectorSize = Vector<float>.Count; // e.g., 8 on AVX2

    // Vectorized loop
    for (; i <= a.Length - vectorSize; i += vectorSize)
    {
        var va = new Vector<float>(a, i); // Load vector from array offset
        var vb = new Vector<float>(b, i);
        var vres = va + vb;               // Single instruction for N elements
        vres.CopyTo(result, i);           // Store vector to array
    }

    // Scalar cleanup loop for remaining elements
    for (; i < a.Length; i++)
    {
        result[i] = a[i] + b[i];
    }
}

The "what if" scenarios are crucial for robust AI systems. What if the arrays are not multiples of the vector size? The cleanup loop handles this. What if the data is not aligned? Vector<T> handles unaligned loads, though they might be slightly slower on some architectures. What if we are processing non-contiguous data (e.g., strided access in matrix multiplication)? Vector<T> requires contiguous memory for loading. If we need to gather non-contiguous data, we might need to use hardware intrinsics directly (like Avx2.Gather), but for standard token processing, data is usually contiguous in arrays or spans.

In AI, specifically for token processing, we often deal with ReadOnlySpan<T>. We can adapt the vectorized approach to work seamlessly with spans, ensuring we avoid heap allocations. This combination of Span<T> for memory safety and zero-copy slicing, paired with Vector<T> for computational throughput, forms the backbone of high-performance C# AI libraries. By mastering these concepts, we move away from the interpreted nature of Python-based AI (which relies on C/C++ backends like NumPy or PyTorch) and leverage the raw, compiled performance of .NET, enabling real-time inference on edge devices or high-throughput batch processing on servers.

Basic Code Example

Let's consider a common scenario in AI text processing: calculating the cosine similarity between two word embedding vectors. This operation is fundamental in semantic search, clustering, and recommendation systems. It involves dot products and square roots—operations that are computationally expensive when performed scalar-by-scalar on large datasets.

Here, we will refactor a standard scalar implementation to use Vector<T>, demonstrating how to leverage SIMD (Single Instruction, Multiple Data) to process multiple floating-point numbers simultaneously, even on a single core.

using System;
using System.Numerics; // Required for Vector<T>
using System.Runtime.CompilerServices; // For MethodImplOptions.AggressiveInlining

public class VectorSimilarity
{
    public static void Main()
    {
        // 1. Setup: Create two sample embedding vectors (dimension 128 is common in AI models)
        // In a real scenario, these would be loaded from a model or database.
        const int dimension = 128;
        float[] vectorA = new float[dimension];
        float[] vectorB = new float[dimension];

        // Populate with dummy data (e.g., random values between 0 and 1)
        Random rand = new Random(42);
        for (int i = 0; i < dimension; i++)
        {
            vectorA[i] = (float)rand.NextDouble();
            vectorB[i] = (float)rand.NextDouble();
        }

        Console.WriteLine($"Processing vectors of dimension: {dimension}");
        Console.WriteLine($"Hardware Vector<T> Count (Simd Length): {Vector<float>.Count}");
        Console.WriteLine($"Is Hardware Acceleration Supported: {Vector.IsHardwareAccelerated}");
        Console.WriteLine(new string('-', 40));

        // 2. Execution: Run Scalar and SIMD versions
        double scalarSimilarity = CalculateCosineSimilarityScalar(vectorA, vectorB);
        double simdSimilarity = CalculateCosineSimilaritySimd(vectorA, vectorB);

        // 3. Validation: Compare results
        Console.WriteLine($"Scalar Result: {scalarSimilarity:F10}");
        Console.WriteLine($"SIMD Result:   {simdSimilarity:F10}");
        Console.WriteLine($"Difference:    {Math.Abs(scalarSimilarity - simdSimilarity):e5}");
    }

    /// <summary>
    /// Standard scalar implementation of Cosine Similarity.
    /// Formula: dot(A, B) / (sqrt(sum(A^2)) * sqrt(sum(B^2)))
    /// </summary>
    public static double CalculateCosineSimilarityScalar(float[] a, float[] b)
    {
        if (a.Length != b.Length)
            throw new ArgumentException("Vectors must be the same length.");

        double dotProduct = 0.0;
        double magnitudeA = 0.0;
        double magnitudeB = 0.0;

        // Process one element at a time
        for (int i = 0; i < a.Length; i++)
        {
            dotProduct += a[i] * b[i];
            magnitudeA += a[i] * a[i];
            magnitudeB += b[i] * b[i];
        }

        return dotProduct / (Math.Sqrt(magnitudeA) * Math.Sqrt(magnitudeB));
    }

    /// <summary>
    /// Optimized SIMD implementation using Vector<T>.
    /// </summary>
    public static double CalculateCosineSimilaritySimd(float[] a, float[] b)
    {
        if (a.Length != b.Length)
            throw new ArgumentException("Vectors must be the same length.");

        int i = 0;
        int lastBlockIndex = a.Length - Vector<float>.Count;

        // Accumulators for the vector operations
        // We use Vector<float> which holds as many floats as the hardware supports (e.g., 4, 8, or 16)
        Vector<float> dotProductVec = Vector<float>.Zero;
        Vector<float> magnitudeAVec = Vector<float>.Zero;
        Vector<float> magnitudeBVec = Vector<float>.Zero;

        // 1. Main Loop: Process chunks of data using SIMD
        for (; i <= lastBlockIndex; i += Vector<float>.Count)
        {
            // Load contiguous memory blocks into Vectors
            Vector<float> va = new Vector<float>(a, i);
            Vector<float> vb = new Vector<float>(b, i);

            // Perform operations on the entire vector at once
            dotProductVec += va * vb;
            magnitudeAVec += va * va;
            magnitudeBVec += vb * vb;
        }

        // 2. Horizontal Reduction: Sum the values within the vectors
        // Vector.Dot is a hardware-accelerated way to sum the elements of a vector
        double dotProduct = Vector.Dot(dotProductVec);
        double magnitudeA = Vector.Dot(magnitudeAVec);
        double magnitudeB = Vector.Dot(magnitudeBVec);

        // 3. Remainder Loop: Process any leftover elements (if length isn't a multiple of Vector<float>.Count)
        for (; i < a.Length; i++)
        {
            dotProduct += a[i] * b[i];
            magnitudeA += a[i] * a[i];
            magnitudeB += b[i] * b[i];
        }

        // 4. Final Calculation
        return dotProduct / (Math.Sqrt(magnitudeA) * Math.Sqrt(magnitudeB));
    }
}

Detailed Explanation

Namespace and Setup:
- using System.Numerics: This namespace contains the Vector<T> type, which is the cornerstone of .NET's portable SIMD API.
- Vector<float>.Count: This property returns the number of float values that fit into a single CPU vector register (e.g., 4 on SSE2, 8 on AVX2, 16 on AVX-512). This determines the width of our parallel processing.
Scalar Implementation (CalculateCosineSimilarityScalar):
- This method represents the traditional approach. It iterates through the arrays one element at a time.
- Why it's slow: The CPU fetches one float, multiplies it, adds it to the accumulator, and repeats. This underutilizes the CPU's vector units (ALUs), which are capable of processing much wider data paths.
SIMD Implementation (CalculateCosineSimilaritySimd):
- Initialization: We declare Vector<float> accumulators initialized to Zero. These act as registers holding partial sums for the dot products and magnitudes.
- The Main Loop:
  - new Vector<float>(a, i): This constructor loads a block of floats from the array starting at index i. It assumes the data is contiguous in memory. The JIT compiler translates this into a single CPU instruction (e.g., VMOVUPS on x64).
  - va * vb: This is a single operator, but it compiles down to a vectorized multiplication instruction (e.g., VMULPS). It multiplies N pairs of floats simultaneously.
  - +=: The results are accumulated into the vector registers.
- Reduction (Vector.Dot):
  - After the loop, dotProductVec contains partial sums (e.g., if Vector width is 4, it holds [sum0, sum1, sum2, sum3]).
  - Vector.Dot performs a horizontal sum (adds all elements together) efficiently. This is often optimized via CPU instructions like HADDPS or simply unrolled addition.
- The Remainder Loop:
  - Crucial Edge Case: If the array length is 129 and Vector<float>.Count is 4, the main loop processes indices 0-127. Index 128 remains. We must handle this scalar fallback to ensure correctness.

Visualizing the Data Flow

The following diagram illustrates how Vector<T> processes data compared to scalar operations. Note that the SIMD path utilizes the vector register width to process multiple data points per clock cycle.

Common Pitfalls

Ignoring Array Length Alignment (The "Off-by-N" Error):
- The Mistake: Assuming Vector<T>.Count divides the array length perfectly.
- The Consequence: If you run the main loop without checking bounds (i < length), you will read past the end of the array, causing an IndexOutOfRangeException. If you stop too early, you lose data.
- The Fix: Always calculate the remainder and process it scalar-ly (as shown in the code). Alternatively, ensure your input data buffers are padded to multiples of Vector<float>.Count.
Misunderstanding Vector.IsHardwareAccelerated:
- The Mistake: Checking this property to decide whether to use Vector<T> or scalar code.
- The Reality: Vector<T> is designed to be portable. If hardware acceleration is not available, the runtime falls back to a software implementation (scalar emulation). While slower than true SIMD, it is still valid. You generally do not need to branch your code based on this check; Vector<T> handles it internally.
Data Alignment:
- The Mistake: Using Vector<T> on unaligned memory pointers expecting maximum performance.
- The Reality: While Vector<T> handles unaligned loads gracefully on modern x64 processors, performance can suffer slightly if data isn't aligned to 16-byte or 32-byte boundaries. In high-performance AI scenarios, using Span<T> and pinned memory can help align data, though Vector<T>'s constructor abstracts this away for standard arrays.
Premature Optimization without Profiling:
- The Mistake: Converting every loop to SIMD.
- The Reality: SIMD introduces overhead (loading/storing registers). For very small arrays (e.g., length < 64), the scalar loop might be faster due to loop overhead and instruction cache pressure. Always profile your specific workload.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.