Chapter 12: Calculating Cosine Similarity with C

Theoretical Foundations

In the realm of AI data preprocessing, we often deal with massive datasets that require cleaning, normalization, and transformation before feeding them into a model. Imagine a high-volume assembly line in a factory. If you pause the entire line to inspect every single component individually before moving to the next step, efficiency plummets. However, if you design a pipeline where the conveyor belt moves continuously and inspection stations apply their checks as the items pass by, you achieve optimal throughput. This is the essence of Deferred Execution in LINQ: the query defines the steps, but the processing happens only when the data is actually needed.

In previous chapters, we explored how System.Numerics.Vector<T> allows us to perform Single Instruction, Multiple Data (SIMD) operations, treating numbers as parallel lanes of data. While Vector<T> is imperative and stateful, LINQ (Language Integrated Query) offers a declarative, functional approach to data manipulation. When building AI applications, we rarely process data in isolation; we filter noise, map raw text to tokens, and shuffle batches for stochastic gradient descent. LINQ provides the syntax to express these transformations as pure functional pipelines, ensuring that data processing logic remains readable, composable, and free from side effects.

The Mechanics of Deferred vs. Immediate Execution

To understand the efficiency of LINQ in AI pipelines, we must distinguish between defining a query and executing it.

Deferred Execution means that the query expression is not evaluated until the result is enumerated (e.g., in a foreach loop or by calling an aggregator like .Count()). The query stores the logic of the operation, not the result.

Immediate Execution forces the query to evaluate immediately and store the results, typically in memory (e.g., .ToList(), .ToArray(), .ToDictionary()).

Consider a dataset of raw text documents. We want to filter out empty lines and normalize whitespace.

using System;
using System.Collections.Generic;
using System.Linq;

public class DataPipeline
{
    public static void ProcessDocuments(IEnumerable<string> rawDocs)
    {
        // 1. DEFINING THE QUERY (Deferred Execution)
        // No processing happens here. We are just building a blueprint.
        var validDocsQuery = rawDocs
            .Where(doc => !string.IsNullOrWhiteSpace(doc))
            .Select(doc => doc.Trim().ToLowerInvariant());

        // 2. EXECUTION TRIGGER
        // The pipeline is executed here. If rawDocs changes after this definition
        // but before this loop, the query reflects those changes.
        foreach (var doc in validDocsQuery)
        {
            Console.WriteLine($"Processing: {doc}");
        }

        // 3. IMMEDIATE EXECUTION
        // We force the query to run now and store the results in memory.
        // This creates a snapshot of the data at this specific moment.
        List<string> processedList = validDocsQuery.ToList();
    }
}

Why this matters for AI: In AI training loops, we often stream data from disk. If we used Immediate Execution (.ToList()) on a massive dataset before training, we would exhaust RAM. By keeping the query deferred, we can stream data, preprocess it on the fly, and feed it to the GPU in batches. However, if we need to shuffle the data (which requires knowing the full dataset size), we must switch to Immediate Execution to materialize the collection.

Pure Functional Pipelines and Side Effects

A critical constraint in building robust AI systems is immutability. Side effects (modifying global state) lead to bugs that are notoriously difficult to trace in concurrent environments. LINQ encourages a functional style where the input collection is never modified; instead, new sequences are projected.

The Forbidden Pattern (Imperative with Side Effects):

// BAD: Modifying external state inside a query
int counter = 0;
var badQuery = rawDocs.Select(doc => {
    counter++; // Side effect: alters external variable
    return doc.ToUpper(); 
});

This violates the principles of pure functional programming. If badQuery is executed multiple times, counter increments unpredictably. In a parallel processing context (PLINQ), this would cause race conditions and corrupted data.

The Functional Pattern (Pure Transformation):

// GOOD: Pure function. Input -> Output. No external state.
var cleanQuery = rawDocs
    .Where(doc => !string.IsNullOrWhiteSpace(doc))
    .Select((doc, index) => new { Index = index, Text = doc.Trim() });

Here, the Select overload provides the index, allowing us to generate metadata without mutating external variables.

Parallel Processing with PLINQ

AI data preprocessing is computationally expensive. Tokenization, normalization, and feature extraction are CPU-bound tasks. PLINQ (Parallel LINQ) utilizes multiple cores to accelerate these pipelines.

By calling .AsParallel(), we transform the query into a parallel execution plan. However, this introduces non-determinism regarding order unless explicitly handled.

using System.Linq;

public class ParallelPreprocessor
{
    public static List<string> NormalizeBatch(List<string> batch)
    {
        // AsParallel() partitions the source collection across threads.
        // The order of elements is not guaranteed unless AsOrdered() is used.
        return batch
            .AsParallel()
            .AsOrdered() // Preserves the original sequence order
            .WithDegreeOfParallelism(Environment.ProcessorCount)
            .Where(doc => doc.Length > 0)
            .Select(doc => doc.Normalize(System.Text.NormalizationForm.FormKC))
            .ToList(); // Immediate execution to materialize the result
    }
}

Architectural Implication: In distributed AI training, data sharding is common. PLINQ allows us to mimic this locally by partitioning a dataset across logical cores, simulating a distributed preprocessing step. This is vital for preparing embeddings before they are vectorized using System.Numerics.Vector<T>.

Data Preprocessing Pipelines in AI Context

When building embeddings for semantic similarity (the ultimate goal of this chapter), raw text must pass through a strict pipeline. LINQ acts as the glue between raw data and numerical representation.

The Pipeline Stages:

Ingestion: Reading streams (Deferred).
Cleaning: Filtering noise, removing HTML tags (.Where).
Normalization: Lowercasing, Unicode normalization (.Select).
Tokenization: Splitting strings into words (.SelectMany).
Batching: Grouping tokens into fixed-size vectors (.GroupBy).

Here is a comprehensive example demonstrating a pure functional pipeline for preparing text for vectorization:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

public class EmbeddingPipeline
{
    // Represents a raw document
    public record Document(string Id, string Content);

    // Represents a tokenized, normalized entry ready for vectorization
    public record ProcessedToken(string DocId, string Token, int Position);

    public static IEnumerable<ProcessedToken> PrepareForEmbedding(IEnumerable<Document> documents)
    {
        // Define the pipeline (Deferred Execution)
        var pipeline = documents
            // 1. Cleaning: Filter out documents with insufficient content
            .Where(d => !string.IsNullOrEmpty(d.Content) && d.Content.Length > 10)

            // 2. Normalization: Lowercase and remove special characters
            .Select(d => new Document(
                d.Id, 
                Regex.Replace(d.Content.ToLower(), @"[^\w\s]", "")
            ))

            // 3. Tokenization: Flatten documents into individual words using SelectMany
            .SelectMany(d => 
                d.Content.Split(' ', StringSplitOptions.RemoveEmptyEntries)
                    .Select((word, index) => new { Word = word, Index = index })
            , (doc, tokenInfo) => new ProcessedToken(
                doc.Id, 
                tokenInfo.Word, 
                tokenInfo.Index
            ))

            // 4. Filtering: Remove stop words (conceptually)
            .Where(t => t.Token.Length > 2); // Simple filter for demo

        // The pipeline is defined but not executed yet.
        // We can now iterate or convert to list.
        return pipeline;
    }

    public static void ExecutePipeline()
    {
        var rawData = new List<Document>
        {
            new("Doc1", "The quick brown fox!"),
            new("Doc2", "Jumps over the lazy dog."),
            new("Doc3", "") // Empty, will be filtered
        };

        // Execution happens here (Deferred)
        foreach (var token in PrepareForEmbedding(rawData))
        {
            Console.WriteLine($"Token: {token.Token} (Doc: {token.DocId})");
        }
    }
}

Connecting to Cosine Similarity and Embeddings

The theoretical foundation of Cosine Similarity relies on comparing the orientation of two vectors in a multi-dimensional space. In the context of AI, these vectors are embeddings—dense numerical representations of text.

The LINQ pipeline above is the prerequisite step. Once text is tokenized and cleaned, we map these tokens to numerical values (often using a vocabulary lookup or a pre-trained model). This results in two vectors, $A$ and $B$.

While System.Numerics.Vector<T> handles the low-level arithmetic for calculating the dot product and magnitude (as explored in subsequent sections of this chapter), LINQ is responsible for the data orchestration.

The Connection:

LINQ prepares the data structure: It ensures that vector $A$ and vector $B$ are derived from comparable sources (e.g., same preprocessing steps, same tokenization logic).
Vector performs the calculation: It computes $\frac{A \cdot B}{\|A\| \|B\|}$.

If the LINQ pipeline is flawed (e.g., includes side effects or inconsistent ordering), the resulting vectors will not accurately represent the semantic meaning of the text. Therefore, mastering LINQ's deferred execution and functional purity is not just a coding style choice; it is a requirement for reproducible AI model inputs.

Visualizing the Pipeline

The following diagram illustrates the flow of data through a LINQ-based preprocessing pipeline, highlighting the decision points between deferred and immediate execution.

This diagram illustrates the flow of data through a LINQ-based preprocessing pipeline, highlighting the critical decision points between deferred and immediate execution to ensure reproducible AI model inputs.

Summary of Concepts

Deferred Execution: Queries are definitions, not results. This allows for efficient streaming of data in AI training loops.
Immediate Execution: .ToList() creates a snapshot. This is necessary when the data source is transient or when random access (indexing) is required for batching.
Pure Functional Style: Avoiding side effects ensures that data pipelines are deterministic and thread-safe, a necessity when using PLINQ for high-performance preprocessing.
PLINQ: Leverages multi-core CPUs to accelerate data cleaning and normalization, reducing the bottleneck before vectorization.

By adhering to these principles, we ensure that the transition from raw text to numerical embeddings is robust, scalable, and mathematically sound—setting the stage for accurate Cosine Similarity calculations.

Basic Code Example

A common real-world problem is to process a raw dataset of text documents, clean them, and then categorize them based on a specific keyword. We want to do this efficiently without writing manual loops.

Here is a simple example using LINQ to build a functional data pipeline. It demonstrates Deferred Execution by separating the definition of the query from its execution.

using System;
using System.Collections.Generic;
using System.Linq;

public class DocumentPreprocessor
{
    public static void Main()
    {
        // 1. The Raw Data Source (Simulating a stream of documents)
        var rawDocuments = new List<string>
        {
            "The quick brown fox jumps over the lazy dog.",
            "  A quick brown DOG is a happy pet.  ", // Contains whitespace and casing issues
            "The lazy fox sleeps all day.",
            "Just a random sentence." // Irrelevant data
        };

        // 2. Define the Processing Pipeline (Deferred Execution)
        // We define the steps here, but no processing happens yet.
        // The 'where' and 'Select' lambdas are stored as an expression tree.
        var processingPipeline = rawDocuments
            .Where(doc => doc.Contains("fox")) // Step A: Filter (Clean/Select)
            .Select(doc => doc.Trim().ToLower()) // Step B: Normalize
            .Select(doc => $"[PROCESSED]: {doc}"); // Step C: Format

        Console.WriteLine("Pipeline defined. No processing has occurred yet.\n");

        // 3. Trigger Execution (Immediate Execution)
        // The pipeline is executed only when we iterate (e.g., .ToList()).
        // This is where the data is actually cleaned and transformed.
        List<string> processedResults = processingPipeline.ToList();

        // 4. Output the results
        Console.WriteLine("Execution Triggered. Results:");
        foreach (var result in processedResults)
        {
            Console.WriteLine(result);
        }
    }
}

Visualizing the Pipeline

The data flows through the pipeline strictly. Note that the "Filter" and "Transform" steps happen conceptually in order, but strictly only when the ToList() method is called.

Step-by-Step Explanation

Data Initialization: We create a List<string> named rawDocuments. This represents our raw, unstructured data source. It contains messy strings with extra whitespace and varying capitalization.
Pipeline Definition: We define processingPipeline. This variable holds the logic for the operations, not the results.
- .Where(doc => doc.Contains("fox")): This filters the list. It looks for the substring "fox".
- .Select(doc => doc.Trim().ToLower()): This transforms the filtered results. It removes whitespace and standardizes the text to lowercase.
- .Select(doc => $"[PROCESSED]: {doc}"): This adds a label to the data.
Deferred Execution: Crucially, when processingPipeline is defined, nothing is actually computed yet. The Where and Select methods return an IEnumerable<T> that wraps the logic. If you put a breakpoint here, you would see no strings processed.
Immediate Execution: We call .ToList(). This forces the pipeline to execute immediately. It iterates through the source, applies the filter, transforms the data, and stores the final results in a new list in memory.
Output: We iterate over the processedResults list to print the clean data.

Common Pitfalls

Mistake: Modifying External State inside a LINQ Query A common error when coming from imperative loops is trying to modify a variable defined outside the query. This breaks the functional paradigm and causes unpredictable behavior, especially with Deferred Execution.

// BAD PRACTICE - DO NOT DO THIS
int counter = 0;
var badQuery = rawDocuments.Select(doc => {
    counter++; // Side Effect: Modifying external variable
    return doc.ToUpper();
});

// The value of 'counter' is unpredictable here because 
// the query hasn't run yet, or might run multiple times.
Console.WriteLine(counter); // Might print 0

Why this fails:

Side Effects: It violates the principle of pure functions. The query should only depend on its input and produce an output, without changing the outside world.
Deferred Execution Risk: If you define the query but don't call .ToList(), the code inside the lambda (including counter++) never runs. If you iterate the query twice, counter increments twice.
PLINQ Issues: If you add .AsParallel() later, multiple threads might try to modify counter simultaneously, causing race conditions and crashes.

_Solution: Always calculate values based on the input data or return new objects. Use .Select to transform data, not to update counters.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.