Chapter 14: Distance Algorithms (Euclidean vs Manhattan)

Theoretical Foundations

Distance is the fundamental currency of similarity in high-dimensional spaces. When we represent data as vectors—whether they are word embeddings, customer feature profiles, or image latent representations—we must quantify how "close" two vectors are to one another. This quantification drives clustering, classification (k-NN), and anomaly detection.

In this chapter, we explore two foundational distance algorithms: Euclidean and Manhattan. While they share the same input (vectors), their geometric interpretations and computational behaviors differ significantly, influencing their suitability for specific AI pipelines.

The Vector Foundation: Coordinates as Data

Before defining distance, we must establish the vector representation. In previous books, we utilized List<T> for general collections. However, for mathematical operations, we require a structure that enforces dimensionality and supports arithmetic operations.

In C#, we can model a vector as an immutable sequence of doubles. Immutability is critical for functional purity; operations on a vector should return a new vector rather than mutating the existing one.

using System;
using System.Collections.Generic;
using System.Linq;

public record Vector(params double[] Coordinates)
{
    public int Dimension => Coordinates.Length;

    // Element-wise addition
    public static Vector operator +(Vector a, Vector b)
    {
        if (a.Dimension != b.Dimension) throw new InvalidOperationException("Dimension mismatch");
        var result = new double[a.Dimension];
        for (int i = 0; i < a.Dimension; i++)
            result[i] = a.Coordinates[i] + b.Coordinates[i];
        return new Vector(result);
    }

    // Element-wise subtraction
    public static Vector operator -(Vector a, Vector b)
    {
        if (a.Dimension != b.Dimension) throw new InvalidOperationException("Dimension mismatch");
        var result = new double[a.Dimension];
        for (int i = 0; i < a.Dimension; i++)
            result[i] = a.Coordinates[i] - b.Coordinates[i];
        return new Vector(result);
    }

    // Scalar multiplication
    public static Vector operator *(Vector v, double scalar)
    {
        var result = new double[v.Dimension];
        for (int i = 0; i < v.Dimension; i++)
            result[i] = v.Coordinates[i] * scalar;
        return new Vector(result);
    }
}

1. Euclidean Distance: The Straight Line

Euclidean distance is the "ordinary" straight-line distance between two points that we learn in geometry. It is derived from the L2 Norm (or Euclidean norm) of the difference vector.

The Formula: For two vectors $A = (a_1, a_2, ..., a_n)$ and $B = (b_1, b_2, ..., b_n)$: $$ d(A, B) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2} $$

Geometric Interpretation: Imagine a bird flying directly from point A to point B. The bird ignores terrain and flies in a straight line through the air. Euclidean distance measures this direct path.

Computational Characteristics:

Sensitivity: It is sensitive to the magnitude of the vectors. If one vector has large coordinate values, the squared differences will dominate the sum.
Cost: It requires squaring differences, summing them, and computing a square root. This is computationally more expensive than Manhattan distance.

Functional Implementation: We can implement this using LINQ's aggregate function. Note that we strictly avoid side effects; the query calculates the sum purely.

public static class DistanceMetrics
{
    public static double Euclidean(Vector a, Vector b)
    {
        // We use Zip to pair coordinates (a_i, b_i)
        // Then select the squared difference
        // Finally, aggregate the sum and take the square root
        return Math.Sqrt(
            a.Coordinates
             .Zip(b.Coordinates, (x, y) => x - y)
             .Select(diff => diff * diff)
             .Sum()
        );
    }
}

2. Manhattan Distance: The Grid Walk

Manhattan distance (also known as Taxicab geometry or L1 distance) measures the distance between two points by summing the absolute differences of their Cartesian coordinates. It is derived from the L1 Norm.

The Formula: $$ d(A, B) = \sum_{i=1}^{n} |a_i - b_i| $$

Geometric Interpretation: Imagine a taxi driving in a city laid out in a grid (like Manhattan). The taxi cannot cut through buildings; it must drive along the grid lines (streets) to get from A to B. The distance is the sum of the blocks traveled horizontally and vertically.

Computational Characteristics:

Sensitivity: It is less sensitive to small outliers than Euclidean distance because it does not square the differences. However, large differences still contribute linearly.
Cost: It only requires subtraction, absolute value, and summation. It avoids the expensive square root operation, making it faster in high-dimensional spaces.

Functional Implementation:

public static double Manhattan(Vector a, Vector b)
{
    return a.Coordinates
            .Zip(b.Coordinates, (x, y) => Math.Abs(x - y))
            .Sum();
}

3. Comparative Analysis: When to Use Which?

The choice between Euclidean and Manhattan is not arbitrary; it depends on the data distribution and the dimensionality.

Feature	Euclidean (L2)	Manhattan (L1)
Geometry	Straight line (as the crow flies)	Grid path (taxi cab)
High Dimensions	Becomes less meaningful (curse of dimensionality)	Often more robust in high dimensions
Outliers	Squared errors amplify outliers	Linear errors dampen outliers
Computational Cost	Higher (Square root + Squaring)	Lower (Abs + Sum)

AI Application Context:

Use Euclidean for: Physical spatial data (e.g., GPS coordinates), or when the magnitude of the vector components is meaningful and normally distributed. It is the default for K-Means clustering.
Use Manhattan for: High-dimensional sparse data (e.g., text embeddings where many dimensions are zero), or when features are independent. It is often preferred for k-NN in high-dimensional spaces because the distance distribution tends to be less skewed.

4. Deferred Execution and Functional Pipelines

In the context of building AI applications, we often deal with massive datasets. Calculating distances for every pair in a dataset is an $O(N^2)$ operation. We must be mindful of Deferred Execution.

In C#, LINQ queries are not executed until the data is enumerated (e.g., via .ToList(), .ToArray(), or a foreach loop). This allows us to build complex query pipelines without incurring immediate computational cost.

Example: Pairwise Distance Calculation Suppose we have a list of Vector objects representing data points. We want to compute a distance matrix.

public static IEnumerable<(int IdA, int IdB, double Distance)> ComputeDistanceMatrix(
    IEnumerable<Vector> dataset, 
    Func<Vector, Vector, double> metric)
{
    // Materialize the dataset to avoid multiple enumeration
    // This is an IMMEDIATE execution step
    var dataPoints = dataset.Select((v, index) => new { Index = index, Vector = v })
                            .ToList();

    // This is a DEFERRED execution query
    // It generates a query plan but does not run it yet
    var query = 
        from a in dataPoints
        from b in dataPoints
        where a.Index < b.Index // Avoid duplicates and self-comparison
        select (a.Index, b.Index, metric(a.Vector, b.Vector));

    return query;
}

Critical Note on Side Effects: When using LINQ for AI pipelines (e.g., normalizing data), we must avoid side effects.

Bad: Modifying a global counter inside a .Select.
Good: Returning a new normalized vector from the .Select.

Example: Normalization Pipeline (Z-Score) Normalization is crucial for distance algorithms. If one feature ranges from 0-1 and another from 0-1000, the Euclidean distance will be dominated by the second feature. We normalize by calculating the mean and standard deviation.

public static IEnumerable<Vector> NormalizeDataset(IEnumerable<Vector> dataset)
{
    // Materialize to calculate statistics (Immediate)
    var matrix = dataset.ToList();

    // Calculate Mean for each dimension
    int dimensions = matrix.First().Dimension;
    var means = Enumerable.Range(0, dimensions)
                          .Select(dim => matrix.Average(v => v.Coordinates[dim]))
                          .ToArray();

    // Calculate Standard Deviation for each dimension
    var stdDevs = Enumerable.Range(0, dimensions)
                            .Select(dim => 
                            {
                                var dimValues = matrix.Select(v => v.Coordinates[dim]).ToArray();
                                var mean = means[dim];
                                var sumSquares = dimValues.Sum(val => Math.Pow(val - mean, 2));
                                return Math.Sqrt(sumSquares / dimValues.Length);
                            })
                            .ToArray();

    // Apply transformation (Deferred until enumeration)
    return matrix.Select(v => 
    {
        var normalizedCoords = new double[v.Dimension];
        for (int i = 0; i < v.Dimension; i++)
        {
            // Avoid division by zero
            normalizedCoords[i] = stdDevs[i] == 0 ? 0 : (v.Coordinates[i] - means[i]) / stdDevs[i];
        }
        return new Vector(normalizedCoords);
    });
}

5. Parallelization with PLINQ

For large-scale AI datasets, sequential processing is inefficient. PLINQ (Parallel LINQ) allows us to utilize multiple cores by simply adding .AsParallel().

Visualizing the Pipeline: The following diagram illustrates the flow of data through a functional pipeline, highlighting where parallelization occurs.

This diagram illustrates a functional pipeline where data flows through sequential processing steps, with the AsParallel() method transforming the pipeline into a concurrent execution model that splits the data source to process elements simultaneously across multiple threads before merging the results. — This diagram illustrates a functional pipeline where data flows through sequential processing steps, with the `AsParallel()` method transforming the pipeline into a concurrent execution model that splits the data source to process elements simultaneously across multiple threads before merging the results.

Optimized Distance Calculation: By applying .AsParallel(), we distribute the distance calculations across available CPU cores. This is particularly effective for the $O(N^2)$ pairwise comparison problem.

public static List<(int IdA, int IdB, double Distance)> ComputeParallelDistances(
    IEnumerable<Vector> dataset)
{
    var dataPoints = dataset.Select((v, i) => new { Id = i, Vector = v }).ToList();

    // Parallelize the outer loop
    var results = dataPoints
        .AsParallel() // Enables multi-threading
        .WithDegreeOfParallelism(Environment.ProcessorCount)
        .SelectMany(a => 
            dataPoints.Where(b => b.Id > a.Id) // Inner loop
                      .Select(b => (a.Id, b.Id, DistanceMetrics.Euclidean(a.Vector, b.Vector)))
        )
        .ToList(); // Immediate execution to capture results

    return results;
}

Summary

In this section, we established the theoretical foundations of distance metrics. We moved from concrete vector representations to abstract distance calculations using functional LINQ pipelines. We distinguished between Euclidean (L2) and Manhattan (L1) norms, recognizing that Euclidean emphasizes magnitude (squared differences) while Manhattan emphasizes rank (absolute differences). Finally, we utilized deferred execution to build efficient pipelines and PLINQ to parallelize the heavy computational load, ensuring our AI applications scale effectively.

Basic Code Example

Let's start with a relatable problem: recommending a movie to a user based on their viewing history. We have a small dataset of users and their ratings for two genres: Sci-Fi and Romance. Each user is represented as a 2-dimensional vector [Sci-Fi Rating, Romance Rating]. Our goal is to find the user most similar to a target user (e.g., "Alice") using distance metrics.

We will implement Euclidean and Manhattan distance calculations using a functional LINQ pipeline. This example emphasizes deferred execution (defining the query) versus immediate execution (materializing the results) and avoids side effects.

The Code Example

using System;
using System.Collections.Generic;
using System.Linq;

public static class DistanceCalculator
{
    // 1. Define a simple record to hold user data (Vector representation).
    // Using 'record' for immutability aligns with functional programming principles.
    public record UserVector(string Name, double SciFiRating, double RomanceRating);

    public static void Main()
    {
        // 2. The Dataset: A collection of user vectors.
        // In a real scenario, this might come from a database or CSV file.
        var users = new List<UserVector>
        {
            new UserVector("Alice",  9.0, 2.0), // Target user: loves Sci-Fi, hates Romance
            new UserVector("Bob",    8.5, 2.5), // Very similar to Alice
            new UserVector("Charlie",3.0, 9.0), // Opposite taste
            new UserVector("Diana",  8.0, 3.0)  // Similar to Alice
        };

        // 3. Identify the target user (Alice).
        // We use First() which is an immediate execution to fetch the specific object.
        var targetUser = users.First(u => u.Name == "Alice");

        // 4. Define the Distance Calculation Functions (Pure Functions).
        // These calculate the distance between two vectors (u1 and u2).

        // Euclidean Distance: Straight-line distance (Pythagorean theorem).
        // Formula: sqrt(sum((u1_i - u2_i)^2))
        Func<UserVector, UserVector, double> euclideanDistance = (u1, u2) =>
        {
            // We project the differences into a sequence, square them, sum them, and sqrt.
            var differences = new[] 
            { 
                u1.SciFiRating - u2.SciFiRating, 
                u1.RomanceRating - u2.RomanceRating 
            };

            // Immediate Execution: .Sum() processes the collection immediately.
            double sumOfSquares = differences.Select(d => d * d).Sum();
            return Math.Sqrt(sumOfSquares);
        };

        // Manhattan Distance: Sum of absolute differences (Grid-like path).
        // Formula: sum(|u1_i - u2_i|)
        Func<UserVector, UserVector, double> manhattanDistance = (u1, u2) =>
        {
            var differences = new[] 
            { 
                Math.Abs(u1.SciFiRating - u2.SciFiRating), 
                Math.Abs(u1.RomanceRating - u2.RomanceRating) 
            };

            // Immediate Execution: .Sum() processes the collection immediately.
            return differences.Sum();
        };

        // 5. Build the LINQ Pipeline (Deferred Execution).
        // We define the query here, but it does NOT execute yet.
        // This pipeline filters out the target user and calculates distances.
        var similarityQuery = users
            .Where(u => u.Name != targetUser.Name) // Filter: Exclude Alice herself
            .Select(u => new 
            { 
                Name = u.Name, 
                // Calculate both distances for comparison
                Euclidean = euclideanDistance(u, targetUser), 
                Manhattan = manhattanDistance(u, targetUser) 
            });

        // 6. Materialize the results (Immediate Execution).
        // The query executes here because .ToList() forces iteration.
        var results = similarityQuery.ToList();

        // 7. Display Results using a functional pipeline (ForEach is forbidden, so we project to strings).
        // We use string.Join to aggregate the output without side effects in the query.
        var outputLines = results
            .Select(r => $"User: {r.Name,-8} | Euclidean: {r.Euclidean:F2} | Manhattan: {r.Manhattan:F2}")
            .ToList();

        Console.WriteLine($"Comparing to Target: {targetUser.Name} [Sci-Fi: {targetUser.SciFiRating}, Romance: {targetUser.RomanceRating}]");
        Console.WriteLine(string.Join(Environment.NewLine, outputLines));
    }
}

Step-by-Step Explanation

Data Representation (UserVector): We define a record named UserVector. In functional programming, immutability is key. A record provides value-based equality and immutability by default. This vector contains two dimensions (features): Sci-Fi rating and Romance rating.
Dataset Initialization: We create a List<UserVector> containing four users. Notice that "Alice" has high Sci-Fi and low Romance ratings. "Bob" and "Diana" are close to her, while "Charlie" is far away.
Target Selection: We use users.First(u => u.Name == "Alice"). This is an Immediate Execution method. It iterates the list immediately to find the first match. We need the actual object reference (or its values) to pass into our distance calculations.
Defining Pure Functions: We define two Func delegates: euclideanDistance and manhattanDistance.
- Euclidean: Represents the straight-line distance. It squares differences, sums them, and takes the square root.
- Manhattan: Represents the distance along axes at right angles (like navigating city blocks). It sums the absolute differences.
- Critical Note: These functions are pure. They take inputs and return an output without modifying any external state (no side effects).
The LINQ Pipeline (Deferred Execution): The variable similarityQuery is defined using standard LINQ syntax (Where, Select).
- Deferred Execution: At this exact moment, no calculations are performed. The code merely constructs a "blueprint" of how to process the data when requested.
- Where(u => u.Name != targetUser.Name): Filters the list to remove "Alice" so she isn't compared to herself.
- Select(...): Projects the data. For every remaining user, it creates a new anonymous object containing the Name and the calculated distances.
Materializing Results: The line var results = similarityQuery.ToList(); triggers the execution. ToList() is an Immediate Execution operator. It iterates through the source, runs the Where filter, executes the Select projection (calling our distance functions), and stores the results in a new list.
Output Generation: To avoid using foreach (which is imperative and often involves side effects like modifying a counter), we use another LINQ chain. We project the calculation results into formatted strings and then join them with string.Join for display. This keeps the entire flow declarative.

Visualizing the Data Flow

The following diagram illustrates how data moves through the LINQ pipeline, highlighting the split between Deferred and Immediate execution.

The diagram visually traces the path of a LINQ query from its declarative definition through deferred operations like Where and Select to the final immediate execution trigger, such as ToList() or foreach, highlighting where data processing is postponed versus where it is actually materialized. — The diagram visually traces the path of a LINQ query from its declarative definition through deferred operations like `Where` and `Select` to the final immediate execution trigger, such as `ToList()` or `foreach`, highlighting where data processing is postponed versus where it is actually materialized.

Common Pitfalls

1. Confusing Deferred and Immediate Execution A common mistake is defining a LINQ query inside a loop without materializing it.

The Mistake:
```
foreach (var u in users) {
    var query = users.Where(x => x.Name != u.Name); // Deferred
    var count = query.Count(); // Immediate
}
```
While Count() forces execution immediately, if you were to chain more operators or use IEnumerable variables incorrectly, you might re-evaluate the query repeatedly, causing performance issues. Always be aware of when the data is actually being processed.

2. Side Effects in Select

The Mistake:
```
int counter = 0;
var badQuery = users.Select(u => {
    counter++; // SIDE EFFECT! Dangerous in deferred execution.
    return u.Name;
});
```
In deferred execution, counter might not be updated when you expect, or it might update multiple times unpredictably if the query is executed more than once. In PLINQ (parallel execution), this causes race conditions. Never modify external variables inside a LINQ lambda.

3. Forgetting Normalization The example uses raw ratings (0-10). In real-world vector math, features often have different scales (e.g., Age [0-100] vs. Income [0-100,000]). Without normalization (scaling values to a common range like 0-1), the feature with the larger magnitude (Income) will dominate the distance calculation, making Age irrelevant. Always scale your vectors before comparing distances.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.