Chapter 14: Distance Algorithms (Euclidean vs Manhattan)
Theoretical Foundations
Distance is the fundamental currency of similarity in high-dimensional spaces. When we represent data as vectors—whether they are word embeddings, customer feature profiles, or image latent representations—we must quantify how "close" two vectors are to one another. This quantification drives clustering, classification (k-NN), and anomaly detection.
In this chapter, we explore two foundational distance algorithms: Euclidean and Manhattan. While they share the same input (vectors), their geometric interpretations and computational behaviors differ significantly, influencing their suitability for specific AI pipelines.
The Vector Foundation: Coordinates as Data
Before defining distance, we must establish the vector representation. In previous books, we utilized List<T> for general collections. However, for mathematical operations, we require a structure that enforces dimensionality and supports arithmetic operations.
In C#, we can model a vector as an immutable sequence of doubles. Immutability is critical for functional purity; operations on a vector should return a new vector rather than mutating the existing one.
using System;
using System.Collections.Generic;
using System.Linq;
public record Vector(params double[] Coordinates)
{
public int Dimension => Coordinates.Length;
// Element-wise addition
public static Vector operator +(Vector a, Vector b)
{
if (a.Dimension != b.Dimension) throw new InvalidOperationException("Dimension mismatch");
var result = new double[a.Dimension];
for (int i = 0; i < a.Dimension; i++)
result[i] = a.Coordinates[i] + b.Coordinates[i];
return new Vector(result);
}
// Element-wise subtraction
public static Vector operator -(Vector a, Vector b)
{
if (a.Dimension != b.Dimension) throw new InvalidOperationException("Dimension mismatch");
var result = new double[a.Dimension];
for (int i = 0; i < a.Dimension; i++)
result[i] = a.Coordinates[i] - b.Coordinates[i];
return new Vector(result);
}
// Scalar multiplication
public static Vector operator *(Vector v, double scalar)
{
var result = new double[v.Dimension];
for (int i = 0; i < v.Dimension; i++)
result[i] = v.Coordinates[i] * scalar;
return new Vector(result);
}
}
1. Euclidean Distance: The Straight Line
Euclidean distance is the "ordinary" straight-line distance between two points that we learn in geometry. It is derived from the L2 Norm (or Euclidean norm) of the difference vector.
The Formula: For two vectors \(A = (a_1, a_2, ..., a_n)\) and \(B = (b_1, b_2, ..., b_n)\): $$ d(A, B) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2} $$
Geometric Interpretation: Imagine a bird flying directly from point A to point B. The bird ignores terrain and flies in a straight line through the air. Euclidean distance measures this direct path.
Computational Characteristics:
- Sensitivity: It is sensitive to the magnitude of the vectors. If one vector has large coordinate values, the squared differences will dominate the sum.
- Cost: It requires squaring differences, summing them, and computing a square root. This is computationally more expensive than Manhattan distance.
Functional Implementation: We can implement this using LINQ's aggregate function. Note that we strictly avoid side effects; the query calculates the sum purely.
public static class DistanceMetrics
{
public static double Euclidean(Vector a, Vector b)
{
// We use Zip to pair coordinates (a_i, b_i)
// Then select the squared difference
// Finally, aggregate the sum and take the square root
return Math.Sqrt(
a.Coordinates
.Zip(b.Coordinates, (x, y) => x - y)
.Select(diff => diff * diff)
.Sum()
);
}
}
2. Manhattan Distance: The Grid Walk
Manhattan distance (also known as Taxicab geometry or L1 distance) measures the distance between two points by summing the absolute differences of their Cartesian coordinates. It is derived from the L1 Norm.
The Formula: $$ d(A, B) = \sum_{i=1}^{n} |a_i - b_i| $$
Geometric Interpretation: Imagine a taxi driving in a city laid out in a grid (like Manhattan). The taxi cannot cut through buildings; it must drive along the grid lines (streets) to get from A to B. The distance is the sum of the blocks traveled horizontally and vertically.
Computational Characteristics:
- Sensitivity: It is less sensitive to small outliers than Euclidean distance because it does not square the differences. However, large differences still contribute linearly.
- Cost: It only requires subtraction, absolute value, and summation. It avoids the expensive square root operation, making it faster in high-dimensional spaces.
Functional Implementation:
public static double Manhattan(Vector a, Vector b)
{
return a.Coordinates
.Zip(b.Coordinates, (x, y) => Math.Abs(x - y))
.Sum();
}
3. Comparative Analysis: When to Use Which?
The choice between Euclidean and Manhattan is not arbitrary; it depends on the data distribution and the dimensionality.
| Feature | Euclidean (L2) | Manhattan (L1) |
|---|---|---|
| Geometry | Straight line (as the crow flies) | Grid path (taxi cab) |
| High Dimensions | Becomes less meaningful (curse of dimensionality) | Often more robust in high dimensions |
| Outliers | Squared errors amplify outliers | Linear errors dampen outliers |
| Computational Cost | Higher (Square root + Squaring) | Lower (Abs + Sum) |
AI Application Context:
- Use Euclidean for: Physical spatial data (e.g., GPS coordinates), or when the magnitude of the vector components is meaningful and normally distributed. It is the default for K-Means clustering.
- Use Manhattan for: High-dimensional sparse data (e.g., text embeddings where many dimensions are zero), or when features are independent. It is often preferred for k-NN in high-dimensional spaces because the distance distribution tends to be less skewed.
4. Deferred Execution and Functional Pipelines
In the context of building AI applications, we often deal with massive datasets. Calculating distances for every pair in a dataset is an \(O(N^2)\) operation. We must be mindful of Deferred Execution.
In C#, LINQ queries are not executed until the data is enumerated (e.g., via .ToList(), .ToArray(), or a foreach loop). This allows us to build complex query pipelines without incurring immediate computational cost.
Example: Pairwise Distance Calculation
Suppose we have a list of Vector objects representing data points. We want to compute a distance matrix.
public static IEnumerable<(int IdA, int IdB, double Distance)> ComputeDistanceMatrix(
IEnumerable<Vector> dataset,
Func<Vector, Vector, double> metric)
{
// Materialize the dataset to avoid multiple enumeration
// This is an IMMEDIATE execution step
var dataPoints = dataset.Select((v, index) => new { Index = index, Vector = v })
.ToList();
// This is a DEFERRED execution query
// It generates a query plan but does not run it yet
var query =
from a in dataPoints
from b in dataPoints
where a.Index < b.Index // Avoid duplicates and self-comparison
select (a.Index, b.Index, metric(a.Vector, b.Vector));
return query;
}
Critical Note on Side Effects: When using LINQ for AI pipelines (e.g., normalizing data), we must avoid side effects.
- Bad: Modifying a global counter inside a
.Select. - Good: Returning a new normalized vector from the
.Select.
Example: Normalization Pipeline (Z-Score) Normalization is crucial for distance algorithms. If one feature ranges from 0-1 and another from 0-1000, the Euclidean distance will be dominated by the second feature. We normalize by calculating the mean and standard deviation.
public static IEnumerable<Vector> NormalizeDataset(IEnumerable<Vector> dataset)
{
// Materialize to calculate statistics (Immediate)
var matrix = dataset.ToList();
// Calculate Mean for each dimension
int dimensions = matrix.First().Dimension;
var means = Enumerable.Range(0, dimensions)
.Select(dim => matrix.Average(v => v.Coordinates[dim]))
.ToArray();
// Calculate Standard Deviation for each dimension
var stdDevs = Enumerable.Range(0, dimensions)
.Select(dim =>
{
var dimValues = matrix.Select(v => v.Coordinates[dim]).ToArray();
var mean = means[dim];
var sumSquares = dimValues.Sum(val => Math.Pow(val - mean, 2));
return Math.Sqrt(sumSquares / dimValues.Length);
})
.ToArray();
// Apply transformation (Deferred until enumeration)
return matrix.Select(v =>
{
var normalizedCoords = new double[v.Dimension];
for (int i = 0; i < v.Dimension; i++)
{
// Avoid division by zero
normalizedCoords[i] = stdDevs[i] == 0 ? 0 : (v.Coordinates[i] - means[i]) / stdDevs[i];
}
return new Vector(normalizedCoords);
});
}
5. Parallelization with PLINQ
For large-scale AI datasets, sequential processing is inefficient. PLINQ (Parallel LINQ) allows us to utilize multiple cores by simply adding .AsParallel().
Visualizing the Pipeline: The following diagram illustrates the flow of data through a functional pipeline, highlighting where parallelization occurs.
Optimized Distance Calculation:
By applying .AsParallel(), we distribute the distance calculations across available CPU cores. This is particularly effective for the \(O(N^2)\) pairwise comparison problem.
public static List<(int IdA, int IdB, double Distance)> ComputeParallelDistances(
IEnumerable<Vector> dataset)
{
var dataPoints = dataset.Select((v, i) => new { Id = i, Vector = v }).ToList();
// Parallelize the outer loop
var results = dataPoints
.AsParallel() // Enables multi-threading
.WithDegreeOfParallelism(Environment.ProcessorCount)
.SelectMany(a =>
dataPoints.Where(b => b.Id > a.Id) // Inner loop
.Select(b => (a.Id, b.Id, DistanceMetrics.Euclidean(a.Vector, b.Vector)))
)
.ToList(); // Immediate execution to capture results
return results;
}
Summary
In this section, we established the theoretical foundations of distance metrics. We moved from concrete vector representations to abstract distance calculations using functional LINQ pipelines. We distinguished between Euclidean (L2) and Manhattan (L1) norms, recognizing that Euclidean emphasizes magnitude (squared differences) while Manhattan emphasizes rank (absolute differences). Finally, we utilized deferred execution to build efficient pipelines and PLINQ to parallelize the heavy computational load, ensuring our AI applications scale effectively.
Basic Code Example
Let's start with a relatable problem: recommending a movie to a user based on their viewing history. We have a small dataset of users and their ratings for two genres: Sci-Fi and Romance. Each user is represented as a 2-dimensional vector [Sci-Fi Rating, Romance Rating]. Our goal is to find the user most similar to a target user (e.g., "Alice") using distance metrics.
We will implement Euclidean and Manhattan distance calculations using a functional LINQ pipeline. This example emphasizes deferred execution (defining the query) versus immediate execution (materializing the results) and avoids side effects.
The Code Example
using System;
using System.Collections.Generic;
using System.Linq;
public static class DistanceCalculator
{
// 1. Define a simple record to hold user data (Vector representation).
// Using 'record' for immutability aligns with functional programming principles.
public record UserVector(string Name, double SciFiRating, double RomanceRating);
public static void Main()
{
// 2. The Dataset: A collection of user vectors.
// In a real scenario, this might come from a database or CSV file.
var users = new List<UserVector>
{
new UserVector("Alice", 9.0, 2.0), // Target user: loves Sci-Fi, hates Romance
new UserVector("Bob", 8.5, 2.5), // Very similar to Alice
new UserVector("Charlie",3.0, 9.0), // Opposite taste
new UserVector("Diana", 8.0, 3.0) // Similar to Alice
};
// 3. Identify the target user (Alice).
// We use First() which is an immediate execution to fetch the specific object.
var targetUser = users.First(u => u.Name == "Alice");
// 4. Define the Distance Calculation Functions (Pure Functions).
// These calculate the distance between two vectors (u1 and u2).
// Euclidean Distance: Straight-line distance (Pythagorean theorem).
// Formula: sqrt(sum((u1_i - u2_i)^2))
Func<UserVector, UserVector, double> euclideanDistance = (u1, u2) =>
{
// We project the differences into a sequence, square them, sum them, and sqrt.
var differences = new[]
{
u1.SciFiRating - u2.SciFiRating,
u1.RomanceRating - u2.RomanceRating
};
// Immediate Execution: .Sum() processes the collection immediately.
double sumOfSquares = differences.Select(d => d * d).Sum();
return Math.Sqrt(sumOfSquares);
};
// Manhattan Distance: Sum of absolute differences (Grid-like path).
// Formula: sum(|u1_i - u2_i|)
Func<UserVector, UserVector, double> manhattanDistance = (u1, u2) =>
{
var differences = new[]
{
Math.Abs(u1.SciFiRating - u2.SciFiRating),
Math.Abs(u1.RomanceRating - u2.RomanceRating)
};
// Immediate Execution: .Sum() processes the collection immediately.
return differences.Sum();
};
// 5. Build the LINQ Pipeline (Deferred Execution).
// We define the query here, but it does NOT execute yet.
// This pipeline filters out the target user and calculates distances.
var similarityQuery = users
.Where(u => u.Name != targetUser.Name) // Filter: Exclude Alice herself
.Select(u => new
{
Name = u.Name,
// Calculate both distances for comparison
Euclidean = euclideanDistance(u, targetUser),
Manhattan = manhattanDistance(u, targetUser)
});
// 6. Materialize the results (Immediate Execution).
// The query executes here because .ToList() forces iteration.
var results = similarityQuery.ToList();
// 7. Display Results using a functional pipeline (ForEach is forbidden, so we project to strings).
// We use string.Join to aggregate the output without side effects in the query.
var outputLines = results
.Select(r => $"User: {r.Name,-8} | Euclidean: {r.Euclidean:F2} | Manhattan: {r.Manhattan:F2}")
.ToList();
Console.WriteLine($"Comparing to Target: {targetUser.Name} [Sci-Fi: {targetUser.SciFiRating}, Romance: {targetUser.RomanceRating}]");
Console.WriteLine(string.Join(Environment.NewLine, outputLines));
}
}
Step-by-Step Explanation
-
Data Representation (
UserVector): We define arecordnamedUserVector. In functional programming, immutability is key. Arecordprovides value-based equality and immutability by default. This vector contains two dimensions (features): Sci-Fi rating and Romance rating. -
Dataset Initialization: We create a
List<UserVector>containing four users. Notice that "Alice" has high Sci-Fi and low Romance ratings. "Bob" and "Diana" are close to her, while "Charlie" is far away. -
Target Selection: We use
users.First(u => u.Name == "Alice"). This is an Immediate Execution method. It iterates the list immediately to find the first match. We need the actual object reference (or its values) to pass into our distance calculations. -
Defining Pure Functions: We define two
Funcdelegates:euclideanDistanceandmanhattanDistance.- Euclidean: Represents the straight-line distance. It squares differences, sums them, and takes the square root.
- Manhattan: Represents the distance along axes at right angles (like navigating city blocks). It sums the absolute differences.
- Critical Note: These functions are pure. They take inputs and return an output without modifying any external state (no side effects).
-
The LINQ Pipeline (Deferred Execution): The variable
similarityQueryis defined using standard LINQ syntax (Where,Select).- Deferred Execution: At this exact moment, no calculations are performed. The code merely constructs a "blueprint" of how to process the data when requested.
Where(u => u.Name != targetUser.Name): Filters the list to remove "Alice" so she isn't compared to herself.Select(...): Projects the data. For every remaining user, it creates a new anonymous object containing the Name and the calculated distances.
-
Materializing Results: The line
var results = similarityQuery.ToList();triggers the execution.ToList()is an Immediate Execution operator. It iterates through the source, runs theWherefilter, executes theSelectprojection (calling our distance functions), and stores the results in a new list. -
Output Generation: To avoid using
foreach(which is imperative and often involves side effects like modifying a counter), we use another LINQ chain. We project the calculation results into formatted strings and then join them withstring.Joinfor display. This keeps the entire flow declarative.
Visualizing the Data Flow
The following diagram illustrates how data moves through the LINQ pipeline, highlighting the split between Deferred and Immediate execution.
Common Pitfalls
1. Confusing Deferred and Immediate Execution A common mistake is defining a LINQ query inside a loop without materializing it.
- The Mistake:
While
foreach (var u in users) { var query = users.Where(x => x.Name != u.Name); // Deferred var count = query.Count(); // Immediate }Count()forces execution immediately, if you were to chain more operators or useIEnumerablevariables incorrectly, you might re-evaluate the query repeatedly, causing performance issues. Always be aware of when the data is actually being processed.
2. Side Effects in Select
- The Mistake:
In deferred execution,
int counter = 0; var badQuery = users.Select(u => { counter++; // SIDE EFFECT! Dangerous in deferred execution. return u.Name; });countermight not be updated when you expect, or it might update multiple times unpredictably if the query is executed more than once. In PLINQ (parallel execution), this causes race conditions. Never modify external variables inside a LINQ lambda.
3. Forgetting Normalization The example uses raw ratings (0-10). In real-world vector math, features often have different scales (e.g., Age [0-100] vs. Income [0-100,000]). Without normalization (scaling values to a common range like 0-1), the feature with the larger magnitude (Income) will dominate the distance calculation, making Age irrelevant. Always scale your vectors before comparing distances.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.