Chapter 1: Lists, Dictionaries, and HashSets in Depth
Theoretical Foundations
Theoretical Foundations: Hashing, Buckets, and Collision Resolution
In Book 2, we established that arrays and List<T> provide contiguous memory storage with \(O(1)\) access by index, but \(O(n)\) search time when looking for a specific value. This linear search becomes a bottleneck in AI applications where we might need to check if a token exists in a vocabulary of 50,000 entries. To bridge the gap from linear lists to the vector embeddings used in modern AI, we must first master hash-based data structures.
These structures utilize a mathematical transformation (hashing) to map data directly to a memory address, effectively bypassing the need for iteration in most cases.
The Real-World Analogy: The Library Index System
Imagine a physical library containing 100,000 books.
- List Approach: To find a specific book, you start at shelf 1, scan every book, move to shelf 2, and repeat. This is \(O(n)\).
- Dictionary/Hashset Approach: You walk to the central index card catalog. You look up the book title (the "Key"), which directs you to a specific shelf and slot number (the "Hash"). You walk directly there. This is \(O(1)\) (average case).
The "Index Catalog" is the Buckets array. The "Title" is the Key. The process of converting the title to a shelf number is the Hash Function.
1. The Hash Function: Mapping Data to Integers
At the core of Dictionary<TKey, TValue> and HashSet<T> lies the GetHashCode() method, defined by the object class and overridden by custom types.
A hash function takes an arbitrary amount of data (a string, an object, a number) and maps it to a fixed-size integer (a hash code).
Properties of a Good Hash Function:
- Deterministic: The same input must always produce the same output.
- Uniform Distribution: Inputs should map to hash codes as evenly as possible across the integer range to minimize collisions.
- Fast Computation: It must be calculated quickly.
How C# Uses Hash Codes:
When you insert an item into a HashSet<int> or Dictionary<string, int>, the collection does not store the item in the order of insertion. Instead, it calculates the hash code of the key, performs a modulo operation against the current internal array size, and places the item in that specific "bucket."
using System;
using System.Collections.Generic;
// Conceptual demonstration of how a bucket index is calculated
public class HashMechanism
{
public static int GetBucketIndex(int key, int bucketCount)
{
// In C#, the actual implementation uses bit masking for performance,
// but mathematically it is equivalent to modulo.
// % is the modulo operator.
return Math.Abs(key.GetHashCode()) % bucketCount;
}
}
2. Internal Mechanics: Buckets and Linked Lists
The internal structure of a Dictionary or HashSet is often visualized as an array of "buckets." However, a single bucket might contain multiple items. This leads us to the concept of Collisions.
Collision Resolution: Chaining
A collision occurs when two different keys produce the same hash code (or map to the same bucket index). Since the array size is finite, this is mathematically inevitable.
C# uses Chaining to resolve collisions. Each bucket in the internal array is not just a single slot for data; it is the head of a linked list. If two keys collide, the second item is appended to the linked list at that bucket.
Visualizing the Bucket Structure:
Imagine a Dictionary<string, int> where the internal array has size 8.
- Key "Apple" hashes to index 2.
- Key "Banana" hashes to index 5.
- Key "Cherry" hashes to index 2 (Collision!).
Performance Implication:
- Best Case (No Collisions): \(O(1)\). The hash maps directly to a bucket with one item.
- Worst Case (All Collisions): \(O(n)\). If every key maps to the same bucket, the structure degenerates into a linked list, requiring linear search.
3. Resizing and Rehashing
A fixed-size array is inefficient; if we allocate 4 buckets and try to insert 5 items, we guarantee collisions. To maintain \(O(1)\) performance, the collection must grow.
The Resizing Process:
- Threshold:
DictionaryandHashSetmaintain a "load factor" (default 1.0). When the number of elements exceedsCount / LoadFactor, the structure is considered full. - Allocation: A new array is allocated, typically doubling the size of the previous array (e.g., 8 -> 16).
- Rehashing: Crucially, the indices of all existing elements must be recalculated. Because the modulo operand (array size) has changed,
hash % 16yields a different result thanhash % 8. - Migration: Every item is iterated over and placed into the new array.
Cost: Resizing is an expensive \(O(n)\) operation because it involves rehashing every element. However, because it happens geometrically (doubling), the amortized cost of insertion remains \(O(1)\).
4. Comparing Collections: List vs. HashSet
In AI applications, specifically when building tokenizers (mapping text tokens to integer IDs), the choice of collection dictates performance.
List<T> (Linear Search)
- Structure: Contiguous memory block.
- Lookup (
Contains): Scans elements sequentially. - Complexity: \(O(n)\).
- Use Case: When order matters and lookups are rare.
HashSet<T> (Hash Lookup)
- Structure: Hash table with buckets and chaining.
- Lookup (
Contains): Computes hash, finds bucket, checks short chain. - Complexity: \(O(1)\) average.
- Use Case: When uniqueness is required and fast lookups are critical.
Code Comparison:
using System;
using System.Collections.Generic;
using System.Diagnostics;
public class PerformanceComparison
{
public static void CompareLookups()
{
int size = 100000;
var list = new List<int>();
var hashSet = new HashSet<int>();
// Populate
for (int i = 0; i < size; i++)
{
list.Add(i);
hashSet.Add(i);
}
// Target
int target = size - 1;
// Measure List
var sw = Stopwatch.StartNew();
bool listContains = list.Contains(target);
sw.Stop();
Console.WriteLine($"List.Contains: {sw.ElapsedTicks} ticks (O(n))");
// Measure HashSet
sw.Restart();
bool hashContains = hashSet.Contains(target);
sw.Stop();
Console.WriteLine($"HashSet.Contains: {sw.ElapsedTicks} ticks (O(1))");
}
}
List is backed by an array, HashSet is backed by a sparse array of buckets. List is faster for very small collections due to CPU cache locality, but HashSet scales significantly better as data grows.
5. Generics and Iterators (yield return)
While Dictionary and HashSet are not linear sequences, C# allows us to iterate over them using IEnumerable<T>. This is where yield return becomes powerful for memory efficiency.
When you iterate a Dictionary, you are not accessing elements by index. Instead, the internal enumerator traverses the bucket array, skipping empty buckets, and yielding items from the linked lists.
Manual Implementation of a Bucket Iterator:
To understand the internal mechanics, let's simulate how a Dictionary enumerator might work using yield return.
using System;
using System.Collections;
using System.Collections.Generic;
// A simplified conceptual model of a hash table bucket
public class SimpleBucket<T>
{
public T Item;
public SimpleBucket<T> Next;
}
public class SimpleHashTable<T> : IEnumerable<T>
{
private SimpleBucket<T>[] _buckets = new SimpleBucket<T>[8];
public void Add(T item)
{
int index = Math.Abs(item.GetHashCode()) % _buckets.Length;
// Add to the front of the linked list (Chaining)
var newBucket = new SimpleBucket<T>
{
Item = item,
Next = _buckets[index]
};
_buckets[index] = newBucket;
}
// Using 'yield return' to traverse the sparse array efficiently
// without creating a separate list in memory.
public IEnumerator<T> GetEnumerator()
{
for (int i = 0; i < _buckets.Length; i++)
{
var current = _buckets[i];
while (current != null)
{
yield return current.Item;
current = current.Next;
}
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
Why yield return matters here:
If we didn't use yield return, we would have to allocate a new List<T> to hold all items before returning them, doubling memory usage. yield return creates a state machine that pauses execution after each item, allowing us to iterate over massive hash tables without loading them entirely into memory.
6. AI Application: Tokenizers and Graph Traversal
How does this theoretical foundation apply to building AI systems?
A. Efficient Lookups for Tokenizers
Modern Large Language Models (LLMs) like GPT or Llama use Vocabulary Mappings. A tokenizer converts text ("Hello") into integers (15496).
- Encoding (Text -> Int): We need a
Dictionary<string, int>(Vocabulary). Lookup is \(O(1)\). - Decoding (Int -> Text): We need an array or
List<string>(Inverse Vocabulary). Lookup is \(O(1)\) by index.
Without a Dictionary, decoding a sequence of 4096 tokens would require scanning the vocabulary list 4096 times (\(O(n^2)\)), making inference unusably slow.
B. Graph Traversal (Knowledge Graphs)
In AI, we often represent knowledge as graphs (nodes and edges). Traversing these graphs (e.g., finding the shortest path between concepts) relies heavily on HashSet<T>.
When performing a Breadth-First Search (BFS) or Depth-First Search (DFS), we must track visited nodes to avoid infinite loops in cyclic graphs.
- Bad Approach: Using a
List<T>to store visited nodes. Checkingif (visited.Contains(node))takes \(O(n)\). - Good Approach: Using a
HashSet<T>. Checkingif (visited.Contains(node))takes \(O(1)\).
For a graph with 1 million nodes, List would require ~500 billion operations for a full search, whereas HashSet requires ~1 million operations.
C. Handling Collisions in Embedding Lookups
As we transition to Book 3's focus on embeddings, consider Vector Quantization. We map high-dimensional vectors to discrete codes (hashes). If two similar vectors map to the same code (collision), we need to resolve it.
- Inverted Index: Search engines (and AI vector databases) use a
Dictionary<K, List<V>>where K is a hash of the vector (or a quantized bucket) and V is the list of vector IDs falling into that bucket. - This allows approximate nearest neighbor search by only comparing vectors within the same bucket, reducing complexity from \(O(N)\) to \(O(K)\) where \(K \ll N\).
Summary
Dictionary and HashSet are not merely "fast lists." They are complex data structures utilizing arrays, linked lists, and hashing algorithms to trade memory for speed. Understanding the mechanics of hashing and collision resolution is the prerequisite for understanding how AI systems perform real-time lookups in massive vocabularies and navigate complex knowledge graphs.
Basic Code Example
Here is a "Hello World" level code example demonstrating the fundamental differences in performance and structure between List<T>, Dictionary<TKey, TValue>, and HashSet<T>, specifically tailored for a real-world Tokenizer vocabulary lookup scenario.
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace Book3_Chapter1_BasicExample
{
class Program
{
static void Main(string[] args)
{
// CONTEXT: In Natural Language Processing (NLP), a Tokenizer breaks text into words (tokens).
// Each token is mapped to a unique integer ID. We need to check if a token exists in our vocabulary.
// We will compare three data structures for this lookup task.
int vocabularySize = 100000; // 100k tokens
List<string> tokenVocabulary = GenerateVocabulary(vocabularySize);
List<string> testTokens = new List<string> { "token_50000", "token_99999", "unknown_token" };
Console.WriteLine($"Vocabulary Size: {vocabularySize} tokens");
Console.WriteLine("--------------------------------------------------");
// 1. Using List<T> (Linear Search)
// Complexity: O(N) - Time increases linearly with vocabulary size.
// Memory: Low overhead, stores only data.
Console.WriteLine("1. Testing List<string> (Linear Search):");
SearchWithList(tokenVocabulary, testTokens);
Console.WriteLine("--------------------------------------------------");
// 2. Using HashSet<T> (Hash-based Lookup)
// Complexity: O(1) - Constant time on average.
// Memory: Higher overhead due to internal buckets and hash storage.
Console.WriteLine("2. Testing HashSet<string> (Hash-based Lookup):");
SearchWithHashSet(tokenVocabulary, testTokens);
Console.WriteLine("--------------------------------------------------");
// 3. Using Dictionary<TKey, TValue> (Key-Value Lookup)
// Complexity: O(1) - Constant time on average.
// Used when we need to retrieve an ID (Value) given a Token (Key).
Console.WriteLine("3. Testing Dictionary<string, int> (Key-Value Lookup):");
SearchWithDictionary(tokenVocabulary, testTokens);
}
// Generates a list of tokens: "token_0", "token_1", ...
static List<string> GenerateVocabulary(int size)
{
List<string> vocab = new List<string>(size);
for (int i = 0; i < size; i++)
{
vocab.Add($"token_{i}");
}
return vocab;
}
// --- LIST IMPLEMENTATION ---
static void SearchWithList(List<string> vocab, List<string> queries)
{
// ALGORITHM: Linear Search
// We iterate through the list from start to finish until we find a match.
// If the item is at the end or not present, we scan the entire list.
foreach (string query in queries)
{
bool found = false;
// Manual iteration (No LINQ shortcuts allowed for deep dive)
for (int i = 0; i < vocab.Count; i++)
{
if (vocab[i] == query)
{
found = true;
break; // Found it, stop searching
}
}
Console.WriteLine($" Query '{query}': {(found ? "Found" : "Not Found")}");
}
// PERFORMANCE NOTE:
// If we searched for "token_99999", the loop runs 100,000 times.
// If we searched for "token_0", the loop runs 1 time.
// This inconsistency is why Lists are slow for lookups.
}
// --- HASHSET IMPLEMENTATION ---
static void SearchWithHashSet(List<string> vocab, List<string> queries)
{
// OPTIMIZATION: Convert List to HashSet once.
// This operation takes O(N) time, but subsequent lookups are O(1).
HashSet<string> hashSet = new HashSet<string>(vocab);
foreach (string query in queries)
{
// ALGORITHM: Hash Lookup
// 1. Calculate HashCode of the string.
// 2. Map HashCode to a specific "Bucket" index.
// 3. Check only the items in that bucket.
bool found = hashSet.Contains(query);
Console.WriteLine($" Query '{query}': {(found ? "Found" : "Not Found")}");
}
// PERFORMANCE NOTE:
// Whether we search for "token_99999" or "token_0", the time taken is roughly the same.
// This is O(1) average case.
}
// --- DICTIONARY IMPLEMENTATION ---
static void SearchWithDictionary(List<string> vocab, List<string> queries)
{
// REAL-WORLD CONTEXT: We usually need the ID of the token, not just a boolean.
// Dictionary maps Key (Token) -> Value (ID).
Dictionary<string, int> dictionary = new Dictionary<string, int>();
// Building the dictionary: O(N)
for (int i = 0; i < vocab.Count; i++)
{
dictionary.Add(vocab[i], i);
}
foreach (string query in queries)
{
// ALGORITHM: Hash Lookup with Value Retrieval
if (dictionary.TryGetValue(query, out int tokenId))
{
Console.WriteLine($" Query '{query}': Found ID {tokenId}");
}
else
{
Console.WriteLine($" Query '{query}': Not Found");
}
}
}
}
}
Explanation of Internal Mechanics
-
List\
(Linear Search): - Internal Structure: A contiguous array of memory. When you add items, it checks if the array is full. If full, it creates a larger array (usually double the size) and copies the elements over (amortized O(1) insertion, but O(N) worst case).
- Lookup Mechanism:
List.Containsor manual iteration performs a linear scan. It compares the target value with every element in the array sequentially. - Complexity: \(O(N)\). If the vocabulary grows to 1 million tokens, the lookup time grows proportionally. This is inefficient for large-scale data.
-
HashSet\
(Hash-based Lookup): - Internal Structure: An array of "buckets." When an item is added, the CLR calculates the
GetHashCode()of the object. - Hashing: The hash code is processed to determine an index in the internal array (the bucket).
- Collisions: If two different objects generate the same hash (or map to the same index), they are stored in a linked list or similar structure within that bucket.
- Lookup: When checking
Contains, it calculates the hash, finds the bucket, and only compares the items in that specific bucket. - Complexity: \(O(1)\) average case. Even with 1 million tokens, finding a token requires calculating the hash and checking one bucket.
- Internal Structure: An array of "buckets." When an item is added, the CLR calculates the
-
Dictionary\
(Key-Value Storage): - Internal Structure: Similar to
HashSet<T>, but it stores a struct containing both the Key and the Value in the bucket. - Use Case: Essential for tokenizers where
Token -> IDmapping is required. It is also the backbone of Graph Traversal algorithms (e.g., storingNode -> VisitedStatus). - Complexity: \(O(1)\) average case for lookup, insertion, and deletion.
- Internal Structure: Similar to
Visualizing the Data Structures
The following diagram illustrates how token_5 might be stored differently in a List vs. a HashSet/Dictionary.
Diagram Interpretation:
- Left (List): To find
token_5, you must scan fromtoken_0throughtoken_5. This is a linear path. - Right (HashSet/Dictionary): The system calculates the hash of
token_5and immediately maps it toBucket 4. Even thoughtoken_12shares the same bucket (a collision), the search is confined to that small bucket.
Common Pitfalls
1. Mutable Keys in Dictionaries/HashSets
A critical mistake is using mutable objects (like a List<int> or a custom class with changing fields) as keys.
- Why it fails: If you add an item to a dictionary using a specific key, the hash code is calculated and the item is placed in a specific bucket. If you modify the object after it's added, its hash code might change.
- Result: The dictionary will look in the new bucket for the key, find nothing, and return
falseeven though the item exists in the old bucket. The item becomes "lost" in the structure. - Fix: Always use immutable types (strings, ints, records) or ensure the object's hash code never changes once it is used as a key.
2. Ignoring Collision Handling
While HashSet and Dictionary are \(O(1)\), a "Denial of Service" style attack can occur if an attacker knows your hash function and provides keys that all hash to the same bucket.
- Result: The structure degrades into a \(O(N)\) linked list scan.
- Context: In .NET, Strings have a randomized hash code per process to prevent this, but be cautious with custom structs.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.