Chapter 1: Lists, Dictionaries, and HashSets in Depth

Theoretical Foundations

Theoretical Foundations: Hashing, Buckets, and Collision Resolution

In Book 2, we established that arrays and List<T> provide contiguous memory storage with \(O(1)\) access by index, but \(O(n)\) search time when looking for a specific value. This linear search becomes a bottleneck in AI applications where we might need to check if a token exists in a vocabulary of 50,000 entries. To bridge the gap from linear lists to the vector embeddings used in modern AI, we must first master hash-based data structures.

These structures utilize a mathematical transformation (hashing) to map data directly to a memory address, effectively bypassing the need for iteration in most cases.

The Real-World Analogy: The Library Index System

Imagine a physical library containing 100,000 books.

List Approach: To find a specific book, you start at shelf 1, scan every book, move to shelf 2, and repeat. This is \(O(n)\).
Dictionary/Hashset Approach: You walk to the central index card catalog. You look up the book title (the "Key"), which directs you to a specific shelf and slot number (the "Hash"). You walk directly there. This is \(O(1)\) (average case).

The "Index Catalog" is the Buckets array. The "Title" is the Key. The process of converting the title to a shelf number is the Hash Function.

1. The Hash Function: Mapping Data to Integers

At the core of Dictionary<TKey, TValue> and HashSet<T> lies the GetHashCode() method, defined by the object class and overridden by custom types.

A hash function takes an arbitrary amount of data (a string, an object, a number) and maps it to a fixed-size integer (a hash code).

Properties of a Good Hash Function:

Deterministic: The same input must always produce the same output.
Uniform Distribution: Inputs should map to hash codes as evenly as possible across the integer range to minimize collisions.
Fast Computation: It must be calculated quickly.

How C# Uses Hash Codes: When you insert an item into a HashSet<int> or Dictionary<string, int>, the collection does not store the item in the order of insertion. Instead, it calculates the hash code of the key, performs a modulo operation against the current internal array size, and places the item in that specific "bucket."

using System;
using System.Collections.Generic;

// Conceptual demonstration of how a bucket index is calculated
public class HashMechanism
{
    public static int GetBucketIndex(int key, int bucketCount)
    {
        // In C#, the actual implementation uses bit masking for performance,
        // but mathematically it is equivalent to modulo.
        // % is the modulo operator.
        return Math.Abs(key.GetHashCode()) % bucketCount;
    }
}

2. Internal Mechanics: Buckets and Linked Lists

The internal structure of a Dictionary or HashSet is often visualized as an array of "buckets." However, a single bucket might contain multiple items. This leads us to the concept of Collisions.

Collision Resolution: Chaining

A collision occurs when two different keys produce the same hash code (or map to the same bucket index). Since the array size is finite, this is mathematically inevitable.

C# uses Chaining to resolve collisions. Each bucket in the internal array is not just a single slot for data; it is the head of a linked list. If two keys collide, the second item is appended to the linked list at that bucket.

Visualizing the Bucket Structure: Imagine a Dictionary<string, int> where the internal array has size 8.

Key "Apple" hashes to index 2.
Key "Banana" hashes to index 5.
Key "Cherry" hashes to index 2 (Collision!).

This diagram illustrates a hash table where the key Banana maps directly to index 5, while the key Cherry maps to the same index 2 as another item, resulting in a collision that requires resolution.

Performance Implication:

Best Case (No Collisions): \(O(1)\). The hash maps directly to a bucket with one item.
Worst Case (All Collisions): \(O(n)\). If every key maps to the same bucket, the structure degenerates into a linked list, requiring linear search.

3. Resizing and Rehashing

A fixed-size array is inefficient; if we allocate 4 buckets and try to insert 5 items, we guarantee collisions. To maintain \(O(1)\) performance, the collection must grow.

The Resizing Process:

Threshold: Dictionary and HashSet maintain a "load factor" (default 1.0). When the number of elements exceeds Count / LoadFactor, the structure is considered full.
Allocation: A new array is allocated, typically doubling the size of the previous array (e.g., 8 -> 16).
Rehashing: Crucially, the indices of all existing elements must be recalculated. Because the modulo operand (array size) has changed, hash % 16 yields a different result than hash % 8.
Migration: Every item is iterated over and placed into the new array.

Cost: Resizing is an expensive \(O(n)\) operation because it involves rehashing every element. However, because it happens geometrically (doubling), the amortized cost of insertion remains \(O(1)\).

4. Comparing Collections: `List` vs. `HashSet`

In AI applications, specifically when building tokenizers (mapping text tokens to integer IDs), the choice of collection dictates performance.

`List<T>` (Linear Search)

Structure: Contiguous memory block.
Lookup (Contains): Scans elements sequentially.
Complexity: \(O(n)\).
Use Case: When order matters and lookups are rare.

`HashSet<T>` (Hash Lookup)

Structure: Hash table with buckets and chaining.
Lookup (Contains): Computes hash, finds bucket, checks short chain.
Complexity: \(O(1)\) average.
Use Case: When uniqueness is required and fast lookups are critical.

Code Comparison:

using System;
using System.Collections.Generic;
using System.Diagnostics;

public class PerformanceComparison
{
    public static void CompareLookups()
    {
        int size = 100000;
        var list = new List<int>();
        var hashSet = new HashSet<int>();

        // Populate
        for (int i = 0; i < size; i++)
        {
            list.Add(i);
            hashSet.Add(i);
        }

        // Target
        int target = size - 1;

        // Measure List
        var sw = Stopwatch.StartNew();
        bool listContains = list.Contains(target);
        sw.Stop();
        Console.WriteLine($"List.Contains: {sw.ElapsedTicks} ticks (O(n))");

        // Measure HashSet
        sw.Restart();
        bool hashContains = hashSet.Contains(target);
        sw.Stop();
        Console.WriteLine($"HashSet.Contains: {sw.ElapsedTicks} ticks (O(1))");
    }
}

Note: While List is backed by an array, HashSet is backed by a sparse array of buckets. List is faster for very small collections due to CPU cache locality, but HashSet scales significantly better as data grows.

5. Generics and Iterators (`yield return`)

While Dictionary and HashSet are not linear sequences, C# allows us to iterate over them using IEnumerable<T>. This is where yield return becomes powerful for memory efficiency.

When you iterate a Dictionary, you are not accessing elements by index. Instead, the internal enumerator traverses the bucket array, skipping empty buckets, and yielding items from the linked lists.

Manual Implementation of a Bucket Iterator: To understand the internal mechanics, let's simulate how a Dictionary enumerator might work using yield return.

using System;
using System.Collections;
using System.Collections.Generic;

// A simplified conceptual model of a hash table bucket
public class SimpleBucket<T>
{
    public T Item;
    public SimpleBucket<T> Next;
}

public class SimpleHashTable<T> : IEnumerable<T>
{
    private SimpleBucket<T>[] _buckets = new SimpleBucket<T>[8];

    public void Add(T item)
    {
        int index = Math.Abs(item.GetHashCode()) % _buckets.Length;

        // Add to the front of the linked list (Chaining)
        var newBucket = new SimpleBucket<T> 
        { 
            Item = item, 
            Next = _buckets[index] 
        };
        _buckets[index] = newBucket;
    }

    // Using 'yield return' to traverse the sparse array efficiently
    // without creating a separate list in memory.
    public IEnumerator<T> GetEnumerator()
    {
        for (int i = 0; i < _buckets.Length; i++)
        {
            var current = _buckets[i];
            while (current != null)
            {
                yield return current.Item;
                current = current.Next;
            }
        }
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

Why yield return matters here: If we didn't use yield return, we would have to allocate a new List<T> to hold all items before returning them, doubling memory usage. yield return creates a state machine that pauses execution after each item, allowing us to iterate over massive hash tables without loading them entirely into memory.

6. AI Application: Tokenizers and Graph Traversal

How does this theoretical foundation apply to building AI systems?

A. Efficient Lookups for Tokenizers

Modern Large Language Models (LLMs) like GPT or Llama use Vocabulary Mappings. A tokenizer converts text ("Hello") into integers (15496).

Encoding (Text -> Int): We need a Dictionary<string, int> (Vocabulary). Lookup is \(O(1)\).
Decoding (Int -> Text): We need an array or List<string> (Inverse Vocabulary). Lookup is \(O(1)\) by index.

Without a Dictionary, decoding a sequence of 4096 tokens would require scanning the vocabulary list 4096 times (\(O(n^2)\)), making inference unusably slow.

B. Graph Traversal (Knowledge Graphs)

In AI, we often represent knowledge as graphs (nodes and edges). Traversing these graphs (e.g., finding the shortest path between concepts) relies heavily on HashSet<T>.

When performing a Breadth-First Search (BFS) or Depth-First Search (DFS), we must track visited nodes to avoid infinite loops in cyclic graphs.

Bad Approach: Using a List<T> to store visited nodes. Checking if (visited.Contains(node)) takes \(O(n)\).
Good Approach: Using a HashSet<T>. Checking if (visited.Contains(node)) takes \(O(1)\).

For a graph with 1 million nodes, List would require ~500 billion operations for a full search, whereas HashSet requires ~1 million operations.

C. Handling Collisions in Embedding Lookups

As we transition to Book 3's focus on embeddings, consider Vector Quantization. We map high-dimensional vectors to discrete codes (hashes). If two similar vectors map to the same code (collision), we need to resolve it.

Inverted Index: Search engines (and AI vector databases) use a Dictionary<K, List<V>> where K is a hash of the vector (or a quantized bucket) and V is the list of vector IDs falling into that bucket.
This allows approximate nearest neighbor search by only comparing vectors within the same bucket, reducing complexity from \(O(N)\) to \(O(K)\) where \(K \ll N\).

Summary

Dictionary and HashSet are not merely "fast lists." They are complex data structures utilizing arrays, linked lists, and hashing algorithms to trade memory for speed. Understanding the mechanics of hashing and collision resolution is the prerequisite for understanding how AI systems perform real-time lookups in massive vocabularies and navigate complex knowledge graphs.

Basic Code Example

Here is a "Hello World" level code example demonstrating the fundamental differences in performance and structure between List<T>, Dictionary<TKey, TValue>, and HashSet<T>, specifically tailored for a real-world Tokenizer vocabulary lookup scenario.

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace Book3_Chapter1_BasicExample
{
    class Program
    {
        static void Main(string[] args)
        {
            // CONTEXT: In Natural Language Processing (NLP), a Tokenizer breaks text into words (tokens).
            // Each token is mapped to a unique integer ID. We need to check if a token exists in our vocabulary.
            // We will compare three data structures for this lookup task.

            int vocabularySize = 100000; // 100k tokens
            List<string> tokenVocabulary = GenerateVocabulary(vocabularySize);
            List<string> testTokens = new List<string> { "token_50000", "token_99999", "unknown_token" };

            Console.WriteLine($"Vocabulary Size: {vocabularySize} tokens");
            Console.WriteLine("--------------------------------------------------");

            // 1. Using List<T> (Linear Search)
            // Complexity: O(N) - Time increases linearly with vocabulary size.
            // Memory: Low overhead, stores only data.
            Console.WriteLine("1. Testing List<string> (Linear Search):");
            SearchWithList(tokenVocabulary, testTokens);

            Console.WriteLine("--------------------------------------------------");

            // 2. Using HashSet<T> (Hash-based Lookup)
            // Complexity: O(1) - Constant time on average.
            // Memory: Higher overhead due to internal buckets and hash storage.
            Console.WriteLine("2. Testing HashSet<string> (Hash-based Lookup):");
            SearchWithHashSet(tokenVocabulary, testTokens);

            Console.WriteLine("--------------------------------------------------");

            // 3. Using Dictionary<TKey, TValue> (Key-Value Lookup)
            // Complexity: O(1) - Constant time on average.
            // Used when we need to retrieve an ID (Value) given a Token (Key).
            Console.WriteLine("3. Testing Dictionary<string, int> (Key-Value Lookup):");
            SearchWithDictionary(tokenVocabulary, testTokens);
        }

        // Generates a list of tokens: "token_0", "token_1", ...
        static List<string> GenerateVocabulary(int size)
        {
            List<string> vocab = new List<string>(size);
            for (int i = 0; i < size; i++)
            {
                vocab.Add($"token_{i}");
            }
            return vocab;
        }

        // --- LIST IMPLEMENTATION ---
        static void SearchWithList(List<string> vocab, List<string> queries)
        {
            // ALGORITHM: Linear Search
            // We iterate through the list from start to finish until we find a match.
            // If the item is at the end or not present, we scan the entire list.
            foreach (string query in queries)
            {
                bool found = false;
                // Manual iteration (No LINQ shortcuts allowed for deep dive)
                for (int i = 0; i < vocab.Count; i++)
                {
                    if (vocab[i] == query)
                    {
                        found = true;
                        break; // Found it, stop searching
                    }
                }

                Console.WriteLine($"  Query '{query}': {(found ? "Found" : "Not Found")}");
            }

            // PERFORMANCE NOTE:
            // If we searched for "token_99999", the loop runs 100,000 times.
            // If we searched for "token_0", the loop runs 1 time.
            // This inconsistency is why Lists are slow for lookups.
        }

        // --- HASHSET IMPLEMENTATION ---
        static void SearchWithHashSet(List<string> vocab, List<string> queries)
        {
            // OPTIMIZATION: Convert List to HashSet once.
            // This operation takes O(N) time, but subsequent lookups are O(1).
            HashSet<string> hashSet = new HashSet<string>(vocab);

            foreach (string query in queries)
            {
                // ALGORITHM: Hash Lookup
                // 1. Calculate HashCode of the string.
                // 2. Map HashCode to a specific "Bucket" index.
                // 3. Check only the items in that bucket.
                bool found = hashSet.Contains(query);

                Console.WriteLine($"  Query '{query}': {(found ? "Found" : "Not Found")}");
            }

            // PERFORMANCE NOTE:
            // Whether we search for "token_99999" or "token_0", the time taken is roughly the same.
            // This is O(1) average case.
        }

        // --- DICTIONARY IMPLEMENTATION ---
        static void SearchWithDictionary(List<string> vocab, List<string> queries)
        {
            // REAL-WORLD CONTEXT: We usually need the ID of the token, not just a boolean.
            // Dictionary maps Key (Token) -> Value (ID).
            Dictionary<string, int> dictionary = new Dictionary<string, int>();

            // Building the dictionary: O(N)
            for (int i = 0; i < vocab.Count; i++)
            {
                dictionary.Add(vocab[i], i);
            }

            foreach (string query in queries)
            {
                // ALGORITHM: Hash Lookup with Value Retrieval
                if (dictionary.TryGetValue(query, out int tokenId))
                {
                    Console.WriteLine($"  Query '{query}': Found ID {tokenId}");
                }
                else
                {
                    Console.WriteLine($"  Query '{query}': Not Found");
                }
            }
        }
    }
}

Explanation of Internal Mechanics

List\ (Linear Search):
- Internal Structure: A contiguous array of memory. When you add items, it checks if the array is full. If full, it creates a larger array (usually double the size) and copies the elements over (amortized O(1) insertion, but O(N) worst case).
- Lookup Mechanism: List.Contains or manual iteration performs a linear scan. It compares the target value with every element in the array sequentially.
- Complexity: \(O(N)\). If the vocabulary grows to 1 million tokens, the lookup time grows proportionally. This is inefficient for large-scale data.
HashSet\ (Hash-based Lookup):
- Internal Structure: An array of "buckets." When an item is added, the CLR calculates the GetHashCode() of the object.
- Hashing: The hash code is processed to determine an index in the internal array (the bucket).
- Collisions: If two different objects generate the same hash (or map to the same index), they are stored in a linked list or similar structure within that bucket.
- Lookup: When checking Contains, it calculates the hash, finds the bucket, and only compares the items in that specific bucket.
- Complexity: \(O(1)\) average case. Even with 1 million tokens, finding a token requires calculating the hash and checking one bucket.
Dictionary\ (Key-Value Storage):
- Internal Structure: Similar to HashSet<T>, but it stores a struct containing both the Key and the Value in the bucket.
- Use Case: Essential for tokenizers where Token -> ID mapping is required. It is also the backbone of Graph Traversal algorithms (e.g., storing Node -> VisitedStatus).
- Complexity: \(O(1)\) average case for lookup, insertion, and deletion.

Visualizing the Data Structures

The following diagram illustrates how token_5 might be stored differently in a List vs. a HashSet/Dictionary.

This diagram contrasts how token_5 is stored in a List versus a HashSet/Dictionary, showing the List as a simple sequential array of values while the HashSet/Dictionary utilizes a hashed index to locate the value within a more complex internal structure. — This diagram contrasts how `token_5` is stored in a List versus a HashSet/Dictionary, showing the List as a simple sequential array of values while the HashSet/Dictionary utilizes a hashed index to locate the value within a more complex internal structure.

Diagram Interpretation:

Left (List): To find token_5, you must scan from token_0 through token_5. This is a linear path.
Right (HashSet/Dictionary): The system calculates the hash of token_5 and immediately maps it to Bucket 4. Even though token_12 shares the same bucket (a collision), the search is confined to that small bucket.

Common Pitfalls

1. Mutable Keys in Dictionaries/HashSets A critical mistake is using mutable objects (like a List<int> or a custom class with changing fields) as keys.

Why it fails: If you add an item to a dictionary using a specific key, the hash code is calculated and the item is placed in a specific bucket. If you modify the object after it's added, its hash code might change.
Result: The dictionary will look in the new bucket for the key, find nothing, and return false even though the item exists in the old bucket. The item becomes "lost" in the structure.
Fix: Always use immutable types (strings, ints, records) or ensure the object's hash code never changes once it is used as a key.

2. Ignoring Collision Handling While HashSet and Dictionary are \(O(1)\), a "Denial of Service" style attack can occur if an attacker knows your hash function and provides keys that all hash to the same bucket.

Result: The structure degrades into a \(O(N)\) linked list scan.
Context: In .NET, Strings have a randomized hash code per process to prevent this, but be cautious with custom structs.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.