Skip to content

Chapter 2: Custom Collections and Enumerators

Theoretical Foundations

Theoretical Foundations: Internal Mechanics of Core Collections

The previous chapter established the IEnumerable/IEnumerator pattern, allowing us to traverse custom data structures. However, traversal is only half the story. The efficiency of data manipulation—specifically insertion, deletion, and lookup—is dictated by the underlying data structure. In AI applications, where we often deal with massive vocabularies (tokenizers) or sparse feature vectors, understanding these internal mechanics is not an optimization; it is a requirement for feasibility.

The Hashing Engine: Dictionary<TKey, TValue> and HashSet<T>

To understand Dictionary and HashSet, we must first understand hashing. A hash function is a deterministic algorithm that maps data of arbitrary size to a fixed-size integer (the hash code).

Real-World Analogy: The Library Index System Imagine a library with millions of books (data). To find a specific book, you do not scan every shelf (O(n) linear search). Instead, you look up the book's title in a card catalog (the hash function). The catalog gives you a specific aisle and shelf number (the bucket). This is O(1) lookup, assuming the catalog is well-organized.

Internal Mechanics: The Bucket Array

In C#, both Dictionary<TKey, TValue> and HashSet<T> rely on an internal array of "buckets."

  1. Hashing: When you insert a key, GetHashCode() is called.
  2. Bucket Indexing: The hash code is mapped to a specific index in the internal array (often using a bitwise operation like hashCode % arrayLength).
  3. Collision Resolution: What if two different keys map to the same bucket? This is a collision. C# uses Chaining. Each bucket points to a linked list of entries. If a collision occurs, the new entry is appended to the list in that bucket.

Resizing and Performance

The initial bucket array is small (e.g., length 3). As you add items, the "load factor" (items/buckets) increases. When it hits a threshold (default ~0.75 for Dictionary), the structure resizes.

  1. A new, larger array is allocated (usually double the size).
  2. Rehashing: Every single item must be re-inserted into the new array because the modulus operation (% arrayLength) changes with the array size.

Algorithmic Complexity:

  • Average Case (O(1)): The hash function distributes keys evenly. Lookup is instant.
  • Worst Case (O(n)): A "hash flood" attack or poor hash function causes all keys to collide into one bucket. Lookup degrades to scanning a linked list.
  • Amortized Insertion (O(1)): Resizing is expensive (O(n)), but it happens rarely, so the average cost remains low.

Comparison: List.Contains vs HashSet.Contains

  • List<string>.Contains("token"): Iterates through the list linearly. O(n). If you have 100,000 tokens, it performs 100,000 comparisons.
  • HashSet<string>.Contains("token"): Calculates hash, jumps to bucket, checks small chain. O(1). 100,000 tokens require roughly the same effort as 10.

AI Application: Tokenizer Vocabularies

In Natural Language Processing (NLP), a tokenizer maps strings (words/subwords) to integer IDs.

// BAD: Using a List for a vocabulary
List<string> vocab = new List<string>(); // O(n) lookup
int GetTokenId(string word) {
    return vocab.IndexOf(word); // Scans entire list every time
}

// GOOD: Using a Dictionary for a vocabulary
Dictionary<string, int> vocabMap = new Dictionary<string, int>(); // O(1) lookup
int GetTokenId(string word) {
    return vocabMap[word]; // Instant access
}
When processing a document with millions of tokens, the Dictionary approach is orders of magnitude faster.

The Queue: Queue<T> (FIFO)

Real-World Analogy: The Assembly Line Items enter one end (tail) and exit the other (head). You cannot skip the line; you process items in the exact order they arrived.

Internal Mechanics: Circular Buffer

Queue<T> is typically implemented using a circular array (ring buffer).

  • Head and Tail Pointers: Instead of shifting elements (which is O(n)), the queue maintains integer indices pointing to the head and tail.
  • Circular Logic: When the tail pointer reaches the end of the array, it wraps around to index 0 (if space is available).
  • Resizing: If the tail catches up to the head (full queue), it resizes to a larger array and copies elements linearly.

Algorithmic Complexity:

  • Enqueue/Dequeue: O(1) (amortized).
  • Peek: O(1).

AI Application: Graph Traversal (BFS)

Graph traversal algorithms like Breadth-First Search (BFS) are fundamental in AI for pathfinding (e.g., GPS routing, game AI).

using System.Collections.Generic;

void BFS(GraphNode start) {
    Queue<GraphNode> queue = new Queue<GraphNode>();
    HashSet<GraphNode> visited = new HashSet<GraphNode>(); // O(1) check to avoid cycles

    queue.Enqueue(start);
    visited.Add(start);

    while (queue.Count > 0) {
        GraphNode current = queue.Dequeue();

        // Process current node (e.g., check if target)
        foreach (var neighbor in current.Neighbors) {
            if (!visited.Contains(neighbor)) {
                visited.Add(neighbor);
                queue.Enqueue(neighbor);
            }
        }
    }
}
Here, Queue ensures we explore the graph layer by layer, while HashSet prevents infinite loops in cyclic graphs.

The Linked List: LinkedList<T>

Real-World Analogy: A Treasure Hunt Each clue (node) contains the location of the next clue. To get to the 10th clue, you must physically traverse the first nine. However, if you are already holding the 5th clue and want to insert a new clue between 5 and 6, you simply update the pointers—you don't need to move the clues 6 through 10.

Internal Mechanics: Doubly Linked Nodes

LinkedList<T> is not a contiguous array. It is a collection of LinkedListNode<T> objects.

  • Node Structure: Each node contains Value, Next, and Previous pointers.
  • The List: Maintains references to Head and Tail.
  • Memory: Nodes are scattered in the heap. This causes cache misses (poor locality of reference) compared to arrays, but offers unique flexibility.

Algorithmic Complexity:

  • Insertion (at known position): O(1). (e.g., list.AddAfter(node, value)).
  • Lookup (by index): O(n). (Must traverse from head).
  • Lookup (by value): O(n).

AI Application: Efficient Sequence Modification

In AI, specifically in Reinforcement Learning or Time-Series analysis, we often maintain a "rolling buffer" of the last N events. If we use a List, removing the oldest item (index 0) requires shifting all remaining items (O(n)). If we use a LinkedList, removing the head is O(1).

// Rolling buffer for recent game states
LinkedList<GameFrame> recentFrames = new LinkedList<GameFrame>();

void AddFrame(GameFrame frame) {
    recentFrames.AddFirst(frame);
    if (recentFrames.Count > 100) {
        // Removing the tail is O(1) in a Doubly Linked List
        recentFrames.RemoveLast(); 
    }
}

Visualizing the Bucket Collision

To visualize how a Dictionary handles collisions, consider a scenario where keys "Apple" and "Banana" hash to the same bucket index (e.g., index 2).

In a vector embedding, each number represents the weight of a specific semantic dimension, so a high value at index 2 indicates a strong presence of the concept associated with that position.
Hold "Ctrl" to enable pan & zoom

In a vector embedding, each number represents the weight of a specific semantic dimension, so a high value at index 2 indicates a strong presence of the concept associated with that position.

When Dictionary["Banana"] is requested:

  1. Hash "Banana" → Index 2.
  2. Go to Bucket 2.
  3. Traverse the linked list: Compare "Banana" with "Apple" (first node).
  4. Match found? No. Move to Next.
  5. Match found? Yes. Return Value 20.

Generics and Iterators in Data Structures

The previous chapter discussed IEnumerable. Core collections implement this using Iterators (yield return). This allows us to decouple the storage mechanism from the traversal logic.

Consider a custom HashTree (hypothetical) that combines a HashSet for O(1) lookups with a LinkedList for O(1) insertion order. We can expose it as an IEnumerable without allocating a separate list.

using System.Collections;
using System.Collections.Generic;

public class OrderedHashSet<T> : IEnumerable<T>
{
    private HashSet<T> _lookup = new HashSet<T>();
    private LinkedList<T> _order = new LinkedList<T>();

    public void Add(T item)
    {
        if (_lookup.Add(item)) // Returns true if added
        {
            _order.AddLast(item);
        }
    }

    // The Iterator Pattern
    public IEnumerator<T> GetEnumerator()
    {
        foreach (var item in _order)
        {
            yield return item; // Streams data without creating a new collection
        }
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

Why this matters for AI: When building a tokenizer, you often need to iterate over the vocabulary in the order they were added (to maintain consistency) but look them up by string instantly. Using yield return allows you to stream this data to a model training loop without duplicating memory.

Summary of Mechanics

Collection Primary Use Case Key Internal Structure Lookup (Key/Index) Insertion
Dictionary Key-Value Mapping Hash Table (Buckets + Chaining) O(1) / O(n) worst O(1)
HashSet Unique Item Storage Hash Table (Buckets + Chaining) O(1) / O(n) worst O(1)
Queue FIFO Processing Circular Buffer O(1) (Head/Tail) O(1)
LinkedList Frequent Insertions Doubly Linked Nodes O(n) O(1) (at pos)

Understanding these mechanics allows the AI engineer to choose the right tool. Using a List for a vocabulary lookup is a systemic failure waiting to happen as data scales. Using a Dictionary provides the O(1) backbone required for high-throughput inference.

Basic Code Example

Here is a simple, 'Hello World' level code example demonstrating the internal mechanics of a Dictionary (Hash Map) and a HashSet.

using System;
using System.Collections.Generic;

public class DictionaryVsListLookup
{
    public static void Main()
    {
        // REAL-WORLD CONTEXT:
        // Imagine a system processing a stream of unique tokens (e.g., words in a book or user IDs).
        // We need to check if a specific token has already been processed.
        // This is a critical operation in Tokenizers for AI models or Graph Traversal algorithms.

        // DATA SETUP:
        // We will populate a List and a HashSet with 10,000 integers.
        const int dataSize = 10000;
        List<int> listData = new List<int>(dataSize);
        HashSet<int> hashSetData = new HashSet<int>();

        for (int i = 0; i < dataSize; i++)
        {
            listData.Add(i);
            hashSetData.Add(i);
        }

        int searchTarget = 9999; // The value we are looking for

        // ---------------------------------------------------------
        // 1. LIST LOOKUP (Linear Search)
        // ---------------------------------------------------------
        // Algorithm: Iterate through the list until the item is found.
        // Complexity: O(n) - In the worst case, we check every element.
        bool foundInList = false;
        foreach (int number in listData)
        {
            if (number == searchTarget)
            {
                foundInList = true;
                break; // Found it, stop searching
            }
        }

        // ---------------------------------------------------------
        // 2. HASHSET LOOKUP (Hash Table Lookup)
        // ---------------------------------------------------------
        // Algorithm: 
        // 1. Calculate HashCode of searchTarget.
        // 2. Map HashCode to a specific "Bucket" (index).
        // 3. Check the bucket. If collision exists, verify exact value.
        // Complexity: O(1) - Average case. Time remains constant regardless of size.
        bool foundInHashSet = hashSetData.Contains(searchTarget);

        // OUTPUT RESULTS
        Console.WriteLine($"Searching for value: {searchTarget}");
        Console.WriteLine($"List Lookup Result: {foundInList}");
        Console.WriteLine($"HashSet Lookup Result: {foundInHashSet}");

        // PERFORMANCE COMPARISON DEMONSTRATION
        // We will now perform the lookup 1000 times to measure the difference.

        System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch();

        // Test List Performance
        stopwatch.Start();
        for (int i = 0; i < 1000; i++)
        {
            // We search for a value that is at the end of the list (worst case for List)
            // This forces the loop to iterate almost 10,000 times per check.
            listData.Contains(searchTarget); 
        }
        stopwatch.Stop();
        Console.WriteLine($"\nTime taken for 1000 lookups (List): {stopwatch.ElapsedMilliseconds} ms");

        // Test HashSet Performance
        stopwatch.Restart();
        for (int i = 0; i < 1000; i++)
        {
            // HashSet uses the hash code to jump directly to the bucket.
            // It does not iterate through previous elements.
            hashSetData.Contains(searchTarget);
        }
        stopwatch.Stop();
        Console.WriteLine($"Time taken for 1000 lookups (HashSet): {stopwatch.ElapsedMilliseconds} ms");
    }
}

Explanation of Internal Mechanics

Here is the breakdown of how these data structures function under the hood:

  1. The List (Array List):

    • Storage: Internally uses a contiguous array.
    • Lookup (Contains): Performs a linear scan. It starts at index 0 and compares the value at each index until it finds a match or reaches the end.
    • Implication: As the data size (\(n\)) grows, the lookup time grows linearly. For 1 million items, it might need to check 1 million items.
  2. The HashSet (Hash Table):

    • Storage: Uses an array of "buckets."
    • Hashing: When you call Contains(value), the structure computes a hash code for that value (an integer).
    • Indexing: It uses modulo arithmetic (hash % arrayLength) to determine exactly which bucket index to check.
    • Collision Handling: If two different values produce the same hash (a collision), they are stored in the same bucket as a linked list (or similar structure).
    • Implication: In the average case, the bucket contains only one item. The lookup is effectively instant, regardless of the dataset size.

Visualizing the Data Structure

The diagram below illustrates the fundamental difference in how data is accessed.

Common Pitfalls

1. Relying on List.Contains for Large Datasets A frequent mistake is using a List<T> to store data that requires frequent existence checks (e.g., checking if a user ID is in a blacklist).

  • The Mistake: if (blacklist.Contains(userId))
  • Why it fails: If the blacklist grows to 100,000 users, the system must perform up to 100,000 comparisons for every single request.
  • The Fix: Convert the data to a HashSet<T> immediately. The cost of building the HashSet (O(n)) is paid once, making subsequent lookups (O(1)) extremely fast.

2. Mutable Keys in Dictionaries While this example uses simple integers, using mutable objects (like a class with changing properties) as keys in a Dictionary or HashSet is dangerous.

  • The Mistake: Adding an object to a HashSet, then changing a property that affects its GetHashCode().
  • Why it fails: The object is placed in a specific bucket based on its original hash. If the hash changes, the object is now "lost" because looking for it using the new hash will point to the wrong bucket.
  • The Fix: Always use immutable types (like string, int, Guid) or immutable structs as keys. If you must use a class, ensure its hash code never changes after insertion.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.