Chapter 4: Immutable Collections for Thread-Safe Data

Theoretical Foundations

The theoretical foundation of immutable collections rests on a fundamental trade-off: sacrificing in-place mutability for guaranteed thread safety and predictable state management. Unlike mutable collections, where operations like Add, Remove, or Update modify the internal memory of the collection object directly, immutable collections treat data as values. Just as the integer 5 cannot be changed into 6 (you can only assign a new variable the value 6), an ImmutableList<T> cannot be altered. Instead, every modification operation returns a new instance of the collection that reflects the change, while sharing as much of the underlying memory structure as possible with the original.

To understand how immutable collections remain performant despite creating new instances on every modification, we must look at Structural Sharing. This concept relies on persistent data structures, specifically tree-based structures, rather than flat arrays.

Imagine a List<T> backed by a standard array. If you want to add an item to an immutable version of this list, you would theoretically need to copy the entire array to a new memory location, append the item, and leave the old array for garbage collection. This results in \(O(N)\) time complexity for additions, which is unacceptable for large datasets.

Immutable collections in .NET (specifically System.Collections.Immutable) use a tree structure (often a B-Tree variant or a relaxed AVL tree) to solve this. When a modification occurs, only the nodes along the path from the root to the modified element are copied. The rest of the tree is shared between the old and new instances.

Consider an ImmutableList with a branching factor (width) of 4. We have 16 items.

A diagram illustrating structural sharing in an ImmutableList with a branching factor of 4 and 16 items would show a tree where multiple leaf nodes reference common internal nodes, demonstrating how updates reuse existing structure to save memory. — A diagram illustrating structural sharing in an `ImmutableList` with a branching factor of 4 and 16 items would show a tree where multiple leaf nodes reference common internal nodes, demonstrating how updates reuse existing structure to save memory.

In the diagram above, Root A' is a new object, but it points to the existing child nodes B1, B2, and B3. Only the branch leading to the new item L17 (via B4') is allocated in new memory. This operation is \(O(\log N)\) rather than \(O(N)\).

The Builder Pattern for Batch Updates

While structural sharing optimizes individual updates, creating a new immutable collection for every single addition in a loop (e.g., adding 10,000 items) is inefficient. Each operation involves allocation and tree balancing overhead.

To address this, immutable collections utilize the Builder Pattern. A Builder is a mutable, temporary object used to construct the immutable collection in a highly optimized way. Once the batch operations are complete, the Builder creates the final immutable snapshot.

This mirrors the construction of complex objects in previous chapters (like StringBuilder vs. String). In AI applications, this is crucial when loading a vocabulary from a file. You wouldn't want to create a new immutable dictionary for every token read; you would use a Builder to construct the dictionary in memory and then "freeze" it into an immutable state for concurrent access.

using System;
using System.Collections.Generic;
using System.Collections.Immutable;

public class VocabularyBuilderExample
{
    public static void BuildVocabulary()
    {
        // 1. Create a mutable builder. This is O(1) allocation.
        var builder = ImmutableDictionary.CreateBuilder<string, int>();

        // 2. Perform batch updates. This is efficient as it uses standard
        //    mutable resizing logic internally (similar to List<T>).
        builder.Add("the", 1);
        builder.Add("quick", 2);
        builder.Add("brown", 3);

        // Simulating a large batch load
        foreach (var token in new[] { "fox", "jumps", "over" })
        {
            builder[token] = builder.Count + 1;
        }

        // 3. Freeze the collection. This creates the immutable tree structure
        //    and discards the mutable builder.
        ImmutableDictionary<string, int> vocabulary = builder.ToImmutable();

        Console.WriteLine($"Vocabulary size: {vocabulary.Count}");
        // vocabulary.Add("new", 100); // Compile error: ImmutableDictionary has no 'Add' method.
    }
}

Deep Dive: Internal Mechanics of `ImmutableDictionary<TKey, TValue>`

To understand the performance of immutable collections, we must compare them to their mutable counterparts, specifically Dictionary<TKey, TValue>.

Mutable Dictionary Mechanics

A standard Dictionary<TKey, TValue> in .NET is backed by an array of "buckets." When you add a key-value pair:

Hashing: The GetHashCode() method of the key is called.
Bucket Indexing: The hash code is processed (modulo the array size) to find an index in the internal array.
Collision Handling: If multiple keys map to the same bucket (a collision), the dictionary uses chaining (a linked list) at that bucket.

Resizing: When the dictionary becomes full (load factor reached), it allocates a new, larger array (usually doubling the size), re-hashes every existing key, and re-inserts them. This is an \(O(N)\) operation but happens infrequently, keeping the amortized complexity at \(O(1)\).

Immutable Dictionary Mechanics

ImmutableDictionary<TKey, TValue> cannot simply resize an array because that would require mutating the existing collection. Instead, it uses a Hash Array Mapped Trie (HAMT).

A HAMT is a tree structure where:

Leaves contain key-value pairs.
Nodes contain bitmasks and references to child nodes.
Branching: Instead of a fixed array size, the bitmask determines which child nodes exist.

The Lookup Process:

Calculate the hash code of the key.
Extract the first few bits of the hash to determine an index (0-31).
Check the bitmask at the current node: if the bit at that index is 0, the key does not exist. If 1, follow the pointer to the child node.
Repeat using the next chunk of bits from the hash until a leaf is reached.
Compare keys for equality (handling hash collisions).

Complexity:

Lookup: \(O(\log N)\) (specifically \(\log_{32} N\)). This is effectively constant time for realistic dataset sizes, though slightly slower than a mutable dictionary due to pointer chasing.
Update: \(O(\log N)\). Only the path from the root to the specific leaf is copied.

AI Context: Tokenizers and Graph Traversal

In AI and Machine Learning, specifically in Natural Language Processing (NLP), Tokenizers map text strings to integer IDs. This is essentially a lookup table (Dictionary).

Scenario: A BPE (Byte Pair Encoding) tokenizer vocabulary might contain 50,000 tokens.

Mutable Dictionary: Fast lookups (\(O(1)\)), but if the tokenizer needs to be shared across multiple threads (e.g., a web server handling concurrent inference requests), you must implement locking mechanisms (lock(_vocab)). This introduces contention and can block threads.
Immutable Dictionary: Lookups are slightly slower (\(O(\log N)\)), but the collection is inherently thread-safe. Multiple threads can read the vocabulary simultaneously without locks. If the vocabulary needs to be updated (rare, but possible in dynamic learning scenarios), the update produces a new version without disrupting readers of the old version.

Graph Traversal: In graph algorithms (e.g., Breadth-First Search), we often maintain a visited set to avoid cycles.

Mutable Set: We modify the set in place as we traverse.
Immutable Set: If we implement a recursive DFS (Depth-First Search) using yield return iterators, an immutable collection allows us to pass the "visited" state down the recursion stack cleanly. However, because of the \(O(\log N)\) overhead, for performance-critical graph traversals (like pathfinding in game AI), mutable collections are often preferred unless the specific algorithm requires persistent snapshots of the graph state.

Comparison: `List<T>` vs. `HashSet<T>` vs. `Immutable`

Let's look at the algorithmic complexity of containment checks, a fundamental operation in AI (e.g., "Does this word exist in the stop-word list?").

1. `List<T>` (Mutable Array)

Structure: Contiguous memory block.
Contains Operation: Linear search \(O(N)\). The engine must compare the target item with every element in the list until a match is found.
Use Case: Only when the collection is small or rarely searched.

// O(N) Complexity
public bool ContainsInList<T>(List<T> list, T item)
{
    // Internally iterates using a for-loop or pointer arithmetic
    return list.Contains(item); 
}

2. `HashSet<T>` (Mutable Hash Table)

Structure: Buckets based on hash codes.
Contains Operation: Average \(O(1)\), Worst Case \(O(N)\) (rare, requires all keys to collide).
Use Case: General purpose fast lookups.

// O(1) Complexity
public bool ContainsInSet<T>(HashSet<T> set, T item)
{
    // Calculates hash, finds bucket, checks equality
    return set.Contains(item); 
}

3. `ImmutableHashSet<T>` (HAMT)

Structure: Hash Array Mapped Trie.
Contains Operation: \(O(\log N)\) (specifically \(\log_{32} N\)).
Use Case: Thread-safe lookups where concurrency is required without locking.

Iterators and Lazy Evaluation

While yield return is a C# language feature, it plays a significant role in how we consume immutable collections efficiently. Because immutable collections are tree-based, iterating over them requires traversing the tree structure.

In a mutable List<T>, iteration is a simple pointer increment over contiguous memory. In an ImmutableList<T>, iteration traverses the tree depth-first.

However, yield return allows us to create lazy iterators. This is vital in AI data pipelines where datasets may not fit entirely in memory. We can define an iterator that yields items from an immutable collection without creating a new copy of the data.

public class DataStreamProcessor
{
    // Using yield return to traverse an immutable structure lazily
    public static IEnumerable<T> TraverseImmutableList<T>(ImmutableList<T> list)
    {
        // In a real immutable list implementation, this would traverse
        // the tree nodes recursively or iteratively.
        foreach (var item in list)
        {
            // 'yield return' suspends execution, allowing the caller
            // to process one item at a time without loading the whole list.
            yield return item;
        }
    }
}

Summary of Mechanics

Hashing & Buckets: The foundation of \(O(1)\) lookups in mutable dictionaries. Immutable dictionaries replace fixed buckets with a tree of nodes (HAMT) to allow structural sharing.
Resizing: Mutable collections resize by allocating a larger array and rehashing. Immutable collections grow by adding new nodes to the tree, preserving the old structure.
Collisions: Handled via chaining in mutable dictionaries. In immutable dictionaries, hash collisions are resolved by checking equality at the leaf nodes of the tree.
Thread Safety: Achieved not by locking, but by immutability. If a thread has a reference to an immutable collection, it guarantees that no other thread can modify that specific version of the data.

Basic Code Example

Here is a simple "Hello World" level example demonstrating the internal mechanics of Dictionary<TKey, TValue> and HashSet<T>, focusing on hashing, buckets, and collision resolution.

The Scenario: Efficient Vocabulary Lookup for a Tokenizer

In Natural Language Processing (NLP), a tokenizer breaks text into individual words (tokens). To process text efficiently, we need to map these string tokens to unique integer IDs (e.g., "Hello" -> 5, "World" -> 10). This mapping must support extremely fast lookups because we might process millions of words per second.

We will compare two approaches:

List<T>: Scans elements one by one (\(O(N)\)).
HashSet<T> / Dictionary<K,V>: Uses hashing to jump directly to the correct bucket (\(O(1)\) average case).

Code Example: Implementing a Simple Vocabulary

using System;
using System.Collections.Generic;

namespace DataStructuresDeepDive
{
    class Program
    {
        static void Main(string[] args)
        {
            // ---------------------------------------------------------
            // SCENARIO: Building a vocabulary for a simple tokenizer.
            // ---------------------------------------------------------

            // 1. The "Slow" Way: Using a List (Linear Search)
            // ---------------------------------------------------------
            List<string> vocabularyList = new List<string>();

            // Simulating adding words to our vocabulary
            vocabularyList.Add("hello");
            vocabularyList.Add("world");
            vocabularyList.Add("data");
            vocabularyList.Add("structures");

            // We need to find the index (ID) of a specific word.
            string wordToFind = "structures";

            // MANUAL SEARCH ALGORITHM (No LINQ shortcuts allowed!)
            // We iterate through the list until we find a match.
            int foundIndex = -1;
            for (int i = 0; i < vocabularyList.Count; i++)
            {
                if (vocabularyList[i] == wordToFind)
                {
                    foundIndex = i;
                    break; // Found it, stop searching.
                }
            }

            Console.WriteLine($"[List] Word '{wordToFind}' found at index: {foundIndex}");
            // Complexity: O(N) - In the worst case, we check every single item.


            // 2. The "Fast" Way: Using a HashSet (Hashing)
            // ---------------------------------------------------------
            // In a real tokenizer, we map string -> int, but for this 
            // "Hello World" example, we will use HashSet<string> to focus 
            // purely on the lookup mechanism without value overhead.
            HashSet<string> vocabularySet = new HashSet<string>();

            // Adding words. Internally, this calculates a hash code 
            // and places the string in a specific "bucket".
            vocabularySet.Add("hello");
            vocabularySet.Add("world");
            vocabularySet.Add("data");
            vocabularySet.Add("structures");

            // The Lookup Operation
            // The HashSet doesn't scan. It computes the hash of "structures"
            // and jumps directly to the bucket where it should be.
            bool exists = vocabularySet.Contains(wordToFind);

            Console.WriteLine($"[HashSet] Does '{wordToFind}' exist? {exists}");
            // Complexity: O(1) - Average case. Constant time regardless of list size.


            // 3. The Dictionary Way: Key-Value Mapping (Tokenizers use this)
            // ---------------------------------------------------------
            Dictionary<string, int> tokenToIdMap = new Dictionary<string, int>();

            // Populate with IDs
            tokenToIdMap["hello"] = 0;
            tokenToIdMap["world"] = 1;
            tokenToIdMap["data"] = 2;
            tokenToIdMap["structures"] = 3;

            // Retrieve the ID
            // Again, this uses the hash of the key to find the bucket instantly.
            if (tokenToIdMap.TryGetValue("structures", out int tokenId))
            {
                Console.WriteLine($"[Dictionary] Token 'structures' maps to ID: {tokenId}");
            }
        }
    }
}

Step-by-Step Explanation

Initialization (List<string>): We create a dynamic array. When we add "hello", "world", etc., the list stores these strings contiguously in memory. To find a word, the computer must start at index 0, compare the string, move to index 1, compare, and so on.
The Linear Search (for loop): In the vocabularyList section, we manually implemented a search loop. This is exactly what happens under the hood if you call List.IndexOf() or List.Contains().
- Why it's slow: If your vocabulary grows to 1 million words, and the word you want is at the end, you perform 1 million string comparisons.
The Hashing Mechanism (HashSet<string>): When we call vocabularySet.Add("structures"), the following happens internally:
- Hash Code: The .GetHashCode() method of the string "structures" is called. This returns an integer (e.g., -123456789).
- Bucket Mapping: The HashSet uses a mathematical formula (usually involving a bitwise AND with the array size) to map that huge integer to a specific index in its internal array of "buckets".
- Storage: The string is placed in that bucket.
The Lookup (Contains): When we ask vocabularySet.Contains("structures"):
- It calculates the hash code for "structures" again.
- It jumps directly to the calculated bucket index.
- It checks the item in that bucket. If the bucket is empty or holds a different item (collision), it handles it (usually via "chaining" or "open addressing").
- Result: It doesn't look at "hello", "world", or "data". It ignores 99.9% of the data.
Dictionary Implementation: The Dictionary<string, int> works exactly like HashSet<string>, but the bucket stores a struct containing both the Key (string) and the Value (int). The lookup speed is identical to HashSet.

Visualizing the Buckets

Imagine the internal storage of the HashSet or Dictionary as an array of buckets. Hashing distributes items evenly across these buckets.

Internal Array (Buckets)
Index 0: [Empty]
Index 1: ["world"] -> (Hash of "world" mapped here)
Index 2: ["data"]  -> (Hash of "data" mapped here)
Index 3: [Empty]
Index 4: ["hello"] -> (Hash of "hello" mapped here)
Index 5: ["structures"] -> (Hash of "structures" mapped here)
...

When you look for "structures", the computer calculates the hash, finds it belongs to Index 5, and retrieves it immediately without checking Indexes 0-4.

Complexity Comparison

Operation	`List<T>`	`Dictionary<K,V>` / `HashSet<T>`
Add	\(O(1)\) (Amortized)*	\(O(1)\) (Average)
Remove	\(O(N)\)	\(O(1)\) (Average)
Contains / Lookup	\(O(N)\) (Linear Scan)	\(O(1)\) (Hash Lookup)

*List Add is \(O(1)\) unless the internal array resizes, which is \(O(N)\).

Common Pitfalls

1. Poor Hash Distribution (The "Bad Key" Problem) If you create a custom class for a Dictionary and implement GetHashCode() poorly (e.g., returning the same integer for every object), all items will map to the same bucket. This turns the Dictionary into a linked list inside that bucket, degrading performance from \(O(1)\) to \(O(N)\).

Rule: A good hash function distributes values uniformly across the integer range.

2. Modifying Keys While in the Collection Never modify a field used in the hash calculation while the object is in a Dictionary or HashSet.

// BAD EXAMPLE
var set = new HashSet<MyObject>();
var obj = new MyObject { Id = 1 };
set.Add(obj);

// If we change obj.Id, its hash code changes.
// The HashSet still looks in the OLD bucket for it.
obj.Id = 2; 

// This might return false, or throw an exception, 
// because the object is effectively lost in the internal structure.
bool found = set.Contains(obj);

3. Ignoring Resizing Costs While lookups are \(O(1)\), adding items triggers resizing when the internal array is full. This involves allocating a new larger array and re-hashing every existing item (copying to new buckets). This specific operation is \(O(N)\), though it happens infrequently. For high-performance scenarios requiring predictable latency, initialize the Dictionary or HashSet with a specific capacity if you know the approximate number of items upfront.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.