Chapter 3: Introduction to Vectors - Arrays vs Tensors

Theoretical Foundations

In Book 2, we explored Generics and how they allow us to write type-safe, reusable code. We saw how List<T> provides a dynamic array that grows as needed. While List<T> is excellent for ordered storage and iteration, it is not optimized for specific access patterns. As we move into building AI systems—specifically tokenizers and graph traversals—we require data structures that offer constant-time complexity for lookups, insertions, and deletions.

This section introduces the core collection types that serve as the building blocks for high-performance vector operations and AI pipelines: Dictionaries, HashSets, Queues, and Linked Lists. We will dissect their internal mechanics, focusing on how they manage memory and collisions, and analyze their algorithmic complexity to understand why they are preferred over linear structures in specific contexts.

The Dictionary: Hashing and Buckets

The Dictionary<TKey, TValue> is the cornerstone of efficient lookups. Unlike a List<T>, which requires scanning elements sequentially (O(n) complexity), a Dictionary aims for O(1) average-case complexity for lookups, insertions, and deletions. This is achieved through hashing.

Real-World Analogy

Imagine a massive library containing millions of books. If you want to find a specific book by its title, walking through every aisle (linear scan) is inefficient. A library uses a catalog system: you look up the title, get a call number (e.g., "Aisle 4, Shelf 2"), and go directly there. The Dictionary works similarly. It takes a key (the book title), calculates a hash code (the call number), and uses that to find the exact location in memory (the shelf).

Internal Mechanics: Hashing and Buckets

A Dictionary internally uses an array of "buckets." When you add a key-value pair:

Hash Calculation: The key's GetHashCode() method is called. This returns an integer (the hash code).
Index Mapping: The hash code is mapped to a specific bucket index using the formula: index = hashCode % bucketCount.
Storage: The key-value pair is stored in that bucket.

However, different keys can produce the same index (a "collision"). To handle this, buckets typically contain a linked list of entries. If two keys map to the same bucket, the new entry is appended to the list in that bucket.

Collision Resolution

Collisions are inevitable. If too many keys map to the same bucket, the performance degrades to O(n) (scanning the linked list). To mitigate this:

Resizing: When the number of elements exceeds a load factor (usually 70-90% of the bucket capacity), the dictionary resizes. It creates a new, larger array (usually double the size) and re-hashes all existing keys into the new buckets. This is an expensive O(n) operation, but it happens infrequently, maintaining the O(1) average amortized complexity.

C# Implementation and Complexity

Here is a simplified conceptual implementation of a dictionary entry and usage in an AI context (e.g., a tokenizer vocabulary):

using System;
using System.Collections.Generic;

public class TokenizerVocabulary
{
    // A dictionary mapping tokens (strings) to their integer IDs
    private Dictionary<string, int> _vocab;

    public TokenizerVocabulary()
    {
        // Initialize with a reasonable capacity to minimize resizing
        _vocab = new Dictionary<string, int>(capacity: 10000);
    }

    public void AddToken(string token, int id)
    {
        // O(1) average, O(n) worst case (if resizing or many collisions)
        _vocab[token] = id;
    }

    public int GetTokenId(string token)
    {
        // O(1) average lookup
        if (_vocab.TryGetValue(token, out int id))
        {
            return id;
        }
        // Return unknown token ID (standard in NLP)
        return -1; 
    }
}

Complexity Analysis:

Lookup/Insert/Delete: O(1) average case. O(n) worst case (if all keys collide).
Memory: O(n) to store keys and values.
Comparison vs List:
- List<string>.Contains("token"): O(n) — must scan every element.
- Dictionary<string, int>.ContainsKey("token"): O(1) — direct index access.

In AI, this is critical. A tokenizer vocabulary might contain 50,000 tokens. Checking if a word exists using a List would be slow in real-time inference, whereas a Dictionary is instantaneous.

The HashSet: Mathematical Set Theory in Code

HashSet<T> is a specialized collection that contains no duplicate elements and provides high-performance set operations (Union, Intersection, Difference). It is essentially a Dictionary<TKey, TValue> where the values are ignored (or a dummy value is used).

Real-World Analogy

Think of a security checkpoint scanning for banned items. You have a list of banned item signatures. As an item passes, you check if it matches any signature. You don't care about the order, and you don't need to store a "value" associated with the item—only whether it exists in the set. This is a HashSet.

Internal Mechanics

Like Dictionary, HashSet uses a hash table structure. It stores only the keys (the elements). When you call Add(item):

The hash code of item is computed.
The bucket index is calculated.
If the bucket is empty, the item is added.
If the bucket contains items, the HashSet checks for equality. If the item already exists (based on GetHashCode and Equals), the operation returns false and no change is made.

AI Context: Efficient Lookups for Tokenizers

In Natural Language Processing (NLP), we often need to filter out "stop words" (common words like "the", "is", "and") or check if a token is in a specific vocabulary subset. Using a List for this requires nested loops or repeated linear scans. A HashSet allows O(1) filtering.

public class TextPreprocessor
{
    // Using HashSet for O(1) membership checking
    private HashSet<string> _stopWords;

    public TextPreprocessor()
    {
        _stopWords = new HashSet<string>
        {
            "the", "is", "at", "which", "on", "a", "an", "and"
        };
    }

    public List<string> FilterStopWords(List<string> tokens)
    {
        var filtered = new List<string>();

        foreach (var token in tokens)
        {
            // O(1) check. If we used List<string>, this would be O(n) per token.
            if (!_stopWords.Contains(token))
            {
                filtered.Add(token);
            }
        }

        return filtered;
    }
}

Complexity Analysis:

Add/Remove/Contains: O(1) average.
Union/Intersection: O(n) where n is the size of the set. This is significantly faster than manual loops over lists.

The Queue: FIFO and Buffering

Queue<T> represents a First-In-First-Out (FIFO) collection. Elements are added at the "rear" and removed from the "front."

Real-World Analogy

A line of people waiting for a bus. The first person to arrive is the first to board. New arrivals join the back of the line.

Internal Mechanics

A Queue is typically implemented using a circular array (a ring buffer).

Enqueue (Add): Adds an element to the tail index. If the array is full, it resizes (doubles capacity) and copies elements.
Dequeue (Remove): Removes the element at the head index and increments the head.
Circular Nature: When the head or tail reaches the end of the array, it wraps around to index 0. This avoids the need to shift all elements (which would happen in a List if you removed from the front).

AI Context: Graph Traversal (BFS)

In AI, graphs are fundamental (e.g., knowledge graphs, pathfinding). Breadth-First Search (BFS) is a standard algorithm for traversing graphs layer by layer. A Queue is the required data structure for BFS.

public class GraphNode
{
    public string Id { get; set; }
    public List<GraphNode> Neighbors { get; set; } = new List<GraphNode>();
}

public class GraphTraversal
{
    public List<string> BFS(GraphNode startNode)
    {
        var visited = new HashSet<string>(); // Track visited nodes to avoid cycles
        var queue = new Queue<GraphNode>();  // The queue for BFS
        var result = new List<string>();

        queue.Enqueue(startNode);
        visited.Add(startNode.Id);

        while (queue.Count > 0)
        {
            // Dequeue is O(1)
            var current = queue.Dequeue();
            result.Add(current.Id);

            foreach (var neighbor in current.Neighbors)
            {
                if (!visited.Contains(neighbor.Id))
                {
                    visited.Add(neighbor.Id);
                    queue.Enqueue(neighbor); // Add to back of line
                }
            }
        }

        return result;
    }
}

Complexity Analysis:

Enqueue/Dequeue: O(1) (amortized, due to resizing).
BFS Traversal: O(V + E) where V is vertices and E is edges. The Queue ensures we process nodes in the order they are discovered.

The LinkedList: Dynamic Insertion and Deletion

LinkedList<T> is a doubly linked list. Each element (a LinkedListNode) contains the value, a reference to the Next node, and a reference to the Previous node.

Real-World Analogy

A scavenger hunt where each clue points to the location of the next clue. You don't need to know where the entire list is stored in memory; you just follow the pointers. If you want to insert a new clue between clue #2 and #3, you simply update the pointers of #2 and #3 to point to the new clue. You don't need to shift physical objects around.

Internal Mechanics

Unlike arrays, LinkedList nodes are scattered in memory (heap). The list maintains references only to the Head and Tail.

Insertion: To insert X between A and B:
1. X.Next = B
2. X.Prev = A
3. A.Next = X
4. B.Prev = X This is O(1) if you already have references to A and B.
Access: To access the 50th element, you must traverse from the Head (O(n)). Therefore, LinkedList is poor for random access but excellent for frequent insertions/deletions in the middle.

AI Context: Implementing Iterators

In AI pipelines, data flows through transformations. C# yield return keyword relies on iterators. While List and Array have built-in iterators, understanding LinkedList helps understand how iterators manage state manually.

yield return creates a state machine. Conceptually, it's like traversing a linked list where the "current" pointer is saved between calls.

public class DataStream<T>
{
    private LinkedList<T> _data = new LinkedList<T>();

    public void Add(T item) => _data.AddLast(item);

    // Custom iterator using 'yield return'
    // This allows processing data one by one without loading all into memory
    public IEnumerable<T> GetProcessedStream()
    {
        foreach (var item in _data)
        {
            // Simulate heavy processing
            // yield return pauses execution here, returns the value, 
            // and resumes from here when the next item is requested.
            yield return ProcessItem(item); 
        }
    }

    private T ProcessItem(T item)
    {
        // Complex logic here
        return item;
    }
}

Complexity Analysis:

Insert/Delete (given node reference): O(1).
Insert/Delete (by index): O(n).
Access (by index): O(n).
Memory Overhead: Higher than arrays (stores pointers + node objects).

Summary of Mechanics and AI Application

The choice of data structure dictates the performance profile of an AI system.

Dictionary:
- Mechanism: Hash table with buckets and collision resolution via chaining.
- AI Use Case: Tokenizer vocabularies (Token -> ID mapping). Fast retrieval is mandatory for real-time text generation.
- Big O: O(1) average lookup.
HashSet:
- Mechanism: Hash table storing only keys.
- AI Use Case: Deduplicating datasets, filtering stop words, or checking graph node visitation (as seen in BFS).
- Big O: O(1) membership check.
Queue:
- Mechanism: Circular buffer (array-based) managing Head and Tail pointers.
- AI Use Case: Graph traversal (BFS), managing batch processing queues in parallel computing.
- Big O: O(1) Enqueue/Dequeue.
LinkedList:
- Mechanism: Doubly linked nodes scattered in memory.
- AI Use Case: Efficiently managing dynamic sequences where insertion order matters and random access is rare (e.g., history of chat messages in a session).
- Big O: O(1) insertion/deletion if node reference is known.

By mastering these internal mechanics, we move from simply using collections to architecting systems that handle the massive scale of data required in modern AI.

Visualizing the Bucket Collision

The following diagram illustrates how a Dictionary handles two different keys (KeyA and KeyB) that hash to the same bucket index, while KeyC hashes to a different index.

Basic Code Example

Here is a simple 'Hello World' level code example demonstrating a custom Hash Table implementation using Dictionary<K,V>, focusing on internal mechanics like hashing, buckets, and collision handling.

using System;
using System.Collections.Generic;

namespace DataStructuresBasics
{
    // A simplified custom HashTable to demonstrate internal mechanics.
    // In production, use System.Collections.Generic.Dictionary<TKey, TValue>.
    public class CustomHashTable<TKey, TValue>
    {
        // Internal bucket structure to handle collisions via chaining (linked lists).
        private class Bucket
        {
            public TKey Key;
            public TValue Value;
            public Bucket Next; // Pointer to the next item in this bucket (collision chain)

            public Bucket(TKey key, TValue value)
            {
                Key = key;
                Value = value;
                Next = null;
            }
        }

        private Bucket[] _buckets;
        private int _capacity;
        private int _count;

        public CustomHashTable(int initialCapacity = 16)
        {
            _capacity = initialCapacity;
            _buckets = new Bucket[_capacity];
            _count = 0;
        }

        // 1. Hashing: Converts a key into an array index.
        // Complexity: O(1) for calculation.
        private int GetBucketIndex(TKey key)
        {
            // Get hash code from the key.
            int hashCode = key.GetHashCode();
            // Ensure non-negative index using modulo operator.
            return Math.Abs(hashCode) % _capacity;
        }

        // 2. Insertion: Adds a key-value pair.
        // Complexity: Average O(1), Worst O(N) if resizing is needed or many collisions occur.
        public void Insert(TKey key, TValue value)
        {
            if (key == null) throw new ArgumentNullException(nameof(key));

            int index = GetBucketIndex(key);
            Bucket current = _buckets[index];

            // Check for existing key (Update scenario)
            while (current != null)
            {
                if (current.Key.Equals(key))
                {
                    current.Value = value; // Update existing value
                    return;
                }
                current = current.Next;
            }

            // Insert new node at the head of the linked list (bucket)
            // This handles collisions: multiple keys hash to the same index.
            Bucket newBucket = new Bucket(key, value);
            newBucket.Next = _buckets[index]; // Point to the old head
            _buckets[index] = newBucket;      // Update head to new node
            _count++;

            // 3. Resizing: If load factor is high, resize array to maintain O(1) performance.
            // Load Factor = Count / Capacity. Threshold usually ~0.75.
            if (_count > _capacity * 0.75)
            {
                Resize();
            }
        }

        // 4. Lookup: Retrieves a value by key.
        // Complexity: Average O(1), Worst O(N) (if all keys collide into one bucket).
        public TValue Get(TKey key)
        {
            if (key == null) throw new ArgumentNullException(nameof(key));

            int index = GetBucketIndex(key);
            Bucket current = _buckets[index];

            // Traverse the linked list in the bucket
            while (current != null)
            {
                if (current.Key.Equals(key))
                {
                    return current.Value;
                }
                current = current.Next;
            }

            throw new KeyNotFoundException($"Key '{key}' not found.");
        }

        // 5. Resizing Mechanism: Doubles capacity and rehashes all items.
        // This is the most expensive operation (O(N)), but amortized over many inserts, it remains O(1).
        private void Resize()
        {
            int newCapacity = _capacity * 2;
            Bucket[] newBuckets = new Bucket[newCapacity];

            // Rehash all existing items into the new array
            for (int i = 0; i < _capacity; i++)
            {
                Bucket current = _buckets[i];
                while (current != null)
                {
                    // Recalculate index based on new capacity
                    int newIndex = Math.Abs(current.Key.GetHashCode()) % newCapacity;

                    // Insert into new bucket (chaining)
                    Bucket newNode = new Bucket(current.Key, current.Value);
                    newNode.Next = newBuckets[newIndex];
                    newBuckets[newIndex] = newNode;

                    current = current.Next;
                }
            }

            _buckets = newBuckets;
            _capacity = newCapacity;
            Console.WriteLine($"DEBUG: Resized to capacity {_capacity}");
        }

        // 6. Iteration: Using 'yield return' to expose items without exposing internal structure.
        public IEnumerable<KeyValuePair<TKey, TValue>> Iterate()
        {
            for (int i = 0; i < _capacity; i++)
            {
                Bucket current = _buckets[i];
                while (current != null)
                {
                    yield return new KeyValuePair<TKey, TValue>(current.Key, current.Value);
                    current = current.Next;
                }
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // Real-world context: A simple Tokenizer Vocabulary used in NLP.
            // We map unique words (string) to their integer IDs (int).
            // Efficient lookups are critical here (O(1) vs O(N) for Lists).
            var vocab = new CustomHashTable<string, int>();

            Console.WriteLine("--- Inserting Tokens ---");
            vocab.Insert("the", 1);
            vocab.Insert("quick", 2);
            vocab.Insert("brown", 3);
            vocab.Insert("fox", 4);

            // Demonstrate Collision: Force a collision if possible (depends on hash codes).
            // In a real scenario, this happens naturally with large datasets.
            vocab.Insert("jumps", 5);
            vocab.Insert("over", 6);
            vocab.Insert("lazy", 7);
            vocab.Insert("dog", 8);

            Console.WriteLine("\n--- Retrieving Tokens ---");
            Console.WriteLine($"ID of 'fox': {vocab.Get("fox")}");
            Console.WriteLine($"ID of 'dog': {vocab.Get("dog")}");

            Console.WriteLine("\n--- Iterating via Yield Return ---");
            foreach (var pair in vocab.Iterate())
            {
                Console.WriteLine($"Token: {pair.Key,-10} ID: {pair.Value}");
            }

            // Performance Comparison Context
            Console.WriteLine("\n--- Complexity Comparison ---");
            Console.WriteLine("List.Contains: O(N) - Must scan every element.");
            Console.WriteLine("HashSet/Dictionary.Contains: O(1) - Direct index access via hashing.");
        }
    }
}

Explanation of Mechanics

Internal Structure (Buckets & Chaining): The _buckets array acts as the primary storage. Each index in the array is a "bucket." To handle collisions (when two different keys produce the same hash index), we use chaining. Each bucket points to a Linked List (Bucket class with a Next property). If two keys hash to index 5, the second one is simply attached to the end of the list at index 5.
Hashing (GetBucketIndex): The GetHashCode() method converts an object into an integer. We apply the modulo operator (% _capacity) to fit this large integer into our array's bounds. This is the deterministic step that allows us to jump directly to a specific memory location (or bucket) without searching the entire collection.
Resizing (Resize): As items are added, the "Load Factor" (count / capacity) increases. When it exceeds a threshold (0.75 in this example), the array doubles in size. Crucially, all items must be rehashed because the modulo divisor (capacity) has changed. This is why resizing is an expensive O(N) operation, but it happens rarely enough that the average cost of insertion remains O(1).
Iterators (yield return): The Iterate method uses yield return. This creates a state machine that allows us to traverse the complex internal structure (array of linked lists) and return items one by one to the caller. The caller sees a simple sequence of pairs, without needing to understand the internal bucket logic.

Common Pitfalls

Mutating Keys After Insertion: Never use mutable objects as keys if their hash code depends on mutable fields. If you change a field in the key object after inserting it into the hash table, GetHashCode() will return a different value. The lookup will calculate a new bucket index and fail to find the item, effectively "losing" it in the table. Always use immutable types (like string, int, or immutable structs) as keys.
Infinite Recursion in GetHashCode(): When overriding GetHashCode() for custom classes, ensure it does not call a method that might eventually call GetHashCode() again. Also, ensure that if a.Equals(b) is true, a.GetHashCode() must equal b.GetHashCode(). Violating this contract breaks the hash table's logic.
Ignoring Load Factor: In a custom implementation, forgetting to resize leads to O(N) performance degradation. As the bucket array fills up, collision chains get longer, effectively turning the hash table into a linked list for lookups. Always monitor the count relative to capacity.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.