Chapter 11: GC Internals - Generations, LOH, and Concurrent Collection
Theoretical Foundations
The .NET Garbage Collector (GC) is often treated as a "black box"—a background process that magically reclaims memory. For high-performance AI applications, treating it as such is a recipe for unpredictable latency spikes and throttled throughput. To build systems that process millions of tokens per second, we must understand the physics of the GC: how it partitions memory, why it pauses execution, and how to manipulate object lifetimes to stay out of its way.
The Heap as a Nursery: Generational Collection
The fundamental premise of the .NET GC is the Generational Hypothesis: objects in a managed heap fall into two categories—short-lived and long-lived. Most objects die young; few survive long enough to become old.
Imagine a high-end restaurant kitchen. When a new order comes in (a new object allocation), the ingredients are prepped on the main counter (Generation 0). If the dish is served quickly and the plate is cleared (the object goes out of scope), the counter remains free for the next order. However, if a dish sits on the counter too long, it is moved to a staging shelf (Generation 1) to clear the counter. Only dishes that remain on the staging shelf for an extended period are moved to the pantry (Generation 2) for long-term storage.
The .NET heap is physically divided into these three generations:
- Gen 0: The nursery. This is where all new objects are allocated. It is relatively small (typically 256KB–4MB depending on the workload). Collection here is extremely fast because it involves scanning a tiny memory region.
- Gen 1: The adolescence phase. This acts as a buffer between Gen 0 and Gen 2. It is roughly the same size as Gen 0. Objects that survive a Gen 0 collection are promoted to Gen 1. Gen 1 collections are still relatively fast but cover more ground than Gen 0.
- Gen 2: The long-lived region. Objects that survive a Gen 1 collection are promoted here. Gen 2 can grow very large (gigabytes). A Gen 2 collection is a "Full GC," which is expensive and often triggers a "stop-the-world" pause, freezing the application until memory is reclaimed.
Why This Matters for AI Token Processing
In the context of AI, specifically when processing tokens (words, sub-words, or embeddings), object lifetimes vary wildly.
- Short-lived: When a tokenizer splits a sentence, it creates numerous
stringinstances for each token. These are transient; once the token is converted to an ID or an embedding vector, the string is discarded. These objects die in Gen 0. - Long-lived: Model weights, configuration objects, and cached embeddings often live for the duration of the application. These reside in Gen 2.
The Critical Problem: If you inadvertently promote a short-lived object to Gen 2 (due to a Gen 0 collection not happening frequently enough, or the object surviving a Gen 0 collection), you create "memory pollution." In AI processing, if you are batching 1000 requests, and intermediate tensors (matrices) are promoted to Gen 2, a subsequent Full GC (Gen 2 collection) will pause the entire application. For a real-time chatbot, a 500ms pause is unacceptable.
The Large Object Heap (LOH): The Abyss of Performance
There is a distinct partition in the .NET memory manager called the Large Object Heap (LOH). The LOH handles objects that are larger than 85,000 bytes.
The Analogy: Think of the LOH as a specialized warehouse for oversized furniture. In a standard apartment (Normal Heap), furniture is moved around easily; compacting the space is simple because everything fits through the doors. However, oversized items (like a pool table) cannot be moved easily. If you remove the pool table, you leave a massive gap in the warehouse that cannot be filled by smaller boxes. This is fragmentation.
Technical Mechanics of the LOH
- Threshold: Any object
>= 85,000 bytesis allocated directly on the LOH. - No Compaction: By default, the LOH is not compacted during a Gen 0 or Gen 1 collection. Compaction is expensive because it requires moving large blocks of memory. The GC assumes that large objects are rare and long-lived, so the cost of moving them outweighs the benefit.
- Fragmentation: When a large object on the LOH is freed, it leaves a hole. If subsequent large allocations cannot fit into this hole, they are placed at the end of the heap. Over time, this leads to a fragmented LOH where the committed memory is high, but the usable contiguous memory is low.
The AI Bottleneck: Tensors and Buffers
In modern C# AI development (using libraries like ML.NET or ONNX Runtime), we frequently deal with tensors. A tensor is a multi-dimensional array used to store numerical data (weights, activations, inputs).
Consider a batch size of 64 with an embedding dimension of 4096 (common in Transformer models like GPT-2).
float[] is 4 bytes.
64 * 4096 * 4 = 1,048,576 bytes (approx 1 MB).
If we allocate this as a single contiguous array, it immediately lands on the LOH. If we process a stream of requests, we might allocate and deallocate thousands of these tensors. The LOH will rapidly become fragmented.
The Consequence:
In a high-throughput AI inference server, you might see OutOfMemoryException errors even when the total available RAM is sufficient. This is because the GC cannot find a contiguous block of memory large enough for the next tensor, despite having plenty of fragmented free space. This is the "LOH Fragmentation Trap."
Concurrent (Background) GC: The Janitor's Schedule
To minimize pauses, .NET uses a Concurrent GC (also known as Background GC). This allows the application to continue allocating objects while a Gen 2 collection is in progress.
The Analogy: Imagine a busy library (the application) being cleaned.
- Server GC (Non-Concurrent): The janitor locks the doors, kicks everyone out, moves all the books, vacuums, and unlocks the doors. The library is closed (paused) for the duration.
- Concurrent GC: The janitor works at night (or in the background). Patrons can still check out books (allocate memory), but the janitor tracks which books are moved. If a patron needs a book that is currently being moved, the janitor hands it over immediately.
How it Works
- Marking Phase: The GC identifies live objects. In a concurrent cycle, the GC runs on a separate thread. The application threads (mutators) continue to run.
- Allocation Contexts: When the application allocates memory during a concurrent GC, it uses a special buffer. If the buffer fills up, the allocation pauses briefly until the GC catches up.
- The "Breadth" vs. "Depth" Trade-off: Concurrent GC is not a silver bullet. It uses more CPU cycles overall to manage the coordination between the GC thread and the application threads. It is designed to reduce latency (pause times), not necessarily to increase throughput.
AI Implications
For AI applications, specifically those using async/await patterns (heavily used in Book 9: Async Streams for Real-Time AI), the Concurrent GC is vital.
When an AI model generates a token stream (Server-Sent Events), the application is constantly allocating new string objects for the tokens and Task objects for the async state machines.
If a blocking Gen 2 collection occurred, the stream would stutter. The user would see a "loading..." spinner for seconds. Concurrent GC ensures that while the model is crunching numbers on the GPU (or CPU), the managed heap is being cleaned in the background without halting the token delivery stream.
Theoretical Foundations
To understand how the GC decides what to keep, we must define Reachability. An object is considered reachable if it is accessible from a "Root."
Roots include:
- Local variables on the stack.
- Static variables.
- CPU registers holding references.
- Objects on the "Handle Table" (used for COM interop).
The Analogy: Imagine a massive spiderweb (the object graph). The roots are the anchor points nailed to the ceiling. The GC starts at the anchors and follows every thread (reference) to see what is attached. Anything not connected to an anchor, directly or indirectly, is considered "garbage" and is swept away.
The Mark-and-Sweep Algorithm
- Mark Phase: The GC pauses execution (or runs concurrently) and walks the graph starting from the roots. Every object it visits is marked as "live."
- Sweep Phase: The GC scans the heap. Unmarked objects are deleted. The memory is reclaimed.
- Compaction (Optional): To solve fragmentation, the GC slides live objects together to create a contiguous block of free space.
Practical Tuning for AI: Object Lifetime Management
In the context of high-performance C# for AI, we must actively manage lifetimes to keep the GC out of the critical path.
1. The Perils of Boxing
Boxing occurs when a value type (like int, struct) is wrapped into an object (placed on the heap).
AI Context: When calculating loss functions or metrics, you might use a List<object> or a generic collection that isn't optimized. If you box a float (4 bytes) into an object, the overhead includes the object header (16 bytes on 64-bit) + the value. This creates garbage rapidly.
Solution: Use List<float> or Span<float>. By avoiding boxing, you keep data on the stack or in contiguous arrays, preventing Gen 0 pollution.
2. Array Pooling and ArrayPool<T>
Allocating large arrays (like tensors) is the primary cause of LOH pressure. ArrayPool<T> is a critical feature introduced in .NET Core.
Concept: Instead of new float[100000], you rent an array from a pool. When done, you return it.
Why: This prevents the GC from seeing the allocation and deallocation. The memory stays "warm" in the pool, reducing the frequency of Gen 0/Gen 1 collections.
using System.Buffers;
public void ProcessTokenEmbedding(int tokenCount, int embeddingSize)
{
// Rent from the pool instead of allocating new.
// This avoids LOH allocation if the size > 85k bytes.
float[] buffer = ArrayPool<float>.Shared.Rent(tokenCount * embeddingSize);
try
{
// Process the AI model inference using the buffer...
// Since the buffer is rented, it might be larger than requested.
// We must track the actual usage.
Span<float> slice = buffer.AsSpan(0, tokenCount * embeddingSize);
// ... perform tensor operations ...
}
finally
{
// CRITICAL: Return the buffer to the pool.
ArrayPool<float>.Shared.Return(buffer);
}
}
3. Structs vs. Classes (Value vs. Reference Types)
In AI, we often process small, immutable data points (e.g., a Token struct containing an ID and a log probability).
- Class: Allocates on the heap. Every token creates garbage.
- Struct: Allocates on the stack (if local) or inline within an array. No GC pressure.
Architectural Implication:
When designing a high-throughput tokenizer, representing tokens as readonly struct Token rather than class Token can reduce GC pressure by orders of magnitude. However, be wary of large structs passed by value (copying overhead). For AI, use ref struct (available since C# 7.2) which guarantees allocation on the stack and cannot be boxed, ensuring zero GC interaction.
Visualizing the Heap Layout
The following diagram illustrates the memory layout during a high-load AI inference scenario, showing the relationship between the Small Object Heap (SOH) and the Large Object Heap (LOH).
The Role of Span<T> in GC Optimization
In previous chapters, we discussed Span<T> for memory safety and SIMD. In the context of GC, Span<T> is a powerful tool because it provides a type-safe view over memory without allocating.
The Analogy: Imagine a large buffet (the heap). Span<T> is like a pair of tongs. It allows you to access and manipulate the food (data) without taking the whole plate (allocating a new object). You can pass the tongs around, but the food stays on the buffet.
AI Application: When processing a stream of tokens, we often need to slice arrays (e.g., extracting a sub-sequence of embeddings).
- Traditional:
var subArray = bigArray.Skip(10).Take(5).ToArray();-> This allocates a new array on the heap (Gen 0 garbage). - Optimized:
Span<float> subView = bigArray.AsSpan().Slice(10, 5);-> This allocates nothing. It simply adjusts pointers. The GC sees no new objects, and therefore no collection is triggered.
Theoretical Foundations
Understanding these foundations allows us to predict the behavior of our AI application.
- Generations dictate that we must minimize the lifetime of temporary objects (tokens, intermediate calculations) to prevent Gen 2 pollution.
- LOH dictates that we must avoid large contiguous allocations (tensors) or use pooling to prevent fragmentation that leads to OOM exceptions.
- Concurrent GC dictates that while we can rely on it for UI responsiveness, high-throughput server AI (like batch processing) often benefits more from manual memory management (pooling,
Span<T>) to reduce the overall GC workload.
By mastering these concepts, we move from "hoping the GC handles it" to "orchestrating the GC," ensuring that our C# AI applications meet the stringent latency and throughput requirements of modern workloads.
Basic Code Example
Here is the code example and detailed explanation for the "Basic Code Example" subsection.
using System;
using System.Buffers;
using System.Diagnostics;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace HighPerformanceAI.GCInternals
{
public class TokenProcessingSimulator
{
// Configuration for the simulation
private const int TotalTokensToProcess = 1_000_000;
private const int BatchSize = 10_000;
public static async Task Main(string[] args)
{
Console.WriteLine($"Starting Token Processing Simulation (Total: {TotalTokensToProcess:N0})");
Console.WriteLine("-------------------------------------------------------------");
// 1. Baseline: Naive approach causing heavy GC pressure (Gen 0/1 collections)
await RunNaiveTokenProcessing();
Console.WriteLine("\n");
// 2. Optimized approach: Using ArrayPool to reduce allocations and LOH pressure
await RunOptimizedTokenProcessing();
Console.WriteLine("\nSimulation Complete. Press any key to exit.");
Console.ReadKey();
}
/// <summary>
/// Simulates processing tokens by allocating new arrays for every batch.
/// This causes frequent Gen 0 collections and potential LOH allocations if strings are large.
/// </summary>
private static async Task RunNaiveTokenProcessing()
{
Console.WriteLine(">>> SCENARIO 1: Naive Allocation (High GC Pressure)");
var sw = Stopwatch.StartNew();
long initialMemory = GC.GetTotalMemory(true);
for (int i = 0; i < TotalTokensToProcess; i += BatchSize)
{
// Create a batch of tokens (simulating strings from an AI model)
// In a real scenario, these might be generated by a tokenizer.
string[] tokenBatch = new string[BatchSize];
for (int j = 0; j < BatchSize; j++)
{
// Simulate a token string.
// Note: If the string length > 850 bytes (approx), it goes directly to the LOH.
tokenBatch[j] = $"Token_{i + j}_Data_Payload_{Guid.NewGuid()}";
}
// Simulate some processing (e.g., concatenation or model state update)
StringBuilder sb = new StringBuilder();
foreach (var token in tokenBatch)
{
sb.Append(token);
}
string _ = sb.ToString(); // Force creation of the final string
// Force a garbage collection check periodically to simulate pressure
// (In production, we don't call GC.Collect(), but here we visualize it)
if (i % (BatchSize * 10) == 0)
{
// Check memory usage
long currentMemory = GC.GetTotalMemory(false);
Console.WriteLine($" [Naive] Processed {i + BatchSize:N0} tokens | Memory: {currentMemory / 1024.0 / 1024.0:F2} MB");
}
}
sw.Stop();
long finalMemory = GC.GetTotalMemory(true);
Console.WriteLine($" [Naive] Completed in {sw.ElapsedMilliseconds}ms");
Console.WriteLine($" [Naive] Memory Delta: {(finalMemory - initialMemory) / 1024.0 / 1024.0:F2} MB");
Console.WriteLine($" [Naive] Gen 0 Collections: {GC.CollectionCount(0)}");
Console.WriteLine($" [Naive] Gen 1 Collections: {GC.CollectionCount(1)}");
Console.WriteLine($" [Naive] Gen 2 Collections: {GC.CollectionCount(2)}");
}
/// <summary>
/// Simulates processing tokens using ArrayPool to reuse memory buffers.
/// This minimizes allocations, reduces Gen 0 pressure, and prevents LOH fragmentation.
/// </summary>
private static async Task RunOptimizedTokenProcessing()
{
Console.WriteLine(">>> SCENARIO 2: Optimized Allocation (Low GC Pressure via ArrayPool)");
var sw = Stopwatch.StartNew();
long initialMemory = GC.GetTotalMemory(true);
// Pre-calculate buffer size needed for the batch
// We use ArrayPool<byte[]> to reuse arrays of string references.
// This avoids the 'new string[]' allocation on the heap for every batch.
var tokenPool = ArrayPool<string[]>.Shared;
for (int i = 0; i < TotalTokensToProcess; i += BatchSize)
{
// Rent an array from the pool instead of allocating a new one.
// If the pool is empty, it will allocate a new one (same cost as naive),
// but subsequent iterations will reuse the memory.
string[] tokenBatch = tokenPool.Rent(BatchSize);
try
{
// Initialize the batch
for (int j = 0; j < BatchSize; j++)
{
// We still allocate strings here (which are on the heap),
// but we avoid allocating the container array (which is on the heap).
// Note: To optimize further, we would also pool the string data buffers,
// but that requires custom types (e.g., Span<char> handling).
tokenBatch[j] = $"Token_{i + j}_Data_Payload_{Guid.NewGuid()}";
}
// Simulate processing
StringBuilder sb = new StringBuilder();
for (int j = 0; j < BatchSize; j++)
{
sb.Append(tokenBatch[j]);
}
string _ = sb.ToString();
if (i % (BatchSize * 10) == 0)
{
long currentMemory = GC.GetTotalMemory(false);
Console.WriteLine($" [Optimized] Processed {i + BatchSize:N0} tokens | Memory: {currentMemory / 1024.0 / 1024.0:F2} MB");
}
}
finally
{
// CRITICAL: Return the array to the pool so it can be reused.
// This marks the memory as available for the next loop iteration.
tokenPool.Return(tokenBatch);
}
}
sw.Stop();
long finalMemory = GC.GetTotalMemory(true);
Console.WriteLine($" [Optimized] Completed in {sw.ElapsedMilliseconds}ms");
Console.WriteLine($" [Optimized] Memory Delta: {(finalMemory - initialMemory) / 1024.0 / 1024.0:F2} MB");
Console.WriteLine($" [Optimized] Gen 0 Collections: {GC.CollectionCount(0)}");
Console.WriteLine($" [Optimized] Gen 1 Collections: {GC.CollectionCount(1)}");
Console.WriteLine($" [Optimized] Gen 2 Collections: {GC.CollectionCount(2)}");
}
}
}
Code Explanation
Here is a line-by-line breakdown of the concepts demonstrated in the code.
1. Setup and Configuration
using System.Buffers;: This namespace is essential for accessingArrayPool<T>, the core mechanism for memory reuse in .NET.TotalTokensToProcess&BatchSize: We define constants to simulate a realistic workload. Processing 1 million tokens in batches of 10,000 allows us to observe GC behavior over time without waiting for hours.MainMethod: The entry point orchestrates two distinct scenarios to contrast performance characteristics.
2. Scenario 1: Naive Allocation
This method represents the standard, unoptimized approach often seen in high-level code.
string[] tokenBatch = new string[BatchSize];:- What happens: This allocates a contiguous block of memory on the Managed Heap to hold references to string objects.
- GC Impact: In a tight loop, this object is allocated in Gen 0. Once the loop iteration finishes and
tokenBatchgoes out of scope (conceptually), this memory becomes garbage. If the Gen 0 budget is exceeded, a GC pause occurs.
tokenBatch[j] = $"Token_...";:- What happens: String interpolation creates new string objects on the heap.
- LOH Risk: If the payload (simulated here by
Guid.NewGuid()) exceeds ~850 bytes, these strings land directly on the Large Object Heap (LOH). The LOH is collected only during a Gen 2 collection (a "Full GC"), which is significantly more expensive and causes longer pauses.
StringBuilderUsage: We simulate the aggregation of tokens (common in AI context windows). This creates yet another large string object at the end (sb.ToString()), further stressing the heap.
3. Scenario 2: Optimized Allocation (ArrayPool)
This method demonstrates high-performance memory management.
ArrayPool<string[]>.Shared:- What happens: We access the global shared pool. This pool is thread-safe and managed by the runtime to balance memory usage and reuse.
tokenPool.Rent(BatchSize);:- What happens: Instead of
new, we request an array. The pool checks if an array of this size is currently闲置 (idle) in its internal stack. - Performance Win: If an array is available, the cost is essentially a pointer assignment (extremely fast). If not, it allocates a new one (same cost as the naive approach). Crucially, the container array (the
string[]) is reused across iterations, drastically reducing Gen 0 pressure.
- What happens: Instead of
try { ... } finally { tokenPool.Return(tokenBatch); }:- What happens: This is the most critical block. The
finallyblock guarantees that the array is returned to the pool, even if an exception occurs during processing. - Why it matters: If you fail to
Returnan object, the pool assumes it is still in use and will allocate a new one next time, defeating the purpose. If you return an array that has been modified (e.g., resized or corrupted), you risk logic errors in the consumer.
- What happens: This is the most critical block. The
Common Pitfalls
-
Forgetting to
Returnto the Pool: The most common mistake is renting an array, using it, and failing to return it. This causes a "memory leak" within the pool context—the pool thinks the memory is busy and keeps allocating new arrays, eventually leading to high memory usage and OutOfMemoryExceptions. Always use atry/finallyblock to ensure the return happens. -
Holding References to Rented Arrays: Once you call
pool.Return(array), you lose exclusive ownership. The runtime may hand this same memory to another thread immediately.- Bad: Storing the rented array in a class field after returning it.
- Bad: Accessing the array indices after the
finallyblock. - Consequence: This leads to race conditions and data corruption where one thread overwrites data being read by another.
-
Clearing Arrays (Security/Safety):
ArrayPooldoes not automatically zero out memory for performance reasons. If you rent an array containing sensitive data (e.g., token embeddings or user IDs), and you return it to the pool, the next consumer might read that residual data.- Fix: If the data is sensitive, manually clear the array (e.g.,
Array.Clear) before returning it, or use thepool.Return(array, clearArray: true)overload.
- Fix: If the data is sensitive, manually clear the array (e.g.,
-
Pooling Large Objects (LOH) blindly: While
ArrayPoolhelps, it has limits. Arrays larger than a certain threshold (usually 1MB or 2MB depending on the implementation) might not be pooled effectively or might still land in the LOH. For AI token processing, if your individual buffers are massive (e.g., embedding vectors), you should profile specifically for LOH fragmentation rather than relying solely on the default pool.
Visualizing the Heap
The following diagram illustrates the difference in memory layout between the Naive and Optimized approaches.
Analysis of the Visualization:
- Left (Naive): Every iteration creates a new object on the heap. The GC must identify and sweep these "dead" objects frequently. In a high-throughput AI scenario, this creates a "sawtooth" memory pattern, causing frequent pauses.
- Right (Optimized): The
ArrayPoolmaintains a static buffer. TheRentandReturnoperations simply move a pointer. No new heap allocations occur for the container array, allowing the GC to sleep while the application processes millions of tokens.
Real-World Context: AI Token Processing
In a high-performance AI application (like a chatbot backend), the bottleneck is often not the GPU computation, but the CPU memory management during pre-processing and post-processing.
- The Problem: An AI model generates tokens one by one. To feed the model, you need to maintain a "context window" (a list of previous tokens). If you use a standard
List<string>, every time the list grows, it might resize (allocating a new, larger array), copying old data. If you process batches of requests, allocating new arrays for every request creates massive GC pressure. - The Consequence: Frequent Gen 0/1 collections introduce latency spikes (jitter). If the allocations are large (e.g., storing large embedding vectors or long text prompts), you risk filling the Large Object Heap (LOH). Once the LOH is fragmented, the application may pause for seconds during a Gen 2 collection, or even crash with
OutOfMemoryExceptiondespite having physical RAM available. - The Solution: By using
ArrayPool<T>(orSpan<T>/Memory<T>for stack-based allocation), we decouple the data (the token strings) from the container (the array holding references). We reuse the container memory, keeping the GC quiet and the CPU cache hot, ensuring sustained high throughput.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.