Chapter 8: Managing Context Windows Locally

Theoretical Foundations

The fundamental challenge of running Large Language Models locally is not computational power, but the scarcity of high-bandwidth memory (VRAM) relative to the model's appetite. In a cloud environment, we abstract away memory constraints by scaling horizontally. Locally, we are bound by the physical limits of our GPU. This section establishes the theoretical bedrock for managing these constraints, focusing on the interplay between context windows, token budgeting, and semantic pruning.

The Finite Journal: A Tale of Context and Memory

Imagine you are a historian (the LLM) writing a collaborative book with a scribe (the user). Your desk is incredibly small—this is your VRAM. You can only keep a few pages of the current chapter open at once.

If the conversation (the book) goes on for too long, the desk becomes cluttered. You cannot fit the new page the scribe hands you because the old pages are taking up all the space. This is a Context Overflow.

To solve this, you have a few strategies:

The Sliding Window: You slide the old pages off the desk into a box (system RAM/disk) to make room for the new ones. You only remember the last few pages vividly.
The Executive Summary: Instead of keeping every page, you take a moment to summarize the previous pages into a single, dense paragraph (vector embeddings). You pin this summary to the top of the desk. It takes up very little space but retains the "gist" of the history.
The Token Budget: You strictly ration how many words the scribe is allowed to write per turn to ensure you never run out of desk space.

In C# ONNX Runtime, we are the Scribe's Manager. We must implement these strategies programmatically to ensure the conversation flows without the GPU throwing an "Out of Memory" exception.

The Architecture of Scarcity: Token Budgeting and Sliding Windows

The core unit of memory in an LLM is the Token. It is not a word; it is a chunk of data. A model's "Context Window" is the maximum number of tokens it can attend to in a single forward pass. This includes the System Prompt, the User Input, and the Model's Previous Output.

When we run locally, we cannot simply rely on the model's native context window (e.g., 4096 or 8192 tokens) because the weights of the model itself already occupy the majority of the VRAM. The activations (the intermediate calculations) consume the rest.

Token Budgeting

Token Budgeting is the act of strictly enforcing a limit on the total tokens processed in a single inference step. This is not just about truncating text; it is about intelligent allocation.

In a modern C# architecture, we treat the context window as a finite resource pool. We use Interfaces to abstract the source of tokens, allowing us to inject budgeting logic transparently. This relates back to Book 7, Chapter 4, where we discussed the IModelConnector interface. Just as that interface allowed us to swap between OpenAI and Local Llama implementations, we will use a IContextManager interface here to wrap our model inference, ensuring that no call reaches the ONNX runtime without passing through our budgeting logic.

Sliding Window Attention

Standard Transformer attention is \(O(n^2)\) with respect to sequence length. If we have 10,000 tokens of history, the model attempts to calculate the relationship of every token to every other token. This is computationally expensive and memory-intensive.

The Sliding Window Attention mechanism restricts the attention head to only look at a fixed "window" of previous tokens (e.g., the last 512 tokens). However, we still need to preserve the information of the tokens that have slid out of the window. This is where the distinction between "Raw History" and "Compressed Context" becomes critical.

Semantic Pruning: The Librarian's Index

When the Sliding Window forces us to discard old tokens, we lose nuance. To mitigate this, we employ Semantic Pruning using Vector Embeddings.

The Analogy: Imagine a Librarian (our C# application) managing a vast archive (the conversation history). The user asks a question about a topic mentioned 50 turns ago. The Librarian cannot read every past page to find the answer (too slow). Instead, the Librarian has an index (Vector Database) where every past conversation is converted into a mathematical vector (a set of coordinates in high-dimensional space).

The user's new question is converted into a vector.
The Librarian looks for the closest vectors in the index (Cosine Similarity).
The Librarian retrieves the actual text of the closest matches and injects them into the context window as "past context."

This allows us to "recall" information from gigabytes of past conversation while only using a few kilobytes of VRAM for the current inference.

The C# ONNX Runtime Pipeline: Structuring the Flow

To implement this in C#, we rely heavily on Records and Pattern Matching. Records provide immutability, which is crucial when dealing with conversation history to prevent accidental state corruption.

We define the state of our context manager as a record containing:

The Active Window: A List<Token> of the immediate history required for the sliding attention mechanism.
The Semantic Index: A collection of vector embeddings representing the "forgotten" history.
The Token Budget: An integer representing the maximum capacity.

Here is the conceptual structure of the data we manage:

using System.Collections.Generic;
using System.Numerics.Tensors; // Hypothetical for vector ops

// Using Records for immutable state management
public record ConversationState(
    List<string> ActiveContext, 
    List<Vector<float>> HistoricalEmbeddings, 
    int TokenBudget
);

// The abstraction for our vector storage (conceptual)
public interface IVectorStore
{
    void Store(Vector<float> embedding, string text);
    IEnumerable<string> Retrieve(Vector<float> query, int topK);
}

The Dynamic Pruning Logic

The "Pruning" logic is where we apply our constraints. We simulate the process of evaluating the context window. When the sum of tokens in ActiveContext exceeds the TokenBudget, we must perform a merge or eviction.

We utilize Pattern Matching to decide the fate of a token block:

If the block is a System Instruction: It is high priority. We keep it, reducing the budget for user history.
If the block is User History: We calculate its semantic density. If it is "fluff" (low information density), it is pruned immediately.
If the block is High-Value History: We generate an embedding for it, store it in our IVectorStore, and remove the raw text from the active window.

This process ensures that the ActiveContext passed to the ONNX Runtime is always lean and within the hardware limits, while the HistoricalEmbeddings serve as a compressed long-term memory.

Visualizing the Context Management Pipeline

The following diagram illustrates how data flows through the pruning and budgeting system before reaching the ONNX Runtime engine.

The diagram depicts the context management pipeline, where the ActiveContext is pruned to meet hardware limits before being passed to the ONNX Runtime, while HistoricalEmbeddings act as a compressed long-term memory source. — The diagram depicts the context management pipeline, where the `ActiveContext` is pruned to meet hardware limits before being passed to the ONNX Runtime, while `HistoricalEmbeddings` act as a compressed long-term memory source.

Architectural Implications for C# Developers

When building this, we must be aware of the Garbage Collector (GC) pressure. VRAM management in C# ONNX Runtime is distinct from managed memory. If we are constantly creating new arrays for tokens or embeddings, we might cause Gen 2 GC collections, which pause the application.

To mitigate this, we use Span<T> and ArrayPool<T> (standard in modern C#). We rent buffers for tokenization and vector operations, use them, and return them immediately. This keeps the managed heap clean and focuses the memory pressure on the VRAM, which is the actual bottleneck.

Furthermore, we must handle Streaming. In a local setting, we often stream tokens to the UI to reduce perceived latency. Our context management must be thread-safe. We will likely use ConcurrentQueue<T> for the active context window to allow the background inference thread to push generated tokens while the UI thread might be adding user inputs, all coordinated by SemaphoreSlim to ensure we don't re-enter the inference engine while it's already running.

Theoretical Foundations

We have established that:

VRAM is the bottleneck, not compute.
Token Budgeting is a hard constraint we must enforce programmatically.
Sliding Windows manage immediate attention scope.
Semantic Pruning (via Vector Embeddings) allows us to maintain long-term coherence without storing raw text in VRAM.
Modern C# features like Records, Pattern Matching, and Span<T> are essential for building a performant, memory-safe context manager.

In the next subsection, we will begin implementing the ContextManager class, focusing on the tokenization and budgeting logic using Microsoft.ML.OnnxRuntime.

Basic Code Example

Here is a simple, self-contained example demonstrating how to manage a limited context window in a local LLM scenario using a sliding window approach and vector embeddings for semantic relevance.

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Numerics;
using System.Text;

namespace LocalContextManagement
{
    // Represents a single message in the conversation history
    public record ConversationMessage(string Role, string Content, float[] Embedding);

    public class ContextWindowManager
    {
        private readonly int _maxTokenBudget;
        private readonly List<ConversationMessage> _history = new();

        // In a real scenario, we would use a dedicated embedding model (e.g., All-MiniLM-L6-v2)
        // For this example, we simulate embeddings as simple numeric vectors.
        public ContextWindowManager(int maxTokenBudget)
        {
            _maxTokenBudget = maxTokenBudget;
        }

        public void AddMessage(string role, string content)
        {
            // Simulate generating an embedding for the content
            // In reality, this would be an ONNX inference call
            var embedding = GenerateMockEmbedding(content);

            var message = new ConversationMessage(role, content, embedding);
            _history.Add(message);
        }

        // Core logic: Prune history based on token budget and semantic relevance
        public List<ConversationMessage> GetOptimizedContext(string currentQuery)
        {
            Console.WriteLine($"[System] Current History Size: {_history.Count} messages");

            // 1. Calculate approximate tokens (1 token ~= 4 chars for English text)
            int currentTokens = _history.Sum(m => m.Content.Length / 4);

            if (currentTokens <= _maxTokenBudget)
            {
                Console.WriteLine("[System] Context fits within budget. Returning full history.");
                return _history;
            }

            Console.WriteLine($"[System] Context exceeds budget ({currentTokens} > {_maxTokenBudget}). Pruning...");

            // 2. Generate embedding for the current query to find relevance
            var queryEmbedding = GenerateMockEmbedding(currentQuery);

            // 3. Score messages by relevance to the current query (Cosine Similarity)
            var scoredHistory = _history
                .Select(msg => new 
                { 
                    Message = msg, 
                    Score = CalculateCosineSimilarity(queryEmbedding, msg.Embedding) 
                })
                .OrderByDescending(x => x.Score) // Keep most relevant
                .ToList();

            // 4. Sliding Window: Reconstruct context until budget is met
            var optimizedContext = new List<ConversationMessage>();
            int accumulatedTokens = 0;

            foreach (var item in scoredHistory)
            {
                int msgTokens = item.Message.Content.Length / 4;

                if (accumulatedTokens + msgTokens <= _maxTokenBudget)
                {
                    optimizedContext.Add(item.Message);
                    accumulatedTokens += msgTokens;
                }
                else
                {
                    // We stop adding once the budget is full, prioritizing by score
                    break;
                }
            }

            // 5. Sort by original chronological order for the LLM to understand flow
            // (Optional, but usually preferred for chat continuity)
            var finalContext = optimizedContext
                .OrderBy(m => _history.IndexOf(m))
                .ToList();

            Console.WriteLine($"[System] Pruned context size: {finalContext.Count} messages. Estimated tokens: {accumulatedTokens}");

            return finalContext;
        }

        // --- Helper Methods ---

        // Simulates an embedding vector (e.g., 384 dimensions)
        private float[] GenerateMockEmbedding(string text)
        {
            var rnd = new Random(text.GetHashCode()); // Deterministic based on text
            var vector = new float[384];
            for (int i = 0; i < vector.Length; i++)
            {
                vector[i] = (float)rnd.NextDouble();
            }
            return vector;
        }

        // Calculates Cosine Similarity between two vectors (0 to 1, where 1 is identical)
        private float CalculateCosineSimilarity(float[] vecA, float[] vecB)
        {
            if (vecA.Length != vecB.Length) return 0;

            float dotProduct = 0;
            float magnitudeA = 0;
            float magnitudeB = 0;

            for (int i = 0; i < vecA.Length; i++)
            {
                dotProduct += vecA[i] * vecB[i];
                magnitudeA += vecA[i] * vecA[i];
                magnitudeB += vecB[i] * vecB[i];
            }

            magnitudeA = (float)Math.Sqrt(magnitudeA);
            magnitudeB = (float)Math.Sqrt(magnitudeB);

            if (magnitudeA == 0 || magnitudeB == 0) return 0;
            return dotProduct / (magnitudeA * magnitudeB);
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // Simulate a constrained environment (e.g., 512 tokens)
            var contextManager = new ContextWindowManager(maxTokenBudget: 512);

            // 1. Populate history with various topics
            // Note: We add enough data to exceed the token budget
            contextManager.AddMessage("system", "You are a helpful assistant specialized in C# and AI.");
            contextManager.AddMessage("user", "What is the capital of France?");
            contextManager.AddMessage("assistant", "The capital of France is Paris.");

            // Add a long technical discussion to fill the context
            contextManager.AddMessage("user", "Can you explain how sliding window attention works in ONNX Runtime?");
            contextManager.AddMessage("assistant", "Sliding window attention restricts the attention mechanism to a fixed-size window of previous tokens, reducing memory complexity from quadratic to linear.");

            // Add noise/unrelated history
            contextManager.AddMessage("user", "What did I have for breakfast?");
            contextManager.AddMessage("assistant", "I don't have access to your personal data unless you tell me.");

            contextManager.AddMessage("user", "Tell me more about the linear complexity part.");
            contextManager.AddMessage("assistant", "In standard attention, every token attends to every other token. With a window size W, each token only attends to W neighbors, significantly lowering VRAM usage.");

            // 2. Simulate a new user query
            string newQuery = "How does sliding window attention affect the context length in ONNX?";

            Console.WriteLine($"\n--- Processing Query: \"{newQuery}\" ---\n");

            // 3. Retrieve optimized context
            var context = contextManager.GetOptimizedContext(newQuery);

            // 4. Simulate passing to the LLM (Local Inference)
            Console.WriteLine("\n--- Final Context Sent to ONNX Model ---");
            foreach (var msg in context)
            {
                Console.WriteLine($"[{msg.Role.ToUpper()}]: {msg.Content}");
            }
        }
    }
}

Detailed Explanation

1. The Problem: Hardware Constraints in Local LLMs

Running Large Language Models (LLMs) locally on consumer hardware (like a laptop with an integrated GPU) presents a significant challenge: VRAM (Video RAM) limitations.

Unlike cloud servers with 80GB+ GPUs, a local device might only have 4GB or 8GB of shared memory. LLMs store the "Context Window" (the conversation history) in memory. If the history grows too large, the application crashes or slows down drastically.

This code solves the problem by implementing a Context Pruning Manager. It acts as a gatekeeper, ensuring that only the most relevant information is passed to the ONNX runtime model, keeping memory usage strictly within the hardware budget.

2. Code Breakdown

Step 1: Data Structures and Setup

public record ConversationMessage(string Role, string Content, float[] Embedding);

The Record: We define a ConversationMessage record. This is a modern C# feature providing immutability.
The Embedding: Notice the float[] Embedding. This is crucial. We aren't just storing text; we are storing a numerical representation (vector) of that text. This allows us to mathematically calculate "semantic relevance" later.

Step 2: The ContextWindowManager Class

public class ContextWindowManager
{
    private readonly int _maxTokenBudget;
    private readonly List<ConversationMessage> _history = new();
    // ...
}

Token Budget: We define _maxTokenBudget. In LLMs, tokens are the units of processing (roughly 4 characters of English text = 1 token). This variable acts as our hard ceiling.
History: We maintain a list of all messages ever sent. In a real production app, this might be stored in a database, but for local inference, we keep it in memory for speed.

Step 3: Simulating Embeddings (The "Mock" Logic)

private float[] GenerateMockEmbedding(string text)
{
    var rnd = new Random(text.GetHashCode()); 
    // ... generates vector ...
}

Why Embeddings? Real-world context management uses Vector Embeddings (generated by a small model like all-MiniLM-L6-v2). These convert text into high-dimensional points (e.g., 384 dimensions).
Simulation: Since we cannot load a heavy embedding model in this simple snippet, we use text.GetHashCode() to seed a Random number generator. This ensures that the same text always produces the same vector, allowing us to simulate similarity checks reliably.

Step 4: Calculating Semantic Relevance

private float CalculateCosineSimilarity(float[] vecA, float[] vecB)
{
    // ... Dot product / Magnitude calculation ...
}

The Math: To decide what to keep and what to discard, we need to know which historical messages are relevant to the current query.
Cosine Similarity: This is the standard metric for vector embeddings. It measures the angle between two vectors.
- 1.0: Identical meaning.
- 0.0: Completely unrelated.
- -1.0: Opposite meaning.
We use this to rank the conversation history.

Step 5: The Pruning Logic (The Sliding Window)

href="#__codelineno-6-1">public List<ConversationMessage> GetOptimizedContext(string currentQuery) pan> // 1. Check if we even need to prune int currentTokens = _history.Sum(m => m.Content.Length / 4); if (currentTokens <= _maxTokenBudget) return _history; // 2. Score messages var scoredHistory = _history .Select(msg => new { Message = msg, Score = CalculateCosineSimilarity(...) }) .OrderByDescending(x => x.Score) // Most relevant first .ToList(); // 3. Fill the bucket var optimizedContext = new List<ConversationMessage>(); int accumulatedTokens = 0; foreach (var item in scoredHistory) { if (accumulatedTokens + item.Message.Content.Length / 4 <= _maxTokenBudget) { optimizedContext.Add(item.Message); accumulatedTokens += item.Message.Content.Length / 4; } } // ... }

Logic Flow:
1. Estimate Cost: We approximate the token count (Length / 4). If it fits, we return the full history.
2. Ranking: If it's too big, we calculate the similarity of every historical message against the new query.
3. Greedy Selection: We iterate through the sorted list (most relevant first) and add messages to optimizedContext until the _maxTokenBudget is reached.
4. Discarding: Any message that doesn't fit is dropped. This is the "sliding window" in action—we are sliding a window of attention over the most important parts of the history.

Step 6: Chronological Sorting

var finalContext = optimizedContext
    .OrderBy(m => _history.IndexOf(m))
    .ToList();

Why? LLMs understand conversations best in chronological order. Even though we selected messages based on relevance (scattered across time), we must re-order them back to the original sequence before sending them to the model. Otherwise, the model might see the "Answer" before the "Question."

Step 7: Execution in Main

We populate the history with technical questions about C#/ONNX and a random question about breakfast.
We issue a new query: "How does sliding window attention affect the context length in ONNX?"
Result: The system detects the budget is exceeded. It calculates that the "breakfast" message has low relevance to the new query (low cosine similarity) but the technical messages have high relevance. It keeps the technical ones and discards the breakfast one.

Visualizing the Flow

The diagram illustrates how the AI filters a user query by calculating cosine similarity against stored messages, retaining the highly relevant technical items while discarding the low-relevance breakfast entry.

Common Pitfalls

Token Estimation Error:
- The Mistake: Using text.Length as an exact token count.
- Why it fails: Tokenization (e.g., using BPE or WordPiece) is language-dependent and non-linear. A 100-character string in English might be 25 tokens, but in Chinese, it might be 50 tokens.
- Solution: In production, always use the specific tokenizer library (e.g., Microsoft.ML.Tokenizers) to get an accurate count. The Length / 4 heuristic is only for rough estimation in prototypes.
Forgetting Chronological Re-sorting:
- The Mistake: Passing the context to the LLM in the order of "Relevance Score" rather than "Time."
- Why it fails: LLMs are autoregressive; they predict the next token based on previous tokens. If the model sees an answer before the question, the attention mechanism breaks, leading to hallucinations or nonsensical replies.
- Solution: Always perform the secondary sort by timestamp (or list index) after selecting the relevant items.
Ignoring System Prompt Overhead:
- The Mistake: Treating the System Prompt as "free" memory.
- Why it fails: System prompts (e.g., "You are a helpful assistant...") consume tokens just like user messages. If your system prompt is 500 tokens and your budget is 512, you have almost no room for conversation.
- Solution: Account for system prompt tokens in your accumulatedTokens calculation. If the budget is tight, consider summarizing the system prompt or hard-coding it into the model session options if the ONNX runtime allows.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.