Chapter 10: Stateful Chat Sessions in Local Memory
Theoretical Foundations
In-memory state management for local Edge AI applications is the architectural discipline of maintaining conversational context directly within the application's Random Access Memory (RAM), acting as a volatile, high-speed intermediary between the user's input and the deterministic inference engine. Unlike stateless HTTP requests where every interaction is treated as a fresh, context-free query, stateful sessions allow the Large Language Model (LLM) to "remember" previous turns, enabling coherent dialogue, multi-step reasoning, and personalized interactions. This is particularly critical when running models like Microsoft's Phi-2 or Meta's Llama 2 locally via ONNX Runtime, where hardware constraints (VRAM/Compute) necessitate strict control over the data fed into the model.
The "Why": The Necessity of Context in Local Inference
To understand the necessity of state, consider the "Amnesiac Librarian" analogy. Imagine walking into a library and asking a librarian, "Where can I find books on quantum physics?" The librarian points you to the correct aisle. You then ask, "Do they have anything by Richard Feynman?" If the librarian has no memory of the previous question, they cannot logically deduce that "they" refers to the quantum physics section. They would treat the second question in isolation, potentially directing you to a biography section instead of the physics section.
In a local Edge AI scenario, the model (the librarian) is often loaded entirely into GPU VRAM or system RAM. The inference pipeline (the request processing) is computationally expensive. Without state management:
- Context Collapse: The model loses the thread of conversation, leading to generic, non-contextual responses.
- Token Inefficiency: You would have to re-send the entire conversation history with every single request, which is inefficient for local hardware with limited context windows (e.g., 2048 or 4096 tokens).
- Latency Spikes: Re-processing the full history for every turn increases the time-to-first-token (TTFT), which is detrimental in real-time edge applications.
State management solves this by acting as a rolling buffer. It retains the conversation history, intelligently truncating or summarizing it to fit within the model's maximum context window (the "context length") defined during the model's training and quantization.
Theoretical Foundations
From a data structure perspective, a chat session is best visualized as a Directed Acyclic Graph (DAG) or, more simply for linear chats, a Time-Series Sequence. However, because we are dealing with complex logic and potential branching (e.g., editing a previous message), the underlying structure in C# must be robust.
The Data Structure: ChatMessage and ConversationHistory
In modern C#, we utilize record types for immutable data transfer objects (DTOs) representing individual messages. A record provides value equality, which is crucial when comparing message content for caching or deduplication strategies, and it supports init accessors for immutability, ensuring that once a message is added to the history, it cannot be mutated by downstream processes (preventing prompt injection attacks or history corruption).
A ChatMessage typically contains:
- Role: (System, User, Assistant)
- Content: The text payload.
- Timestamp: For auditing and potential TTL (Time-To-Live) eviction.
- Metadata: Token count, processing latency, etc.
The ConversationHistory acts as a manager. It is not merely a List<T>; it is a stateful entity that enforces constraints. In the context of ONNX Runtime, where the input is a tensor of token IDs, the history manager must bridge the gap between human-readable strings and the model's vocabulary.
The Token Limit Constraint: The "Suitcase" Analogy
A critical constraint in local inference is the Context Window. This is often visualized as a Suitcase.
- The Suitcase: The model's maximum context length (e.g., 4096 tokens).
- The Items: The tokens representing the system prompt, user messages, and assistant responses.
- The Packing Strategy: How we fit items into the suitcase.
If you try to pack too many items (tokens) into the suitcase, it won't close (inference fails). If you pack only the newest items, you might leave behind essential context (the "Amnesiac Librarian" problem). Therefore, the theoretical foundation of state management relies on Eviction Policies.
Common eviction strategies include:
- First-In-First-Out (FIFO): Removing the oldest messages once the token limit is reached. This is simple but risks losing critical early context.
- Summarization: Using a secondary, smaller model (or the same model) to summarize the oldest conversation turns into a single "summary token" or "context blob," effectively compressing history.
- Sliding Window: Keeping the most recent \(N\) tokens while keeping the System Prompt fixed at the beginning.
In C#, this requires dynamic calculation of token counts. Since ONNX models often use different tokenizers (e.g., GPT2Tokenizer vs. LlamaTokenizer), the state manager must be agnostic to the specific tokenizer, relying on an abstraction to estimate or calculate token usage.
The Inference Pipeline Integration
The state management layer sits between the UI (or API endpoint) and the ONNX Runtime inference engine. When a user sends a message, the flow is:
- Input Reception: The new user message is received.
- State Retrieval: The
ConversationHistoryretrieves the relevant context window. - Serialization/Tokenization: The history is serialized into a prompt string. This string is passed to a
Tokenizer(a concept introduced in Book 8: Optimizing Local Models), which converts text intoReadOnlyMemory<int>(token IDs). - Attention Mechanism: The tokenized sequence is fed into the ONNX model. The model's Self-Attention mechanism uses the entire sequence to calculate the probability distribution for the next token. The "state" is effectively the KV (Key-Value) cache in the transformer architecture, which is built from the input sequence.
Visualizing the Stateful Inference Flow
The following diagram illustrates how the in-memory state interacts with the local ONNX runtime. Note that the state is volatile; if the application restarts, the state is lost (unless persisted to disk, which is outside the scope of this subsection).
C# Architectural Patterns for State Management
To implement this robustly in C#, we leverage specific modern features that enforce safety and performance.
1. Immutability with record and init
Using record for message entities ensures that the conversation history is not accidentally mutated. This is vital for debugging and thread safety in asynchronous environments.
public enum ChatRole { System, User, Assistant }
public record ChatMessage(
ChatRole Role,
string Content,
DateTimeOffset Timestamp,
int TokenCount = 0
);
2. Interfaces for Abstraction (The "Swappable" Concept)
As referenced from Book 5: Abstraction Layers for AI, we define an interface for state management. This allows us to swap between an in-memory implementation (for fast local testing) and a persistent implementation (e.g., SQLite for long-term edge storage) without changing the inference pipeline.
public interface IConversationState
{
IReadOnlyList<ChatMessage> History { get; }
int CurrentTokenCount { get; }
void AddMessage(ChatRole role, string content);
void Truncate(int maxTokens); // Eviction policy
string BuildPrompt(); // Serializes history for the model
}
3. Span and Memory for Token Handling
When dealing with local inference, performance is paramount. Converting strings to tokens and back involves heavy allocation. We use Span<T> and Memory<T> (as discussed in Book 2: High-Performance C#) to slice arrays of token IDs without creating new memory allocations. This is critical when managing the "context window" buffer.
For example, when we need to truncate the history to fit the model's limit, we don't want to create new lists. We want to calculate the slice of the existing token array that represents the valid context.
Edge Cases and Architectural Implications
-
The "System Prompt" Anchor: The system prompt (e.g., "You are a helpful assistant") is usually static. In the "Suitcase" analogy, this is the bottom of the suitcase—it stays there, and we pack user/assistant messages on top. The state manager must ensure the system prompt is always included in the token count calculation, even during truncation. If the system prompt itself exceeds the context window (unlikely but possible with complex instructions), the application must throw an exception, as the model cannot function without its core directives.
-
Asynchronous Concurrency: In a server-side Blazor application or a multi-threaded desktop app, multiple requests might try to modify the conversation history simultaneously. While
List<T>is not thread-safe, the state manager should encapsulate history access. UsingSemaphoreSlimorReaderWriterLockSlimensures that the history is read and written atomically. However, for a single-user local edge application, a simpleConcurrentQueuemight suffice for appending messages, but care must be taken when calculating the total token count (which requires iterating the collection). -
Token Estimation vs. Exact Counting: Before the actual tokenizer runs, the state manager often needs a "quick estimate" to decide if truncation is needed. Relying on simple character-count heuristics (e.g., 1 token ≈ 4 characters) is dangerous because different models (Phi vs. Llama) have different tokenization efficiencies. The theoretical best practice is to cache the token count at the moment of tokenization. When
AddMessageis called, we should ideally tokenize it immediately to get the exact count, storing that count in theChatMessagerecord. This prevents the "double tokenization" cost later when building the prompt. -
State Persistence (The "What If"): While this subsection focuses on in-memory state, the architecture must be prepared for serialization. Since
ChatMessageis a record, it is easily serializable to JSON. In an edge scenario (e.g., a disconnected tablet), the state manager should be able to dump the history to local storage (usingSystem.Text.Json) and reload it upon restart, restoring the session context.
The Role of IChatClient Abstraction
Referencing the architectural patterns from Book 4: Dependency Injection in .NET, the state management should be decoupled from the inference engine via an interface. This allows the same stateful session logic to drive both a local ONNX model and a cloud-based API (like OpenAI) during development phases.
// Conceptual interface representing the inference engine
public interface ILocalInferenceEngine
{
Task<string> InferAsync(ReadOnlyMemory<int> tokens, CancellationToken ct);
}
// The state manager orchestrates the flow
public class StatefulChatSession
{
private readonly IConversationState _state;
private readonly ILocalInferenceEngine _engine;
private readonly ITokenizer _tokenizer;
public async Task<string> ChatAsync(string userInput)
{
// 1. Update State
_state.AddMessage(ChatRole.User, userInput);
// 2. Prepare Input (Tokenization)
string prompt = _state.BuildPrompt();
var tokens = _tokenizer.Encode(prompt);
// 3. Enforce Constraints (Truncation if needed)
if (tokens.Length > _maxContextLength)
{
_state.Truncate(_maxContextLength - tokens.Length); // Logic to remove old messages
// Re-encode after truncation
prompt = _state.BuildPrompt();
tokens = _tokenizer.Encode(prompt);
}
// 4. Inference
string response = await _engine.InferAsync(tokens, CancellationToken.None);
// 5. Update State with Response
_state.AddMessage(ChatRole.Assistant, response);
return response;
}
}
Theoretical Foundations
The theoretical foundation of stateful chat sessions in local memory rests on the interplay between data structures (immutable records), resource management (token limits/context windows), and algorithmic efficiency (eviction policies). It transforms the LLM from a simple text completion engine into a conversational agent.
By managing the "Conversation History" as a first-class citizen in the application memory, we bridge the gap between the static, stateless nature of the ONNX Runtime execution and the dynamic, continuous nature of human dialogue. This requires a rigorous understanding of how tokens are consumed, how memory is allocated in C#, and how to architect systems that remain responsive even when hardware resources are constrained.
Basic Code Example
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
// ---------------------------------------------------------
// REAL-WORLD CONTEXT
// ---------------------------------------------------------
// Imagine you are building a local, offline customer support chatbot
// for a smart appliance (e.g., a washing machine). The user asks:
// 1. "How do I clean the filter?" (Context: Washing Machine)
// 2. "What temperature should I use?" (Context: Washing Machine)
//
// Without state management, the AI treats question #2 as a generic query.
// With state management, the AI remembers we are talking about washing machines.
//
// This code demonstrates a "Hello World" of stateful chat:
// 1. It maintains a conversation history in memory.
// 2. It automatically truncates history to fit hardware constraints (token limits).
// 3. It prepares the state for an ONNX Runtime inference session.
// ---------------------------------------------------------
public class StatefulChatSession
{
// ---------------------------------------------------------
// CONFIGURATION & STATE DEFINITIONS
// ---------------------------------------------------------
// In a real ONNX model (like Phi-3 or Llama), the context window is limited.
// For this example, we simulate a hard limit of 20 tokens to demonstrate truncation logic.
private const int MaxContextTokens = 20;
// A simple struct to represent a chat message.
// In production, this would include roles (User, Assistant, System) and metadata.
public struct ChatMessage
{
public string Role { get; set; } // "user" or "assistant"
public string Content { get; set; }
public int TokenCount { get; set; } // Estimated token count
public override string ToString() => $"{Role}: {Content}";
}
// The in-memory buffer holding the conversation history.
// We use a List for dynamic resizing.
private readonly List<ChatMessage> _conversationHistory = new List<ChatMessage>();
// ---------------------------------------------------------
// CORE LOGIC: ADDING MESSAGES & MANAGING TOKENS
// ---------------------------------------------------------
/// <summary>
/// Adds a message to the session and ensures the total token count
/// stays within the hardware constraints (MaxContextTokens).
/// </summary>
public void AddMessage(string role, string content)
{
// 1. Estimate tokens (Simplified: 1 word ~= 1 token for demo purposes)
int estimatedTokens = EstimateTokenCount(content);
var message = new ChatMessage
{
Role = role,
Content = content,
TokenCount = estimatedTokens
};
// 2. Add to history
_conversationHistory.Add(message);
// 3. Enforce Token Limit (Truncation Strategy)
// We remove oldest messages until we fit within the limit.
// This is a "Sliding Window" approach common in Edge AI.
EnforceTokenLimit();
}
/// <summary>
/// Simulates a tokenizer. In real scenarios, use the specific model's tokenizer (e.g., Tiktoken).
/// </summary>
private int EstimateTokenCount(string text)
{
if (string.IsNullOrWhiteSpace(text)) return 0;
// Naive estimation: Split by spaces and punctuation
return text.Split(new[] { ' ', '.', ',', '!', '?' }, StringSplitOptions.RemoveEmptyEntries).Length;
}
/// <summary>
/// Removes the oldest messages if the total token count exceeds MaxContextTokens.
/// </summary>
private void EnforceTokenLimit()
{
int totalTokens = _conversationHistory.Sum(m => m.TokenCount);
// Iterate from the start (oldest messages) to remove excess
while (totalTokens > MaxContextTokens && _conversationHistory.Count > 0)
{
var removed = _conversationHistory[0];
_conversationHistory.RemoveAt(0);
totalTokens -= removed.TokenCount;
Console.WriteLine($"[System] Truncated history. Removed: '{removed.Content}'");
}
}
// ---------------------------------------------------------
// ONNX RUNTIME INTEGRATION PREPARATION
// ---------------------------------------------------------
/// <summary>
/// Prepares the stateful prompt for the ONNX model.
/// This method formats the history into a single string (prompt engineering).
/// </summary>
public string GetFormattedPrompt(string newQuery)
{
var sb = new StringBuilder();
// 1. System Instruction (Context Anchor)
sb.AppendLine("System: You are a helpful assistant for a washing machine.");
// 2. Append Conversation History
foreach (var msg in _conversationHistory)
{
sb.AppendLine($"{msg.Role}: {msg.Content}");
}
// 3. Append the new user query
sb.AppendLine($"user: {newQuery}");
return sb.ToString();
}
/// <summary>
/// Mocks the creation of ONNX Runtime input tensors using the session state.
/// In a real app, this converts the text string into Integer IDs (InputIDs).
/// </summary>
public void RunMockInference(string userQuery)
{
Console.WriteLine($"\n--- Processing Query: '{userQuery}' ---");
// 1. Prepare the prompt with history
string fullPrompt = GetFormattedPrompt(userQuery);
Console.WriteLine($"[Prompt Prepared]\n{fullPrompt}");
// 2. Convert to Input Tensor (Mock Logic)
// Real ONNX models expect 'input_ids' (LongTensor) and 'attention_mask'.
// Here we simulate the tokenization process.
var inputIds = TokenizeToIds(fullPrompt);
// 3. Create Tensor (Simulated ONNX Input)
// Dimensions: [BatchSize (1), SequenceLength (variable)]
var tensor = new DenseTensor<long>(inputIds, new[] { 1, inputIds.Length });
Console.WriteLine($"[Tensor Created] Shape: [1, {inputIds.Length}], Tokens: {inputIds.Length}");
// 4. Inference (Simulated)
// In a real app: using var session = new InferenceSession("model.onnx");
// var outputs = session.Run(new List<NamedOnnxValue> { ... });
Console.WriteLine("[Inference] Simulated ONNX execution complete.");
// 5. Update State (Simulated Assistant Response)
// In a real app, the model generates this response.
// We add it to history to maintain context for the NEXT turn.
string mockResponse = "Based on our previous conversation, I recommend checking the manual.";
AddMessage("assistant", mockResponse);
Console.WriteLine($"[State Updated] Assistant response added to memory.");
}
private long[] TokenizeToIds(string text)
{
// Extremely simplified tokenizer for demonstration
// Maps characters to long IDs just to show tensor creation.
return text.Select(c => (long)c).ToArray();
}
// ---------------------------------------------------------
// MAIN EXECUTION FLOW
// ---------------------------------------------------------
public static void Main()
{
Console.WriteLine("=== Local Edge AI: Stateful Chat Demo ===\n");
var session = new StatefulChatSession();
// SCENARIO 1: Initial Context
// User asks about the washing machine.
session.RunMockInference("How do I clean the filter?");
// SCENARIO 2: Context Retention
// User asks a follow-up. The system MUST remember the washing machine context.
// If we didn't manage state, the AI would lose the topic.
session.RunMockInference("What temperature should I use?");
// SCENARIO 3: Token Limit Enforcement
// We will flood the chat with long messages to trigger the truncation logic.
Console.WriteLine("\n=== Testing Token Limit Enforcement ===");
string longText = "This is a very long sentence designed to exceed the token limit we set. " +
"We are testing the sliding window mechanism.";
// Add multiple long messages to fill the buffer
for (int i = 0; i < 5; i++)
{
session.AddMessage("user", $"Message {i}: {longText}");
}
Console.WriteLine("\n=== Final History State ===");
// Note: Only the most recent messages that fit in the 20-token window remain.
}
}
Detailed Line-by-Line Explanation
1. Setup and Data Structures
usingDirectives: We includeSystem,System.Collections.Generic,System.Linq, andSystem.Textfor standard operations. Crucially, we includeMicrosoft.ML.OnnxRuntimeandMicrosoft.ML.OnnxRuntime.Tensors. While this example mocks the actual inference execution, these namespaces represent the real-world environment where the state management logic lives.MaxContextTokens: Set to20. In a real hardware-constrained Edge AI device (like a Raspberry Pi or a microcontroller), the RAM is limited. A model like Phi-3 Mini might require 4GB+ RAM for 128k context, but only 512MB for a 4k context. We artificially restrict this to 20 to make the truncation logic visible in the console output.ChatMessageStruct: A lightweight value type to store the role (User/Assistant), the text content, and the estimated token count. Using astructis a performance optimization suitable for high-frequency message handling in memory-constrained environments.
2. State Management Logic (AddMessage & EnforceTokenLimit)
AddMessage: This is the entry point for any interaction. It calculates the "cost" (token count) of the new message and adds it to the_conversationHistorylist.EstimateTokenCount: In a production environment, you would use the specific model's tokenizer (e.g.,tiktokenfor GPT, or the SentencePiece tokenizer for Llama). For this "Hello World" example, we use a naive heuristic: splitting by whitespace. This ensures the code runs without external dependencies.EnforceTokenLimit(The Sliding Window):- This is the critical logic for Edge AI. It calculates the
totalTokensof the entire history. - If
totalTokens > MaxContextTokens, it enters awhileloop. - It removes the oldest message (
_conversationHistory[0]) and subtracts its token count from the total. - Why this matters: This ensures the model never receives an input larger than the hardware can handle, preventing Out-Of-Memory (OOM) crashes, which are fatal on edge devices.
- This is the critical logic for Edge AI. It calculates the
3. ONNX Integration Preparation (GetFormattedPrompt)
- Prompt Engineering: ONNX models don't understand "chat" natively; they understand text strings. This method reconstructs the raw text history.
- Formatting: We use a standard format:
System: ...,User: ...,Assistant: .... This structure helps the model distinguish between instructions and dialogue. RunMockInference:- Step 1: It calls
GetFormattedPromptto get the state-aware string. - Step 2: It converts the string to a
DenseTensor<long>. In a real app, the tokenizer converts "Hello" into integers like[15496, 13]. The ONNX Runtime requires these integers in a Tensor format. - Step 3: It simulates the execution. In a real scenario,
session.Run(inputs)would block until the model generates a response. - Step 4 (Crucial): It calls
AddMessage("assistant", mockResponse). This closes the loop. Without this step, the AI would answer correctly in isolation but would forget its own answer in the next turn.
- Step 1: It calls
4. Execution Flow (Main)
- Scenario 1 & 2: We demonstrate that the AI retains context. The second query ("What temperature?") is ambiguous alone, but the system prompt and history provide the necessary context (Washing Machine).
- Scenario 3 (Stress Test): We generate 5 long messages. Each message exceeds the
MaxContextTokens(20) individually or combined. TheEnforceTokenLimitlogic aggressively trims the history, keeping only the most recent data that fits within the window.
Visualizing the State Flow
The following diagram illustrates how data moves through the system, specifically highlighting where the "State" is updated and where the "Truncation" occurs before reaching the ONNX model.
Common Pitfalls
-
Forgetting to Append the Assistant Response:
- The Mistake: Developers often run the inference, display the output to the user, but fail to add that output back into the
_conversationHistorylist. - The Consequence: The next user query will be processed with only the user's previous question in context, but the AI will have no memory of its own previous answer. This breaks the conversational flow.
- The Fix: Always ensure the pipeline includes
AddMessage("assistant", modelOutput)immediately after inference.
- The Mistake: Developers often run the inference, display the output to the user, but fail to add that output back into the
-
Using Naive Tokenizers in Production:
- The Mistake: Using the
EstimateTokenCountmethod provided in this example (splitting by spaces) for a real application. - The Consequence: Different models use different tokenization algorithms (BPE, WordPiece, SentencePiece). A space might be one token in one model and two in another. If your count is inaccurate, you might exceed the model's context window, causing the ONNX Runtime to throw a runtime exception (often a cryptic "Invalid Input" error).
- The Fix: Integrate the official tokenizer library for your specific model (e.g.,
Microsoft.ML.OnnxRuntime.Transformersor Python'stiktokenported to C#).
- The Mistake: Using the
-
Ignoring System Prompt Overhead:
- The Mistake: Calculating token limits based only on user/assistant messages.
- The Consequence: The
Systemprompt (e.g., "You are a helpful assistant...") also consumes tokens. If the limit is tight, the system prompt might push the total context over the limit, causing silent truncation of actual conversation history. - The Fix: Include the system prompt token count in the
totalTokenscalculation withinEnforceTokenLimit.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.