Chapter 18: Offline RAG - Querying Local Files

Theoretical Foundations

The central promise of Edge AI is the transformation of a general-purpose computing device into a specialized, intelligent agent that serves you alone. To achieve this in the realm of information retrieval—specifically for local files—we must move beyond simple keyword matching and embrace semantic understanding. This requires a sophisticated architectural pattern known as Retrieval-Augmented Generation (RAG), implemented entirely offline.

The Problem: The Limits of Context and Knowledge

An Large Language Model (LLM), no matter how sophisticated, is a static snapshot of the world frozen in time at the moment of its training. It knows about the world in general, but it knows nothing about your specific documents, your private notes, or your proprietary codebase. Furthermore, LLMs have a finite "context window"—the amount of text they can consider at one time. You cannot simply paste a 500-page technical manual into a chat prompt and expect a coherent answer.

The Analogy of the Archivist: Imagine you have a brilliant colleague with an eidetic memory of everything published before 2023, but they are forbidden from looking at any documents in the room. You ask them, "What is the specific clause in the merger agreement regarding intellectual property?" They cannot answer because they lack the source material. If you handed them the entire 500-page agreement and asked the same question five seconds later, they wouldn't have time to read it. They would be overwhelmed.

Offline RAG is the solution. It acts as a hyper-efficient librarian. It doesn't read the entire book to you; it runs to the shelf, pulls the exact book, opens it to the exact page, and hands that specific paragraph to your brilliant colleague (the LLM) to summarize.

The Architecture of Offline RAG

In our C# environment, building this "librarian" requires a pipeline composed of three distinct phases: Ingestion (Indexing), Retrieval, and Generation. Since we are operating strictly offline, every component of this pipeline must run locally on the user's machine using ONNX models.

1. Ingestion: From Raw Bytes to Semantic Vectors

The first challenge is making local files understandable to a machine that only understands numbers. We cannot simply pass a PDF binary blob to an LLM. We must convert the information into a format that preserves meaning.

Text Chunking: Documents are rarely linear. A PDF might have headers, footers, tables, and multiple columns. To handle this, we employ Chunking. This is the process of breaking a large document into smaller, semantically meaningful segments.

Why? If we chunk too large, we lose the ability to find specific details. If we chunk too small, we lose the context required to understand the sentence.
The Strategy: In C#, we might use System.Text.RegularExpressions to split by semantic boundaries (like double newlines for paragraphs) rather than arbitrary character counts. This ensures that a single "thought" or "paragraph" remains intact.

Vector Embeddings (The "Meaning" Layer): This is the most critical step. We need to convert text into a high-dimensional vector (an array of floating-point numbers). We use a lightweight model (like a BERT variant or Phi-3.5-mini) specifically trained for this task, often called an "Embedding Model."

The Concept: Imagine a 3D graph. The word "King" might be located at coordinates (x, y, z). The word "Queen" will be located nearby. The word "Car" will be far away. In reality, these models use hundreds or thousands of dimensions, creating a "semantic space."
Why this matters: In this space, mathematical distance equals semantic similarity. "How to fix a leak" and "repairing a plumbing issue" will have vectors that are mathematically close, even though they share no keywords.

2. The Local Vector Database

Once we have these vectors, we need to store them in a way that allows for extremely fast searching. A standard SQL database is terrible at this; it excels at exact matches, not "find the vector closest to this one."

We use a Vector Database (like a local instance of Qdrant, Milvus, or a file-based index like LanceDB).

The Mechanism: These databases use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index).
Analogy: Think of a library card catalog. Instead of scanning every book on every shelf (Brute Force search), the database creates a map. It groups "Math" books together, "History" books together, and within "Math," it groups "Calculus" together. When you ask for a topic, it navigates this hierarchy instantly to find the nearest match.

3. Retrieval and Dynamic Prompt Engineering

When the user asks a question, the pipeline executes the Query Phase:

The user's question ("What is the warranty policy?") is converted into a vector using the same embedding model used for ingestion.
This query vector is sent to the local Vector Database.
The database returns the top k most relevant text chunks (e.g., the top 3 paragraphs from your warranty PDF).

The Prompt Construction: We now have the user's question and the relevant context. We cannot just throw the context at the LLM. We must construct a Dynamic Prompt.

This is where C# string interpolation and templating shine. We construct a prompt that looks like this:

string systemPrompt = "You are an AI assistant. Answer the question based EXACTLY on the provided context. If the context does not contain the answer, say 'I don't know'.";
string context = string.Join("\n\n", retrievedChunks); // The data from the Vector DB
string userQuestion = "What is the warranty policy?";

string finalPrompt = $"{systemPrompt}\n\nContext:\n{context}\n\nQuestion:\n{userQuestion}";

This prompt is then fed into our local ONNX LLM (e.g., Llama 3.2 or Phi-3) running via the Microsoft.ML.OnnxRuntime C# bindings.

Why This Must Be Offline (The Privacy & Latency Imperative)

Using this architecture locally offers distinct advantages over cloud-based RAG:

Data Sovereignty: Your documents never leave the machine. For legal, medical, or proprietary data, this is non-negotiable.
Latency: Once the vectors are indexed, retrieval is sub-millisecond. There is no network round-trip to an API.
Cost: No token costs. You can query your documents 10,000 times a day for free.

The Role of C# and Modern .NET Features

In building this, we rely heavily on modern C# features to manage the complexity and performance.

Interfaces for Abstraction: We use Interfaces (IEmbeddingModel, IVectorStore, ITextChunker) to decouple the implementation from the contract. This is crucial because the underlying technology might change. You might switch from a BERT model to a Phi embedding model, or from a file-based vector store to an in-memory store for testing.

Previous Chapter Reference: Recall in Book 8 where we discussed the IOnnxModel interface. We are applying that same pattern here, but extending it to the RAG pipeline components.

Span<T> and Memory<T>: Text processing and vector manipulation are memory-intensive. When chunking large files or converting embeddings (arrays of floats) into the format required by the ONNX runtime, we avoid allocating massive strings or arrays. We use Span<T> to slice and dice data without memory copies. This is essential for keeping the application responsive while ingesting gigabytes of local data.

Async/Await: Ingesting files is I/O bound. Vector search is CPU bound. The UI must remain responsive. We use async/await to pipeline these operations. The user can continue to use the app while the background service ingests the file system.

Dependency Injection (DI): We configure our pipeline using the .NET DI container. This allows us to inject different strategies for different environments (e.g., a mock vector store for unit tests, a real file-based store for production).

Summary of the Pipeline

The theoretical foundation of Offline RAG is the transformation of unstructured data into structured, searchable knowledge.

This diagram illustrates how an Offline RAG pipeline transforms unstructured data into structured, searchable knowledge, flowing through a mock vector store for unit tests and a real file-based store for production.

By mastering this theoretical flow, you are preparing to implement a system that is not just an AI, but a personalized knowledge engine, running entirely within the safety of your own C# application.

Basic Code Example

Here is a complete, self-contained C# example demonstrating a local offline RAG pipeline using ONNX Runtime for inference. This example simulates querying a local text file, generating embeddings, and performing semantic search without internet access.

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.Json;

namespace LocalOfflineRag
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("=== Local Offline RAG: Hello World ===\n");

            // 1. Setup: Define local file paths (Simulated local documents)
            string localDocsDir = Path.Combine(Path.GetTempPath(), "LocalRagDocs");
            Directory.CreateDirectory(localDocsDir);

            // Create dummy local files if they don't exist
            string doc1Path = Path.Combine(localDocsDir, "doc1.txt");
            string doc2Path = Path.Combine(localDocsDir, "doc2.txt");

            if (!File.Exists(doc1Path))
                File.WriteAllText(doc1Path, "The capital of France is Paris. It is known for the Eiffel Tower.");
            if (!File.Exists(doc2Path))
                File.WriteAllText(doc2Path, "The capital of Japan is Tokyo. It is known for its bustling streets and technology.");

            // 2. Load Local Documents
            var documents = LoadDocuments(localDocsDir);
            Console.WriteLine($"Loaded {documents.Count} documents from local storage.");

            // 3. Initialize Embedding Engine (Simulated ONNX Inference)
            // In a real scenario, you would load a specific ONNX model like 'all-MiniLM-L6-v2.onnx'
            var embeddingEngine = new LocalEmbeddingEngine();

            // 4. Generate Embeddings for Documents (Offline)
            // We convert text into vector representations
            var documentVectors = new List<VectorEntry>();
            foreach (var doc in documents)
            {
                var vector = embeddingEngine.GetEmbedding(doc.Content);
                documentVectors.Add(new VectorEntry { Text = doc.Content, Vector = vector });
                Console.WriteLine($"Generated embedding for: {doc.FileName}");
            }

            // 5. User Query
            string userQuery = "What is the capital of France?";
            Console.WriteLine($"\nUser Query: \"{userQuery}\"");

            // 6. Generate Query Embedding
            var queryVector = embeddingEngine.GetEmbedding(userQuery);

            // 7. Perform Semantic Search (Cosine Similarity)
            var relevantContext = SemanticSearch(documentVectors, queryVector, topK: 1);

            // 8. Construct Prompt for Local LLM
            string prompt = $@"
You are a helpful assistant. Answer the question based ONLY on the provided context.

Context:
{relevantContext}

Question: {userQuery}
Answer:";

            // 9. Run Local LLM Inference (Simulated ONNX Generation)
            // In a real scenario, you would use a decoder model like Phi-2 or Llama
            var llmEngine = new LocalLlmEngine();
            string response = llmEngine.Generate(prompt);

            Console.WriteLine($"\nLocal LLM Response:\n{response}");
        }

        // --- Helper Methods & Classes ---

        static List<Document> LoadDocuments(string directoryPath)
        {
            var docs = new List<Document>();
            foreach (var file in Directory.GetFiles(directoryPath))
            {
                docs.Add(new Document
                {
                    FileName = Path.GetFileName(file),
                    Content = File.ReadAllText(file)
                });
            }
            return docs;
        }

        static string SemanticSearch(List<VectorEntry> documentVectors, float[] queryVector, int topK)
        {
            // Calculate Cosine Similarity between query and each document
            var scores = documentVectors.Select(doc => new
            {
                Text = doc.Text,
                Score = CosineSimilarity(doc.Vector, queryVector)
            })
            .OrderByDescending(x => x.Score)
            .Take(topK);

            // Combine top results into context
            StringBuilder contextBuilder = new StringBuilder();
            foreach (var item in scores)
            {
                contextBuilder.AppendLine($"- {item.Text} (Relevance: {item.Score:P})");
            }

            return contextBuilder.ToString();
        }

        static float CosineSimilarity(float[] vecA, float[] vecB)
        {
            float dotProduct = 0f;
            float magnitudeA = 0f;
            float magnitudeB = 0f;

            for (int i = 0; i < vecA.Length; i++)
            {
                dotProduct += vecA[i] * vecB[i];
                magnitudeA += vecA[i] * vecA[i];
                magnitudeB += vecB[i] * vecB[i];
            }

            magnitudeA = (float)Math.Sqrt(magnitudeA);
            magnitudeB = (float)Math.Sqrt(magnitudeB);

            if (magnitudeA == 0 || magnitudeB == 0) return 0;
            return dotProduct / (magnitudeA * magnitudeB);
        }
    }

    // --- Simulated ONNX Embedding Engine ---
    // Real implementation would use: InferenceSession.Run() on an ONNX embedding model
    public class LocalEmbeddingEngine
    {
        // Simulating a 384-dimensional embedding model (e.g., all-MiniLM-L6-v2)
        private const int Dimension = 384;
        private Random _rng = new Random(42); // Fixed seed for reproducibility

        public float[] GetEmbedding(string text)
        {
            // In a real ONNX implementation:
            // 1. Tokenize text
            // 2. Create InputTensor
            // 3. session.Run(inputs)
            // 4. Extract output tensor

            // SIMULATION: We generate a deterministic vector based on string hashing
            // to simulate semantic similarity for this demo.
            float[] vector = new float[Dimension];
            int hash = text.GetHashCode();

            // Fill vector with pseudo-random values based on hash
            _rng = new Random(hash); 
            for (int i = 0; i < Dimension; i++)
            {
                vector[i] = (float)_rng.NextDouble();
            }

            // Normalize vector (L2 norm)
            double sumSquares = 0;
            foreach (var val in vector) sumSquares += val * val;
            double norm = Math.Sqrt(sumSquares);
            for (int i = 0; i < Dimension; i++)
            {
                vector[i] = (float)(vector[i] / norm);
            }

            return vector;
        }
    }

    // --- Simulated ONNX LLM Engine ---
    // Real implementation would use: InferenceSession.Run() with past_key_values
    public class LocalLlmEngine
    {
        public string Generate(string prompt)
        {
            // In a real ONNX implementation:
            // 1. Tokenize prompt
            // 2. Initialize empty input_ids
            // 3. Loop: Run model, get logits, sample next token, append to input_ids
            // 4. Detokenize output

            // SIMULATION: Rule-based response for the demo context
            if (prompt.Contains("capital of France") || prompt.Contains("Paris"))
            {
                return "Based on the retrieved context, the capital of France is Paris.";
            }
            return "I do not have enough context to answer that question.";
        }
    }

    // --- Data Models ---
    public class Document
    {
        public string FileName { get; set; }
        public string Content { get; set; }
    }

    public class VectorEntry
    {
        public string Text { get; set; }
        public float[] Vector { get; set; }
    }
}

Line-by-Line Explanation

1. Setup and Document Loading

Main Method: The entry point of the application. It orchestrates the entire RAG pipeline.
localDocsDir: Defines a directory in the system's temporary folder to simulate a local file system. This ensures the code runs without needing specific folder permissions.
File I/O: The code checks if dummy text files exist. If not, it creates them. This mimics a user having a folder of local documents (e.g., PDFs converted to text) that they want to query.
LoadDocuments: A helper method that reads all files from the directory. It returns a list of Document objects containing the file name and content.

2. Embedding Generation (Offline)

LocalEmbeddingEngine: This class simulates an ONNX Runtime session for embeddings.
- Real World Context: In a production app, you would instantiate InferenceSession with a .onnx file (e.g., text-embedding-v2.onnx). You would tokenize the input using a library like Microsoft.ML.Tokenizers and pass the token IDs to the ONNX model.
- Simulation Logic: To make this "Hello World" runnable without external model files, we simulate vector generation. We hash the text string to seed a random number generator, creating a deterministic "semantic" vector. We then normalize this vector (L2 norm) so that Cosine Similarity works correctly.
documentVectors: A list storing the text and its corresponding vector representation. This acts as our local vector database.

3. Query Processing

User Query: We define a static query: "What is the capital of France?".
Query Embedding: We pass the user's text through the same LocalEmbeddingEngine. Crucially, the query must be embedded using the exact same model (or simulation logic) as the documents to ensure vector space alignment.

4. Semantic Search (Vector Math)

SemanticSearch: This method implements the retrieval step.
CosineSimilarity: This function calculates the cosine of the angle between two vectors.
- Why Cosine? It measures orientation rather than magnitude. In text embeddings, this captures semantic similarity regardless of document length.
- Calculation: It computes the dot product divided by the product of the magnitudes (norms).
Ranking: The documents are ordered by their similarity score (descending). We take the topK (here, 1) most relevant documents.
Context Construction: The text of the top-ranked document is formatted into a string. This string will be injected into the LLM prompt.

5. Local LLM Inference

LocalLlmEngine: This class simulates a decoder-only model (like Phi-2 or Llama 2) running locally via ONNX.
- Real World Context: A real implementation involves:
  1. Tokenization: Converting the prompt string into integer IDs.
  2. Inference Loop: Running the ONNX model in a loop to generate tokens one by one (autoregressive generation).
  3. Sampling: Using techniques like Greedy Sampling or Top-P (Nucleus) Sampling to select the next token.
  4. Detokenization: Converting generated IDs back to text.
Simulation Logic: For this demo, the engine checks if the prompt contains specific keywords (derived from our context). This proves that the RAG pipeline successfully passed the retrieved context to the LLM.
Prompt Engineering: The prompt is explicitly constructed with instructions ("Answer based ONLY on the provided context") and the [Context] block retrieved from the search step.

6. Execution Flow

Load: Read local text files.
Index: Convert files to vectors (offline).
Query: Convert user question to vector.
Retrieve: Find the document vector closest to the query vector.
Generate: Feed the document text + question to the local LLM.
Output: Display the grounded answer.

Common Pitfalls

Model Mismatch in Embeddings:
- The Mistake: Generating document embeddings with one model (e.g., OpenAI's text-embedding-ada-002) and query embeddings with a different local model (e.g., all-MiniLM-L6-v2).
- Why it fails: Embedding models map text to unique vector spaces. Mixing models results in vectors that are mathematically incompatible. Cosine similarity will return random noise, not semantic similarity.
- Solution: Ensure the exact same model architecture and weights are used for both indexing (offline) and querying (online).
Context Window Overflow:
- The Mistake: Retrieving too many documents (topK) or entire large files and pasting them all into the LLM prompt.
- Why it fails: LLMs have a fixed context window (e.g., 4k or 128k tokens). Exceeding this causes truncation (losing information) or errors. Even if it fits, "lost in the middle" phenomena can degrade performance.
- Solution: Implement strict chunking strategies (e.g., split text into 512-token chunks) and limit topK to 3-5 highly relevant chunks.
Ignoring Tokenization Overhead:
- The Mistake: Treating text length as character count rather than token count.
- Why it fails: LLMs process tokens, not characters. A "Hello World" prompt might be 2 tokens in one model but 4 in another. If you calculate context limits based on characters, you will unexpectedly overflow the context window.
- Solution: Always use the specific tokenizer associated with your ONNX model to count tokens before inference.
File Locking in Local RAG:
- The Mistake: Reading a local file stream while another process is writing to it (e.g., a user actively editing a document).
- Why it fails: This throws IOException (file in use) or returns incomplete/partial data, leading to corrupted embeddings or hallucinations.
- Solution: Implement file locking checks or copy files to a temporary processing directory before indexing. Use FileShare.ReadWrite carefully if real-time updates are required.

Visualizing the Pipeline

To enable real-time updates from a file, the pipeline must open the file stream with FileShare.ReadWrite, allowing simultaneous access by both the writing application and the reading AI model. — To enable real-time updates from a file, the pipeline must open the file stream with `FileShare.ReadWrite`, allowing simultaneous access by both the writing application and the reading AI model.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.