Chapter 10: Private RAG - Local Embeddings & Vector Stores

Theoretical Foundations

In the previous chapter, we explored the mechanics of LangGraph, where we learned about the Checkpointer as a state management mechanism. A Checkpointer acts like a meticulous librarian, saving the complete state of a conversation (graph state) after every step, allowing us to pause, resume, or rewind a session. We are now moving from the orchestration of agents to the infrastructure of data. In this chapter, we build a Private RAG (Retrieval-Augmented Generation) pipeline, which is essentially a self-contained, local library system.

Imagine a library that exists entirely within your computer. You have a massive collection of books (your documents), and you want to ask a question like, "What is the capital of France?" A traditional search engine would look for keywords, but a RAG system works like a knowledgeable librarian who understands the meaning of your question, walks to the correct shelf, pulls out the relevant book, and hands it to an AI to read and answer.

In a Private RAG pipeline, this entire process happens offline. No data leaves your machine. We achieve this by combining three pillars: 1. Local Embeddings: Converting text into numerical vectors (the "understanding" of meaning). 2. Local Vector Stores: Storing these vectors in a database optimized for mathematical similarity (the "shelving system"). 3. Local Generation: Using a local LLM (via Ollama) to synthesize the answer based on the retrieved context.

The Embedding Model: The Librarian's Filing System

At the heart of RAG is the Embedding Model. If you recall from our discussion on WebGPU Compute Shaders, we learned that GPUs excel at parallel matrix multiplication. Embedding models are essentially neural networks that perform massive matrix multiplications to compress text into a list of floating-point numbers (a vector).

The Analogy: Think of an embedding as a Hash Map in web development. In TypeScript, a Hash Map allows us to map a key (a string) to a value (a specific address). However, unlike a standard Hash Map where a slight change in the key (e.g., "Apple" vs. "Aple") results in a completely different hash, embeddings utilize Semantic Hashing.

Standard Hash Map: hash("cat") -> 0x5f4. hash("dog") -> 0x9a2. The outputs are unrelated.
Semantic Embedding: embed("cat") -> [0.1, 0.8, -0.2, ...]. embed("dog") -> [0.12, 0.78, -0.15, ...]. These vectors are mathematically close in high-dimensional space.

In our local library, the embedding model acts as the Chief Librarian. When you feed it a document, it doesn't just catalog the title; it reads the book and assigns it a coordinate in a 3D (or 1024D) map based on its meaning. A book about "feline behavior" will be placed on the shelf right next to a book about "kittens," even if the word "cat" never appears in the text.

Why Local Embeddings? Privacy is the primary driver. Sending proprietary documents to a cloud API (like OpenAI's) for embedding creates a data leakage vector. By running models like nomic-embed-text via Ollama, we utilize the same WebGPU acceleration discussed earlier to generate these vectors in real-time on the user's machine.

Vector Stores: The Semantic Shelf System

Once the Chief Librarian (Embedding Model) has assigned coordinates to our text, we need a place to store them. This is the Vector Store (e.g., ChromaDB, LanceDB).

The Analogy: Imagine a warehouse where books are not arranged by the Dewey Decimal System, but by their conceptual meaning. If you have a book on "Quantum Physics" and another on "Advanced Mechanics," they are physically placed next to each other, regardless of their publication date or author.

A Vector Store is not a standard SQL database. A SQL database is like a spreadsheet: exact matches only. If you query SELECT * FROM books WHERE title = 'Physics', you only get exact matches. A Vector Store, however, performs Approximate Nearest Neighbor (ANN) search.

The "Why" of ANN: In high-dimensional space (where embeddings live, often 768 or 1024 dimensions), calculating exact distances between every vector is computationally prohibitive ($O(N^2)$). ANN algorithms (like HNSW - Hierarchical Navigable Small World graphs) allow us to traverse the "shelf" efficiently. They build a graph structure where vectors are nodes, and edges connect similar vectors.

Visualizing the Vector Space: Below is a visualization of how vectors are clustered in a local store. Note how "Technology" and "Science" are distinct yet close, while "Cooking" is isolated.

A diagram visualizes a vector space where Technology and Science form distinct but adjacent clusters, while Cooking appears as an isolated point, illustrating semantic proximity in a local store's data.

The Upsert Operation: Dynamic Knowledge Management

In a static library, books are added once. In a digital RAG system, data is fluid. We need a mechanism to add new documents or update existing ones without rebuilding the entire index. This is the Upsert Operation.

The Analogy: Imagine a library card catalog. When a new edition of a book arrives, you don't throw away the whole cabinet. You find the card for the old edition, remove it, and insert the new one. If no card exists, you add a new one. This is an "Update or Insert" (Upsert) operation.

In vector databases, Upsert is critical for maintaining a Live Index. It takes a unique ID (like a file path or a hash of the content) and the corresponding vector. If the ID exists, the vector is updated (useful if the document changes). If not, it is inserted.

Why is this critical for Private RAG? In a local environment, documents might change frequently (e.g., a user editing a local markdown file). The Upsert operation ensures the vector store remains synchronized with the source of truth without the overhead of a full re-indexing, which is computationally expensive.

The Retrieval Mechanism: The Query Vector

When a user asks a question, the pipeline reverses the embedding process.

Query Embedding: The user's question ("How do I optimize WebGPU shaders?") is passed through the same embedding model, generating a query vector.
Similarity Search: The Vector Store calculates the distance (usually Cosine Similarity or Euclidean Distance) between the query vector and all stored document vectors.
Top-K Retrieval: The system retrieves the $K$ closest vectors (e.g., the top 3 most relevant documents).

The Math Behind It: Cosine Similarity measures the cosine of the angle between two vectors. If the angle is 0 degrees, the vectors are identical (similarity = 1). If they are 90 degrees apart, they are unrelated (similarity = 0). $$ \text{Cosine Similarity} = \frac{A \cdot B}{||A|| \times ||B||} $$ This is calculated massively in parallel on the GPU, leveraging the WebGPU Compute Shards we discussed earlier.

Integration: The RAG Chain

Finally, we stitch this together. The retrieved documents (context) are injected into the prompt sent to the local LLM (via Ollama).

The Web Development Analogy (Microservices): Think of the RAG pipeline as a microservices architecture: * Service A (Embedding): Takes raw text, returns vectors. (Stateless, compute-heavy). * Service B (Vector DB): Takes vectors, stores/retrieves them. (Stateful, I/O heavy). * Service C (LLM): Takes context + query, returns natural language. (Stateless, heavy compute).

The Checkpointer (from the previous chapter) manages the state of the entire workflow. If the LLM fails to generate a response, the Checkpointer allows us to resume exactly after the Vector DB retrieval step, avoiding the need to re-embed or re-search the documents.

Performance Optimization: WebGPU and Quantization

Running these models locally introduces latency challenges. Two techniques solve this:

Quantization: We reduce the precision of the model weights (e.g., from 16-bit floating point to 4-bit integers). This is like compressing a high-res image to a web-optimized JPG. It drastically reduces memory usage and increases speed with minimal loss in accuracy.
WebGPU Compute Shaders: As defined in our glossary, these are programs executed on the GPU. For vector stores, calculating the distance between a query vector and 10,000 stored vectors is a massive parallel task. WebGPU allows us to offload this from the CPU, making local RAG feel instantaneous.

Summary of the Private RAG Flow

To visualize the end-to-end flow without code, consider this data lifecycle:

Ingestion: User drops a PDF. -> Ollama Embedding Model (via WebGPU) -> Vector created.
Storage: Vector + ID -> Upsert -> Local Vector Store (Chroma/Lance).
Retrieval: User types query. -> Embedding Model -> Query Vector -> Similarity Search -> Top-K Context.
Generation: Context + Query -> Ollama LLM -> Natural Language Response.
Persistence: The state of this interaction is saved by the Checkpointer, allowing the user to ask follow-up questions ("What about the second point?") without losing the context of the retrieved documents.

This architecture ensures that your data never leaves your local network, providing the security of an air-gapped system with the intelligence of a cloud-based AI.

Basic Code Example

This "Hello World" example demonstrates a fully local, browser-based RAG pipeline. We will simulate a scenario where a user queries a small knowledge base (a few sentences about AI) to get context-aware answers. The entire process—embedding a query, searching a local vector store, and retrieving relevant context—happens in the client's browser using Transformers.js for embeddings and a simple in-memory vector store for demonstration.

This approach ensures end-to-end privacy (no data leaves the device) and low latency, leveraging WebGPU acceleration where available.

Prerequisites

To run this example, you would typically use a modern bundler (like Vite or Next.js). For simplicity, we assume a standard web environment with TypeScript support.

// Import necessary libraries from Transformers.js
// Note: In a real project, you'd install this via npm: `npm install @xenova/transformers`
// We use the dynamic import to handle potential browser compatibility and loading states.
import { pipeline, env } from '@xenova/transformers';

// Configure environment for local execution
// Disabling remote models forces the use of locally cached models or downloads them once.
env.allowRemoteModels = true; // Set to false for strict offline mode after initial download
env.useBrowserCache = true;

The Code Example

/**
 * ============================================================================
 * PART 1: DATA STRUCTURES & STATE MANAGEMENT
 * ============================================================================
 */

/**
 * Represents a single vector embedding with its associated metadata.
 * We use Immutable State Management: once created, the object properties are read-only.
 */
type DocumentChunk = {
  id: string;
  text: string;
  embedding: number[]; // The high-dimensional vector
};

/**
 * A simple in-memory vector store for demonstration.
 * In production, this would be replaced by ChromaDB, LanceDB, or pgvector.
 */
class LocalVectorStore {
  private documents: DocumentChunk[] = [];

  /**
   * Adds a document to the store.
   * Returns a new array (immutability) rather than mutating the existing one.
   */
  async addDocument(text: string, embedding: number[]): Promise<DocumentChunk[]> {
    const newDoc: DocumentChunk = {
      id: `doc_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`,
      text,
      embedding,
    };

    // Create a new array to preserve immutability
    this.documents = [...this.documents, newDoc];
    return this.documents;
  }

  /**
   * Performs a similarity search (Cosine Similarity).
   * @param queryEmbedding - The vector representation of the user's query.
   * @param topK - Number of top results to return.
   * @returns The top K matching documents.
   */
  async search(queryEmbedding: number[], topK: number = 1): Promise<DocumentChunk[]> {
    // Calculate cosine similarity for each document
    const scoredDocs = this.documents.map((doc) => {
      const similarity = cosineSimilarity(queryEmbedding, doc.embedding);
      return { ...doc, score: similarity };
    });

    // Sort by score descending and slice top K
    return scoredDocs
      .sort((a, b) => b.score - a.score)
      .slice(0, topK)
      .map(({ score, ...rest }) => rest); // Remove score from final output
  }
}

/**
 * ============================================================================
 * PART 2: MATH UTILITIES (UNDER THE HOOD)
 * ============================================================================
 */

/**
 * Calculates the cosine similarity between two vectors.
 * Formula: dot(A, B) / (||A|| * ||B||)
 * Used to measure how "close" two vectors are in semantic space.
 */
function cosineSimilarity(vecA: number[], vecB: number[]): number {
  if (vecA.length !== vecB.length) {
    throw new Error('Vectors must be of the same dimension');
  }

  let dotProduct = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }

  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

/**
 * ============================================================================
 * PART 3: MAIN APPLICATION LOGIC
 * ============================================================================
 */

/**
 * The main entry point of the RAG pipeline.
 */
async function runRagPipeline() {
  console.log('🚀 Initializing Local RAG Pipeline...');

  // 1. Initialize the Embedding Pipeline
  // We use 'Xenova/all-MiniLM-L6-v2' - a small, fast model perfect for browser environments.
  // This downloads the model weights (via WASM) on first run.
  const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

  // 2. Initialize the Local Vector Store
  const vectorStore = new LocalVectorStore();

  // 3. Ingest Knowledge Base (Simulated Database)
  const knowledgeBase = [
    "Artificial Intelligence is the simulation of human intelligence processes by machines.",
    "Machine Learning is a subset of AI that focuses on training algorithms to learn patterns.",
    "Deep Learning uses neural networks with many layers to analyze various factors of data.",
    "The weather today is sunny and warm, perfect for hiking."
  ];

  console.log('📝 Ingesting documents into local vector store...');

  // Generate embeddings for each document and store them
  // Note: We process sequentially for simplicity, but parallel processing is faster.
  for (const text of knowledgeBase) {
    // The embedding model returns a Tensor; we extract the data array.
    const output = await embedder(text, { pooling: 'mean', normalize: true });
    const embedding = Array.from(output.data); // Convert TypedArray to standard Array
    await vectorStore.addDocument(text, embedding);
  }

  // 4. User Query
  const userQuery = "What is AI?";
  console.log(`\n🔍 User Query: "${userQuery}"`);

  // 5. Generate Query Embedding
  console.log('🧠 Generating query embedding...');
  const queryOutput = await embedder(userQuery, { pooling: 'mean', normalize: true });
  const queryEmbedding = Array.from(queryOutput.data);

  // 6. Retrieve Relevant Context
  console.log('🔎 Searching vector store...');
  const relevantDocs = await vectorStore.search(queryEmbedding, 1);

  // 7. Output Results
  console.log('\n✅ Retrieved Context:');
  relevantDocs.forEach((doc, idx) => {
    console.log(`   [${idx + 1}] ${doc.text}`);
  });

  // In a real app, you would now pass this context to a local LLM (via Ollama) 
  // to generate the final answer.
  console.log('\n🏁 Pipeline Complete. Ready for LLM inference.');
}

// Execute the pipeline
// We wrap this in a try-catch to handle network/model loading errors gracefully.
runRagPipeline().catch(console.error);

Detailed Line-by-Line Explanation

type DocumentChunk:
- What: Defines the shape of our data.
- Why: TypeScript interfaces ensure type safety. We store the raw text, a unique ID, and the numerical vector (embedding).
- Immutability: Note that we do not export methods to modify text or embedding after creation. This aligns with Immutable State Management principles, preventing accidental data corruption during concurrent search operations.
LocalVectorStore Class:
- private documents: Encapsulation. The internal array is protected from external direct access.
- addDocument: Instead of push() (which mutates the array), we use the spread operator [...this.documents, newDoc]. This creates a new array reference. In a React context, this triggers re-renders correctly; in Node.js, it prevents race conditions.
- search: This is the retrieval engine. It maps over every document to calculate a similarity score.
cosineSimilarity:
- What: A geometric function to measure the angle between two vectors.
- Why: Embeddings are points in high-dimensional space. Semantic similarity corresponds to geometric proximity. A score of 1.0 means identical meaning; 0.0 means unrelated.
- Under the Hood: It calculates the dot product (sum of multiplied pairs) and divides by the product of their magnitudes (Euclidean lengths). This normalizes for document length, focusing purely on semantic direction.
pipeline('feature-extraction', ...):
- What: Initializes the Transformers.js pipeline.
- Why: This abstracts away the complexity of loading model weights, tokenizing input, and running inference.
- WASM & WebGPU: Under the hood, Transformers.js uses WASM SIMD (Single Instruction, Multiple Data) instructions to accelerate matrix multiplications on the CPU. If WebGPU is enabled in the browser, it offloads these calculations to the GPU for massive parallelism.
Ingestion Loop:
- We iterate through our "database" (the knowledgeBase array).
- pooling: 'mean': Transformer models output a tensor of shape [batch_size, sequence_length, hidden_dim]. Mean pooling averages these tokens to create a single vector representing the entire sentence.
- normalize: true: L2 normalizes the vector (unit length). This is crucial because cosine similarity is mathematically equivalent to the dot product of L2-normalized vectors.
Query Processing:
- We repeat the embedding process for the user's query. The model must be the same one used for documents, otherwise, the vectors live in different mathematical spaces and similarity calculations will fail.
Retrieval:
- The vectorStore.search method compares the query vector against all stored document vectors.
- It sorts the results by score (highest similarity first) and returns the top result.

Common Pitfalls

Model Dimension Mismatch:
- Issue: Using all-MiniLM-L6-v2 for documents but all-mpnet-base-v2 for the query.
- Result: Vectors will have different dimensions (384 vs 768), causing a runtime error in the cosine similarity function or returning garbage results.
- Fix: Always use the exact same model architecture for both indexing and querying.
Async/Await Loops in JavaScript:
- Issue: Using forEach with async callbacks.
- Example: knowledgeBase.forEach(async (text) => { await embedder(text); })
- Result: The loop does not wait for the embeddings to finish. The code proceeds to the next step immediately, resulting in an empty vector store.
- Fix: Use a standard for...of loop (as done in the example) or Promise.all if concurrency is desired.
Vercel/Serverless Timeouts:
- Issue: If you deploy this logic to a serverless function (like Vercel Edge), downloading the model weights (often 50MB+) on every cold start will exceed the timeout limit (usually 10s).
- Fix: For a SaaS app, pre-cache models in the build step or use a persistent server (like a Docker container) for the embedding service. For a purely client-side app (like the example), the download happens in the user's browser, bypassing serverless limits.
Hallucinated JSON in Dynamic Imports:
- Issue: When dynamically importing model configurations or vector data from external JSON files, network errors can return HTML error pages instead of JSON.
- Result: JSON.parse() throws a syntax error.
- Fix: Always wrap dynamic imports or fetch calls in try-catch blocks and validate the response structure before parsing.

Visualization of the Data Flow

This diagram illustrates the data flow where an unhandled error from a dynamic import() or fetch() call propagates upward to break execution, contrasted against a robust flow where a try-catch block intercepts the error, validates the response structure, and safely proceeds to parsing. — This diagram illustrates the data flow where an unhandled error from a dynamic `import()` or `fetch()` call propagates upward to break execution, contrasted against a robust flow where a `try-catch` block intercepts the error, validates the response structure, and safely proceeds to parsing.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.