Chapter 14: Chat with your Data - Building the Full RAG App

Theoretical Foundations

At its heart, Retrieval-Augmented Generation (RAG) is an architectural pattern designed to overcome the fundamental limitation of Large Language Models (LLMs): their static knowledge cutoff. While an LLM like GPT-4 possesses vast general knowledge, it cannot access real-time, private, or proprietary information unless that data is explicitly provided as context during the inference process. RAG bridges this gap by dynamically retrieving relevant information from a custom data source and injecting it into the prompt before the LLM generates a response.

Imagine a brilliant but amnesiac historian. You ask them a question about a specific, obscure event in your company's history. Without access to the company archives, they might hallucinate or provide a generic, incorrect answer. RAG acts as the historian's assistant who instantly fetches the relevant pages from the company archives and places them on the desk before the historian begins to speak. The historian (the LLM) uses these documents to formulate an accurate, context-aware response.

This process is not merely a keyword search. It is semantic search, meaning it finds information based on the conceptual meaning of the query, not just matching words. This is the "Intelligent" part of the app we are building.

The RAG Pipeline: A High-Level View

The RAG process can be broken down into two distinct phases:

Indexing (Ingestion): This is the offline process of preparing your data. You take your raw documents (PDFs, text files, web pages), split them into manageable chunks, convert these chunks into numerical representations (embeddings), and store them in a vector database.
Retrieval & Generation (Query Time): This is the online process. When a user asks a question, the system converts the question into an embedding, searches the vector database for the most semantically similar document chunks, retrieves them, and constructs a prompt that includes both the original question and the retrieved context. This augmented prompt is then sent to the LLM.

The "Why": Beyond Simple Prompting

Why not just dump all your data into the context window of a single prompt? The reasons are both technical and practical:

Context Window Limits: LLMs have finite context windows (e.g., 128k tokens for GPT-4). You cannot fit an entire knowledge base, a product catalog, or years of documentation into a single prompt. RAG allows you to selectively pull in only the most relevant information.
Cost and Latency: Sending massive amounts of data with every query is expensive (more tokens = higher cost) and slow (more data to process = higher latency). RAG ensures you only send what's necessary.
Data Freshness: To update a fine-tuned model, you need to retrain it—a costly and time-consuming process. With RAG, you simply update your vector database with new documents, and the system immediately has access to the latest information.
Reducing Hallucinations: By grounding the LLM's response in retrieved, factual documents, you significantly reduce the likelihood of it making up information. The model is constrained to answer based on the provided context.

The Web Development Analogy: RAG as a Dynamic API Gateway

Think of a modern web application with a microservices architecture. You have multiple services: User Service, Product Service, Order Service. Instead of a client (e.g., a React frontend) making a separate, direct API call to each service for every piece of data, you often use an API Gateway or a Backend-for-Frontend (BFF).

The API Gateway acts as a central orchestrator: 1. It receives a request from the client. 2. It inspects the request and determines which underlying services are needed. 3. It makes concurrent calls to those specific services (e.g., GET /users/123 and GET /products/456). 4. It aggregates the responses, formats them, and sends a single, cohesive response back to the client.

RAG is the API Gateway for an LLM.

The Client: The user's query ("What is our company's policy on remote work?").
The API Gateway: The RAG orchestrator (e.g., a LangChain Runnable or LangGraph workflow).
The Microservices: The vector database (e.g., Pinecone, Weaviate) and the LLM.
The Aggregated Response: The final, context-aware answer generated by the LLM, grounded in the retrieved documents.

The RAG system doesn't just pass the query to the LLM. It first "calls" the vector database microservice to fetch relevant data, then "calls" the LLM microservice with an augmented prompt. This orchestration is what makes the system intelligent and context-aware.

Deep Dive: The Indexing Phase (The "Preparation")

This is where we prepare our "knowledge base" for fast, semantic retrieval. This phase is crucial and often the most complex part of building a robust RAG system.

1. Document Loading and Splitting

Raw data comes in many formats: PDFs, Markdown, HTML, JSON, etc. The first step is to load this data into a standardized Document object, which typically contains the text content and metadata (e.g., source URL, page number).

However, you cannot embed an entire 500-page book as a single unit. It's too large and loses semantic granularity. We must split (or "chunk") the documents into smaller, overlapping pieces.

Why Overlap? Imagine splitting a paragraph right in the middle of a sentence. The context is lost. Overlapping chunks ensures that a single sentence or idea isn't broken across two separate embeddings, preserving context. A common strategy is to use a RecursiveCharacterTextSplitter, which tries to split by paragraphs, then by sentences, then by words, ensuring chunks are semantically coherent.

2. Embeddings: The Heart of Semantic Search

This is the most critical transformation. We convert text chunks into vector embeddings.

An embedding is a list of floating-point numbers (a vector) that represents the semantic meaning of the text in a high-dimensional space. Texts with similar meanings will have vectors that are "close" to each other in this space, even if they don't share the same keywords.

Analogy: Hash Maps vs. Embeddings

In web development, a Hash Map (or Map in TypeScript) is a data structure that maps a unique key to a value. It's deterministic and exact. If you hash the string "apple", you will always get the same hash value. Searching for "apple" means computing its hash and looking up that exact key. This is lexical search.

An Embedding is like a "semantic hash map," but it's probabilistic and fuzzy. Instead of mapping a key to an exact value, it maps a concept to a region in a high-dimensional space.

Hash Map: hash("king") -> 0x7a3b1c and hash("queen") -> 0x9d2e4f. These are completely different and unrelated.
Embedding: embed("king") -> [0.12, -0.45, 0.88, ...] and embed("queen") -> [0.15, -0.42, 0.85, ...]. These vectors are numerically very close to each other.

When you perform a semantic search, you're not looking for an exact match. You're looking for vectors that are "close" to your query vector, typically measured by cosine similarity. This is how the system understands that "What is the company's policy on remote work?" is semantically similar to a document titled "Employee Handbook: Telecommuting Guidelines," even though the words are different.

3. Vector Store Integration

Once you have your embeddings, you need to store them somewhere for efficient retrieval. This is the role of a Vector Database (e.g., Pinecone, Weaviate, Chroma, or a local vector store like hnswlib).

A vector database is optimized for one primary operation: Approximate Nearest Neighbor (ANN) search. It organizes vectors in a way that allows for incredibly fast lookups of the "closest" vectors to a given query vector, even with millions or billions of entries. This is far more efficient than calculating the similarity against every single document in your collection (which would be a brute-force "k-nearest neighbors" search).

Deep Dive: The Retrieval & Generation Phase (The "Runtime")

This is the live interaction where the magic happens.

Query Embedding: The user's question ("What is our remote work policy?") is sent to the same embedding model used during indexing, converting it into a query vector.
Similarity Search: The query vector is sent to the vector store, which returns the top k most similar document chunks (e.g., k=4). These are the most relevant pieces of text from your knowledge base.

Prompt Engineering (The Augmentation): This is where the retrieved context is injected. A well-crafted prompt template is essential. It's not just "stuffing" the context in. A typical RAG prompt looks like this:

You are a helpful assistant. Use the following context to answer the user's question.
If the answer is not in the context, say you don't know.

Context:
{retrieved_document_1}
{retrieved_document_2}
...
{retrieved_document_k}

Question: {user_question}

LLM Inference: The augmented prompt is sent to the LLM (e.g., gpt-4-turbo). The LLM processes the entire prompt, focusing heavily on the provided context to generate a grounded, accurate answer.

The Role of LangChain.js and the Vercel AI SDK

In this chapter, we will use LangChain.js as the orchestration framework. LangChain provides the abstractions and integrations to build this entire pipeline with composable, reusable components called Runnables.

Think of a Runnable as a pure function in a data processing pipeline. A RunnableSequence chains these together: document_loader | text_splitter | embeddings | vector_store.

For the user-facing application, we will integrate with the Vercel AI SDK. This is where our UIState definition becomes critical. The Vercel AI SDK separates the application's UI logic from the AI's logic. The AIState holds the conversation history and the results from our RAG pipeline (e.g., the retrieved documents, the final answer). The UIState then decides how to render this state. For example, it might render a "Thinking..." message while the RAG pipeline is running, and then switch to rendering the final answer and even the source documents used, creating a dynamic, streamable user experience.

Visualizing the RAG Pipeline

The entire flow can be visualized as a two-stage process. The first stage (Indexing) is typically done offline. The second stage (Query) is the real-time user interaction.

Advanced Concept: Parallel Tool Execution in RAG

While the standard RAG pipeline is sequential, we can introduce Parallel Tool Execution for more complex scenarios. Imagine a user asks: "What was our revenue last quarter, and what was the main topic of the all-hands meeting?"

A naive RAG system might struggle. It might retrieve revenue documents and meeting notes, but the context could become noisy, and the LLM might confuse the two.

A more advanced agent-based approach, using a framework like LangGraph, would treat the vector store as a "tool." The agent would: 1. Parse the user's query into two independent sub-questions. 2. In parallel, call the vector store tool twice: once with the query "revenue last quarter" and once with "all-hands meeting topic." 3. Receive both sets of retrieved documents. 4. Synthesize a final answer using the combined context.

This is analogous to a web frontend making parallel fetch requests to different API endpoints to load a dashboard, rather than waiting for one request to finish before starting the next. It dramatically reduces latency and can improve answer quality by providing cleaner, more focused context for each part of a compound question.

Summary

RAG is not just a technique; it's a paradigm shift for building AI applications. It transforms LLMs from static encyclopedias into dynamic, knowledgeable assistants that can leverage your specific, up-to-the-minute data. By mastering the components of the RAG pipeline—loading, splitting, embedding, retrieving, and generating—you can build powerful, context-aware applications that are both accurate and scalable. The theoretical foundation lies in understanding the interplay between semantic search (embeddings) and generative power (LLMs), orchestrated by a robust framework like LangChain.js.

Basic Code Example

This example demonstrates a self-contained, "Hello World" style Retrieval-Augmented Generation (RAG) pipeline using Node.js and TypeScript. We will simulate a SaaS web application scenario where a user queries a chatbot to retrieve information from a small, hardcoded dataset of product documentation.

This code is designed to be run directly in a Node.js environment. It does not require a frontend or a database; it focuses purely on the logic of the RAG process: Ingestion (loading and splitting data), Retrieval (vector search), and Generation (answering based on context).

Prerequisites

To run this code, you will need: 1. Node.js (v18 or higher). 2. An OpenAI API key. 3. The following dependencies installed: * @langchain/openai * @langchain/core * langchain

You can install them via:

npm install @langchain/openai @langchain/core langchain

The Code

/**
 * Minimal RAG Pipeline Example
 * 
 * This script demonstrates a basic Retrieval-Augmented Generation (RAG) flow.
 * It ingests a small set of documents, creates vector embeddings, retrieves
 * relevant context based on a user query, and generates an answer using OpenAI.
 * 
 * Usage: npx ts-node rag-minimal.ts
 */

import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { StringOutputParser } from "@langchain/core/output_parsers";

// 1. CONFIGURATION
// ---------------------------------------------------------
// In a real SaaS app, this would come from environment variables (.env file).
// We use a placeholder here for demonstration. 
// WARNING: Never hardcode API keys in production code.
const OPENAI_API_KEY = process.env.OPENAI_API_KEY || "sk-...";

// 2. DATA INGESTION (Simulated)
// ---------------------------------------------------------
/**
 * Represents a raw document in our knowledge base.
 * In a real app, this might come from a database, PDF, or Markdown file.
 */
const rawDocuments: string[] = [
    "The 'Acme Rocket' product has a top speed of Mach 5 and uses liquid hydrogen.",
    "The 'Acme Rocket' requires a launch pad clearance of at least 100 meters.",
    "The 'Acme Drone' model X1 can fly for 45 minutes on a single charge.",
    "The 'Acme Drone' is waterproof up to IP67 standards, suitable for rain.",
];

// 3. MAIN RAG FUNCTION
// ---------------------------------------------------------
async function runRAGPipeline(query: string) {
    console.log(`\n[1] Processing Query: "${query}"`);

    // A. Initialize the LLM (Large Language Model)
    // ---------------------------------------------------------
    // We use gpt-3.5-turbo for this example due to its speed and cost efficiency.
    const llm = new ChatOpenAI({
        apiKey: OPENAI_API_KEY,
        modelName: "gpt-3.5-turbo",
        temperature: 0, // Deterministic output for factual retrieval
    });

    // B. Initialize Embeddings and Vector Store
    // ---------------------------------------------------------
    // Embeddings convert text into numerical vectors (semantic meaning).
    // MemoryVectorStore simulates a vector database (like Pinecone or Weaviate) in RAM.
    const embeddings = new OpenAIEmbeddings({
        apiKey: OPENAI_API_KEY,
    });

    // C. Ingest Documents into the Vector Store
    // ---------------------------------------------------------
    // We convert raw strings into LangChain Document objects (with metadata).
    const documents = rawDocuments.map(
        (text, index) =>
            new Document({
                pageContent: text,
                metadata: { source: "product_manuals", id: index },
            })
        );

    // Create the vector store from documents. This performs the embedding generation.
    const vectorStore = await MemoryVectorStore.fromDocuments(
        documents,
        embeddings
    );

    console.log("[2] Documents embedded and stored in memory.");

    // D. Semantic Search (Retrieval)
    // ---------------------------------------------------------
    // We perform a similarity search to find the top 2 most relevant documents.
    // This is the "Retrieval" step of RAG.
    const retrievedDocs = await vectorStore.similaritySearch(query, 2);

    console.log("[3] Retrieved Context:");
    retrievedDocs.forEach((doc, i) => {
        console.log(`   ${i + 1}. ${doc.pageContent}`);
    });

    // E. Contextual Generation (Augmentation)
    // ---------------------------------------------------------
    // We construct a prompt that instructs the LLM to answer strictly 
    // based on the provided context.
    const prompt = ChatPromptTemplate.fromMessages([
        ["system", "You are a helpful assistant for Acme Corp. Answer the user's question strictly based on the following context:"],
        ["context", ""], // This placeholder will be filled by the document chain
        ["human", "{input}"],
    ]);

    // The "Stuff" chain takes the context documents and "stuffs" them into the prompt.
    const documentChain = await createStuffDocumentsChain({
        llm,
        prompt,
        outputParser: new StringOutputParser(),
    });

    // F. Execute the Chain
    // ---------------------------------------------------------
    // We pass the user query and the retrieved documents to the chain.
    const response = await documentChain.invoke({
        input: query,
        context: retrievedDocs, // Injecting the retrieved data
    });

    return response;
}

// 4. EXECUTION
// ---------------------------------------------------------
(async () => {
    try {
        // Define a user query that requires external knowledge
        const userQuery = "What is the water resistance rating of the Acme Drone?";

        const answer = await runRAGPipeline(userQuery);

        console.log("\n[4] Final Answer Generated by LLM:");
        console.log("------------------------------------------------");
        console.log(answer);
        console.log("------------------------------------------------");

    } catch (error) {
        console.error("Pipeline failed:", error);
    }
})();

Detailed Line-by-Line Explanation

Here is the breakdown of the logic, numbered according to the execution flow.

1. Configuration and Imports

Lines 11-15: We import the necessary LangChain modules.
- OpenAIEmbeddings: Responsible for converting text into vector representations (arrays of numbers) that capture semantic meaning.
- ChatOpenAI: The interface for interacting with OpenAI's chat models (like GPT-3.5 or GPT-4).
- MemoryVectorStore: A simple in-memory vector database. It stores vectors and allows for similarity searches. In a production SaaS app, you would swap this for PineconeVectorStore or WeaviateVectorStore.
- Document: The standard data structure in LangChain, containing pageContent (the text) and metadata (tags, source info).
Lines 22-24: We set up the API key. In a real web app, this is loaded from environment variables (e.g., process.env.OPENAI_API_KEY) to ensure security.

2. Data Ingestion (Simulated)

Lines 29-35: We define rawDocuments. In a real scenario, this data would be fetched from a database or parsed from files (PDFs, Markdown). We simulate a knowledge base containing specs for "Acme Rocket" and "Acme Drone".
Why this matters: RAG relies on the quality of the source data. If the data isn't chunked or structured correctly, the retrieval step will fail to find relevant context.

3. The RAG Pipeline Function (`runRAGPipeline`)

This is the core asynchronous function that orchestrates the entire process.

Lines 43-47 (Initialize LLM):
- We instantiate ChatOpenAI. Setting temperature: 0 is crucial for RAG applications. It forces the model to be factual and deterministic, reducing the likelihood of hallucinating answers when it doesn't know the answer.
Lines 51-53 (Initialize Embeddings):
- We prepare the tool that will convert our text into vectors. This uses the text-embedding-ada-002 model by default, which is efficient for semantic search.
Lines 57-65 (Ingest Documents):
- We map over our raw strings and convert them into Document objects.
- Metadata: We attach source and id. This is vital for production apps to trace where an answer came from and to filter searches (e.g., "search only in the 'Billing' collection").
- MemoryVectorStore.fromDocuments: This method performs two actions internally:
  1. Calls the embeddings model to generate vectors for every document.
  2. Stores those vectors in RAM.
Lines 69-76 (Semantic Search / Retrieval):
- vectorStore.similaritySearch(query, 2): This is the "Retrieval" step. It converts the user's query into a vector and finds the documents with the closest vector matches (using cosine similarity).
- We request the top 2 results. This context is passed to the LLM to help it answer.
Lines 80-87 (Prompt Engineering):
- We define a ChatPromptTemplate. This is a structured message list.
- ["system", "..."]: Tells the model its role ("Acme Corp assistant") and rules ("Answer strictly based on context").
- ["context", ""]: A placeholder where LangChain will inject the retrieved documents.
- ["human", "{input}"]: The user's actual question.
Lines 89-94 (The Chain):
- createStuffDocumentsChain: This is a helper chain that takes the prompt, the LLM, and an output parser.
- "Stuffing": The term "stuffing" refers to the method of taking all retrieved documents and concatenating them into the context window of the LLM. This is the simplest form of RAG. (Alternative methods include Map-Reduce or Refine for very long documents).
Lines 98-100 (Execution):
- We invoke the chain with the query and the specific context (the retrieved documents). The LLM generates the final string response.

Visualization of the RAG Flow

The following diagram illustrates the data flow in this application.

This diagram illustrates the Retrieval-Augmented Generation (RAG) flow, where a Large Language Model (LLM) synthesizes a final string response by integrating context retrieved from an external knowledge base.

Common Pitfalls in JavaScript/TypeScript RAG Apps

When moving from this "Hello World" example to a production SaaS application, watch out for these specific issues:

Vercel/AWS Lambda Timeouts (Async/Await Loops)
- The Issue: Serverless functions (like Vercel Server Actions) often have strict timeouts (e.g., 10s or 30s). Embedding generation and LLM calls are network-heavy and slow.
- The Fix: Never generate embeddings on the fly during a user request if the dataset is large. Perform ingestion asynchronously in the background (e.g., via a queue like BullMQ or Upstash QStash). The user request should only perform the Retrieval and Generation steps, which are faster.
Hallucinated JSON / Output Parsing Errors
- The Issue: When asking an LLM to return structured data (like JSON for a UI component), it might return valid JSON 99% of the time, but occasionally add trailing commas, comments, or text outside the code block.
- The Fix: Never trust raw string output. Always use LangChain's JsonOutputParser or Zod-validated parsers. If using raw OpenAI function calling, ensure you handle the function_call object correctly rather than parsing the text message.
Context Window Overflow
- The Issue: The "Stuffing" method used in this example (createStuffDocumentsChain) concatenates all retrieved documents. If you retrieve 10 large documents, you might exceed the token limit of the model (e.g., 4096 tokens for older GPT models), causing the API to throw an error or truncating data.
- The Fix: Implement a "Map-Reduce" or "Refine" chain (available in LangChain) for large datasets. These chains process documents in batches and summarize them iteratively to fit within the token limit.
Non-Blocking I/O and Event Loop Blocking
- The Issue: While Node.js is non-blocking by default, heavy synchronous CPU operations (like parsing a massive 100MB PDF synchronously or calculating embeddings locally without offloading) will block the Event Loop. This freezes the entire server, preventing other users from accessing the app.
- The Fix: Always use asynchronous file reading (fs.promises.readFile) and ensure heavy computations (like embedding generation) are handled via external APIs (OpenAI) or worker threads. Do not perform synchronous loops over large arrays on the main thread.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.