Chapter 15: Multi-modal RAG - Indexing Images and Text

Theoretical Foundations

In our previous explorations of Retrieval-Augmented Generation (RAG), we operated almost exclusively within the domain of text. We learned to chunk documents, generate text embeddings, and retrieve relevant passages based on semantic similarity. This process is analogous to building a highly sophisticated, semantic search engine for a library of books. You can ask a question in natural language, and the system finds the most relevant paragraphs across thousands of volumes.

However, the real world is not a library of pure text. It is a rich, interconnected tapestry of images, diagrams, charts, and unstructured documents. A technical manual might contain schematic diagrams, a financial report will be dense with charts, and a marketing brief is incomplete without its visual assets. To build a truly intelligent enterprise search system, we must teach our models to "see" and understand these multimodal assets, not just read their alt-text or captions.

This is the fundamental goal of Multi-modal RAG. We are expanding our retrieval system from a single-dimensional library of text to a multi-dimensional, unified semantic space where an image of a "faulty capacitor" and a text document describing "capacitor failure modes" can exist as conceptually related entities.

The "Why" - The Business Imperative: Imagine an automotive company's internal knowledge base. A junior engineer encounters a strange corrosion pattern on a battery terminal. They could search for text: "corrosion on battery terminal." The results might be service bulletins and forum posts. But what if the most relevant information is a high-resolution photograph in a quality control report from three years ago? A text-only RAG system would likely miss this. A multi-modal system, however, can take the engineer's photograph as the query itself and retrieve the exact documentation containing similar visual evidence, even if the text descriptions are sparse. This dramatically accelerates problem-solving, reduces errors, and unlocks the value trapped in unstructured visual data.

From Textual Embeddings to Multimodal Embeddings: The Vector Bridge

The core mechanism enabling this is the Multimodal Embedding Model. To understand this, let's first recall what a standard text embedding is. As we discussed in Book 2, a text embedding model (like text-embedding-ada-002) is a neural network that has been trained to map text snippets into a high-dimensional vector space. The key property of this space is that semantically similar texts are located close to each other. The phrase "king" is near "queen," and "cat" is far from "automobile."

A multimodal embedding model, such as OpenAI's CLIP (Contrastive Language-Image Pre-training), is a more powerful, dual-brained architecture. It is not one model, but two tightly coupled models that have been trained together on a massive dataset of image-text pairs (e.g., the entire internet). The training process is brilliant in its simplicity:

The model is shown an image and a batch of text captions.
Its goal is to identify which caption belongs to which image.
Through this contrastive learning process, the model learns to create a single, shared vector space.

In this shared space, the vector for a photograph of a golden retriever is mathematically close to the vector for the text "a golden retriever playing in a field." It is also close to the vector for the text "a happy dog." However, it is far from the vector for "a skyscraper at night."

Analogy: The Universal Hash Map for Concepts Think of a standard text embedding as a hash map where the keys are words and the values are vectors. It's a one-to-one mapping: {"king": [0.1, 0.8, ...]}.

A multimodal embedding is like a universal hash map for concepts, where the keys can be any type of data—an image, a sentence, a paragraph, or even a sound wave—and the values are the same conceptual vectors. The hash function is the multimodal model itself. It doesn't care about the input format; it cares about the underlying meaning. This is the bridge that allows us to compare apples and oranges, or in our case, images and text.

This diagram illustrates how a shared semantic space acts as a bridge, allowing disparate inputs like images and text to be compared based on their underlying meaning rather than their specific formats.

Now, let's translate this theory into a practical pipeline. The process can be broken down into two primary phases: Indexing and Retrieval.

1. The Indexing Phase: Ingesting and Vectorizing Visual Data

This is where we prepare our data for retrieval. It's analogous to the ETL (Extract, Transform, Load) process in traditional data warehousing, but tailored for multimodal content.

Step A: Ingestion and Preprocessing
- What: We start with our source documents, which are often PDFs containing both text and images, or directories of image files. The first step is to separate these elements. For a PDF, we need to parse it to extract text blocks and extract images as binary data (e.g., JPEGs, PNGs).
- Why: The embedding model requires a clean input. A blurry, oversized, or improperly formatted image will produce a poor-quality embedding. This is where Node.js libraries like sharp become essential. We can use sharp to:
  - Resize images to a standard dimension required by the model (e.g., 224x224 or 512x512 pixels).
  - Normalize pixel values (scaling them to a specific range like 0-1).
  - Convert formats to ensure compatibility.
- Under the Hood: This is a data cleaning and standardization step. Just as we would remove stop words or normalize casing for text, we must standardize our visual data. This ensures that the embedding model focuses on the semantic content of the image (the objects, scenes, and concepts) rather than incidental artifacts like size or format.
Step B: Feature Extraction and Embedding Generation
- What: The preprocessed image (and any associated text) is fed into the multimodal embedding model. The model processes the input and outputs a high-dimensional vector (e.g., a list of 512 or 1024 floating-point numbers). This vector is the image's "embedding" or "fingerprint" in the unified semantic space.
- Why: This is the core transformation. We are converting an unstructured, high-dimensional pixel grid into a structured, lower-dimensional vector that captures its semantic meaning. This vector is what we will perform similarity searches on.
- Under the Hood: The image passes through the convolutional layers of the model's vision encoder (e.g., a Vision Transformer or a ResNet). These layers act as hierarchical feature extractors, identifying edges, textures, shapes, and eventually complex objects. The final output of the encoder, before the classification head, is the embedding vector. For text, a similar process occurs using a transformer-based text encoder. The magic of CLIP is that both encoders are trained to map to the same vector space.
Step C: Metadata Structuring and Storage
- What: The generated vector is useless in isolation. It must be stored in a vector database (like Pinecone or Qdrant) alongside rich metadata. This metadata is the crucial link that provides context.
- Why: When we retrieve an image, we need to know what it is, where it came from, and how to display it. The metadata provides this essential information.
- Structure of Metadata: A well-structured metadata object for an indexed image might look like this:
```
{
  "source_document": "Q3_Fault_Analysis_Report.pdf",
  "page_number": 42,
  "image_filename": "capacitor_C7_bulge.jpg",
  "generated_caption": "A close-up photograph of an electrolytic capacitor on a circuit board, showing significant bulging at the top, indicating failure.",
  "document_section": "Power Supply Unit Analysis",
  "ingestion_timestamp": "2023-10-27T10:00:00Z"
}
```
- Under the Hood: The vector database stores the vector and its associated metadata as a single record. When a query comes in, the database performs a similarity search (e.g., using cosine similarity or dot product) to find the vectors closest to the query vector. It then returns the top-k results, including the metadata, which your application can use to reconstruct the answer or display the image.

This is the phase where the power of the unified semantic space becomes apparent.

Step A: Query Processing
- What: A user provides a query. This query can be purely textual ("Show me examples of capacitor failure") or it can be an image (the photo the engineer took of the faulty component).
- Why: This flexibility is the primary user-facing benefit of multi-modal RAG. It meets the user where they are, allowing them to ask questions in the most natural format for their problem.
- Under the Hood: If the query is text, it is passed through the text encoder of the same multimodal model used for indexing to generate a query vector. If the query is an image, it is passed through the vision encoder. The key is that the exact same model and vector space must be used for both indexing and querying to ensure the vectors are comparable.
Step B: Similarity Search
- What: The query vector is sent to the vector database. The database calculates the similarity between the query vector and all the indexed image vectors.
- Why: This is the core retrieval mechanism. It finds the "nearest neighbors" of the query in the semantic space. For a text query about "bulging capacitors," the database will return vectors for images that visually represent bulging capacitors, even if the extracted text from the source document didn't explicitly use those words.
- Under the Hood: The database uses efficient algorithms like HNSW (Hierarchical Navigable Small World) to perform an approximate nearest neighbor search. This allows it to find the most similar vectors in milliseconds, even across millions of indexed items.
Step C: Result Synthesis and Presentation
- What: The database returns the top-k results, each containing the similarity score and the associated metadata. Your application logic then uses this metadata to present the answer to the user.
- Why: The raw vector and similarity score are not user-friendly. The metadata allows you to display the actual image, the source document, the page number, and any generated captions, providing a rich, contextual answer.
- Example Workflow:
  1. User uploads an image of a faulty capacitor.
  2. Your Node.js backend generates a vector for this image.
  3. The vector is sent to Pinecone for a similarity search.
  4. Pinecone returns a result: { id: 'vec_123', score: 0.92, metadata: { ... } }.
  5. Your application uses the metadata.source_document and metadata.page_number to generate a link to the original PDF and the metadata.image_filename to display the stored image next to the user's query, alongside the generated caption.

Analogy: The Web Developer's Toolkit

To solidify this, let's map the multi-modal RAG pipeline to a familiar web development architecture.

Multimodal Embedding Model (CLIP) = The Universal Transpiler (e.g., Babel/TypeScript Compiler): Just as a transpiler converts code from different languages (TypeScript, JSX, SCSS) into a single, standardized format (JavaScript), the embedding model converts diverse inputs (images, text) into a single, standardized format (vectors). This allows different "languages" of data to be understood by the same system.
Vector Database (Pinecone/Qdrant) = A Semantic Key-Value Store (like a specialized Redis): Instead of storing key: string -> value: data, it stores key: vector -> value: metadata. The retrieval isn't based on exact string matching but on semantic similarity, much like how you'd use a hash map for fast lookups, but here the "hash" is a conceptual similarity.
The Retrieval Process = API Gateway with a Service Mesh: When a request (query) comes in, the API gateway (your application logic) doesn't know which service (image or text) can handle it. It sends the request to the service mesh (the vector space), which intelligently routes it to the most relevant service (the closest image or text vector) based on the request's "intent" (its vector representation).
Metadata = The href and src Attributes: The vector itself is just a coordinate. The metadata provides the actionable links—the href to the source document and the src to the image file, allowing the frontend to render a meaningful response.

By building this unified semantic space, we move from a text-centric search tool to a holistic information retrieval system that mirrors the way humans think—connecting ideas, images, and concepts based on their underlying meaning, not just their surface-level keywords. This is the foundational shift that multi-modal RAG enables for enterprise search and knowledge management.

Basic Code Example

In a multi-modal RAG system, the indexing phase is where we convert raw visual data into a format that a vector database can understand. Unlike text, an image is a grid of pixels. To perform semantic search, we must transform this grid into a high-dimensional vector (an array of numbers) that captures the semantic meaning of the image.

For this "Hello World" example, we will simulate a SaaS workflow where a user uploads an image to a web application. We will then: 1. Process the image (using sharp for resizing and standardization). 2. Generate a vector embedding (using a mock function that simulates an API call to a model like CLIP or OpenAI's clip-vit-base-p32). 3. Store the vector and metadata in a vector database (simulated here, but applicable to Pinecone or Qdrant).

This process creates the "index" that allows us to later query "Find images of a sunset" and retrieve the correct visual data.

Implementation: Image Indexing Pipeline

Below is a fully self-contained TypeScript script. It uses sharp for image manipulation and simulates the embedding generation step to keep the example runnable without external API keys.

// index.ts
import sharp from 'sharp';
import fs from 'fs';
import path from 'path';

/**
 * @description Represents the structure of a vector database record.
 * @template T - The type of the metadata (e.g., file path, user ID).
 */
interface VectorRecord<T> {
  id: string;
  vector: number[];
  metadata: T;
}

/**
 * @description Metadata specific to our image upload scenario.
 */
interface ImageMetadata {
  filePath: string;
  uploadedAt: Date;
  originalName: string;
}

/**
 * @description Mock interface for a Vector Database (e.g., Pinecone, Qdrant).
 * In a real app, this would be an API client.
 */
interface VectorDatabase {
  upsert(record: VectorRecord<ImageMetadata>): Promise<void>;
  query(vector: number[], topK: number): Promise<VectorRecord<ImageMetadata>[]>;
}

/**
 * @description A simulated Vector Database implementation.
 * It stores data in memory using a Map.
 */
class MockVectorDB implements VectorDatabase {
  private store: Map<string, VectorRecord<ImageMetadata>> = new Map();

  async upsert(record: VectorRecord<ImageMetadata>): Promise<void> {
    this.store.set(record.id, record);
    console.log(`[DB] Indexed vector for ID: ${record.id}`);
  }

  async query(vector: number[], topK: number): Promise<VectorRecord<ImageMetadata>[]> {
    // Calculate Euclidean distance (simplified for demo)
    const scores = Array.from(this.store.values()).map((record) => {
      const distance = Math.sqrt(
        vector.reduce((sum, val, i) => sum + Math.pow(val - record.vector[i], 2), 0)
      );
      return { ...record, score: distance };
    });

    // Sort by lowest distance (closest match)
    return scores.sort((a, b) => a.score - b.score).slice(0, topK);
  }
}

/**
 * @description Simulates a Multimodal Embedding Model (e.g., CLIP).
 * In production, this would call an API (OpenAI, Replicate) or run a local ONNX model.
 * It converts an image buffer into a 512-dimensional vector.
 */
async function generateImageEmbedding(imageBuffer: Buffer): Promise<number[]> {
  console.log('[Model] Generating embedding from image buffer...');

  // SIMULATION: In reality, we would pass the buffer to a model.
  // Here, we generate a deterministic pseudo-random vector based on the buffer length
  // to simulate a unique vector for every unique image size/content.
  const seed = imageBuffer.length;
  const vector: number[] = [];

  // Generate a 512-dimension vector
  for (let i = 0; i < 512; i++) {
    // Pseudo-random generator based on seed
    const x = Math.sin(seed + i) * 10000;
    vector.push(x - Math.floor(x));
  }

  return vector;
}

/**
 * @description Main processing pipeline.
 * 1. Reads image from disk.
 * 2. Preprocesses (resizes) using Sharp.
 * 3. Generates embedding.
 * 4. Upserts to Vector DB.
 */
async function indexImage(
  imagePath: string, 
  db: VectorDatabase
): Promise<void> {
  try {
    console.log(`\n--- Starting Indexing for: ${path.basename(imagePath)} ---`);

    // 1. Image Preprocessing (Sharp)
    // We resize to ensure consistent input size for the model and reduce memory usage.
    // 'fit: cover' maintains aspect ratio while filling the dimensions.
    const processedImageBuffer = await sharp(imagePath)
      .resize(224, 224, { fit: 'cover' })
      .png() // Normalize format to PNG
      .toBuffer();

    console.log(`[Sharp] Image processed. Buffer size: ${processedImageBuffer.length} bytes`);

    // 2. Embedding Generation
    const embeddingVector = await generateImageEmbedding(processedImageBuffer);

    // 3. Prepare Metadata
    const metadata: ImageMetadata = {
      filePath: imagePath,
      uploadedAt: new Date(),
      originalName: path.basename(imagePath)
    };

    // 4. Vector DB Upsert
    // We use the file path as a unique ID for this demo.
    const record: VectorRecord<ImageMetadata> = {
      id: path.basename(imagePath, path.extname(imagePath)), // e.g., 'sunset'
      vector: embeddingVector,
      metadata: metadata
    };

    await db.upsert(record);

    console.log(`--- Indexing Complete ---\n`);
  } catch (error) {
    console.error('Error during indexing pipeline:', error);
    throw error;
  }
}

/**
 * @description Example usage: Simulating a SaaS app processing uploads.
 */
(async () => {
  // Initialize DB
  const vectorDB = new MockVectorDB();

  // Create dummy image files for demonstration purposes
  // In a real app, these would come from an HTTP request (e.g., Multer in Express)
  const mockImages = [
    { name: 'sunset.jpg', content: 'SUNSET_IMAGE_DATA' },
    { name: 'mountain.png', content: 'MOUNTAIN_IMAGE_DATA' },
    { name: 'city_night.jpg', content: 'CITY_NIGHT_DATA' }
  ];

  // Write dummy files to disk
  for (const img of mockImages) {
    fs.writeFileSync(img.name, img.content);
  }

  // Run the indexing pipeline for each image
  for (const img of mockImages) {
    await indexImage(img.name, vectorDB);
  }

  // Cleanup dummy files
  mockImages.forEach(img => fs.unlinkSync(img.name));
})();

Line-by-Line Explanation

Imports and Interfaces:
- We import sharp for high-performance image processing, fs for file system access, and path for handling file paths.
- VectorRecord<T> defines the schema for our database entries. It requires a unique id, the numerical vector, and flexible metadata.
- ImageMetadata specifies what contextual data we store with the image (useful for displaying results in a UI later).
The Mock Vector Database (MockVectorDB):
- Why: To make this code runnable without signing up for Pinecone or Qdrant.
- How: It uses a JavaScript Map to store data in memory.
- upsert: Adds a vector record to the map.
- query: Calculates the Euclidean distance between a query vector and all stored vectors. This is a simplified version of how vector databases perform Approximate Nearest Neighbor (ANN) search.
Embedding Simulation (generateImageEmbedding):
- Why: Real multimodal models (like CLIP) are large and require API keys or heavy local dependencies.
- How: We simulate the model's behavior. A real model takes an image and outputs an array of 512 or 1024 floating-point numbers. Our mock generates a deterministic array based on the image buffer length to ensure that the same image produces the same vector every time.
The Pipeline (indexImage):
- Step 1 (Preprocessing): sharp(imagePath).resize(224, 224)...
  - Neural networks expect inputs of specific dimensions (usually square, e.g., 224x224 pixels). We resize the image to standardize it. We also convert it to a Buffer to pass it to the model.
- Step 2 (Embedding): We await the result of our mock model function.
- Step 3 (Upsert): We construct the final VectorRecord and send it to our database. In a production app, this is where you would handle batching (processing multiple images at once) to optimize network latency.
Execution Flow:
- We create dummy files to simulate a user upload.
- We loop through them, running the indexImage pipeline for each.
- We clean up the file system afterward to keep the script clean.

Graph State Visualization

In a production environment (like a LangGraph application), the image processing pipeline is often a node in a larger graph. The Graph State would carry the image buffer, the generated vector, and metadata through the workflow.

This diagram illustrates how a Graph State object serves as the central data carrier, passing an image buffer, its generated vector, and metadata through the nodes of a production workflow.

Common Pitfalls in JS/TS Multimodal RAG

Vercel/AWS Lambda Timeouts:
- Issue: Image processing and model inference are CPU-intensive. Serverless functions (like Vercel Edge or AWS Lambda) have strict timeouts (e.g., 10 seconds).
- Solution: Do not process images synchronously in API routes. Offload indexing to a background job queue (e.g., BullMQ, Inngest) or a dedicated worker process. Use sharp with sharp.cache({ files: 0 }) to manage memory usage.
Async/Await Loops (The "Waterfall" Trap):
- Issue: Using await inside a forEach loop processes images one by one. If you have 100 images, this takes 100x the time.
- Bad Code:
```
images.forEach(async (img) => await process(img)); // Runs sequentially, but function returns immediately
```
- Solution: Use Promise.all for concurrency (if your API rate limits allow) or a queue system for controlled concurrency.
```
await Promise.all(images.map(img => process(img)));
```
Hallucinated JSON in Metadata:
- Issue: When storing metadata in vector databases (especially Pinecone), if you try to store complex objects or non-stringified JSON, the database client might throw errors or silently fail.
- Solution: Always strictly type your metadata interface (as shown in ImageMetadata). Ensure dates are converted to strings (ISO 8601) before storage, as JSON does not support native Date objects.
Image Format & Buffer Handling:
- Issue: Passing raw file streams or incorrect MIME types to sharp or the embedding model can cause crashes.
- Solution: Always convert uploads to a Buffer using .toBuffer() before processing. Explicitly handle formats (e.g., .png() or .jpeg()) to ensure the model receives consistent data.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.