Chapter 9: Query Expansion & Hypothetical Document Embeddings (HyDE)
Theoretical Foundations
At its heart, a standard Retrieval-Augmented Generation (RAG) system operates on a principle of semantic similarity. When a user asks a question, we convert that question into a Query Vectorāa numerical representation capturing its meaning. We then search our vector database for the document vectors that are "closest" to this query vector. This process, as established in previous chapters, relies on the mathematical proximity of vectors in a high-dimensional space to find relevant context.
However, this approach has a fundamental limitation: semantic ambiguity and the "single point of failure." A user's query is a single, often brief, representation of a complex intent. The resulting query vector is a single point in that high-dimensional space. If this point doesn't perfectly align with the region where the most relevant document vectors reside, the retrieval will fail, or it will retrieve documents that are only tangentially related.
This is analogous to searching for a file on your computer using a single, overly specific keyword. If you search for "quarterly_report_final_v3.pdf", you might find it instantly. But if you search for "financial performance summary", you might miss the file named "Q3_2024_Profit_Analysis.docx" because the embedding model, despite its sophistication, might not map these semantically similar concepts to exactly the same point in space. The original query vector is a high-stakes bet on a single interpretation of the user's intent.
Query Expansion and Hypothetical Document Embeddings (HyDE) are two advanced techniques designed to mitigate this risk. They don't replace the core vector search; they enrich the input to make it more robust and representative of the true user intent.
Query Expansion: Casting a Wider, More Intelligent Net
What it is: Query Expansion is the process of algorithmically generating multiple variations or interpretations of the user's original query. Instead of creating a single query vector, we create several. We then perform a vector search for each of these variations and combine the results. This process casts a wider net, increasing the probability of capturing relevant documents that a single, literal query might have missed.
The "Why": Overcoming Lexical and Semantic Gaps
Imagine you're a librarian. A student asks, "I need materials on the French Revolution." A naive search might only find books with "French Revolution" in the title. But a good librarian knows that this topic also involves concepts like "Bastille," "Marie Antoinette," "Jacobins," "guillotine," and "Enlightenment." Query Expansion is the process of the librarian proactively thinking of these related concepts and searching for them on the student's behalf.
In the context of vector embeddings, this addresses two key issues:
- Semantic Drift: The vector for "French Revolution" might be very close to "18th-century European history," but perhaps not close enough to a document whose primary topic is "The Reign of Terror" if that document's vector was generated with a different emphasis.
- Terminology Mismatch: The user might use informal language ("What caused the uprising in France in the late 1700s?"), while the source documents use formal terminology ("Causes of the French Revolution"). A single vector might not bridge this gap effectively.
The "How": Multiple Strategies
Query Expansion isn't just about adding random words. It's a strategic process. Common methods include:
- Pseudo-Relevance Feedback (PRF): This is a classic IR technique adapted for RAG. We perform an initial search with the raw query vector. We then assume the top
kresults are relevant. We extract key terms or phrases from these retrieved documents and use them to augment the original query before re-embedding and searching again. This is like doing a quick "scout" search to find the right vocabulary before the main expedition. - LLM-Generated Variations: A more modern and powerful approach is to use a Large Language Model (LLM) itself to generate semantically similar but syntactically diverse query variations. We can prompt the LLM with:
"Generate 3 alternative phrasings for the following question that capture the same intent: [User's Question]". This leverages the LLM's vast knowledge of language to create a more robust set of query vectors. For example:- Original: "What are the primary benefits of using a vector database?"
- Variation 1: "Why should a company choose a vector database for their search needs?"
- Variation 2: "List the advantages of vector-based indexing over traditional methods."
- Variation 3: "How does a vector database improve semantic search performance?"
- Hypothetical Document Embeddings (HyDE): This is a more sophisticated form of query expansion. Instead of just generating alternative queries, we ask the LLM to generate a hypothetical answer or a relevant document snippet. We then embed this hypothetical document and use it as the query vector.
Hypothetical Document Embeddings (HyDE): Answering Before You Search
What it is: HyDE flips the script on traditional retrieval. Instead of embedding the question to find an answer, you first ask an LLM to generate a hypothetical answer based on its internal knowledge (without access to your specific documents). You then embed this generated answer and use that embedding to search your vector database.
The "Why": Bridging the Question-Answer Representation Gap
The core problem HyDE solves is the fundamental difference in representation between a question and an answer.
- A question is often phrased as an inquiry, a request for information. It might be incomplete or speculative. (e.g., "What is the effect of X on Y?")
- An answer is declarative, factual, and information-dense. It states facts, describes processes, and provides explanations. (e.g., "X inhibits the function of Y by binding to its active site, which leads to...")
In the vector space, the embedding for a question and the embedding for a relevant answer might be quite far apart, even though they are semantically linked. A vector search based on the question might retrieve other questions or high-level conceptual documents, but not the dense, factual paragraphs containing the specific answer.
HyDE bridges this gap by creating a "pseudo-answer" vector. It generates a document that looks like the kind of document that would contain the answer. The embedding of this hypothetical document is therefore much closer in the vector space to the embeddings of your actual, real-world documents that contain the answers.
Analogy: The Expert Consultant
Imagine you need to solve a complex engineering problem. You have two options: 1. Standard RAG: You describe the problem to a search engine (the query vector). The search engine looks for documents that discuss similar problems. 2. HyDE: You first ask a brilliant consultant (the LLM) to draft a hypothetical solution based on their general expertise. This draft might not be 100% accurate for your specific context, but it's structured like a real solution. You then take this draft and search your company's internal project archives for documents that are stylistically and structurally similar to this draft. This is far more likely to surface the specific, detailed reports and blueprints you need than just searching for the problem statement.
Theoretical Foundations
To understand why these techniques work, we must visualize the high-dimensional vector space. Let's consider a simplified 2D representation of the semantic space for documents about a company's technology stack.
In this diagram: * The Original Query Vector (Q1) is a general question. It's semantically close to introductory documents (D2) and process documents (D4), but it might miss the highly technical deployment manifest (D3) because the vector for "deploy the API" is not as close to the vector for a Kubernetes YAML file as it could be. * Query Expansion (Q2, Q3) creates new vectors that probe different parts of the document space. Q3 ("API production rollout") is now much closer to the "API Gateway Config" (D1). This increases the chance of retrieving relevant technical documents. * HyDE (HyDE_Q) generates a hypothetical answer about Kubernetes deployment. This vector is now located very close to the actual "K8s Deployment Manifest" (D3), which is the most precise and valuable document for the user's intent. It successfully bridges the gap between the question and the highly specific answer format.
Integration in a JavaScript RAG Pipeline: The Asynchronous Workflow
In a Node.js environment, these techniques are not just theoretical; they are operational workflows that leverage the non-blocking I/O model. The process is inherently asynchronous and parallelizable.
Standard RAG Pipeline (Recap):
1. User sends a query.
2. The server (e.g., Express.js) receives the query.
3. An embedding model (e.g., via a library like onnxruntime-node or an API call) converts the query into a single vector. This is a CPU-bound or network-bound operation.
4. The server makes a non-blocking call to the vector database (e.g., Pinecone) to perform a similarity search with this single vector.
5. The database returns the top k documents.
6. These documents are passed to an LLM to generate the final answer.
Enhanced Pipeline with Query Expansion & HyDE:
This pipeline introduces more steps, but they can be managed efficiently with promises and async/await.
- Input: User sends a query:
"What is the process for escalating a critical bug in production?" - Expansion/HyDE Generation (Parallelizable):
- For Query Expansion: We can use an LLM call to generate variations. This is an I/O operation (network request to an LLM API).
- For HyDE: We use an LLM call to generate a hypothetical answer. This is also an I/O operation.
- Crucially, these calls can be made in parallel. We don't need to wait for the expansion variations to be generated before generating the HyDE answer. We can fire off both requests simultaneously.
- Embedding (Parallelizable):
- Once we have the original query, the expansion variations, and the HyDE hypothetical document, we embed them all. If we're using a local embedding model, this can be batched. If we're using an API, we can send them in a single batch request or make concurrent requests. This is a key advantage of non-blocking I/O; we don't block the main thread while waiting for the embeddings.
- Vector Search (Parallelizable):
- We now have multiple query vectors. We can perform multiple, concurrent vector searches against our database. For example, we might search with the HyDE vector and the best expansion vector simultaneously. Again, these are non-blocking I/O operations.
- Result Aggregation:
- The server receives results from multiple searches. These results need to be combined. A common strategy is to take the union of all retrieved documents and then re-rank them, for example, using a score based on the similarity from each vector search or using a cross-encoder for more precise scoring.
- Context Synthesis & Generation:
- The final, aggregated, and re-ranked list of documents is passed to the LLM for answer generation, along with the original user query.
This enhanced pipeline demonstrates a sophisticated use of the Node.js event loop. The main thread is never blocked. It orchestrates a series of asynchronous I/O operations (LLM calls, database queries, embedding requests) and efficiently combines their results to produce a far more accurate and contextually relevant output. The complexity is managed not by blocking, but by composing promises and managing the flow of data through an asynchronous pipeline.
Basic Code Example
This example demonstrates a simplified, self-contained implementation of Hypothetical Document Embeddings (HyDE) within a Node.js/TypeScript environment. We will simulate a SaaS web application where a user asks a question, and the system uses HyDE to generate a hypothetical answer, embed that answer, and retrieve relevant context from a vector database.
The core idea of HyDE is to transform a raw user query (which might be vague or abstract) into a concrete, "ideal" document that contains the answer. This hypothetical document is then embedded, and its vector is used for retrieval. This often bridges the semantic gap between a question and its answer more effectively than embedding the question itself.
The HyDE Logic Flow
- User Query: The user submits a natural language question (e.g., "What are the benefits of using a vector database?").
- Hypothetical Document Generation: A Large Language Model (LLM) is prompted to generate a short, factual document that answers the question, as if it were written by an expert.
- Embedding the Hypothesis: The generated hypothetical document is passed through an embedding model to create a high-dimensional vector.
- Vector Search (KNN): This vector is used to query the vector database for the 'K' most similar real documents.
- Response Generation: The retrieved real documents are passed to the LLM to generate the final, accurate answer for the user.
TypeScript Implementation
This code simulates the entire pipeline. In a production environment, you would replace the mock services (e.g., mockLLM, mockEmbedding) with actual API calls to services like OpenAI, Cohere, or a self-hosted model. The vector database logic is also simulated for clarity.
// main.ts
// ============================================================================
// 1. TYPE DEFINITIONS
// ============================================================================
/**
* Represents a document stored in our vector database.
* @property id - Unique identifier for the document.
* @property content - The actual text content of the document.
* @property embedding - The vector representation of the content (number array).
*/
interface Document {
id: string;
content: string;
embedding: number[];
}
/**
* Represents the core dependencies for our HyDE pipeline.
* This dependency injection pattern makes the code more testable and modular.
*/
interface Dependencies {
generateHypotheticalDoc: (query: string) => Promise<string>;
generateEmbedding: (text: string) => Promise<number[]>;
searchVectorDB: (queryVector: number[], k: number) => Promise<Document[]>;
generateFinalAnswer: (query: string, context: Document[]) => Promise<string>;
}
// ============================================================================
// 2. MOCK SERVICES (SIMULATING REAL-WORLD APIs)
// ============================================================================
/**
* MOCK: Simulates an LLM call to generate a hypothetical document.
* In a real app, this would be an API call to OpenAI's GPT-4, etc.
* It's designed to produce a factual-sounding answer to the query.
*/
const mockLLMGenerateHypotheticalDoc = async (query: string): Promise<string> => {
console.log(`[LLM] Generating hypothetical document for query: "${query}"`);
// Simulate network delay
await new Promise(resolve => setTimeout(resolve, 100));
// A simple, deterministic response for this example
if (query.toLowerCase().includes("vector database")) {
return "A vector database is a specialized database that stores data as high-dimensional vectors, which are numerical representations of unstructured data like text, images, or audio. It is optimized for performing fast similarity searches using algorithms like K-Nearest Neighbors (KNN). This makes it ideal for applications like semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG).";
}
return "This is a hypothetical answer generated by an LLM for the given query.";
};
/**
* MOCK: Simulates an embedding model (e.g., text-embedding-ada-002).
* It converts text into a vector of a fixed dimension (e.g., 768).
* In reality, this is a complex neural network process.
*/
const mockGenerateEmbedding = async (text: string): Promise<number[]> => {
console.log(`[Embedding Model] Generating embedding for text of length ${text.length}...`);
await new Promise(resolve => setTimeout(resolve, 50));
// For this example, we create a deterministic but "random-looking" vector
// based on the text's length and character codes. This is NOT how real embeddings work.
// A real embedding captures semantic meaning.
const dimension = 768;
const vector: number[] = [];
let seed = text.length;
for (let i = 0; i < dimension; i++) {
seed = (seed * 9301 + 49297) % 233280;
vector.push(seed / 233280); // Normalize to 0-1 range
}
return vector;
};
/**
* MOCK: Simulates a vector database (e.g., Pinecone, Weaviate, Qdrant).
* It stores documents and performs a K-Nearest Neighbors (KNN) search.
*/
const mockVectorDB: Document[] = [
{ id: "doc1", content: "Vector databases are designed to handle high-dimensional data, making them perfect for AI applications.", embedding: [] },
{ id: "doc2", content: "Traditional relational databases are not optimized for similarity search on unstructured data.", embedding: [] },
{ id: "doc3", content: "K-Nearest Neighbors (KNN) is the algorithm used to find the most similar vectors in a database.", embedding: [] },
{ id: "doc4", content: "What is the capital of France? The capital of France is Paris.", embedding: [] },
];
// Pre-populate embeddings for our mock database to make the search realistic
(async () => {
for (const doc of mockVectorDB) {
doc.embedding = await mockGenerateEmbedding(doc.content);
}
})();
const mockSearchVectorDB = async (queryVector: number[], k: number): Promise<Document[]> => {
console.log(`[Vector DB] Performing KNN search for k=${k}...`);
await new Promise(resolve => setTimeout(resolve, 50));
// Simple cosine similarity calculation
const cosineSimilarity = (vecA: number[], vecB: number[]): number => {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < vecA.length; i++) {
dotProduct += vecA[i] * vecB[i];
normA += vecA[i] * vecA[i];
normB += vecB[i] * vecB[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
};
const scoredDocs = mockVectorDB.map(doc => ({
doc,
score: cosineSimilarity(queryVector, doc.embedding),
}));
// Sort by score in descending order and take the top k
scoredDocs.sort((a, b) => b.score - a.score);
return scoredDocs.slice(0, k).map(item => item.doc);
};
/**
* MOCK: Simulates the final LLM call that uses the retrieved context.
* It synthesizes an answer based on the provided documents.
*/
const mockGenerateFinalAnswer = async (query: string, context: Document[]): Promise<string> => {
console.log(`[LLM] Generating final answer using ${context.length} context documents...`);
await new Promise(resolve => setTimeout(resolve, 100));
if (context.length === 0) {
return "I couldn't find any relevant information to answer your question.";
}
const contextText = context.map(doc => doc.content).join('\n\n');
return `Based on the following context:\n\n${contextText}\n\n---\n\nAnswer to the user's question: "${query}"`;
};
// ============================================================================
// 3. CORE HYPOTHETICAL DOCUMENT EMBEDDINGS (HyDE) PIPELINE
// ============================================================================
/**
* Executes the HyDE retrieval pipeline.
*
* @param query - The user's natural language question.
* @param deps - The dependency-injected services (LLM, Embedding, DB).
* @returns A promise that resolves to the final generated answer string.
*/
async function runHyDEPipeline(query: string, deps: Dependencies): Promise<string> {
console.log(`\nš Starting HyDE Pipeline for query: "${query}"\n`);
// --- Step 1: Generate Hypothetical Document ---
// The LLM creates a "fake" document that contains the answer.
// This document is semantically richer than the raw query.
const hypotheticalDoc = await deps.generateHypotheticalDoc(query);
console.log(`ā
Hypothetical Document Generated:\n"${hypotheticalDoc}"\n`);
// --- Step 2: Embed the Hypothetical Document ---
// We convert the generated text into a vector that captures its meaning.
const queryVector = await deps.generateEmbedding(hypotheticalDoc);
console.log(`ā
Query Vector Generated (Dimension: ${queryVector.length})\n`);
// --- Step 3: Retrieve Real Documents from Vector DB ---
// We use the vector of the *hypothetical answer* to find the most relevant *real documents*.
const retrievedDocs = await deps.searchVectorDB(queryVector, 2); // k=2
console.log(`ā
Retrieved ${retrievedDocs.length} Relevant Documents:`);
retrievedDocs.forEach(doc => console.log(` - [${doc.id}]: ${doc.content.substring(0, 50)}...`));
console.log('');
// --- Step 4: Generate Final Answer ---
// The LLM uses the retrieved real documents as context to answer the original query.
const finalAnswer = await deps.generateFinalAnswer(query, retrievedDocs);
console.log(`ā
Final Answer Generated:\n"${finalAnswer}"\n`);
return finalAnswer;
}
// ============================================================================
// 4. APPLICATION EXECUTION (SIMULATING A WEB APP ROUTE)
// ============================================================================
/**
* Main entry point to run the example.
* This simulates an API endpoint being called in a web application.
*/
async function main() {
// Assemble the dependencies for our pipeline
const dependencies: Dependencies = {
generateHypotheticalDoc: mockLLMGenerateHypotheticalDoc,
generateEmbedding: mockGenerateEmbedding,
searchVectorDB: mockSearchVectorDB,
generateFinalAnswer: mockGenerateFinalAnswer,
};
// Example 1: A query where HyDE is particularly useful (abstract concept)
const userQuery1 = "What are the main advantages of using a vector database?";
await runHyDEPipeline(userQuery1, dependencies);
// Example 2: A more direct query
const userQuery2 = "How does KNN search work in a vector database?";
await runHyDEPipeline(userQuery2, dependencies);
}
// Run the main function to execute the example
main().catch(console.error);
Detailed Line-by-Line Explanation
1. Type Definitions
interface Document: This defines the structure of our data. In a real-world application, this would map to a row in a database or a document in a search index. Theembeddingproperty is the crucial vector representation.interface Dependencies: This is a key architectural pattern for building robust and testable applications. Instead of hard-coding API calls inside our main logic, we define an interface for the services we need. This allows us to easily swap out mock implementations for real ones and makes our code unit-testable.
2. Mock Services
mockLLMGenerateHypotheticalDoc: This function simulates the first core step of HyDE. It takes a user query and returns a string that answers the query. In a real application, this would be a call to an LLM like GPT-4 with a carefully crafted prompt (e.g., "Write a short, factual paragraph that answers the following question: [query]"). Our mock provides a deterministic response for the example's purpose.mockGenerateEmbedding: This simulates converting text into a vector. Real embedding models (liketext-embedding-ada-002) are complex neural networks that map semantic meaning to a high-dimensional space (e.g., 1536 dimensions). Our mock creates a pseudo-random but deterministic vector based on the input text, which is sufficient for demonstrating the flow.mockVectorDB&mockSearchVectorDB: This simulates a vector database. It holds a set of pre-indexed documents (with pre-computed embeddings). ThemockSearchVectorDBfunction implements a simplified K-Nearest Neighbors (KNN) search by calculating the cosine similarity between the query vector and each document's vector, then returning the topkmost similar documents. Cosine similarity measures the angle between two vectors, which is a standard metric for text similarity.
3. Core HyDE Pipeline (runHyDEPipeline)
This function orchestrates the entire process, breaking it down into four distinct, numbered steps.
- Step 1: Generate Hypothetical Document: The raw
queryis passed to the LLM service. The result,hypotheticalDoc, is a text string that is semantically richer and more concrete than the original query. - Step 2: Embed the Hypothetical Document: The
hypotheticalDocis passed to the embedding service. The result,queryVector, is the numerical representation of the answer, not the question. This is the key insight of HyDE. - Step 3: Retrieve Real Documents: The
queryVectoris used to query the vector database. The system finds real documents whose vectors are closest to the vector of the hypothetical answer. This is highly effective because the hypothetical answer is semantically aligned with the content of the real documents that contain the answer. - Step 4: Generate Final Answer: The retrieved real documents are passed as context to the LLM, along with the original user query. The LLM's job is now simplified: it has the relevant information and just needs to synthesize a clear, concise answer.
4. Application Execution (main)
- The
mainfunction acts as the entry point, simulating a web server route handler (e.g., in Next.js or Express). - It assembles the
dependenciesobject, injecting our mock services. - It calls the
runHyDEPipelinewith a sample query and logs the entire process, making it easy to follow the data flow.
Common Pitfalls
When implementing HyDE and RAG pipelines in a JavaScript/TypeScript environment, especially in serverless platforms like Vercel or in production web apps, be aware of these common issues:
-
LLM Hallucination in Hypothetical Document Generation:
- Problem: The LLM used to generate the hypothetical document might "hallucinate" or invent facts. If the generated document contains incorrect information, the embedding will be based on that falsehood, potentially leading the retrieval step to pull irrelevant or misleading documents.
- Mitigation: For critical applications, you may need to add a "fact-checking" or "grounding" step, or use a more constrained prompt for the LLM. For many RAG use cases, the benefit of richer semantics outweighs the risk of minor hallucinations.
-
Asynchronous/Await Loops in Serverless Environments:
- Problem: Vercel and other serverless platforms have strict execution timeouts (e.g., 10-30 seconds). A full HyDE pipeline involves multiple sequential API calls (LLM -> Embedding -> DB -> LLM), each adding latency. If any service is slow, the entire function can time out.
- Mitigation:
- Parallelize where possible: While the core HyDE pipeline is sequential, you can parallelize other tasks. For example, if you need to fetch user data alongside the RAG call, do it concurrently.
- Optimize API calls: Use streaming responses from LLMs if possible to reduce time-to-first-token latency.
- Increase Timeout: Configure the function timeout in your
vercel.jsonor platform settings, but be mindful of costs. - Consider Background Jobs: For very long-running pipelines, consider offloading the work to a background job queue (e.g., Vercel Background Functions, AWS SQS) and notifying the user upon completion.
-
Vector Dimension Mismatch:
- Problem: This is a critical and often subtle bug. The dimension of the vector generated by your embedding model (e.g., 1536 for
text-embedding-ada-002) must exactly match the dimension configured in your vector database index. A mismatch will cause the database to reject the vector or, worse, return incorrect search results. - Mitigation: Always store the embedding model's dimension as a constant or configuration variable. Use this same value when creating your vector index and when generating embeddings. Validate this during your CI/CD pipeline.
- Problem: This is a critical and often subtle bug. The dimension of the vector generated by your embedding model (e.g., 1536 for
-
Cost Management:
- Problem: HyDE doubles your LLM API calls per query (once for generating the hypothetical doc, once for the final answer). This can significantly increase costs compared to standard RAG.
- Mitigation:
- Analyze Value vs. Cost: Only use HyDE for queries where it provides a measurable lift in retrieval quality. Simple keyword-style queries might not need it.
- Caching: Cache the hypothetical document embeddings for common queries to avoid re-generating them.
- Use Smaller Models: Consider using a smaller, cheaper model for the hypothetical document generation step if it proves effective for your domain.
-
Inadequate Prompt Engineering:
- Problem: The quality of the hypothetical document is entirely dependent on the prompt used to generate it. A generic prompt can lead to vague or unhelpful documents.
- Mitigation: Invest time in crafting and testing your LLM prompts. The prompt for HyDE should be explicit: "You are an expert. Write a short, factual paragraph that directly and completely answers the following question. Do not add conversational filler." Provide examples (few-shot prompting) to guide the model's output style.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.