Chapter 7: Using pgvector with PostgreSQL and EF Core

Theoretical Foundations

The theoretical foundation of integrating pgvector with PostgreSQL and EF Core rests on bridging the gap between the mathematical abstraction of vector embeddings and the concrete, transactional world of relational databases. To understand this integration, we must first dissect the nature of vector embeddings, the mechanics of similarity search, and the architectural implications of storing high-dimensional data within a row-based database system.

The Nature of Vector Embeddings

At the core of modern AI applications lies the vector embedding. An embedding is a dense numerical representation of data—text, images, audio—mapped into a high-dimensional vector space. In this space, semantic meaning is encoded as geometric proximity.

Consider the analogy of a Library of Babel. Imagine a library containing every possible book of a certain length. In this library, books that are semantically similar (e.g., two books about "quantum physics") are not placed on the same shelf based on the Dewey Decimal System, but rather are stored in physical locations that are geometrically close to one another. A vector embedding is the coordinate of a book within this infinite library.

In C#, this is represented as a primitive array of floating-point numbers:

float[] embedding = new float[] { 0.1f, -0.4f, 0.9f, ... }; // Typically 384, 768, or 1536 dimensions

However, standard relational databases (like PostgreSQL without extensions) treat arrays as unstructured blobs or require complex normalization (storing each dimension in a separate column or row). This is inefficient for the mathematical operations required to find "nearby" books.

The Mathematical Core: Similarity Metrics

To find the closest book in our Library of Babel, we need a measuring stick. In vector space, this measuring stick is a similarity metric. The three most critical metrics are:

Euclidean Distance (L2): The straight-line distance between two points. It is sensitive to magnitude. If one vector represents a concept with high intensity and another with low intensity, they may be far apart even if they represent the same concept.
Cosine Similarity: The cosine of the angle between two vectors. This measures orientation, not magnitude. It answers: "Are these two vectors pointing in the same direction?" This is the standard for text embeddings, where the length of the vector often depends on word frequency rather than semantic meaning.
Inner Product: The dot product of two vectors. It correlates with cosine similarity when vectors are normalized.

The "Why" of pgvector: Standard SQL is ill-equipped to calculate these metrics over high-dimensional arrays efficiently. A query like "Find the row where the vector is closest to X" requires iterating over every row, extracting N dimensions, and performing N-dimensional arithmetic. This is an \(O(N \times M)\) operation (where \(N\) is the number of rows and \(M\) is the dimensionality), which is computationally prohibitive.

pgvector introduces a new data type (vector) and specialized C functions to perform these calculations at the database level, pushing the heavy lifting of math close to the data storage.

The Relational-Vector Hybrid: Why Not Just Use a Vector Database?

A common architectural question is: Why force vector capabilities into a relational database when dedicated vector databases (like Pinecone, Milvus, or Weaviate) exist?

The answer lies in Data Locality and Consistency.

In a standard RAG (Retrieval-Augmented Generation) application, you often need to perform a hybrid search. You might need to find documents that are semantically similar to a query AND belong to a specific user AND are marked as active.

Scenario:

Pure Vector DB: You query for the top 10 similar vectors. You receive IDs. You then must query a separate relational database (SQL Server/Postgres) to fetch metadata (User ID, Date, Status) and filter out unauthorized or inactive documents. This introduces network latency, eventual consistency issues, and complex transaction management.
pgvector (Hybrid): You perform a vector search and a SQL filter in a single atomic query.

Analogy: The Supermarket vs. The Specialty Shop

Dedicated Vector DB: A specialty spice shop. It has an incredible, indexed catalog of spices (vectors) and knows exactly where everything is. However, if you need a spice and a specific brand of olive oil (metadata), you must go to the grocery store (relational DB) next door, pay twice, and coordinate the trip.
pgvector: A well-organized Supermarket. The spices are indexed by flavor profile (vectors), but they are physically located on the same aisle as the olive oil. You can query for "spicy and smoky" (vector search) and filter by "organic" (SQL WHERE clause) simultaneously.

Mapping Vectors in EF Core: The Type Mapping Challenge

In the previous chapter, we discussed Value Conversions and Value Comparers in EF Core. These concepts are paramount when working with pgvector.

In C#, a vector is an array (float[]). In PostgreSQL with pgvector, it is a distinct vector type. EF Core does not natively understand this mapping. We must instruct the DbContext how to translate a C# array into a PostgreSQL vector and vice versa.

This is not merely a serialization format; it is a semantic translation.

Storage: When saving an entity, EF Core must convert the float[] into a string or binary format that PostgreSQL understands (e.g., [0.1,0.2,0.3]).
Retrieval: When reading, the raw data from the database driver (Npgsql) must be converted back into a managed float[].

The Performance Implication: If this mapping is naive, every read operation requires allocating a new array on the managed heap. In high-throughput AI applications, this causes Garbage Collection (GC) pressure. Modern C# features like Span<T> and Memory<T> allow for zero-copy parsing, but the EF Core provider for Npgsql handles much of this internally by leveraging System.Text.Json or custom binary parsers.

Theoretical Foundations

To understand how pgvector works under the hood, we must look at the execution plan of a similarity query.

1. The Sequential Scan (Brute Force)

Without an index, a query <=> (Cosine Distance) or <-> (Euclidean Distance) performs a sequential scan.

Process: For every row, calculate the distance between the stored vector and the query vector.
Cost: \(O(N \times D)\), where \(N\) is rows and \(D\) is dimensions.
Analogy: Searching for a specific grain of sand on a beach by picking up every grain and comparing it to a reference grain.

2. The HNSW Index (Hierarchical Navigable Small World)

To make vector search viable at scale, pgvector supports the HNSW index. This is a graph-based indexing algorithm.

The Analogy: The Highway System Imagine trying to walk from one city to another (finding the nearest neighbor).

No Index: You walk in a straight line, crossing every field and river (Sequential Scan).
HNSW: You start on a local road (coarse graph layer), merge onto a highway (intermediate layer), and finally take an expressway (top layer) to get close to the destination, then descend back to local roads to pinpoint the exact location.

Architectural Nuance: HNSW is a graph where nodes are vectors. It constructs multiple layers of graphs. The top layers have fewer connections (long-range links), allowing for rapid traversal across the vector space. The bottom layers have dense connections for precise refinement.

Why HNSW in pgvector? It provides \(O(\log N)\) search complexity, drastically reducing the search space. However, it comes with trade-offs:

Build Time: Index creation is computationally expensive and memory-intensive.
Memory Usage: The index lives in RAM (shared buffers).
Write Amplification: Inserting new vectors requires updating the graph structure.

EF Core Integration: The Abstraction Layer

When using EF Core with pgvector, we are leveraging the Specification Pattern (conceptually) to define our queries. EF Core translates LINQ expressions into SQL.

The Critical Concept: Provider-Specific Functions Standard LINQ (x => x.Id == 1) translates to standard SQL. However, vector similarity is not part of the SQL standard. We must use User-Defined Functions (UDFs) mapped via EF Core.

In the Npgsql provider, these functions are mapped to C# extension methods. When EF Core encounters a method like vector.L2Distance(target), it doesn't execute C# code; it translates it into the SQL function l2_distance(vector_column, target_vector).

Hybrid Search Strategy: The power of EF Core is combining these provider-specific functions with standard LINQ.

Conceptual Query Logic:

Vector Filter: Calculate distance to target embedding.
Relational Filter: Apply WHERE Category = 'Legal' AND IsPublic = true.
Ordering: Order by the calculated distance.
Pagination: Skip and Take.

This creates a SQL query that looks roughly like:

SELECT "Id", "Content", "Embedding" <=> @p0 AS "Distance"
FROM "Documents"
WHERE "Category" = 'Legal' AND "IsPublic" = true
ORDER BY "Distance"
LIMIT 10;

RAG (Retrieval-Augmented Generation) Patterns

In the context of RAG, pgvector serves as the Short-Term Memory or Context Window for the LLM.

The Workflow:

Ingestion: Documents are chunked. Each chunk is sent to an embedding model (e.g., text-embedding-ada-002) to generate a float[].
Storage: The chunk text and the float[] are saved via EF Core into PostgreSQL.
Retrieval: User asks a question. The question is embedded.
Similarity Search: EF Core queries pgvector for the top \(K\) chunks closest to the question vector.
Augmentation: These chunks are injected into the LLM prompt as context.

The "What If" - Handling Drift: What happens when the vector space shifts? If new data is added, the distribution of vectors changes. HNSW indexes handle incremental updates, but over time, the index efficiency might degrade (fragmentation). In a purely relational world, we would REINDEX. With pgvector, we rely on the database's maintenance capabilities, but the theoretical implication is that vector indexes are dynamic structures that require different maintenance strategies than B-Tree indexes.

Visualizing the Architecture

The following diagram illustrates the flow of data and the abstraction layers involved in a hybrid search using EF Core and pgvector.

This diagram illustrates how EF Core's abstraction layers translate a hybrid search query into a vector-based similarity operation that is dynamically maintained and executed by the pgvector extension within the PostgreSQL database. — This diagram illustrates how EF Core's abstraction layers translate a hybrid search query into a vector-based similarity operation that is dynamically maintained and executed by the `pgvector` extension within the PostgreSQL database.

Deep Dive: The HNSW Index Configuration

When defining the index theoretically, we must consider the parameters exposed by pgvector that affect the trade-off between recall (accuracy) and speed.

m (The number of connections per node):
- Higher m increases the connectivity of the graph.
- Effect: Higher recall (accuracy) but slower build time and larger index size.
- Analogy: In a city, having more streets (connections) makes it easier to find a route, but the city map becomes more complex.
ef_construction (Size of the dynamic candidate list):
- This controls the number of neighbors evaluated during index construction.
- Effect: Higher values create a higher quality graph but significantly increase build time.
ef_search (Size of the dynamic candidate list at query time):
- This controls the breadth of the search during retrieval.
- Effect: Higher values increase recall at the cost of speed.

In EF Core, these parameters are typically configured via Fluent API, allowing developers to tune the index without writing raw SQL migrations.

The Role of Memory Storage

While pgvector stores vectors on disk, the performance of vector search is heavily dependent on how much of the index fits into Shared Buffers (PostgreSQL's RAM cache).

Theoretical Implication: If the HNSW index is larger than available RAM, the database must swap pages from disk. This introduces I/O latency, which is the enemy of real-time AI applications.

Strategy: For high-performance RAG, the vector database (pgvector) should ideally be provisioned with enough RAM to hold the entire index (or at least the active working set) in memory.
Caching: Since EF Core is stateless (per request), it relies on the database's internal caching mechanisms. Unlike an in-memory vector store (like Redis with Vector Search), pgvector persists data, ensuring durability and ACID compliance.

Theoretical Foundations

The integration of pgvector with EF Core is not merely a technical implementation of a data type; it is a convergence of two paradigms:

Geometric Search (Vector): Based on high-dimensional mathematics and graph theory (HNSW).
Relational Integrity (SQL): Based on set theory, ACID transactions, and structured metadata.

By using EF Core, we abstract the complexity of the underlying SQL and vector operations, allowing us to treat vector similarity as just another property of our domain entities. This enables the construction of complex AI applications where semantic search is not an isolated feature but deeply woven into the fabric of the application's data model.

Basic Code Example

Here is a basic "Hello World" example for using pgvector with EF Core. We will simulate a scenario where a user wants to find a specific book in a library based on its title and a semantic description of its content, using vector similarity.

Prerequisites

Before running this code, ensure you have:

A PostgreSQL database with the pgvector extension enabled (CREATE EXTENSION IF NOT EXISTS vector;).
The following NuGet packages installed:
Microsoft.EntityFrameworkCore
Npgsql.EntityFrameworkCore.PostgreSQL
Npgsql.Extensions (specifically for pgvector mapping support)

The Code Example

using System;
using System.Collections.Generic;
using System.Linq;
using System.Numerics;
using Microsoft.EntityFrameworkCore;
using Npgsql;
using Npgsql.EntityFrameworkCore.PostgreSQL;

// 1. Define the Entity
// This represents a book in our library. We store the title and the vector embedding.
public class Book
{
    public int Id { get; set; }
    public string Title { get; set; } = string.Empty;

    // The vector column. We use float[] for flexibility, but pgvector supports vector(n).
    public float[] Embedding { get; set; } = Array.Empty<float>();
}

// 2. Define the DbContext
// This manages the connection and mapping to the database.
public class LibraryContext : DbContext
{
    public DbSet<Book> Books { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        // CRITICAL: We must register the vector type mapping.
        // Without this, EF Core won't know how to translate float[] to PostgreSQL vector.
        var dataSourceBuilder = new NpgsqlDataSourceBuilder("Host=localhost;Database=library_db;Username=postgres;Password=your_password");

        // This line enables the vector extension for EF Core.
        dataSourceBuilder.EnableVector();

        optionsBuilder.UseNpgsql(dataSourceBuilder.Build());
    }

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        // Map the entity to the table
        modelBuilder.Entity<Book>(entity =>
        {
            entity.ToTable("books");
            entity.HasKey(e => e.Id);
            entity.Property(e => e.Title).IsRequired();

            // Explicitly configure the column as a vector.
            // We define the dimension here (e.g., 384 for a small embedding model).
            entity.Property(e => e.Embedding)
                  .HasColumnType("vector(384)");
        });
    }
}

// 3. The "Hello World" Execution Logic
public class Program
{
    public static async Task Main()
    {
        // Initialize the database and seed dummy data
        await InitializeDatabaseAsync();

        // Define a search query (e.g., user input)
        string userQuery = "A story about a wizard in a magical forest";

        // In a real app, you would generate the embedding for this text using an AI model.
        // For this example, we simulate a vector representing the query.
        float[] queryEmbedding = GenerateDummyEmbedding(384, 100); 

        // Perform the Semantic Search
        using (var context = new LibraryContext())
        {
            Console.WriteLine($"Searching for: '{userQuery}'");
            Console.WriteLine("--------------------------------------------------");

            // THE CORE QUERY: Calculate Cosine Similarity
            // We use L2 Euclidean distance (Vector.L2Distance) which pgvector optimizes.
            // To get Cosine Similarity (1 - cosine_distance), we calculate: 
            // 1 - (||x|| * ||y||) - usually handled by the distance function directly.
            // pgvector's '<->' operator maps to L2 distance. 
            // For Cosine similarity, we often use the `<=>` operator, but EF Core L2Distance maps to L2.
            // However, for this example, we stick to L2 distance (Euclidean) as it's the standard 
            // distance metric for vector search in pgvector by default.

            var similarBooks = await context.Books
                .OrderBy(b => EF.Functions.L2Distance(b.Embedding, queryEmbedding))
                .Take(3)
                .ToListAsync();

            foreach (var book in similarBooks)
            {
                // Calculate the actual similarity score for display (Cosine Similarity)
                // Cosine Similarity = (A . B) / (||A|| * ||B||)
                float dotProduct = Vector3.Dot(
                    new Vector3(book.Embedding[0..3]), 
                    new Vector3(queryEmbedding[0..3])
                ); // Simplified for demo; real usage handles full vectors

                // Note: In a real scenario, you might calculate the distance in the DB or project it.
                // Here we just print the result.
                Console.WriteLine($"ID: {book.Id} | Title: {book.Title}");
            }
        }
    }

    // Helper to seed data (Simulating a pre-populated database)
    private static async Task InitializeDatabaseAsync()
    {
        using var context = new LibraryContext();
        // Ensure database is created
        await context.Database.EnsureDeletedAsync(); // Clean slate for demo
        await context.Database.EnsureCreatedAsync();

        if (!await context.Books.AnyAsync())
        {
            // Add dummy books with dummy embeddings
            var books = new List<Book>
            {
                new() { Title = "The Hobbit", Embedding = GenerateDummyEmbedding(384, 10) },
                new() { Title = "Advanced Calculus", Embedding = GenerateDummyEmbedding(384, 50) },
                new() { Title = "Harry Potter and the Sorcerer's Stone", Embedding = GenerateDummyEmbedding(384, 12) }, // Close to query
                new() { Title = "Cooking for Beginners", Embedding = GenerateDummyEmbedding(384, 90) }
            };

            context.Books.AddRange(books);
            await context.SaveChangesAsync();
        }
    }

    // Helper to generate a random vector (simulating an AI model output)
    private static float[] GenerateDummyEmbedding(int dimensions, int seed)
    {
        var random = new Random(seed);
        var embedding = new float[dimensions];
        for (int i = 0; i < dimensions; i++)
        {
            embedding[i] = (float)random.NextDouble();
        }
        return embedding;
    }
}

Detailed Explanation

Here is the line-by-line breakdown of how the code works.

1. Entity Definition (`Book` Class)

public class Book
{
    public int Id { get; set; }
    public string Title { get; set; } = string.Empty;
    public float[] Embedding { get; set; } = Array.Empty<float>();
}

Id & Title: Standard properties for relational data.
Embedding: This is the crucial property. We use float[] (array of floats) to represent the vector. While PostgreSQL's pgvector has a specific vector type, .NET does not have a native equivalent. Therefore, we map the database vector type to a .NET float[] (or IEnumerable<float>).

2. DbContext Configuration (`LibraryContext`)

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    var dataSourceBuilder = new NpgsqlDataSourceBuilder("...");
    dataSourceBuilder.EnableVector();
    optionsBuilder.UseNpgsql(dataSourceBuilder.Build());
}

NpgsqlDataSourceBuilder: In modern Npgsql (the PostgreSQL driver for .NET), data sources are the preferred way to configure connections.
EnableVector(): This is the magic switch. It registers the type handlers required to translate between PostgreSQL's vector type and .NET's float[]. Without this, EF Core will throw an exception saying it doesn't know how to map the column.

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Book>(entity =>
    {
        entity.Property(e => e.Embedding)
              .HasColumnType("vector(384)");
    });
}

HasColumnType("vector(384)"): We explicitly tell EF Core to create the column as a vector with a fixed dimension of 384. This is important because pgvector requires dimensions to be defined for indexing and storage optimization.

3. The Search Logic (`Main` Method)

var similarBooks = await context.Books
    .OrderBy(b => EF.Functions.L2Distance(b.Embedding, queryEmbedding))
    .Take(3)
    .ToListAsync();

EF.Functions.L2Distance: This translates to the <-> operator in PostgreSQL (Euclidean distance). It calculates the straight-line distance between two vectors.
- Note: While "Cosine Similarity" is often preferred for text embeddings (ignoring magnitude), pgvector implements Cosine Similarity as 1 - (L2 Distance of normalized vectors). Using L2Distance on normalized vectors is mathematically equivalent to finding the smallest angle between them.
OrderBy: This sorts the results by the calculated distance. The smallest distance (closest to 0) indicates the highest similarity.
Take(3): We only want the top 3 most relevant results.

4. Database Interaction

When this code runs, EF Core generates SQL similar to this:

SELECT b."Id", b."Title", b."Embedding"
FROM books AS b
ORDER BY b."Embedding" <-> '[0.1, 0.5, ...]'::vector
LIMIT 3

The pgvector extension handles the vector math directly on the database server, which is significantly faster than pulling all data into the application and calculating distances in C#.

Visualizing the Data Flow

The diagram illustrates the data flow where vector search queries are processed directly on the database server via the pgvector extension, bypassing the need to transfer raw data to the C# application for distance calculations. — The diagram illustrates the data flow where vector search queries are processed directly on the database server via the `pgvector` extension, bypassing the need to transfer raw data to the C# application for distance calculations.

Common Pitfalls

Missing EnableVector() Call
- The Issue: You configure the connection string correctly and define the entity, but you get a runtime error: No Npgsql type mapping exists for 'vector'.
- The Fix: Ensure dataSourceBuilder.EnableVector(); is called before UseNpgsql. This registers the custom type handler.
Dimension Mismatch
- The Issue: You create a vector column with vector(768) (e.g., for a large language model like text-embedding-ada-002), but your C# code tries to insert an array of length 384.
- The Fix: PostgreSQL will throw an error during insertion. Ensure your float[] length matches the dimension defined in HasColumnType("vector(n)").
Calculating Distance in Application (Performance Killer)
- The Issue: A developer might fetch all records (context.Books.ToList()) and then use C# System.Numerics to calculate distances in a loop.
- The Fix: Always perform the distance calculation inside the LINQ query (OrderBy(b => EF.Functions.L2Distance(...))). This pushes the computation to the database, leveraging indexes (like HNSW) and minimizing network overhead.
Forgetting to Normalize for Cosine Similarity
- The Issue: Using L2Distance (Euclidean) on raw, non-normalized text embeddings can lead to suboptimal results if the magnitude of the vectors varies significantly.
- The Fix: If using L2Distance for semantic search, ensure your embeddings are normalized (length of 1) before storing them, or use the <=> operator (Cosine Distance) if your driver/Npgsql version supports it directly via a custom function mapping. In standard EF Core, L2Distance on normalized vectors is the standard approach.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.

Chapter 7: Using pgvector with PostgreSQL and EF Core

Theoretical Foundations

The Nature of Vector Embeddings

The Mathematical Core: Similarity Metrics

The Relational-Vector Hybrid: Why Not Just Use a Vector Database?

Mapping Vectors in EF Core: The Type Mapping Challenge

Theoretical Foundations

1. The Sequential Scan (Brute Force)

2. The HNSW Index (Hierarchical Navigable Small World)

EF Core Integration: The Abstraction Layer

RAG (Retrieval-Augmented Generation) Patterns

Visualizing the Architecture

Deep Dive: The HNSW Index Configuration

The Role of Memory Storage

Theoretical Foundations

Basic Code Example

Prerequisites

The Code Example

Detailed Explanation

1. Entity Definition (Book Class)

2. DbContext Configuration (LibraryContext)

3. The Search Logic (Main Method)

4. Database Interaction

Visualizing the Data Flow

Common Pitfalls

1. Entity Definition (`Book` Class)

2. DbContext Configuration (`LibraryContext`)

3. The Search Logic (`Main` Method)