Chapter 18: Seeding Databases with Synthetic Data

Theoretical Foundations

The generation of synthetic data is not merely a convenience; it is a foundational discipline for building robust, scalable, and secure AI-integrated applications. In the context of modern C# development using Entity Framework Core, the ability to programmatically generate high-fidelity, domain-specific data bridges the gap between static database schemas and the dynamic, probabilistic nature of Large Language Models (LLMs) and Vector Databases.

The Imperative of Synthetic Data in AI Architectures

To understand why synthetic data seeding is critical, consider the architecture of a modern Retrieval-Augmented Generation (RAG) system. A RAG pipeline relies on the semantic richness of a vector database (like Azure AI Search, Pinecone, or PostgreSQL with pgvector) to provide context to an LLM. However, the quality of the retrieval is directly proportional to the quality and volume of the data embedded.

If you rely solely on production data for development and testing, you encounter three immediate bottlenecks:

Privacy & Compliance (GDPR/CCPA): Production data often contains PII (Personally Identifiable Information). Feeding this into external LLM APIs or storing it in development vector stores creates significant legal liability.
Scarcity: Early-stage development requires volume. You cannot test the scalability of a vector search algorithm with five records. You need thousands, perhaps millions, of contextually diverse records.
Edge Case Coverage: Production data reflects historical patterns. To test an AI's resilience, you need synthetic data that simulates rare events, outliers, and adversarial inputs.

Synthetic data generation allows us to create a "parallel universe" of data that mirrors the statistical properties and relational integrity of production without containing a single byte of actual user information.

The Role of EF Core and the Challenge of Referential Integrity

Entity Framework Core (EF Core) acts as the domain gatekeeper. It enforces the rules of our application through C# models. When seeding data, we are not simply writing SQL INSERT statements; we are instantiating C# objects that must satisfy the constraints defined in our DbContext.

The primary challenge in synthetic data generation for relational databases is Referential Integrity. Consider a standard e-commerce domain model: Customer, Order, and OrderItem. You cannot generate an Order without a valid CustomerID. If you generate data randomly, you risk violating foreign key constraints, causing the seeding process to fail.

This is where the distinction between Random generation and Contextual generation becomes vital. Random generation (e.g., picking a random integer for a foreign key) is brittle. Contextual generation ensures that child entities are generated in the context of their parents.

Analogy: The Film Set Prop Department

Imagine a film set for a complex historical drama. The set requires thousands of props: letters, furniture, costumes, and background extras.

Production Data is like using real historical artifacts. It is authentic but fragile, expensive to acquire, and risky to handle (insurance, damage).
Synthetic Data is the work of the prop department.

The prop department (our seeding logic) does not simply grab random items from a warehouse. They build items that look authentic (statistical fidelity) and fit the specific scene (referential integrity). If the script calls for a letter written by a specific character to another, the prop department ensures the letter exists and is handed to the correct actor. They cannot hand a letter to an actor who hasn't been cast yet.

In our C# application, the DbContext is the Director, and the Synthetic Data Generator is the Prop Master. The Prop Master must know the script (the domain model) to ensure that every prop is valid for the scene (the database state).

The Evolution of Data Generation: From Faker to LLMs

Historically, libraries like Bogus (a C# port of the Faker.js library) have been the standard for generating dummy data. They excel at creating localized, structured data (names, addresses, dates) based on predefined rules.

However, with the rise of AI, the requirements have shifted. We no longer just need structured data; we need semantic data. For a RAG system, a synthetic user review that says "Product is good" is useless. A review that says, "The battery life is exceptional, lasting over 12 hours, though the screen brightness struggles in direct sunlight," provides rich vector embeddings.

This introduces a hybrid approach:

Structured Seeding (Bogus/Faker): Used for foreign keys, dates, and scalar values.
Unstructured/Textual Seeding (LLMs): Used for generating domain-specific text (reviews, support tickets, code comments) that serves as the "chunk" data for vectorization.

Managing Relationships and Bulk Performance

When seeding complex graphs of objects, the order of operations matters. EF Core tracks changes in its Change Tracker. While convenient for small datasets, the overhead of change tracking becomes a bottleneck when inserting thousands of entities.

The N+1 Problem in Seeding: If you generate 1,000 customers, and for each customer, you generate 10 orders, and for each order, 5 items, naive iteration results in thousands of database round-trips if not handled correctly.

The Solution:

Disabling Change Tracking: For bulk inserts, we must instruct EF Core to stop tracking entities after they are added to the context. This prevents the memory overhead of maintaining a snapshot of the original state for every object.
Bulk Extensions: While EF Core's AddRange is optimized, it still generates individual INSERT statements (or a massive SQL script). For truly high-performance seeding (millions of rows), we often look toward extensions like EFCore.BulkExtensions or raw SqlBulkCopy to bypass EF Core's SQL generation entirely and stream data directly into the database engine.

Visualizing the Data Graph

To visualize how synthetic data respects the hierarchical structure of a domain model, consider the following relationship graph. The generation process must traverse from the root (Customer) down to the leaves (OrderItems), ensuring that every child has a reference to its parent before the context is saved.

A diagram illustrates the hierarchical traversal from the root Customer entity down to the leaf OrderItems, visualizing how synthetic data ensures every child node is linked to its parent before context saving.

The Bridge to Vector Databases and RAG

Finally, the theoretical foundation of this chapter rests on the downstream application of this data. Once seeded, the text fields generated by our synthetic process (e.g., Product.Description, Customer.Review) are typically passed through an embedding model (like OpenAI's text-embedding-ada-002 or a local model).

This transforms the unstructured text into Vectors (arrays of floating-point numbers). These vectors are stored in a Vector Database.

Why Synthetic Data is Superior for RAG Testing: In a production environment, user queries are unpredictable. In a synthetic environment, we can control the "Ground Truth." We can generate a question like, "What is the battery life of the X-200 laptop?" and ensure the synthetic review containing that answer is present in the database. This allows us to rigorously test the Recall and Precision of our vector search algorithms. We can artificially inject "noise" (irrelevant reviews) to ensure the vector search filters them out effectively.

Theoretical Foundations

Domain Modeling: Define strict C# classes with EF Core configurations.
Rule Definition: Determine the statistical distribution of data (e.g., 80% of customers are active, 20% are inactive).
Contextual Generation: Use a generator (like Bogus) to create a graph of objects, ensuring parent-child relationships are preserved.
LLM Augmentation: For text fields requiring semantic depth, prompt an LLM to generate realistic paragraphs based on the entity's context.
Bulk Ingestion: Disable change tracking and use bulk insert mechanisms to persist the data to the SQL database.
Vectorization: Process the seeded text into embeddings and populate the vector database.

By mastering synthetic data seeding, you move from being a passive consumer of data to an active architect of your data ecosystem, enabling rapid iteration and rigorous testing of AI capabilities.

Basic Code Example

Here is a basic 'Hello World' example for seeding an EF Core database with synthetic data generated by an LLM.

using Microsoft.EntityFrameworkCore;
using Microsoft.SemanticKernel;
using System.ComponentModel;
using System.ComponentModel.DataAnnotations;
using System.Text.Json;

// 1. Define the Domain Model
public class BlogPost
{
    [Key]
    public int Id { get; set; }
    public string Title { get; set; } = string.Empty;
    public string Content { get; set; } = string.Empty;
    public string Author { get; set; } = string.Empty;
    public DateTime PublishedDate { get; set; }
}

// 2. Define the EF Core DbContext
public class BlogContext : DbContext
{
    public DbSet<BlogPost> BlogPosts { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder options)
        => options.UseSqlite("Data Source=blog.db");
}

// 3. Define the LLM Plugin for Data Generation
public class DataGeneratorPlugin
{
    [KernelFunction("generate_blog_post")]
    [Description("Generates a single, realistic blog post object with title, content, author, and date.")]
    public BlogPost GenerateBlogPost()
    {
        // In a real scenario, this function would be decorated with [KernelFunction] 
        // and call an LLM. For this "Hello World" example, we simulate the LLM 
        // response to ensure the code is runnable without external API keys.
        return new BlogPost
        {
            Title = "The Future of AI in .NET",
            Content = "Artificial Intelligence is rapidly changing how we write software...",
            Author = "Jane Doe",
            PublishedDate = DateTime.UtcNow
        };
    }
}

// 4. Main Execution Logic
public class Program
{
    public static async Task Main(string[] args)
    {
        // A. Setup the Kernel (The Orchestrator)
        var kernel = Kernel.CreateBuilder()
            .AddOpenAIChatCompletion("gpt-4o-mini", Environment.GetEnvironmentVariable("OPENAI_API_KEY") ?? "demo-key")
            .Build();

        // B. Register the Plugin
        var generatorPlugin = new DataGeneratorPlugin();
        kernel.Plugins.AddFromObject(generatorPlugin, "DataGen");

        // C. Setup the Database
        using var context = new BlogContext();
        await context.Database.EnsureDeletedAsync(); // Clean slate for demo
        await context.Database.EnsureCreatedAsync();

        // D. Generate Data via Kernel
        Console.WriteLine("Generating synthetic data via LLM...");

        // Invoking the specific function defined in our plugin
        var result = await kernel.InvokeAsync(
            "DataGen", 
            "generate_blog_post"
        );

        // E. Parse and Seed
        // The Kernel returns a function result; we cast it to our domain object
        var syntheticPost = (BlogPost)result.Value;

        context.BlogPosts.Add(syntheticPost);
        await context.SaveChangesAsync();

        // F. Verify
        var savedPost = await context.BlogPosts.FirstAsync();
        Console.WriteLine($"Seeded: '{savedPost.Title}' by {savedPost.Author}");
    }
}

Code Explanation

Domain Model Definition (BlogPost):
- This class represents the entity we want to populate. It uses standard C# properties with [Key] (from System.ComponentModel.DataAnnotations) to define the primary key for Entity Framework. This is the structure that will eventually be mapped to a database table.
Database Context (BlogContext):
- This class inherits from DbContext and defines a DbSet<BlogPost>. This tells EF Core that we want a table named BlogPosts to store our objects.
- OnConfiguring sets up a lightweight SQLite database file. This makes the example self-contained and runnable without a heavy SQL Server installation.
The LLM Plugin (DataGeneratorPlugin):
- In modern .NET AI development, we encapsulate logic into "Plugins". This class acts as a wrapper around the LLM.
- The GenerateBlogPost method is marked with [KernelFunction]. In a production environment, the Semantic Kernel would use this metadata to send a prompt to an LLM (like GPT-4) to generate the text.
- Simulation: To make this "Hello World" code runnable without an API key, I implemented a "Mock" LLM response inside the method. It returns a hardcoded BlogPost object. In a real scenario, the LLM would return raw text or JSON, which the Kernel would parse into this object.
Kernel Initialization:
- Kernel.CreateBuilder() initializes the orchestrator.
- .AddOpenAIChatCompletion registers the connection to the AI model. Even though we are mocking the generation in the plugin for this specific code snippet, we still need to initialize the Kernel to follow the architectural pattern.
- kernel.Plugins.AddFromObject registers our C# class so the Kernel can see it and invoke it.
Database Preparation:
- EnsureDeletedAsync and EnsureCreatedAsync are helper methods to reset the database state for this demonstration. This ensures you always start with a fresh, empty table.
Invocation and Seeding:
- kernel.InvokeAsync is the bridge between the AI layer and the application logic. It triggers the specific function in our plugin.
- The result is cast to BlogPost.
- context.BlogPosts.Add tracks the entity.
- SaveChangesAsync executes the SQL INSERT command against the SQLite database.

Common Pitfalls

Inconsistent Data Types (LLM vs. Database):
- The Mistake: LLMs are probabilistic. If you ask an LLM for a "Date", it might return "Yesterday", "2023-10-27", or "Next Tuesday". If your C# property is DateTime but the LLM returns a string, the seeding process will crash.
- The Fix: Always use Structured Output (JSON Schemas) when prompting LLMs for database seeding, or use a library like System.Text.Json to attempt parsing. In the example above, we hardcoded the DateTime to ensure type safety for the "Hello World" demo.
Ignoring Referential Integrity:
- The Mistake: Generating a BlogPost with an AuthorId of 999 when no User with ID 999 exists in the database. This causes Foreign Key constraint violations.
- The Fix: Always seed "Parent" entities (Authors/Categories) first, retrieve their generated IDs, and pass those IDs to the LLM prompt when generating "Child" entities (Blog Posts).
Hallucinated Primary Keys:
- The Mistake: Allowing the LLM to generate an Id field. LLMs have no concept of your database's current auto-increment state.
- The Fix: Explicitly instruct the LLM to ignore the ID field, and let the database (or EF Core) handle the ID generation.

Visualizing the Flow

When persisting an entity, the application must explicitly instruct the LLM to ignore the ID field, allowing the database or EF Core to manage auto-increment generation before the data is saved.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.