Chapter 18: Seeding Databases with Synthetic Data
Theoretical Foundations
The generation of synthetic data is not merely a convenience; it is a foundational discipline for building robust, scalable, and secure AI-integrated applications. In the context of modern C# development using Entity Framework Core, the ability to programmatically generate high-fidelity, domain-specific data bridges the gap between static database schemas and the dynamic, probabilistic nature of Large Language Models (LLMs) and Vector Databases.
The Imperative of Synthetic Data in AI Architectures
To understand why synthetic data seeding is critical, consider the architecture of a modern Retrieval-Augmented Generation (RAG) system. A RAG pipeline relies on the semantic richness of a vector database (like Azure AI Search, Pinecone, or PostgreSQL with pgvector) to provide context to an LLM. However, the quality of the retrieval is directly proportional to the quality and volume of the data embedded.
If you rely solely on production data for development and testing, you encounter three immediate bottlenecks:
- Privacy & Compliance (GDPR/CCPA): Production data often contains PII (Personally Identifiable Information). Feeding this into external LLM APIs or storing it in development vector stores creates significant legal liability.
- Scarcity: Early-stage development requires volume. You cannot test the scalability of a vector search algorithm with five records. You need thousands, perhaps millions, of contextually diverse records.
- Edge Case Coverage: Production data reflects historical patterns. To test an AI's resilience, you need synthetic data that simulates rare events, outliers, and adversarial inputs.
Synthetic data generation allows us to create a "parallel universe" of data that mirrors the statistical properties and relational integrity of production without containing a single byte of actual user information.
The Role of EF Core and the Challenge of Referential Integrity
Entity Framework Core (EF Core) acts as the domain gatekeeper. It enforces the rules of our application through C# models. When seeding data, we are not simply writing SQL INSERT statements; we are instantiating C# objects that must satisfy the constraints defined in our DbContext.
The primary challenge in synthetic data generation for relational databases is Referential Integrity. Consider a standard e-commerce domain model: Customer, Order, and OrderItem. You cannot generate an Order without a valid CustomerID. If you generate data randomly, you risk violating foreign key constraints, causing the seeding process to fail.
This is where the distinction between Random generation and Contextual generation becomes vital. Random generation (e.g., picking a random integer for a foreign key) is brittle. Contextual generation ensures that child entities are generated in the context of their parents.
Analogy: The Film Set Prop Department
Imagine a film set for a complex historical drama. The set requires thousands of props: letters, furniture, costumes, and background extras.
- Production Data is like using real historical artifacts. It is authentic but fragile, expensive to acquire, and risky to handle (insurance, damage).
- Synthetic Data is the work of the prop department.
The prop department (our seeding logic) does not simply grab random items from a warehouse. They build items that look authentic (statistical fidelity) and fit the specific scene (referential integrity). If the script calls for a letter written by a specific character to another, the prop department ensures the letter exists and is handed to the correct actor. They cannot hand a letter to an actor who hasn't been cast yet.
In our C# application, the DbContext is the Director, and the Synthetic Data Generator is the Prop Master. The Prop Master must know the script (the domain model) to ensure that every prop is valid for the scene (the database state).
The Evolution of Data Generation: From Faker to LLMs
Historically, libraries like Bogus (a C# port of the Faker.js library) have been the standard for generating dummy data. They excel at creating localized, structured data (names, addresses, dates) based on predefined rules.
However, with the rise of AI, the requirements have shifted. We no longer just need structured data; we need semantic data. For a RAG system, a synthetic user review that says "Product is good" is useless. A review that says, "The battery life is exceptional, lasting over 12 hours, though the screen brightness struggles in direct sunlight," provides rich vector embeddings.
This introduces a hybrid approach:
- Structured Seeding (Bogus/Faker): Used for foreign keys, dates, and scalar values.
- Unstructured/Textual Seeding (LLMs): Used for generating domain-specific text (reviews, support tickets, code comments) that serves as the "chunk" data for vectorization.
Managing Relationships and Bulk Performance
When seeding complex graphs of objects, the order of operations matters. EF Core tracks changes in its Change Tracker. While convenient for small datasets, the overhead of change tracking becomes a bottleneck when inserting thousands of entities.
The N+1 Problem in Seeding: If you generate 1,000 customers, and for each customer, you generate 10 orders, and for each order, 5 items, naive iteration results in thousands of database round-trips if not handled correctly.
The Solution:
- Disabling Change Tracking: For bulk inserts, we must instruct EF Core to stop tracking entities after they are added to the context. This prevents the memory overhead of maintaining a snapshot of the original state for every object.
- Bulk Extensions: While EF Core's
AddRangeis optimized, it still generates individualINSERTstatements (or a massive SQL script). For truly high-performance seeding (millions of rows), we often look toward extensions likeEFCore.BulkExtensionsor rawSqlBulkCopyto bypass EF Core's SQL generation entirely and stream data directly into the database engine.
Visualizing the Data Graph
To visualize how synthetic data respects the hierarchical structure of a domain model, consider the following relationship graph. The generation process must traverse from the root (Customer) down to the leaves (OrderItems), ensuring that every child has a reference to its parent before the context is saved.
The Bridge to Vector Databases and RAG
Finally, the theoretical foundation of this chapter rests on the downstream application of this data. Once seeded, the text fields generated by our synthetic process (e.g., Product.Description, Customer.Review) are typically passed through an embedding model (like OpenAI's text-embedding-ada-002 or a local model).
This transforms the unstructured text into Vectors (arrays of floating-point numbers). These vectors are stored in a Vector Database.
Why Synthetic Data is Superior for RAG Testing: In a production environment, user queries are unpredictable. In a synthetic environment, we can control the "Ground Truth." We can generate a question like, "What is the battery life of the X-200 laptop?" and ensure the synthetic review containing that answer is present in the database. This allows us to rigorously test the Recall and Precision of our vector search algorithms. We can artificially inject "noise" (irrelevant reviews) to ensure the vector search filters them out effectively.
Theoretical Foundations
- Domain Modeling: Define strict C# classes with EF Core configurations.
- Rule Definition: Determine the statistical distribution of data (e.g., 80% of customers are active, 20% are inactive).
- Contextual Generation: Use a generator (like Bogus) to create a graph of objects, ensuring parent-child relationships are preserved.
- LLM Augmentation: For text fields requiring semantic depth, prompt an LLM to generate realistic paragraphs based on the entity's context.
- Bulk Ingestion: Disable change tracking and use bulk insert mechanisms to persist the data to the SQL database.
- Vectorization: Process the seeded text into embeddings and populate the vector database.
By mastering synthetic data seeding, you move from being a passive consumer of data to an active architect of your data ecosystem, enabling rapid iteration and rigorous testing of AI capabilities.
Basic Code Example
Here is a basic 'Hello World' example for seeding an EF Core database with synthetic data generated by an LLM.
using Microsoft.EntityFrameworkCore;
using Microsoft.SemanticKernel;
using System.ComponentModel;
using System.ComponentModel.DataAnnotations;
using System.Text.Json;
// 1. Define the Domain Model
public class BlogPost
{
[Key]
public int Id { get; set; }
public string Title { get; set; } = string.Empty;
public string Content { get; set; } = string.Empty;
public string Author { get; set; } = string.Empty;
public DateTime PublishedDate { get; set; }
}
// 2. Define the EF Core DbContext
public class BlogContext : DbContext
{
public DbSet<BlogPost> BlogPosts { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder options)
=> options.UseSqlite("Data Source=blog.db");
}
// 3. Define the LLM Plugin for Data Generation
public class DataGeneratorPlugin
{
[KernelFunction("generate_blog_post")]
[Description("Generates a single, realistic blog post object with title, content, author, and date.")]
public BlogPost GenerateBlogPost()
{
// In a real scenario, this function would be decorated with [KernelFunction]
// and call an LLM. For this "Hello World" example, we simulate the LLM
// response to ensure the code is runnable without external API keys.
return new BlogPost
{
Title = "The Future of AI in .NET",
Content = "Artificial Intelligence is rapidly changing how we write software...",
Author = "Jane Doe",
PublishedDate = DateTime.UtcNow
};
}
}
// 4. Main Execution Logic
public class Program
{
public static async Task Main(string[] args)
{
// A. Setup the Kernel (The Orchestrator)
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion("gpt-4o-mini", Environment.GetEnvironmentVariable("OPENAI_API_KEY") ?? "demo-key")
.Build();
// B. Register the Plugin
var generatorPlugin = new DataGeneratorPlugin();
kernel.Plugins.AddFromObject(generatorPlugin, "DataGen");
// C. Setup the Database
using var context = new BlogContext();
await context.Database.EnsureDeletedAsync(); // Clean slate for demo
await context.Database.EnsureCreatedAsync();
// D. Generate Data via Kernel
Console.WriteLine("Generating synthetic data via LLM...");
// Invoking the specific function defined in our plugin
var result = await kernel.InvokeAsync(
"DataGen",
"generate_blog_post"
);
// E. Parse and Seed
// The Kernel returns a function result; we cast it to our domain object
var syntheticPost = (BlogPost)result.Value;
context.BlogPosts.Add(syntheticPost);
await context.SaveChangesAsync();
// F. Verify
var savedPost = await context.BlogPosts.FirstAsync();
Console.WriteLine($"Seeded: '{savedPost.Title}' by {savedPost.Author}");
}
}
Code Explanation
-
Domain Model Definition (
BlogPost):- This class represents the entity we want to populate. It uses standard C# properties with
[Key](fromSystem.ComponentModel.DataAnnotations) to define the primary key for Entity Framework. This is the structure that will eventually be mapped to a database table.
- This class represents the entity we want to populate. It uses standard C# properties with
-
Database Context (
BlogContext):- This class inherits from
DbContextand defines aDbSet<BlogPost>. This tells EF Core that we want a table namedBlogPoststo store our objects. OnConfiguringsets up a lightweight SQLite database file. This makes the example self-contained and runnable without a heavy SQL Server installation.
- This class inherits from
-
The LLM Plugin (
DataGeneratorPlugin):- In modern .NET AI development, we encapsulate logic into "Plugins". This class acts as a wrapper around the LLM.
- The
GenerateBlogPostmethod is marked with[KernelFunction]. In a production environment, the Semantic Kernel would use this metadata to send a prompt to an LLM (like GPT-4) to generate the text. - Simulation: To make this "Hello World" code runnable without an API key, I implemented a "Mock" LLM response inside the method. It returns a hardcoded
BlogPostobject. In a real scenario, the LLM would return raw text or JSON, which the Kernel would parse into this object.
-
Kernel Initialization:
Kernel.CreateBuilder()initializes the orchestrator..AddOpenAIChatCompletionregisters the connection to the AI model. Even though we are mocking the generation in the plugin for this specific code snippet, we still need to initialize the Kernel to follow the architectural pattern.kernel.Plugins.AddFromObjectregisters our C# class so the Kernel can see it and invoke it.
-
Database Preparation:
EnsureDeletedAsyncandEnsureCreatedAsyncare helper methods to reset the database state for this demonstration. This ensures you always start with a fresh, empty table.
-
Invocation and Seeding:
kernel.InvokeAsyncis the bridge between the AI layer and the application logic. It triggers the specific function in our plugin.- The result is cast to
BlogPost. context.BlogPosts.Addtracks the entity.SaveChangesAsyncexecutes the SQLINSERTcommand against the SQLite database.
Common Pitfalls
-
Inconsistent Data Types (LLM vs. Database):
- The Mistake: LLMs are probabilistic. If you ask an LLM for a "Date", it might return "Yesterday", "2023-10-27", or "Next Tuesday". If your C# property is
DateTimebut the LLM returns a string, the seeding process will crash. - The Fix: Always use Structured Output (JSON Schemas) when prompting LLMs for database seeding, or use a library like
System.Text.Jsonto attempt parsing. In the example above, we hardcoded theDateTimeto ensure type safety for the "Hello World" demo.
- The Mistake: LLMs are probabilistic. If you ask an LLM for a "Date", it might return "Yesterday", "2023-10-27", or "Next Tuesday". If your C# property is
-
Ignoring Referential Integrity:
- The Mistake: Generating a
BlogPostwith anAuthorIdof 999 when no User with ID 999 exists in the database. This causes Foreign Key constraint violations. - The Fix: Always seed "Parent" entities (Authors/Categories) first, retrieve their generated IDs, and pass those IDs to the LLM prompt when generating "Child" entities (Blog Posts).
- The Mistake: Generating a
-
Hallucinated Primary Keys:
- The Mistake: Allowing the LLM to generate an
Idfield. LLMs have no concept of your database's current auto-increment state. - The Fix: Explicitly instruct the LLM to ignore the ID field, and let the database (or EF Core) handle the ID generation.
- The Mistake: Allowing the LLM to generate an
Visualizing the Flow
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.