Chapter 10: Seed Data Generation using LLMs

Theoretical Foundations

In the foundational chapters of our boilerplate, we established the AIState as a critical server-side construct. We defined it as the model's persistent memory, a structured representation of its understanding of a task, complete with tool calls and intermediate data. This concept is the cornerstone of building stateful AI applications. Now, we pivot from using the AI to react to user input to using it as a proactive engine for a different, yet equally vital, purpose: populating our application's database.

Traditionally, seeding a database involves writing static JSON or SQL fixture files. You manually craft a handful of users, products, and orders. This process is brittle, time-consuming, and, most importantly, lacks realism and scale. A developer might create user_01 through user_10, but these entities lack believable relationships, varied personalities, or the complex, messy data patterns found in a real-world production environment. How do you test a recommendation engine with only ten products? How do you simulate a user's evolving profile with just one or two posts?

This is where we introduce the concept of a Generative Data Engine. Instead of manually writing data, we define a schema and a set of rules, and then delegate the tedious work of data creation to a Large Language Model. The LLM becomes our tireless, infinitely creative data entry clerk. But this is not a simple "generate 100 users" command. It's a sophisticated, multi-step process that mirrors the architectural patterns we've already built.

To understand this shift, consider the analogy of building a city. The traditional method is like a film set: you build a few facades that look good from one angle. Our Generative Data Engine is like a simulation game (think Cities: Skylines or SimCity). You don't place every single citizen and car. You set the rules—zoning, population density, economic policies—and the simulation generates a living, breathing city with emergent complexity. Similarly, we provide the schema (the zoning laws) and the LLM generates the citizens (the data) who interact according to those laws, creating a rich, believable world for our application to operate in.

The Challenge of Relational Integrity

The primary difficulty in using LLMs for data generation is not generating text; it's maintaining referential integrity. Our SaaS boilerplate, as established in the database chapter, is not a simple key-value store. It's a relational database, possibly with vector extensions. This means tables are interconnected. A Post must belong to a User. An Order must be linked to a Customer. A Comment needs both a Post and a User.

If we were to naively prompt an LLM: "Generate 50 users and 200 posts," we would face chaos. The model might generate users named "Alice" and "Bob," but the posts it generates might reference "Charlie" and "Diana," who don't exist in our user table. The foreign keys would be broken, and our database would be in an inconsistent state.

This is where the principles of LangGraph and state management become essential. We cannot treat data generation as a single, monolithic task. It must be a structured workflow, a graph of specialized nodes, where each node has a single responsibility and the state is carefully managed and passed between them. We are essentially building a dedicated agent-based system for data creation.

A Workflow of Symbiotic Nodes

Our Generative Data Engine will be a LangGraph that orchestrates a sequence of operations. Each node in this graph performs a distinct part of the data creation process, and they work together to ensure a coherent final dataset.

The Schema Node: This is the entry point. It doesn't generate data itself but rather sets the stage. It consumes your database schema (perhaps introspected from your ORM) and formats it into a clear, machine-readable prompt. This is the "blueprint" for all subsequent generations.
The Parent Generator Node: This node takes the schema and generates a set of "parent" entities. These are the top-level records that other records will depend on, such as Users or Products. The output of this node is a list of structured JSON objects. Crucially, these are now our "source of truth" for IDs.
The Child Generator Node: This is where the magic of relational integrity happens. This node receives the schema and the generated parent data. Its prompt is more complex: "Generate 200 Posts, but ensure the userId field in each post corresponds to a real id from the provided list of users." It acts like a function that generates data within a specific, constrained context. It might also be responsible for creating "sibling" entities that share a common parent, like Comments for a Post.
The Vector & Embedding Node (Specialized Task): For tables that require vector support (e.g., for semantic search), a standard JSON generator is insufficient. This specialized node takes the generated text content (like a Post's body or a Product's description) and calls an embedding model to convert that text into a vector embedding. This vector is then appended to the record before it's stored. This step is what enables the "diversity" mentioned in our chapter outline. By generating varied, descriptive text, we create a rich semantic space in our vector database, allowing for meaningful similarity searches later.
The Persister Node: The final node in the chain. It takes the fully-formed, validated, and vector-enriched data objects and writes them to the database using our ORM. This is the "commit" phase of our data transaction.

Visualizing the Data Generation Pipeline

This workflow can be visualized as a linear graph where the state flows from a general schema definition to specific, relational, and vector-enriched data.

A linear pipeline diagram illustrates the data transaction's commit phase, visually tracing the flow from a general schema definition through relational structures to vector-enriched data.

This pipeline ensures that each step builds upon the last. The state (the generated parent IDs, the text content for embedding) is passed forward, creating a chain of custody for the data. This is a direct application of the Conditional Edge concept, though in this case, our graph is largely linear. The predicate for the ChildGen node's incoming edge is "Has parent data been generated?" and the Persist node's incoming edge is "Has all data been generated and vectorized?".

Analogy: The AI-Powered Construction Crew

Let's solidify this with a final, comprehensive web development analogy. Imagine you are building a complex e-commerce site.

Traditional Seeding: You are the sole architect. You manually write a list of 10 customers, 20 products, and 50 orders in a spreadsheet, then write a script to import it. It's slow, and if you change the product schema, you have to rewrite the whole spreadsheet.
Our Generative Data Engine: You hire a foreman (the LangGraph Orchestrator).
1. You give the foreman the architectural blueprints (the Schema Node).
2. The foreman hires a team of specialized subcontractors:
  - A "Customer Team" (Parent Generator) that builds 100 realistic customer profiles, complete with names, addresses, and preferences. They hand you a list of customer IDs.
  - A "Product Team" (Parent Generator) that designs 500 unique products with detailed descriptions and prices.
  - A "Sales Team" (Child Generator). You give them the customer and product lists. Their job is to generate 5,000 realistic orders. They know they can't create an order for a product that doesn't exist, because they are working directly from the lists provided by the other teams. This is referential integrity.
  - A "Marketing Team" (Vector & Embedding Node). They take all the product descriptions and customer reviews, write compelling, varied marketing copy for them, and then run it through their "SEO analysis machine" (the embedding model) to generate keyword vectors for the site's search engine.
3. Finally, a "Data Entry Clerk" (Persister Node) takes all the finalized work from all teams and meticulously enters it into the live database.

By orchestrating this process, you have moved from being a manual laborer to a project manager. You define the requirements and the structure, and the AI-powered crew builds a complex, realistic, and fully relational dataset for you. This not only accelerates development but also provides a testing environment that is a far more accurate representation of production reality, allowing you to build and test features like advanced search, recommendation algorithms, and data analytics with confidence.

Basic Code Example

This example demonstrates a minimal, self-contained TypeScript script that generates relational seed data for a SaaS application's user and profile tables using the OpenAI API. We will focus on enforcing referential integrity (ensuring generated profile.userId matches an existing user.id) and using schema-aware prompts to produce structured JSON output. This pipeline is designed to be integrated into a database seeding script, such as one using Prisma or Drizzle ORM.

The core logic involves three steps:

Generate Users: Create a list of synthetic users with unique IDs.
Generate Profiles: For each user, generate a corresponding profile, explicitly referencing the user's ID to maintain the relational link.
Upsert Data: Insert the generated data into a database (simulated here with console logs, but adaptable to an ORM).

The Code

/**

 * @file seed.ts
 * @description A self-contained script to generate relational seed data using LLMs.
 *              Simulates a SaaS boilerplate with User and Profile tables.
 *              Uses OpenAI's GPT-4o-mini for structured data generation.
 */

// --- IMPORTS ---
import OpenAI from 'openai';

// --- CONFIGURATION ---
// In a real app, load from environment variables (e.g., process.env.OPENAI_API_KEY)
const OPENAI_API_KEY = 'sk-...'; // Placeholder: Replace with a real key for execution
const openai = new OpenAI({ apiKey: OPENAI_API_KEY });

// --- TYPE DEFINITIONS ---
/**

 * Represents a User entity in the SaaS database.
 * @property id - Unique identifier (UUID string).
 * @property email - User's email address.
 * @property name - User's full name.
 */
type User = {
  id: string;
  email: string;
  name: string;
};

/**

 * Represents a Profile entity linked to a User.
 * @property id - Unique identifier (UUID string).
 * @property userId - Foreign key referencing User.id.
 * @property bio - User's biography text.
 * @property avatarUrl - URL to a placeholder avatar image.
 */
type Profile = {
  id: string;
  userId: string; // Foreign key for referential integrity
  bio: string;
  avatarUrl: string;
};

/**

 * Structured JSON response format for LLM generation.
 * Enforces a schema to ensure reliable parsing.
 */
type LLMResponse<T> = {
  data: T[];
};

// --- HELPER FUNCTIONS ---

/**

 * Generates a list of synthetic users using an LLM.
 * The prompt is engineered to produce structured JSON matching the `User` type.
 * @param count - Number of users to generate.
 * @returns A Promise resolving to an array of User objects.
 */
async function generateUsers(count: number): Promise<User[]> {
  const prompt = `
    You are a data generator for a SaaS application.
    Generate ${count} unique, realistic user records.
    Return ONLY valid JSON, no markdown code blocks.
    The JSON must be an object with a "data" key, containing an array of users.
    Each user must have: "id" (UUID v4 string), "email" (realistic), "name" (full name).
    Example: { "data": [{ "id": "123e4567-e89b-12d3-a456-426614174000", "email": "user@example.com", "name": "John Doe" }] }
  `;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'You are a structured JSON data generator.' },
      { role: 'user', content: prompt },
    ],
    response_format: { type: 'json_object' }, // Enforce JSON output
    temperature: 0.7, // Introduce some variability
  });

  const content = response.choices[0].message.content;
  if (!content) throw new Error('LLM returned empty content for users.');

  const parsed = JSON.parse(content) as LLMResponse<User>;
  return parsed.data;
}

/**

 * Generates profiles for existing users, enforcing referential integrity.
 * The LLM is given the list of users to ensure profile.userId matches a real user.id.
 * @param users - Array of existing users to generate profiles for.
 * @returns A Promise resolving to an array of Profile objects.
 */
async function generateProfiles(users: User[]): Promise<Profile[]> {
  const userContext = JSON.stringify(users, null, 2);
  const prompt = `
    You are a data generator for a SaaS application.
    Generate a profile for EACH user in the provided list.
    Return ONLY valid JSON, no markdown code blocks.
    The JSON must be an object with a "data" key, containing an array of profiles.
    For each user, create a profile with:

      - "id": A new UUID v4 string.
      - "userId": MUST be the exact "id" of the user from the input list. This is critical for database integrity.
      - "bio": A short, realistic biography (1-2 sentences).
      - "avatarUrl": A placeholder image URL (e.g., from https://api.dicebear.com).
    Input Users:
    ${userContext}
  `;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'You are a structured JSON data generator.' },
      { role: 'user', content: prompt },
    ],
    response_format: { type: 'json_object' },
    temperature: 0.7,
  });

  const content = response.choices[0].message.content;
  if (!content) throw new Error('LLM returned empty content for profiles.');

  const parsed = JSON.parse(content) as LLMResponse<Profile>;
  return parsed.data;
}

/**

 * Simulates an upsert operation into a database.
 * In a real SaaS boilerplate, this would use an ORM like Prisma or Drizzle.
 * @param users - Array of users to seed.
 * @param profiles - Array of profiles to seed.
 */
async function seedDatabase(users: User[], profiles: Profile[]): Promise<void> {
  console.log('--- SEEDING DATABASE ---');

  // Simulate User Upsert
  console.log(`Seeding ${users.length} users...`);
  for (const user of users) {
    // In a real app: await prisma.user.upsert({ where: { id: user.id }, update: user, create: user });
    console.log(`  [USER] ID: ${user.id} | Email: ${user.email} | Name: ${user.name}`);
  }

  // Simulate Profile Upsert
  console.log(`Seeding ${profiles.length} profiles...`);
  for (const profile of profiles) {
    // In a real app: await prisma.profile.upsert({ where: { id: profile.id }, update: profile, create: profile });
    console.log(`  [PROFILE] ID: ${profile.id} | User: ${profile.userId} | Bio: ${profile.bio.substring(0, 30)}...`);
  }

  console.log('--- SEEDING COMPLETE ---');
}

// --- MAIN EXECUTION LOGIC ---

/**

 * Orchestrates the seed data generation pipeline.
 * 1. Generates users.
 * 2. Generates profiles linked to users.
 * 3. Seeds the simulated database.
 */
async function main() {
  try {
    console.log('Starting seed data generation...');

    const USER_COUNT = 3; // Small number for "Hello World" demonstration

    // Step 1: Generate Users
    const users = await generateUsers(USER_COUNT);

    // Step 2: Generate Profiles (with referential integrity)
    const profiles = await generateProfiles(users);

    // Step 3: Seed Database
    await seedDatabase(users, profiles);

  } catch (error) {
    console.error('Error during seed generation:', error);
    // In a real SaaS app, this should trigger alerts or logging (e.g., Sentry)
  }
}

// Execute if run directly
if (require.main === module) {
  // Check for API Key warning
  if (OPENAI_API_KEY === 'sk-...') {
    console.warn('WARNING: OPENAI_API_KEY is not set. The script will fail if executed.');
    console.warn('Please set the OPENAI_API_KEY environment variable or update the placeholder.');
  }
  main();
}

export { generateUsers, generateProfiles, seedDatabase };

Detailed Explanation

1. The Core Concept: Schema-Aware Generation

The fundamental challenge with LLM-generated data is structure. LLMs are probabilistic; without constraints, they might return free text, malformed JSON, or fields that don't match your database schema.

How it works:

Prompt Engineering: We explicitly define the output format in the prompt (e.g., "Return ONLY valid JSON... with a 'data' key"). This guides the model to produce a predictable structure.
JSON Mode: The OpenAI SDK parameter response_format: { type: 'json_object' } forces the model to output valid JSON. This is a technical guardrail that complements the prompt.
Type Safety: We use TypeScript interfaces (User, Profile) to define the expected shape. While TypeScript doesn't enforce this at runtime, it ensures our code is type-safe and helps us structure the prompt correctly.

Under the Hood: When the LLM processes the prompt, it doesn't "know" TypeScript. It predicts the next token based on patterns in its training data. By providing examples and explicit instructions ("each user must have 'id', 'email', 'name'"), we steer the model's probability distribution toward the desired schema. The temperature: 0.7 setting introduces creativity (e.g., varied names) while temperature: 0.0 would produce deterministic, repetitive data.

2. Enforcing Referential Integrity

In a relational database, foreign keys (like profile.userId) must reference existing primary keys (user.id). If the LLM generates profiles independently of users, the IDs likely won't match, causing foreign key constraint violations.

How it works:

Two-Step Generation: We generate users first and store them in memory.
Context Injection: When generating profiles, we serialize the generated users array and inject it into the prompt (userContext). The prompt explicitly instructs: "userId: MUST be the exact 'id' of the user from the input list."
LLM Reasoning: The model now has the specific id values it generated for users. It uses this context to ensure the profile.userId field matches one of those IDs. This mimics the logic of a database join but happens entirely within the LLM's context window.

Visualizing the Data Flow:

This diagram illustrates how an LLM performs an in-context join by matching a userId from one data source to corresponding records in another, all processed within the model's context window. — This diagram illustrates how an LLM performs an in-context join by matching a `userId` from one data source to corresponding records in another, all processed within the model's context window.

3. The Upsert Operation

In a SaaS boilerplate, seeding data should be idempotent—meaning you can run it multiple times without creating duplicates or errors. The Upsert (Update or Insert) operation handles this.

How it works in this example: The seedDatabase function simulates an upsert. In a real-world scenario using an ORM like Prisma, the code would look like this:

// Real-world Prisma Upsert Example
await prisma.user.upsert({
  where: { id: user.id }, // Look for existing record by ID
  update: { name: user.name }, // If found, update fields
  create: { ...user }, // If not found, create new record
});

Why this matters for SaaS:

Development: You can run npm run db:seed repeatedly during development without manual cleanup.
Testing: Test suites can seed a known state before each test run.
CI/CD: Deployment pipelines can safely seed staging environments.

Common Pitfalls

Hallucinated JSON & Parsing Errors:
- Issue: Even with response_format: 'json_object', LLMs can occasionally produce invalid JSON (e.g., trailing commas, unescaped strings). Attempting JSON.parse() on this will throw a runtime error.
- Solution: Wrap parsing in a try-catch block. For production, use a schema validation library like Zod to parse the LLM output before using it. Zod can coerce types and validate strict structure, catching hallucinations where the LLM invents fields.
Vercel/Serverless Timeouts:
- Issue: Generating large datasets (e.g., 1000+ records) via sequential LLM calls can exceed serverless function timeouts (typically 10-60 seconds on Vercel).
- Solution: Batch generation. Instead of generating 1 profile per user, generate 50 profiles in a single LLM call by providing a list of 50 user IDs in the prompt. This reduces the number of network round-trips.
Async/Await Loops:
- Issue: Using await inside a forEach loop is a common anti-pattern. forEach does not wait for promises to resolve; it fires them all simultaneously. This can lead to rate limits (e.g., OpenAI API rate limits) or race conditions if order matters.
- Solution: Use for...of loops for sequential execution or Promise.all for parallel execution (with caution regarding rate limits).
- Bad:
```
users.forEach(async (user) => {
  await generateProfile(user); // Runs in parallel, uncontrolled
});
```
- Good (Sequential):
```
for (const user of users) {
  await generateProfile(user); // Waits for each to finish
}
```
Context Window Limits:
- Issue: If you try to generate profiles for 10,000 users in one go, the serialized user list might exceed the LLM's context window (e.g., 128k tokens for GPT-4o).
- Solution: Chunk the data. Process users in batches of 50-100, generating profiles for each batch separately. This ensures the input data always fits within the model's context limit.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.