Chapter 9: Generating Synthetic Data for Fine-Tuning

Theoretical Foundations

At its core, synthetic data generation for fine-tuning is the process of using a large, powerful model (a "teacher") to create a dataset that trains a smaller, more efficient model (a "student") to perform a specific task. This is not merely about data augmentation; it is about knowledge distillation at the dataset level. The goal is to compress the vast, generalized knowledge of a massive transformer architecture into a lightweight model capable of running locally on consumer hardware via WebGPU or Transformers.js.

To understand this, we must look back at the fundamental architecture of Large Language Models (LLMs) introduced in earlier chapters. We previously discussed how the Attention Mechanism allows models to weigh the importance of different words in a sequence. In synthetic data generation, we are essentially curating the inputs and outputs to the Attention Mechanism to reinforce specific pathways in the smaller model's neural network.

The Teacher-Student Dynamic

Imagine a renowned university professor (the Teacher Model, e.g., GPT-4 or a large open-source model like Llama 3 70B) teaching a class of first-year students (the Student Model, a smaller 7B or 3B parameter model). The professor possesses deep, generalized knowledge across all disciplines. The students, however, have limited capacity and need to learn a specific skill—say, writing Python code for data analysis—efficiently.

If the students only read general textbooks, they will learn slowly and unevenly. However, if the professor generates a curated set of practice problems and solutions (the Synthetic Data), the students can focus their limited cognitive load on exactly what matters. This is the essence of synthetic data generation: The Teacher generates the curriculum.

The Web Development Analogy: Microservices vs. Serverless Functions

In modern web development, we often compare LLMs to Microservices. A large LLM is like a monolithic backend service handling authentication, billing, analytics, and user management all at once. It is powerful but resource-heavy and slow to respond.

Synthetic data generation is the process of extracting a specific microservice. We are taking the logic for "billing" from the monolith and refactoring it into a lightweight, serverless function (the Student Model).

The Teacher (Monolith): Handles any request but requires significant infrastructure (GPU clusters).
The Synthetic Data: The API logs and request/response pairs that define exactly how the billing service should behave.
The Student (Serverless Function): A stripped-down, highly optimized piece of code that only does billing. It runs instantly on a small edge device (like a local browser via WebGPU).

By generating synthetic data, we are essentially creating the "API documentation" that allows the small model to mimic the behavior of the large model without needing its complexity.

Under the hood, synthetic data generation relies on distribution matching. A language model is essentially a probability distribution over sequences of tokens. When we fine-tune a model, we are adjusting the weights of its neural network to shift its probability distribution to match a target distribution.

In a standard dataset, the target distribution is derived from human-written text. In synthetic generation, the target distribution is derived from the Teacher's outputs. The Student learns to minimize the divergence (the difference) between its predicted probability distribution and the Teacher's distribution.

This is often formalized using Kullback-Leibler (KL) Divergence, a concept from information theory that measures how one probability distribution diverges from a second, expected probability distribution. By minimizing KL divergence during fine-tuning, the Student Model learns to mimic the Teacher's "reasoning style" and factual accuracy.

The Role of System Prompts in Data Generation

The System Prompt is the architect of this process. As defined in our context, the System Prompt is the hidden instruction that sets the rules. In synthetic data generation, the System Prompt given to the Teacher is critical. It does not just ask a question; it instructs the Teacher on how to generate the data itself.

For example, a System Prompt might look like this:

"You are a data curator. Your task is to generate high-quality instruction-response pairs for training a coding assistant. You must generate diverse scenarios, include edge cases, and provide step-by-step reasoning in the response."

This prompt transforms the Teacher from a conversational partner into a data factory. It ensures that the generated data adheres to the specific formatting requirements of the target environment (e.g., Ollama's chat template or a specific JSON structure for Transformers.js).

The Pipeline: From Raw Text to Structured Pairs

The generation process is rarely a single step. It typically involves a multi-stage pipeline:

Seed Generation: The Teacher creates a broad list of topics or questions.
Response Generation: The Teacher answers these questions, often using Chain-of-Thought (CoT) reasoning to produce detailed outputs.
Filtering and Validation: This is crucial. The Teacher's output is not perfect. We often use a "Critic" model (another instance of the Teacher or a specialized reward model) to evaluate the quality of the generated data. This is similar to a code linter in web development—it checks for syntax errors, logical fallacies, or formatting issues before the data is committed to the dataset.
Formatting: The data is structured into the specific format required by the fine-tuning library. For local models, this often means formatting into ChatML or Alpaca templates.

Visualization of the Synthetic Data Pipeline

The flow of data from the Teacher to the Student, and how it integrates into the local environment, can be visualized as follows:

The diagram illustrates the synthetic data pipeline, showing how raw data flows from the Teacher model into a formatting stage where it is structured into templates like ChatML or Alpaca before being used for fine-tuning the Student model in the local environment.

Performance Considerations for Local Environments

When generating data specifically for local runtimes like WebGPU or Transformers.js, the theoretical approach must account for hardware constraints. A model running on a browser via WebGPU has a limited context window and memory bandwidth compared to a cloud-based Teacher.

Therefore, the synthetic data must be optimized for token efficiency. We generate data that teaches the model to be concise yet accurate. We also ensure the data covers the specific vocabulary and syntax required by the target environment. For instance, if the Student Model is intended to generate JSON objects for a web application, the synthetic data must heavily feature valid JSON structures in the responses.

The "Why" of Synthetic Data

Why go through this trouble? Why not just use human data?

Scarcity: High-quality, domain-specific human data is expensive and rare. Synthetic data allows us to scale data creation infinitely.
Control: We can generate data for edge cases that rarely occur in the wild, ensuring the Student Model is robust.
Privacy: Synthetic data can be generated to mimic sensitive data patterns without containing actual private information.
Alignment: We can use the Teacher to generate data that aligns with specific safety guidelines or behavioral constraints, baking these into the Student Model from the start.

In summary, synthetic data generation is the bridge between the massive capabilities of cloud-based LLMs and the practical, constrained reality of local AI. It is a process of knowledge compression, guided by precise System Prompts, resulting in a tailored dataset that empowers small models to punch far above their weight class.

Basic Code Example

This example demonstrates a minimal SaaS-style backend endpoint that uses a local LLM (simulated via Ollama) to generate synthetic instruction-response pairs. It focuses on the core mechanics of streaming data generation, parsing the stream, and structuring it for fine-tuning. We will use Node.js with Express and TypeScript, simulating an API endpoint that a frontend client would hit to generate data.

The Core Logic

The process involves three distinct stages: 1. Initiation: The client sends a request with a specific topic or seed prompt. 2. Streaming & Parsing: The server queries the local LLM (via Ollama's API), receives a raw text stream, and parses it in real-time to extract structured JSON objects. 3. Response: The server pipes the parsed, structured data back to the client using Server-Sent Events (SSE).

import express, { Request, Response } from 'express';
import axios, { AxiosResponse } from 'axios';

// --- Types & Interfaces ---

/**
 * Represents a single synthetic data point for fine-tuning.
 * @typedef {Object} SyntheticDataPoint
 * @property {string} instruction - The prompt or question generated by the teacher model.
 * @property {string} response - The answer generated by the teacher model.
 * @property {string} category - The topic category (e.g., 'coding', 'creative_writing').
 */
interface SyntheticDataPoint {
  instruction: string;
  response: string;
  category: string;
}

/**
 * Configuration for the Ollama API request.
 * @typedef {Object} OllamaRequestConfig
 * @property {string} model - The local model name (e.g., 'llama2').
 * @property {string} prompt - The prompt sent to the model.
 * @property {string} format - Requested response format (json).
 */
interface OllamaRequestConfig {
  model: string;
  prompt: string;
  format: 'json';
  stream: boolean;
}

// --- Constants ---

const app = express();
const PORT = 3000;
const OLLAMA_API_URL = 'http://localhost:11434/api/generate';

// --- Helper Functions ---

/**
 * Parses a raw SSE line from Ollama.
 * Ollama streams JSON objects line-by-line.
 * @param {string} chunk - The raw string chunk received from the stream.
 * @returns {Partial<SyntheticDataPoint> | null} - Parsed object or null if incomplete.
 */
function parseOllamaStream(chunk: string): Partial<SyntheticDataPoint> | null {
  // Remove the "data: " prefix if present (standard SSE format)
  const cleanChunk = chunk.replace(/^data: /, '').trim();

  if (!cleanChunk) return null;

  try {
    // Ollama returns partial JSON objects. 
    // For this simplified example, we assume the final chunk contains the full JSON.
    // In production, you would accumulate chunks until a valid JSON object is formed.
    const parsed = JSON.parse(cleanChunk);

    // Map Ollama's response structure to our interface
    // Note: In a real scenario, we'd use a prompt to force Ollama to output specific keys.
    // Here we simulate extracting keys from the response text.
    if (parsed.response) {
      return {
        instruction: parsed.response, // Simulating extracted instruction
        response: "Simulated detailed answer based on the instruction.", // Placeholder
        category: 'general'
      };
    }
    return null;
  } catch (error) {
    // Type narrowing: checking if error is an instance of Error
    if (error instanceof Error) {
      console.warn(`Failed to parse chunk: ${error.message}`);
    }
    return null;
  }
}

/**
 * Generates a synthetic data point by querying the local LLM.
 * This function simulates the "Teacher Model" behavior.
 * @param {string} topic - The seed topic for generation.
 * @returns {Promise<SyntheticDataPoint>} - The generated data point.
 */
async function generateSyntheticData(topic: string): Promise<SyntheticDataPoint> {
  const prompt = `
    Generate a high-quality synthetic instruction and response pair about ${topic}.
    Output ONLY raw JSON in the following format:
    {
      "instruction": "The generated question or task",
      "response": "The detailed answer or solution",
      "category": "${topic}"
    }
  `;

  const config: OllamaRequestConfig = {
    model: 'llama2', // Assumes a local model is running
    prompt: prompt,
    format: 'json',
    stream: false // We disable streaming for this specific helper function to keep it simple
  };

  try {
    // Simulate the API call to Ollama
    // In a real streaming scenario, we would use axios with `onDownloadProgress` or a fetch stream reader.
    const response: AxiosResponse = await axios.post(OLLAMA_API_URL, config);

    // Type narrowing: Ensure data exists and is an object
    if (response.data && typeof response.data === 'object') {
      const data = response.data;

      // Validate the structure (basic runtime check)
      if (typeof data.response === 'string') {
        // Here we parse the stringified JSON inside the 'response' field
        // This is a common pattern when forcing LLMs to output JSON via text prompts
        const parsedContent = JSON.parse(data.response);

        return {
          instruction: parsedContent.instruction,
          response: parsedContent.response,
          category: parsedContent.category
        };
      }
    }

    throw new Error("Invalid response structure from LLM");
  } catch (error) {
    if (error instanceof Error) {
      console.error(`Generation failed: ${error.message}`);
    }
    // Fallback data for demonstration purposes
    return {
      instruction: `What is ${topic}?`,
      response: `This is a simulated response about ${topic}.`,
      category: topic
    };
  }
}

// --- Main Application Logic ---

app.use(express.json());

/**
 * Endpoint: GET /generate-stream
 * 
 * Streams synthetic data back to the client using Server-Sent Events (SSE).
 * This simulates a real-time data generation pipeline.
 */
app.get('/generate-stream', (req: Request, res: Response) => {
  const { topic } = req.query;

  // 1. Validate Input
  if (!topic || typeof topic !== 'string') {
    res.status(400).json({ error: "A 'topic' query parameter is required." });
    return;
  }

  // 2. Setup SSE Headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.flushHeaders(); // Flush headers immediately to establish the stream

  // 3. Async Loop for Data Generation
  // We generate 3 data points for this demo
  (async () => {
    try {
      for (let i = 0; i < 3; i++) {
        // Generate data using the helper function
        const dataPoint = await generateSyntheticData(topic);

        // Format as SSE message
        // Structure: "event: update\ndata: {json}\n\n"
        const sseMessage = `event: update\ndata: ${JSON.stringify(dataPoint)}\n\n`;

        // Write to the response stream
        res.write(sseMessage);

        // Simulate processing delay
        await new Promise(resolve => setTimeout(resolve, 500));
      }

      // End the stream
      res.write('event: end\ndata: {"status": "completed"}\n\n');
      res.end();
    } catch (err) {
      // Handle errors within the stream
      const errorMessage = err instanceof Error ? err.message : 'Unknown error';
      res.write(`event: error\ndata: {"error": "${errorMessage}"}\n\n`);
      res.end();
    }
  })();
});

// Start Server
app.listen(PORT, () => {
  console.log(`Synthetic Data Generator running at http://localhost:${PORT}`);
});

Detailed Line-by-Line Explanation

1. Imports and Type Definitions

import express, { Request, Response } from 'express';
import axios, { AxiosResponse } from 'axios';

* Why: We import express for the web server framework and axios for making HTTP requests to the local Ollama instance. * Under the Hood: Node.js uses CommonJS or ES Modules. We are strictly typing our imports to ensure TypeScript understands the shape of the request and response objects.

2. Interface Definitions

interface SyntheticDataPoint {
  instruction: string;
  response: string;
  category: string;
}

* Why: This defines the "Shape" of our synthetic data. Fine-tuning datasets (like Alpaca or ShareGPT formats) rely on strict schemas. * Type Narrowing: By defining this interface, we enable TypeScript to catch errors if we try to assign a number to instruction or miss a required field.

3. The `parseOllamaStream` Function

function parseOllamaStream(chunk: string): Partial<SyntheticDataPoint> | null {
  const cleanChunk = chunk.replace(/^data: /, '').trim();
  // ...
}

* Why: Ollama (and most SSE streams) send data wrapped in specific formats (e.g., data: { ...}). This function strips the wrapper to get the raw JSON. * SSE Context: In Server-Sent Events, lines starting with data: carry the payload. We strip this to isolate the JSON. * Error Handling: The try...catch block is crucial. LLMs are non-deterministic; they might hallucinate invalid JSON. If parsing fails, we return null rather than crashing the server.

4. The `generateSyntheticData` Function (Teacher Model Simulation)

async function generateSyntheticData(topic: string): Promise<SyntheticDataPoint> {
  // ... Prompt Engineering ...
  const response: AxiosResponse = await axios.post(OLLAMA_API_URL, config);
  // ...
}

* Prompt Engineering: We explicitly instruct the model to output only JSON. This is a technique to structure raw LLM outputs. * Runtime Type Checking: if (response.data && typeof response.data === 'object') is a form of Type Narrowing. Even though TypeScript knows response.data is any (from the library), we narrow it down to an object before accessing properties. * Parsing: We parse data.response because Ollama often wraps the actual generation in a metadata object.

5. The Express Endpoint (`/generate-stream`)

app.get('/generate-stream', (req: Request, res: Response) => {
  // ...
  res.setHeader('Content-Type', 'text/event-stream');
  res.flushHeaders();
  // ...
});

* SSE Setup: Setting the Content-Type to text/event-stream tells the browser to listen for a stream rather than a single JSON response. res.flushHeaders() sends the headers immediately, preventing the client from timing out while waiting for the first byte. * Async Generator Pattern: We use an immediately invoked async function (async () => { ... })(); to handle the loop. This allows us to use await inside the loop, simulating the time it takes for the LLM to "think" and generate tokens.

6. Streaming the Response

const sseMessage = `event: update\ndata: ${JSON.stringify(dataPoint)}\n\n`;
res.write(sseMessage);

* SSE Protocol: The protocol requires double newlines (\n\n) to terminate a message. We construct a string adhering to this standard. * Backpressure: res.write() handles backpressure automatically in Node.js streams. If the client is slow to read, the buffer fills up, and the server pauses generation (though in this simple loop, we rely on Node's internal buffering).

Common Pitfalls

Vercel/AWS Lambda Timeouts (The "Cold Start" Trap)
- Issue: Serverless functions often have strict timeouts (e.g., 10 seconds on Vercel Hobby plans). Generating synthetic data via LLMs is computationally expensive and slow.
- Why it happens: If you try to generate a batch of 50 data points in a single serverless request, the function will likely time out before finishing.
- Solution: Use a background job queue (like BullMQ or Inngest) or generate data in smaller chunks (e.g., 5 points per request) and let the client handle pagination.
LLM Hallucination & Invalid JSON
- Issue: Even with strict prompts, LLMs occasionally output malformed JSON (e.g., missing commas, trailing commas, or unescaped quotes).
- Why it happens: LLMs are probabilistic text generators, not JSON validators.
- Solution: Never trust the raw output. Always wrap JSON.parse() in a try/catch block (as shown in the code). If parsing fails, log the error and either retry the request or discard that specific chunk. Use a library like zod for runtime schema validation.
Async/Await in Loops (Performance Bottlenecks)
- Issue: Using await inside a for loop sequentially processes requests. If one request takes 5 seconds, a loop of 10 requests takes 50 seconds.
- Why it happens: The loop waits for the previous iteration to complete before starting the next.
- Solution: For independent data generation tasks, use Promise.all() to run generations in parallel. Note: Be careful not to overload your local GPU/CPU when running local LLMs in parallel.
Memory Leaks in SSE
- Issue: If the client disconnects (closes the browser tab) while the server is generating data, the server continues processing until the loop finishes or an error is thrown.
- Why it happens: The res object remains open, and the async function keeps running.
- Solution: Listen for the close event on the response object: res.on('close', () => { /* cleanup logic */ }). This allows you to abort the generation process immediately, saving resources.

Visualizing the Data Flow

The following diagram illustrates the request lifecycle from the client to the local LLM and back.

This diagram illustrates the complete request lifecycle, tracing a user prompt from the client through the local LLM's generation process and highlighting the abort signal that can immediately halt execution to conserve resources.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.