Chapter 3: Streaming AI Responses through APIs

Theoretical Foundations

To understand streaming AI responses, we must first dismantle a common misconception: that Large Language Models (LLMs) generate answers instantly. In reality, an LLM is an autoregressive model; it generates text one token (roughly 300-500 bits of data) at a time. It cannot generate the entire response in parallel because the probability of the next token depends entirely on the sequence of tokens that came before it.

If we treat the API request as a standard synchronous HTTP call, the client sends a request and waits. The server waits for the LLM to finish generating the entire response (which might take 10–30 seconds for complex reasoning), packages it into a single JSON object, and sends it back. This is the "waterfall" of latency—a blocking, binary transaction that feels sluggish and unresponsive.

Streaming changes the architecture from a Batch Process to a Real-Time Pipeline. Instead of waiting for the bucket to fill before carrying it, we use a hose to deliver the water drop by drop the moment it leaves the tap.

The Analogy: The Chef vs. The Sushi Conveyor Belt

Imagine a client (a hungry diner) ordering a complex meal (a detailed AI response) from a kitchen (the backend server).

The Synchronous Model (The Chef): The diner places an order. The chef prepares the entire meal—appetizer, main course, and dessert—without serving a single plate. Only when the last garnish is placed does the chef walk out and serve the whole meal at once. If the diner asked for a modification halfway through, the chef would have to restart from scratch. The diner sits staring at an empty table, impatient and disengaged.
The Streaming Model (The Sushi Conveyor Belt): The chef prepares the first piece of sushi (the first token) and immediately places it on a conveyor belt (the ReadableStream). The diner sees the sushi moving toward them and can start eating immediately. While the diner is chewing the first piece, the chef is already preparing the second. The diner is engaged, receiving value continuously, and the perceived wait time is reduced to the time it takes to receive the first piece, not the last.

In technical terms, the "Conveyor Belt" is the Server-Sent Events (SSE) protocol over HTTP. Unlike WebSockets, which are bidirectional (full-duplex) and complex to maintain, SSE is unidirectional (simplex). It is designed specifically for this scenario: a continuous stream of data from server to client.

Under the hood, streaming relies on the HTTP/1.1 Transfer-Encoding: chunked header (and HTTP/2/3 equivalents). When a server sets this header, it promises to send the response in a series of "chunks," each preceded by its size.

However, standard HTTP chunking is often too low-level for modern AI applications. It just sends raw bytes. We need structure. This is where Server-Sent Events (SSE) comes in. SSE is a standard that wraps these chunks in a specific text-based format:

data: {"token": "The"}

data: {"token": " quick"}

data: {"token": " brown"}

data: [DONE]

The client listens to this stream, parses the data: fields, and reconstructs the message incrementally.

Why Edge Runtime is Non-Negotiable for Streaming

This brings us to the Edge Runtime. In a traditional Node.js backend (Serverless Functions), there is a "cold start" penalty—loading the runtime, dependencies, and initializing the environment before code execution begins. For a standard API that takes 200ms to run, a 500ms cold start is annoying but acceptable.

For a streaming response that might last 30 seconds, a 500ms cold start at the beginning of the stream is catastrophic. It delays the arrival of the very first token, destroying the user experience.

Edge Runtime (based on V8 Isolates or similar lightweight runtimes) eliminates this. It is designed for:

Zero Cold Starts: The environment is already "warm" or initializes in milliseconds.
Global Distribution: The Edge function runs physically close to the user, reducing the round-trip time (RTT) for the initial connection.
Standard Web APIs: It uses the Request and Response objects native to the web, making it ideal for manipulating streams.

The Architecture: tRPC, Edge Functions, and the ReadableStream

In the context of a Full-Stack TypeScript application, we often use tRPC for end-to-end type safety. However, tRPC traditionally relies on HTTP POST/GET requests with a single JSON response. To support streaming, we must bridge the gap between tRPC's type-safe procedures and the raw ReadableStream of the Edge Runtime.

We treat the streaming endpoint not as a standard tRPC query, but as a Gateway. The tRPC client initiates the request, but the server response bypasses the standard tRPC JSON serialization. Instead, it returns a raw Response object containing a ReadableStream.

The Data Transformation Layer: Parsing Partial JSON

One of the biggest challenges with streaming LLMs is that the data arriving is often a stream of raw text tokens, or a stream of partial JSON objects. You cannot simply concatenate these tokens and parse them at the end; you must handle them as they arrive.

Consider the LLM outputting a JSON object representing a user profile: {"name": "Alex", "age": 30, "skills": ["TypeScript", "React"]

In a stream, this might arrive as:

Chunk 1: {"name": "Alex", "age": 3
Chunk 2: 0, "skills": ["TypeScript", "Re
Chunk 3: act"]}

If you try to JSON.parse Chunk 1, it will fail. We need a Stateful Stream Transformer. This is a middleware layer within the Edge Function that maintains a "buffer" of characters. As new chunks arrive, it appends them to the buffer and attempts to parse complete JSON objects. If successful, it yields the parsed object; if not, it waits for more data.

This is analogous to a Packet Reassembler in networking. IP packets arrive out of order or fragmented; the reassembler holds them in a buffer until it can reconstruct the original message.

Visualizing the Stream Flow

The following diagram illustrates the lifecycle of a streaming request from the client, through the Edge Runtime, to the LLM, and back to the UI.

The Hierarchical Agentic Workflow Connection

Streaming is not just about displaying text faster; it is the backbone of modern Hierarchical Agentic Workflows.

In previous chapters, we discussed agents as autonomous entities. In a hierarchical workflow, a "Supervisor" agent decides which "Executor" agent to call. If the Supervisor needs to call an Executor that takes 5 minutes to process data, the user needs feedback immediately.

Without streaming, the UI would freeze. With streaming, the Supervisor can emit "thoughts" or "tool selection steps" as they happen.

Step 1: Supervisor decides to use a Calculator tool.
Step 2: Supervisor emits: Thinking: I need to calculate 2 + 2...
Step 3: Executor runs and streams the result.

By using Edge Functions to stream these agent steps, we create a UI that feels "alive." The user sees the chain of thought unfolding in real-time, which builds trust and transparency—critical factors when working with probabilistic AI models.

Under the Hood: The `ReadableStream` Controller

When we construct a Response in the Edge Runtime, we pass it a ReadableStream. This stream is initialized with a function that receives a controller.

// Conceptual representation of the stream controller logic
const stream = new ReadableStream({
  async start(controller) {
    // 1. Open connection to LLM
    const llmStream = await fetchLLMStream();

    // 2. Create a reader for the LLM's output
    const reader = llmStream.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    // 3. The Pump Function
    while (true) {
      const { done, value } = await reader.read();
      if (done) {
        // LLM finished, close the stream to client
        controller.close();
        break;
      }

      // 4. Decode bytes to text
      const chunk = decoder.decode(value, { stream: true });
      buffer += chunk;

      // 5. Parse buffer (extract complete JSON objects)
      const { parsedData, remainingBuffer } = parsePartialJSON(buffer);

      // 6. Send parsed data to client
      if (parsedData) {
        controller.enqueue(JSON.stringify(parsedData));
      }

      // 7. Keep the leftover for the next chunk
      buffer = remainingBuffer;
    }
  }
});

Summary of the "Why"

Perceived Performance: Human perception is relative. A 10-second wait feels longer than 1 second of wait followed by 9 seconds of continuous delivery.
Error Handling: If the connection drops in a synchronous request, you lose the entire result. In a stream, you can persist what has already arrived and potentially resume.
Complexity Management: By offloading the rendering of tokens to the client, the server can process the next token immediately without waiting for the client to acknowledge receipt of the previous one. This decouples the generation speed from the network latency of the response payload.

Basic Code Example

In a modern SaaS application, waiting for a complex AI generation (like a long report or code block) to complete before showing any result creates a poor user experience. Server-Sent Events (SSE) allow the backend to stream data incrementally over a single HTTP connection. Unlike WebSockets, which are bidirectional, SSE is unidirectional (server-to-client), making it lightweight and ideal for AI text generation.

In this example, we will build a simple backend endpoint that simulates an AI generating a JSON object token-by-token. We will then build a frontend that consumes this stream, parses the partial JSON on the fly, and updates the UI in real-time.

The Architecture

We will use a standard Node.js/Express backend for the API and a vanilla TypeScript frontend. This ensures the concepts are universally applicable, regardless of the specific framework (though we will note where frameworks like Next.js or tRPC would differ.

This diagram illustrates how core TypeScript concepts remain framework-agnostic, while highlighting specific implementation differences in frameworks like Next.js or tRPC.

Backend Implementation (Node.js + Express)

This code sets up an Express server with a single endpoint /api/chat. It simulates an AI generating a JSON object by sending chunks of text with a slight delay.

// server.ts
import express, { Request, Response } from 'express';
import cors from 'cors';

const app = express();
const PORT = 3000;

// Middleware to handle CORS (necessary for local dev)
app.use(cors());
// Middleware to parse JSON bodies
app.use(express.json());

/**

 * @route   GET /api/chat
 * @desc    Streams a simulated AI response as Server-Sent Events (SSE).
 * @returns  Text/Event-Stream
 */
app.get('/api/chat', (req: Request, res: Response) => {
  // 1. Set SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  // 2. Simulate a sequence of JSON tokens an LLM might produce
  const jsonTokens = [
    '{\n  "response": "Hello', 
    ' World', 
    '! This', 
    ' is', 
    ' a', 
    ' stream', 
    '",\n  "metadata": {\n    "temperature": 0.7,\n    "tokens": 6\n  }\n}'
  ];

  // 3. Helper function to send data in SSE format
  // SSE format: "data: <content>\n\n"
  const sendToken = (token: string) => {
    res.write(`data: ${token}\n\n`);
  };

  // 4. Stream tokens with a delay to simulate network latency and LLM processing
  let index = 0;
  const intervalId = setInterval(() => {
    if (index < jsonTokens.length) {
      sendToken(jsonTokens[index]);
      index++;
    } else {
      // 5. End the stream when done
      clearInterval(intervalId);
      res.end(); 
    }
  }, 100); // 100ms delay between tokens

  // 6. Handle client disconnect (cleanup)
  req.on('close', () => {
    clearInterval(intervalId);
    res.end();
  });
});

app.listen(PORT, () => {
  console.log(`SSE Server running on http://localhost:${PORT}`);
});

Frontend Implementation (TypeScript)

This frontend uses the native EventSource API to connect to the backend. It accumulates the incoming chunks and performs "Type Narrowing" to validate and parse the partial JSON structure safely.

// client.ts

/**

 * Interface for the expected final JSON structure.
 */
interface AIResponse {
  response: string;
  metadata: {
    temperature: number;
    tokens: number;
  };
}

/**

 * Simulates a UI update function (e.g., updating a DOM element).
 * @param content - The accumulated text to display
 */
const updateUI = (content: string) => {
  const uiElement = document.getElementById('stream-output');
  if (uiElement) {
    uiElement.textContent = content;
  }
  console.log("UI Updated:", content);
};

/**

 * Connects to the SSE endpoint and handles the stream.
 */
function startStream() {
  // 1. Initialize EventSource
  // Note: In a real app, this URL would come from your backend API route.
  const eventSource = new EventSource('http://localhost:3000/api/chat');

  // 2. Accumulator to hold partial JSON chunks
  let accumulatedData = '';

  // 3. Listen for the 'message' event (default event for SSE)
  eventSource.onmessage = (event: MessageEvent) => {
    // 4. Append the new chunk to the accumulator
    const chunk = event.data;
    accumulatedData += chunk;

    // 5. Visual Feedback: Update UI with raw text immediately
    // This shows the "streaming" effect before full JSON validity.
    updateUI(accumulatedData);

    // 6. Attempt Type Narrowing / Validation
    // We try to parse the accumulated data. If it fails, we wait for more chunks.
    try {
      // JSON.parse throws if the string is incomplete/invalid
      const parsedData: unknown = JSON.parse(accumulatedData);

      // Type Guard: Check if the parsed object matches our interface
      if (isAIResponse(parsedData)) {
        console.log("Full JSON Validated:", parsedData);
        // Here you might trigger a specific action upon completion
        // For example, lock the UI or save to database.
      }
    } catch (error) {
      // Expected behavior: JSON.parse fails until the stream finishes.
      // We silently ignore this in the UI to allow the stream to continue.
    }
  };

  // 7. Handle errors
  eventSource.onerror = (error) => {
    console.error("EventSource failed:", error);
    eventSource.close();
  };
}

/**

 * Type Guard function.
 * Checks if the unknown object conforms to the AIResponse interface.
 * This is the "Type Narrowing" concept in action.
 */
function isAIResponse(obj: any): obj is AIResponse {
  return (
    typeof obj === 'object' &&
    obj !== null &&
    typeof obj.response === 'string' &&
    typeof obj.metadata === 'object' &&
    typeof obj.metadata.temperature === 'number' &&
    typeof obj.metadata.tokens === 'number'
  );
}

// Start the process
startStream();

Line-by-Line Explanation

Backend (`server.ts`)

Headers Setup:
- res.setHeader('Content-Type', 'text/event-stream'): This is crucial. It tells the browser that this is not a standard JSON response but a continuous stream.
- res.setHeader('Cache-Control', 'no-cache'): Prevents intermediate proxies or the browser from caching the response.
- res.setHeader('Connection', 'keep-alive'): Keeps the TCP connection open.
Data Simulation:
- jsonTokens: We define an array of strings. Notice that the strings are split arbitrarily (e.g., "Hello, " World"). This simulates how an LLM (like GPT) outputs tokens. The split points do not respect JSON syntax rules (like closing quotes or braces) until the very end.
The Streamer (sendToken):
- res.write(data: ${token}\n\n): This is the strict SSE protocol.
  - data: is the specific field name for the payload.
  - \n\n (two newlines) marks the end of the event frame. Without the double newline, the browser will buffer the data indefinitely and never fire the onmessage event.
The Interval:
- We use setInterval to mock the time an LLM takes to generate the next token. In a real scenario, this would be a loop reading from a stream (e.g., res.body.on('data', ...)).
Cleanup:
- req.on('close'): If the user closes the tab or navigates away, the connection drops. We must clear the interval on the server to prevent memory leaks (zombie processes trying to write to a closed response).

Frontend (`client.ts`)

EventSource Initialization:
- new EventSource(url): The browser handles the connection management automatically. It will automatically reconnect if the connection drops (unless configured otherwise).
Accumulation Strategy:
- accumulatedData += chunk: Because the server sends partial JSON, we cannot parse every chunk individually. We must concatenate them into a single string until the JSON is valid or the stream ends.
UI Rendering (The "Why" of Streaming):
- updateUI(accumulatedData): We update the UI inside the onmessage loop. This creates the "typewriter" effect. The user sees text immediately, even if the underlying data structure (JSON) isn't complete yet.
Type Narrowing & Validation (isAIResponse):
- The Problem: TypeScript types exist only at compile-time. At runtime, parsedData is just any.
- The Solution: We use a Type Guard (obj is AIResponse). Inside this function, we perform runtime checks (e.g., typeof obj.metadata.temperature === 'number').
- The Result: If this function returns true, TypeScript knows that parsedData is of type AIResponse and allows us to access properties safely without compile-time errors.
Error Handling:
- try/catch: JSON.parse throws an error if the string is incomplete (e.g., "{ "key": "val"). This is expected behavior during streaming. We catch it to prevent the app from crashing, allowing the stream to continue appending data until the JSON is valid.

Common Pitfalls

The "Invalid JSON" Trap:
- Issue: LLMs often output malformed JSON, especially if they are interrupted or if the temperature is high. They might close a bracket ] without opening one [.
- Fix: Never rely solely on JSON.parse inside the stream loop for validation. Use a streaming JSON parser library (like jsonstream or stream-json) or implement a buffer that only attempts parsing when a valid top-level object is detected.
Vercel/AWS Lambda Timeouts:
- Issue: Serverless functions often have strict timeouts (e.g., 10s on Vercel Hobby plans). A long AI generation might exceed this, causing the connection to drop abruptly.
- Fix: For long streams, use a persistent server (like a Node.js container on ECS/Kubernetes) or a specialized streaming service (like Vercel's AI SDK with Edge Runtime, which handles keep-alives differently).
Async/Await Misuse:
- Issue: Developers often try to use await inside the streaming loop (e.g., await sendToken()).
- Fix: SSE is event-driven, not promise-driven. Use setInterval or while loops with non-blocking delays. Do not block the event loop.
Missing event: Field:
- Issue: The default event type is message. If you send event: custom\n in the stream, the frontend must listen to eventSource.addEventListener('custom', ...) instead of onmessage.
- Fix: Stick to the default data: format unless you need multiple event types (e.g., distinguishing between "text" and "status" updates).
CORS Configuration:
- Issue: EventSource is subject to CORS. If your backend doesn't explicitly allow the frontend's origin, the browser will block the stream silently.
- Fix: Ensure Access-Control-Allow-Origin is set correctly in the backend headers.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.