Chapter 9: Skeleton Loaders & Suspense for AI Streams
Theoretical Foundations
When we built the chat interface in the previous chapter, we used the Vercel AI SDK's useChat hook to manage streaming responses. The SDK handled the complexities of Server-Sent Events (SSE), allowing us to simply map over the messages array and render the content as it arrived. While functional, this approach exposes a raw technical reality: AI generation is slow. Unlike a traditional database query that might take 50ms, a Large Language Model (LLM) can take several seconds to produce a coherent response. If we simply wait for the first token to arrive before showing anything, the user is left staring at a static, unresponsive UI. This is a failure of perceived performance.
Perceived performance is the subjective experience of speed. It is not merely the measure of time from request to response (absolute latency), but rather how responsive and alive the application feels to the user. In the context of generative AI, where absolute latency is inherently high due to model inference, we must employ strategies to decouple the user's perception from the backend processing time. We achieve this by shifting from a "loading state" mindset to a "progressive rendering" mindset.
The core problem is the uncanny valley of waiting. When a user clicks "Send," they expect immediate feedback. If the interface freezes for 2 seconds and then suddenly dumps a block of text, the experience feels jarring and disconnected. We need to bridge the gap between the click and the final output. This is where Suspense and Skeleton Loaders come into play, but they must be adapted for the unique characteristics of AI streams.
The Web Development Analogy: The Restaurant Kitchen vs. The Assembly Line
To understand the necessity of these techniques, let's use an analogy of a restaurant.
In a traditional web application (like a standard CRUD app), the user orders a dish (makes a request), and the kitchen (server) prepares the entire dish before passing it to the waiter (API response) to serve to the customer. The customer waits, but the wait is for a complete product. This is an Assembly Line model: work is done in stages, but the product is not released until it is finished.
In a generative AI application, the kitchen is an LLM. It doesn't just cook the dish; it invents the recipe, sources the ingredients, and cooks them one spoonful at a time. If the waiter waits for the entire pot of soup to be finished before bringing the first spoon, the customer is waiting unnecessarily long. Furthermore, if the soup is terrible, the customer has wasted time waiting for the whole thing.
Model Streaming changes this dynamic. It is like the chef ladling soup into the bowl as it is cooked, handing the bowl to the waiter, who runs it to the table immediately. The customer can start tasting the soup (reading the response) while the rest of it is still being cooked.
However, there is still a delay between the order and the first spoonful (the time to generate the first token). If the waiter stands at the kitchen door staring at the chef until the first spoon is ready, the customer sees nothing happening. This is the "frozen UI" problem.
Suspense and Skeleton Loaders are the equivalent of the waiter bringing a bread basket and a menu immediately upon receiving the order. It signals to the customer, "We have received your order, we are working on it, and here is something structurally similar to what you will receive to keep you engaged." The bread basket (Skeleton) mimics the shape of the meal (the final UI structure) but is not the meal itself. It occupies the space, manages the layout, and reduces the perceived wait time.
The Mechanics of Non-Blocking UIs: React Suspense
In React, Suspense is a mechanism that allows components to "wait" for something before rendering. In the context of data fetching, it enables us to declaratively specify a loading state (a fallback) that is displayed while a component is waiting for its data.
In the previous chapter, we rendered the stream by mapping over an array of message objects. The content property of the current message was a string that grew in length as tokens arrived. While this works, it is a "pull" model where the UI is driven by the data arriving.
For the "Theoretical Foundations" of streaming, we must understand the "Push" model of Server-Sent Events (SSE) and how Suspense boundaries interact with it.
When we use the Vercel AI SDK's streamUI function (or useChat), we are establishing a persistent HTTP connection. The server pushes data chunks (tokens) to the client. The client's browser receives these chunks and updates the state.
However, React's rendering cycle is synchronous. If we try to render a component that depends on a stream that hasn't started yet, we block. Suspense solves this by suspending the rendering tree until the data is ready.
The Critical Distinction: In traditional Suspense for data fetching, the fallback is displayed until the entire data request is resolved. In AI Streaming, we don't want to wait for the entire stream. We want to show the fallback only for the initial connection latency, and then seamlessly transition to the streaming content.
This requires a nuanced approach: 1. Initial State: The component is waiting for the stream to start (e.g., waiting for the LLM to acknowledge the prompt). 2. Streaming State: The stream has started, and tokens are arriving.
We cannot use a standard Suspense boundary around the entire stream renderer, because that would hide the streaming content until the stream finishes (which might be never, or until the user closes the connection).
Instead, we use Suspense to handle the initial latency—the time from the user clicking "Send" to the arrival of the first token. Once the first token arrives, we "resolve" the suspense boundary and render the streaming component.
Visualizing the Flow
The following diagram illustrates the timeline of a generative UI request, comparing a blocking UI to a non-blocking UI with Suspense and Skeletons.
Designing Effective Skeleton Loaders
A skeleton loader is not just a generic spinner. It is a structural placeholder. For generative UI, the skeleton must mimic the shape of the expected response.
If the AI is generating a chat response, the skeleton should look like a chat bubble. If the AI is generating code, the skeleton should look like lines of code. If the AI is generating a table, the skeleton should be a grid of gray bars.
Why is this important? 1. Layout Stability: It reserves the exact space the content will occupy. This prevents Cumulative Layout Shift (CLS), a Core Web Vital metric. When the text finally streams in, the UI doesn't jump around. 2. Cognitive Expectation: It tells the user what is coming. A user asking for a code snippet expects to see code. Seeing a spinning loader gives no context. Seeing a code-shaped skeleton reinforces the user's intent.
The "Mimicry" Principle:
In the context of the Vercel AI SDK, we often use streamUI to generate React components on the server. For example, we might ask the AI to generate a weather card. The streamUI function can stream back a React component (JSX).
The skeleton loader for this should be a static React component that matches the dimensions and hierarchy of the expected output.
// Example of a Skeleton Component designed to mimic a specific output structure
// This is a conceptual representation of how we structure skeletons for AI streams.
const WeatherCardSkeleton = () => {
return (
<div className="border rounded-lg p-4 shadow-sm bg-gray-50 animate-pulse">
{/* Header Skeleton */}
<div className="flex justify-between items-center mb-4">
<div className="h-6 w-1/3 bg-gray-300 rounded"></div> {/* City Name */}
<div className="h-4 w-1/6 bg-gray-300 rounded"></div> {/* Date */}
</div>
{/* Main Content Skeleton */}
<div className="flex items-center justify-between">
<div className="h-12 w-12 bg-gray-300 rounded-full"></div> {/* Icon placeholder */}
<div className="h-8 w-20 bg-gray-300 rounded"></div> {/* Temperature */}
</div>
{/* Details Grid Skeleton */}
<div className="grid grid-cols-3 gap-2 mt-4">
<div className="h-4 bg-gray-300 rounded"></div>
<div className="h-4 bg-gray-300 rounded"></div>
<div className="h-4 bg-gray-300 rounded"></div>
</div>
</div>
);
};
This skeleton uses the animate-pulse utility (common in Tailwind CSS) to create a gentle shimmer effect, indicating activity. The structure (header, main content, details) is preserved, so when the actual WeatherCard component streams in, the swap is visually smooth.
Server-Sent Events (SSE) and the Streaming Pipeline
To understand how Suspense and Skeletons integrate, we must look under the hood at Server-Sent Events (SSE).
SSE is a standard that allows a server to push data to a client over a single HTTP connection. Unlike WebSockets, which are bidirectional, SSE is unidirectional (server-to-client). This is perfect for AI generation because the flow is one-way: the model generates tokens, and the client consumes them.
The Pipeline:
1. Client Request: The user sends a message. The Next.js route handler (API endpoint) receives the request.
2. Model Connection: The server connects to the LLM provider (e.g., OpenAI).
3. Stream Initiation: The LLM begins generating tokens. The server receives these tokens in real-time.
4. The "Bridge": The server does not wait for the response to finish. It creates a ReadableStream (a web standard API for streaming data).
5. SSE Encoding: The server writes tokens into the stream, often formatted as SSE messages (e.g., data: {"content": "token"}\n\n).
6. Client Consumption: The browser's fetch API (or the Vercel AI SDK's internal logic) reads this stream. It parses the SSE events and updates the React state.
The Role of Suspense in the Stream:
When we wrap a component in <Suspense fallback={<Skeleton />}>, React pauses the rendering of that component's children. However, for a stream, we don't want to pause the entire stream once it starts.
We use a technique called Streaming Components. In Next.js (App Router), we can pass a stream as a prop or use a reader pattern.
Consider the flow of a streamUI response. The server returns a stream of "renderable" chunks. These might be raw text tokens or serialized React components.
- Initial Render: The parent component renders. It encounters a
Suspenseboundary. - Suspense Trigger: The child component (the one fetching the stream) is not ready. React shows the
fallback(the Skeleton). - Stream Start: The server sends the first chunk. The client receives it. The promise associated with the stream resolves (or the state updates).
- Suspense Resolution: React detects that the data is available. It unmounts the
fallbackand mounts the child component. - Streaming Rendering: The child component now renders the content. As subsequent chunks arrive (via SSE), the component updates its internal state and re-renders the text incrementally.
This creates a seamless transition:
Skeleton (0ms) -> [Suspense Boundary] -> First Token (1500ms) -> Streaming Text (1500ms - 5000ms)
Deep Dive: streamUI and Custom Loading States
The Vercel AI SDK's streamUI function is a powerful abstraction that combines generation and rendering. Unlike useChat, which primarily streams text, streamUI allows the model to generate React components directly.
How streamUI works:
It accepts a render function. As the model generates tokens, the SDK attempts to parse them into a specific format (often JSON representing the component). When a valid component definition is parsed, the render function is called.
The Loading State Challenge with streamUI:
Because streamUI can generate any component, a single static skeleton might not suffice if the AI's output structure is unpredictable.
To handle this, we implement progressive hydration.
- Phase 1: Text Skeleton. Initially, we don't know what the AI will return. We show a generic text skeleton.
- Phase 2: Component Detection. As the stream arrives, we might detect that the AI is generating a structured response (e.g., a list of items).
- Phase 3: Dynamic Skeleton Swap. We can swap the generic text skeleton for a more specific layout skeleton (e.g., a list skeleton) once the structure is inferred.
This requires managing the loading state manually or using React's use hook (in experimental versions) or state management to track the "shape" of the incoming data.
Under the Hood: The ReadableStream Interface When we implement a custom stream reader in a Next.js Server Component or a Route Handler, we are working with the Web Streams API.
// Conceptual representation of reading a stream in a React Server Component
// This is NOT executable code, but illustrates the underlying mechanism.
import { ReadableStream } from 'stream';
// 1. Create a stream from the AI response
const aiStream = await openai.chat.completions.create({ ... });
// 2. Convert to a Web Stream
const stream = new ReadableStream({
async start(controller) {
// Read from AI stream and enqueue chunks
for await (const chunk of aiStream) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
controller.enqueue(content);
}
}
controller.close();
},
});
// 3. In the Client Component, we read this stream
// The Vercel AI SDK abstracts this, but conceptually:
const reader = stream.getReader();
const { value, done } = await reader.read();
// 4. Updating State
// If we are using Suspense, the initial read might be wrapped in a promise
// that resolves only when the first chunk arrives.
Theoretical Foundations
The integration of Skeleton Loaders and Suspense for AI streams is a deliberate architectural choice to combat the high absolute latency of LLMs. By understanding the mechanics of Model Streaming via SSE, we can design UIs that remain responsive.
- Perceived Performance is the goal; absolute speed is the constraint.
- Suspense manages the initial latency gap (0 to first token).
- Skeleton Loaders provide structural stability and cognitive cues.
streamUIallows for dynamic rendering, requiring adaptive loading strategies.
In the following sections, we will move from theory to practice, implementing these concepts using the Vercel AI SDK and React Server Components to build a UI that feels instant, even when the underlying computation is heavy.
Basic Code Example
In a modern SaaS application, we often need to generate dynamic content, such as a personalized marketing email or a structured report, using a large language model (LLM). The generation process can take several seconds. If we wait for the entire response before showing anything to the user, the application feels sluggish and unresponsive.
To solve this, we use Streaming. The AI model sends back tokens (words, punctuation, code) as soon as they are generated. On the client, we can display these tokens immediately, creating the illusion of real-time generation.
However, there is a brief moment before the first token arrives where the network request is being established. To provide a polished user experience, we use Suspense Boundaries and Skeleton Loaders. The Skeleton Loader mimics the layout of the expected content, preventing the UI from jumping or shifting once the data arrives (known as Cumulative Layout Shift or CLS).
The following example demonstrates a Next.js Server Component that streams a generated "User Report" using the Vercel AI SDK. It wraps the streaming logic in a React Suspense boundary to show a skeleton loader immediately.
Code Example
This example uses the Next.js App Router and the Vercel AI SDK.
// app/actions/generateReport.ts
'use server';
import { streamUI } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
/**
* Server Action to generate a user report stream.
* This runs on the server and streams tokens to the client.
*/
export async function generateReportStream(userId: string) {
// We use streamUI to generate a specific React component based on the AI's output.
// This allows us to stream structured UI, not just raw text.
const result = await streamUI({
model: openai('gpt-4o-mini'),
// Prompt tailored for the SaaS context
prompt: `Generate a brief status report for user ID: ${userId}.
Include a summary of their activity and a recommendation.
Format it clearly.`,
// Define the schema for the component we want the AI to generate.
// This ensures the AI outputs valid JSON matching our component props.
text: ({ content }) => <div>{content}</div>,
// Define a custom component that the AI can invoke to render specific UI blocks.
tools: {
report_card: {
description: 'A card displaying a user status report.',
parameters: z.object({
username: z.string().describe('The name of the user'),
status: z.enum(['active', 'inactive', 'pending']).describe('Current status'),
summary: z.string().describe('A brief summary of activity'),
recommendation: z.string().describe('Suggested next step'),
}),
// This function renders the UI on the server when the AI provides valid JSON.
generate: async function* ({ username, status, summary, recommendation }) {
// Yield a loading state while generating the inner content (if needed)
yield <div className="animate-pulse">Generating report card...</div>;
// Return the final component
return (
<div className="border rounded-lg p-4 shadow-sm bg-white">
<h3 className="font-bold text-lg text-gray-800">Report for {username}</h3>
<div className="mt-2 space-y-2">
<p className="text-sm text-gray-600">
<span className="font-semibold">Status:</span> {status}
</p>
<p className="text-sm text-gray-700">{summary}</p>
<div className="mt-3 p-2 bg-blue-50 border-l-4 border-blue-500">
<p className="text-sm font-semibold text-blue-700">Recommendation</p>
<p className="text-sm text-blue-600">{recommendation}</p>
</div>
</div>
</div>
);
},
},
},
});
return result;
}
// app/components/ReportGenerator.tsx
'use client';
import { useState, useTransition } from 'react';
import { generateReportStream } from '@/app/actions/generateReport';
import { readStreamableValue } from 'ai';
import { Suspense } from 'react';
/**
* Client Component to trigger the stream and handle the UI state.
*/
export default function ReportGenerator() {
const [isPending, startTransition] = useTransition();
const [uiState, setUiState] = useState<React.ReactNode | null>(null);
const handleClick = () => {
startTransition(async () => {
// 1. Call the Server Action
const result = await generateReportStream('user_12345');
// 2. Read the streamable value
// The readStreamableValue utility allows us to consume the stream token-by-token
for await (const delta of readStreamableValue(result.value)) {
// 3. Update state with the latest UI component from the stream
if (delta) {
setUiState(delta);
}
}
});
};
return (
<div className="max-w-md mx-auto p-6 space-y-6">
<button
onClick={handleClick}
disabled={isPending}
className="w-full py-2 px-4 bg-blue-600 text-white rounded hover:bg-blue-700 disabled:opacity-50 transition"
>
{isPending ? 'Generating...' : 'Generate User Report'}
</button>
{/*
The Suspense Boundary handles the "loading" state.
While the stream is initializing (or if we were fetching initial data),
the fallback UI (Skeleton) is shown immediately.
Once the stream starts sending components, the children render.
*/}
<Suspense fallback={<ReportSkeleton />}>
<div className="min-h-[200px]">
{isPending ? uiState : null}
</div>
</Suspense>
</div>
);
}
/**
* A Skeleton Loader that mimics the structure of the final Report Card.
* This prevents layout shift and provides immediate visual feedback.
*/
function ReportSkeleton() {
return (
<div className="border rounded-lg p-4 shadow-sm bg-white animate-pulse">
<div className="h-6 bg-gray-200 rounded w-1/2 mb-4"></div>
<div className="space-y-2">
<div className="h-4 bg-gray-200 rounded w-1/3"></div>
<div className="h-4 bg-gray-200 rounded w-full"></div>
<div className="h-4 bg-gray-200 rounded w-2/3"></div>
<div className="mt-3 p-2 bg-gray-100 border-l-4 border-gray-300">
<div className="h-3 bg-gray-200 rounded w-1/3 mb-1"></div>
<div className="h-3 bg-gray-200 rounded w-3/4"></div>
</div>
</div>
</div>
);
}
Line-by-Line Explanation
1. Server Action (app/actions/generateReport.ts)
'use server';: This directive marks the file (or specific functions) as Server Actions. This allows the client component to call this function directly, like an RPC, without manually creating an API route.import { streamUI } from 'ai';: Imports the core function from the Vercel AI SDK.streamUIis designed specifically for streaming React Server Components (RSC) to the client.export async function generateReportStream(...): Defines the server-side logic. It accepts auserId(simulating a SaaS context).const result = await streamUI({ ... });: We call thestreamUIfunction. This function communicates with the AI provider (OpenAI in this case) and handles the streaming protocol.model: openai('gpt-4o-mini'): Specifies the model. We use a smaller, faster model suitable for structured text generation.prompt: ...: A string instruction telling the model what to generate. We include theuserIdto make the request dynamic.text: ({ content }) => <div>{content}</div>: This defines how to handle plain text tokens if the AI doesn't invoke a tool. It wraps text in a simple div.tools: { report_card: { ... } }: This is the most powerful part ofstreamUI.- Description: We tell the AI it has access to a tool named
report_cardand describe what it's for. - Parameters (Zod): We use Zod to strictly define the data structure the AI must return (username, status, summary, recommendation). The SDK validates the AI's output against this schema.
- Generate: This is an async generator function (
async function*). It runs on the server when the AI successfully provides valid JSON matching the Zod schema.
- Description: We tell the AI it has access to a tool named
yield <div>Generating report card...</div>: Inside the generator, we canyieldintermediate UI. This is useful if the tool generation takes time. In this example, it shows a quick loading state inside the card itself.return <div className="border...">...</div>: The final return value is a fully formed React component. This component is serialized and streamed to the client.return result;: The function returns theresultobject, which contains avalueproperty. Thisvalueis a streamable state that the client can consume.
2. Client Component (app/components/ReportGenerator.tsx)
'use client';: Marks this as a Client Component because it uses hooks (useState,useTransition) and event handlers (onClick).const [isPending, startTransition] = useTransition();: A React hook to handle state updates that are non-blocking.isPendingbecomes true while the server action is running.const [uiState, setUiState] = useState<React.ReactNode | null>(null);: State to hold the current chunk of UI received from the stream.startTransition(async () => { ... }): When the user clicks the button, we wrap the async call in a transition. This keeps the UI responsive.const result = await generateReportStream('user_12345');: We call the server action. Note thatawaithere waits for the stream to finish or the initial connection to establish. However, the SDK handles the streaming logic internally.for await (const delta of readStreamableValue(result.value)) { ... }: We use thereadStreamableValueutility provided by the SDK. This converts the raw stream from the server into an async iterable. We loop through every "delta" (update) sent from the server.if (delta) { setUiState(delta); }: As each new UI chunk arrives (e.g., first the text "User Report", then the card component), we update the local state. This triggers a re-render, and the user sees the content appearing progressively.<Suspense fallback={<ReportSkeleton />}>: This wraps the area where the content will appear.- Why is it here? While
isPendingis true (during the initial click), theSuspenseboundary catches the loading state. If we had fetched initial data usinguseorfetchinside a child component,Suspensewould pause rendering that child until the data is ready, showing the fallback (ReportSkeleton) instead. - Note: In this specific streaming pattern, the
Suspenseboundary primarily handles the initial "hanging" state before the stream starts emitting tokens. Once the stream emits the first token, thechildren(thedivcontaininguiState) are rendered.
- Why is it here? While
<ReportSkeleton />: This component is purely presentational. It uses gray bars (bg-gray-200) andanimate-pulseto mimic the shape of the final report card. This ensures that when the real content loads, the layout doesn't shift, providing a smooth transition.
Visualizing the Data Flow
The following diagram illustrates the lifecycle of a streaming request in this architecture.
Common Pitfalls
When implementing streaming UI with React Suspense and the Vercel AI SDK, developers often encounter specific issues related to JavaScript/TypeScript behavior and server-client boundaries.
1. Hallucinated JSON / Schema Validation Errors
* The Issue: LLMs are probabilistic. Even with strict prompting, the model might return text that isn't valid JSON, or JSON that doesn't match your Zod schema (e.g., returning "Status: Active" instead of { "status": "active" }).
* The Consequence: If you don't validate the output, the generate function will throw an error, breaking the stream and potentially crashing the UI.
* The Fix: Always use a schema validator like Zod (as shown in the example). The streamUI function handles validation automatically. If the AI returns invalid data, the SDK will retry or handle the error gracefully, preventing your application code from receiving malformed data.
2. Vercel/AWS Timeouts (504 Gateway Timeout)
* The Issue: Serverless functions (like Vercel Edge or Node.js functions) have strict execution time limits (e.g., 10s for Hobby, up to 15s for Pro on Vercel). If the AI model takes too long to generate the first token or the full response, the serverless platform may terminate the connection.
* The Consequence: The client receives a 504 error, and the stream cuts off abruptly.
* The Fix:
* Keep Prompts Concise: Reduce the context sent to the model.
* Optimize Model Choice: Use faster models (like gpt-4o-mini instead of gpt-4).
* Client-Side Resilience: Implement error boundaries on the client to catch broken streams and offer a "Retry" button.
3. Async/Await Loops Blocking the Stream
* The Issue: Inside the generate function of a tool, developers might be tempted to perform heavy synchronous computations or await non-streaming database calls.
// BAD: This blocks the stream
generate: async ({ id }) => {
const data = await db.query('SELECT * FROM users WHERE id = ?', [id]);
// ... heavy processing ...
return <Component data={data} />;
}
yield to send intermediate states immediately. If you must fetch data, ensure it's done efficiently. For very heavy computations, consider moving them to a background job and polling for status on the client, rather than blocking the streaming response.
4. Misplacing Suspense Boundaries
* The Issue: Placing Suspense too high up in the component tree (e.g., in layout.tsx) or not wrapping the specific streaming component.
* The Consequence: If Suspense is missing, the browser might hang waiting for the stream to complete before rendering anything. If placed too high, an error in one streaming component might take down the entire page layout.
* The Fix: Wrap the streaming component specifically. Use granular Suspense boundaries for different parts of the page. Remember that Suspense only works for async operations that are designed to "suspend" (like reading a streamable value or fetching data in a Server Component).
5. Forgetting 'use client' or 'use server'
* The Issue: Mixing client and server code without the directives.
* The Consequence:
* Calling a Server Action without 'use server' results in a network error (404).
* Using useState in a file without 'use client' causes a build error.
* The Fix: Be explicit. Server Actions (logic that touches the DB or AI) belong in server files. UI interaction logic (clicks, form state) belongs in client files.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.