Chapter 1: Modeling SaaS Plans (Tokens vs Seats)

Theoretical Foundations

In our previous exploration of the Agentic Workflow Orchestration pattern, we established how complex tasks are decomposed into a sequence of discrete, manageable steps executed by specialized agents. This concept of breaking down a larger process into fundamental units of work is the philosophical bridge to understanding our first SaaS pricing model: the Token.

A Token is the atomic unit of the Large Language Model universe. It is the currency of computation. When you send a prompt to an LLM, you are not sending a string of characters; you are sending a sequence of tokens. When the model responds, it generates a sequence of new tokens. Every single aspect of the interaction—from the cost of the API call to the model's ability to remember the conversation—is governed by the token.

To understand a token, let's use an analogy. Imagine you are a typesetter in the age of Gutenberg. You don't think in terms of sentences or paragraphs, but in individual letters, spaces, and punctuation marks. These are your raw materials. A token is the LLM's equivalent of a letter or a syllable. It's the smallest piece of text the model can understand and work with. For English text, a good rule of thumb is that 100 tokens is approximately 75 words. The word "unbelievably" might be broken into two tokens: "un" and "believably". The model doesn't see the word as a single entity, but as a combination of these fundamental building blocks.

This granularity has profound implications. It means that every single character you type into a prompt has a cost and a computational weight. This is the bedrock of consumption-based billing.

The Context Window: The Model's Working Memory

If tokens are the alphabet, the Context Window is the size of the LLM's short-term memory. It is a fixed-size buffer, measured in tokens, that the model can hold in its "consciousness" at any one time. This window includes both the input (your prompt, including any previous conversation history) and the output (the model's generated response).

Think of the Context Window as a whiteboard in a meeting room. The whiteboard has a finite amount of space. You can write your questions on it (input), and the model writes its answers (output). As the meeting progresses, the whiteboard fills up. If you try to write more than the board can hold, something must be erased. In the world of LLMs, the oldest information (the beginning of the conversation) is truncated, or "forgotten," to make space for new tokens. This is why a model might "forget" your initial instructions in a very long conversation—its whiteboard simply ran out of space.

This constraint is absolute. You cannot have a conversation that exceeds the model's Context Window. Therefore, managing token usage is not just about cost; it's about maintaining the coherence and effectiveness of the AI interaction.

The "Token" SaaS Model: A Deep Dive into Consumption Billing

The "Token" SaaS model, therefore, is a direct reflection of this underlying technical reality. It is a consumption-based or pay-as-you-go pricing structure. The customer is billed for the exact amount of computational resource they consume, measured in tokens.

The "What": A customer purchases a "bucket" of tokens, for example, 1,000,000 tokens for $10. Every time they make an API call to your service, which in turn calls an LLM, the input and output tokens are counted and deducted from their balance. When they run out, they either stop receiving service or must purchase more.

The "Why" - The Strategic Implications:

For the Customer (The Startup): This model offers a low barrier to entry. A developer can sign up and start building an application for pennies. They are not locked into a large monthly contract. Their costs scale directly with their usage and, by extension, their revenue. If they have a quiet month, their bill is low. If they have a viral hit, their bill goes up, but it's funded by their success. This is incredibly attractive for businesses with unpredictable or spiky workloads.
For the Provider (You): This model aligns your revenue directly with your infrastructure costs. Your primary expense is the API bill from the LLM provider (e.g., OpenAI, Anthropic). When a customer consumes 1M tokens, your cost is predictable, and you've priced it to include your margin. It feels "fair" and transparent. However, the revenue is unpredictable. You cannot forecast your monthly recurring revenue (MRR) with certainty, which makes it difficult to plan for growth, hire, or secure investment.

Under the Hood: From a systems perspective, implementing this requires a robust tracking mechanism. Every single API request that passes through your backend must be intercepted. The system needs to capture the number of input and output tokens returned by the LLM provider's API. This count must be logged against the user's account in real-time. This is where a system like Stripe's Usage-Based Billing becomes critical, as it can ingest these usage records and handle the metering, aggregation, and invoicing automatically.

The "Seat" SaaS Model: A Deep Dive into Fixed Licensing

In stark contrast to the granular, unpredictable nature of token billing is the "Seat" model. This is a classic fixed-fee or subscription-based structure.

The "What": A customer pays a flat fee per user (or "seat") per month, regardless of how much they use the service. For example, $20 per user per month. If a company buys 10 seats, their bill is $200/month, whether they make 10 API calls or 10,000.

The "Why" - The Strategic Implications:

For the Customer (The Enterprise): This model provides cost predictability. A CFO can budget for it precisely. It also encourages adoption and "land-and-expand" strategies within an organization. Once a team has 10 seats, they can try out the service with other departments without worrying about a sudden spike in the bill. It removes the "meter anxiety" that can stifle experimentation and usage.
For the Provider (You): This model provides revenue predictability. You know exactly what your MRR will be, which is the lifeblood of a stable, growing SaaS business. It simplifies financial planning and makes your business more valuable to investors. The trade-off is that you absorb the risk of variable infrastructure costs. If a "power user" in one of your $20/month seats uses the service 100x more than an average user, your profit margin on that seat can be severely eroded or even become a loss.

Under the Hood: The technical implementation is simpler on the billing side but more complex on the access control side. You need a system to manage user invitations, role-based access, and seat allocation. The primary challenge is preventing "seat leakage"—where a company pays for seats for employees who no longer use the service. This requires robust user activity monitoring and de-provisioning workflows. The billing integration with a system like Stripe is straightforward: a recurring subscription for a fixed quantity.

The Hybrid Approach: The Best of Both Worlds

Many successful SaaS companies don't choose one model; they blend them. This is often called a "hybrid" or "tiered usage" model.

The "What": A customer subscribes to a plan that includes a generous allowance of tokens for a fixed monthly fee (a "Seat"). For example, the "Pro" plan at $50/month includes 5 million tokens. If they exceed that allowance, they are charged an overage fee based on a pre-defined price per additional token (consumption).

The "Why": This model provides the predictability of the Seat model for the customer's baseline usage, while still capturing the value of high consumption through the overage fees. It gives the provider a predictable revenue floor while protecting margins from runaway power users. It's a powerful way to satisfy both CFOs and power users.

Visualizing the Models

To crystallize these relationships, consider the following data flow for a token-based request.

This diagram illustrates the data flow for a token-based request, showing how raw input is processed by the model to generate a response that satisfies both CFOs (via cost control) and power users (via performance).

This diagram illustrates that the billing logic is not an afterthought; it is an integrated part of the request-handling pipeline. The system must count tokens before sending them to the model (to check limits) and after receiving the response (to bill for usage).

Web Development Analogy: API Calls vs. Hosting Tiers

To ground this in a more familiar web development context, think of the Token model as the API Call pricing model of a service like AWS S3 or Twilio. You pay per gigabyte of storage or per SMS message sent. Your bill is a direct result of your application's activity. It's usage-based, raw, and variable.

The Seat model, on the other hand, is like a managed hosting plan or a GitHub Teams subscription. You pay a flat monthly fee for a defined set of resources (e.g., server specs, number of team members) and capabilities. You don't get a bill for every single HTTP request your server handles or every git push your developer makes. The cost is fixed and predictable, abstracting away the underlying resource consumption.

Choosing between these models is one of the most critical strategic decisions you will make. It defines your relationship with your customers, your own revenue stability, and the fundamental architecture of your monetization engine.

Basic Code Example

This example demonstrates a "Hello World" implementation of streamable-ui. We will build a simple Next.js API route that streams a generated response and a clickable button component back to the client in real-time. This illustrates how SaaS platforms can move beyond static text to deliver interactive, dynamic interfaces directly from the AI model's output stream.

We will use the Vercel AI SDK (ai package) to handle the streaming transport and Server Actions to serialize the React component.

The Architecture

The logic flows through three distinct stages:

Generation: The LLM generates a text response.
Transformation: The server intercepts the stream, identifies a specific trigger (e.g., [BUTTON]), and injects a React component definition into the stream.
Hydration: The client receives the stream, renders the text as it arrives, and when the component definition is received, it instantiates the interactive React component in the DOM.

The diagram illustrates the hydration process, where the client receives an HTML stream, progressively renders the text, and upon receiving the component definition, instantiates the interactive React component within the DOM.

The Code

This is a fully self-contained Next.js example. It assumes a standard App Router setup.

// File: app/actions.ts
'use server';

import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { createStreamableValue } from 'ai/rsc';
import React from 'react';

/**

 * @description A simple React Button component that will be streamed to the client.
 * Note: In a real app, this might be a complex interactive chart or form.
 */
const StreamableButton = ({ text }: { text: string }) => {
  return (
    <button 
      onClick={() => alert(`Clicked: ${text}`)}
      style={{ 
        padding: '10px 20px', 
        marginTop: '10px', 
        backgroundColor: '#0070f3', 
        color: 'white', 
        border: 'none', 
        borderRadius: '5px', 
        cursor: 'pointer' 
      }}
    >
      {text}
    </button>
  );
};

/**

 * @description Server Action to generate text and inject a UI component.
 * This function streams both raw text tokens and a serializable component definition.
 */
export async function generateInteractiveResponse(prompt: string) {
  // 1. Initialize a streamable value. This acts as the transport mechanism.
  const stream = createStreamableValue<string>();

  (async () => {
    // 2. Generate text from the LLM.
    // We use generateText for simplicity, but in a real streaming UI scenario, 
    // we would likely use streamText to process tokens as they arrive.
    const { text } = await generateText({
      model: openai('gpt-3.5-turbo'),
      prompt: `Generate a response to: "${prompt}". 
               If the user asks for an action, include the text "[ACTION]" in your response.`,
    });

    // 3. Simulate streaming logic.
    // In a real streamText implementation, we would iterate over the token stream.
    // Here, we simulate the flow by splitting the text and injecting the component.
    const parts = text.split('[ACTION]');

    // Stream the first part of the text
    if (parts[0]) {
      stream.update(parts[0]);
    }

    // 4. The Magic: Inject a React Component into the stream.
    // We serialize the component definition. The client will hydrate this.
    // We pass the component type and its props.
    if (parts.length > 1) {
      stream.update(
        // @ts-ignore - We are passing a component definition to the stream
        <StreamableButton text="Confirm Action" />
      );

      // Stream the remaining text if any
      if (parts[1]) {
        stream.update(parts[1]);
      }
    }

    // 5. Close the stream
    stream.done();
  })();

  return stream.value;
}

// File: app/page.tsx
'use client';

import { useState } from 'react';
import { readStreamableValue } from 'ai/rcs';
import { generateInteractiveResponse } from './actions';

export default function Page() {
  const [input, setInput] = useState('');
  const [components, setComponents] = useState<React.ReactNode[]>([]);

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault();

    // 1. Call the Server Action to get the stream
    const stream = await generateInteractiveResponse(input);

    // 2. Read the stream
    for await (const chunk of readStreamableValue(stream)) {
      // 3. Check if the chunk is a React Element (Component) or a String
      if (React.isValidElement(chunk)) {
        // If it's a component, add it to our state to render it
        setComponents((prev) => [...prev, chunk]);
      } else if (typeof chunk === 'string') {
        // If it's text, we typically append it to a text state.
        // For simplicity here, we just log it or append to a display string.
        console.log('Text Token:', chunk);
      }
    }
  };

  return (
    <div style={{ padding: '20px', fontFamily: 'sans-serif' }}>
      <h1>Streamable UI Demo</h1>
      <form onSubmit={handleSubmit}>
        <input 
          value={input}
          onChange={(e) => setInput(e.target.value)}
          placeholder="Ask for an action..."
          style={{ padding: '10px', width: '300px' }}
        />
        <button type="submit" style={{ marginLeft: '10px', padding: '10px' }}>
          Send
        </button>
      </form>

      <div style={{ marginTop: '20px', border: '1px solid #eee', padding: '10px' }}>
        <strong>Streamed Components:</strong>
        {/* 4. Render the components that were streamed from the server */}
        <div style={{ marginTop: '10px' }}>
          {components.map((comp, index) => (
            <div key={index}>{comp}</div>
          ))}
        </div>
      </div>
    </div>
  );
}

Line-by-Line Explanation

1. The Server Action (`actions.ts`)

'use server';: This directive marks the file as containing Server Actions. It ensures the code runs only on the server, keeping API keys (like OpenAI) secure.
createStreamableValue<string>(): This initializes a mutable stream object provided by the Vercel AI SDK. It acts as a wrapper around a standard Web ReadableStream. We type it as string initially, though it will eventually hold mixed types (strings + React elements).
const { text } = await generateText(...): We perform a standard LLM call. In a purely streaming architecture, we would use streamText instead, which returns a textStream (an AsyncIterable). For this simplified example, we generate the full text first to demonstrate the logic of parsing and injecting components.
text.split('[ACTION]'): We simulate a "trigger" logic. In a production environment, the LLM might output a structured JSON or a specific token that signals the backend to inject a UI component.
stream.update(parts[0]): This pushes the first chunk of text into the stream. The Vercel AI SDK serializes this and sends it to the client via Server-Sent Events (SSE).
stream.update(<StreamableButton ... />): This is the core concept. We are passing a React Element directly into the stream. The SDK handles the serialization of this object (turning it into a format the client can understand) and transmits it over the network.
stream.done(): Signals the end of the stream. The client's for await loop will terminate here.

2. The Client Component (`page.tsx`)

'use client';: Marks this as a client-side component in the Next.js App Router, allowing the use of hooks (useState) and event listeners.
readStreamableValue(stream): This utility function consumes the stream coming from the server. It returns an AsyncIterable that yields chunks of data.
for await (const chunk of ...): We iterate over the stream as data arrives. This is crucial for low latency; the UI updates incrementally without waiting for the full response.
React.isValidElement(chunk): This is the client-side "hydration" check. When the server sends a React component, the client receives it as a serialized object. readStreamableValue automatically reconstructs it into a valid React Element. We check if the chunk is an element (a component) or a primitive (string).
setComponents(...): If we detect a component, we add it to the local state. React automatically renders this component into the DOM, making it interactive (clickable, etc.) immediately.

Common Pitfalls

When implementing streamable-ui, developers often encounter specific TypeScript and architectural challenges:

Hallucinated JSON / Serialization Errors
- Issue: If you ask an LLM to "output a React component," it will often output JSX code as a string (e.g., "<button>Click me</button>"). This is just text; it is not a functional React component and cannot be interactive.
- Fix: The server must explicitly construct the React element (e.g., React.createElement(...)) and pass the object to the stream, not the stringified code. Never trust the LLM to write its own UI code directly into the DOM; use predefined components that the LLM can trigger via logic.
Vercel Timeout Limits (45s)
- Issue: Vercel Serverless Functions have a strict execution timeout (usually 10s on Hobby, 60s on Pro). If the LLM generation is slow or the logic is complex, the stream might be cut off.
- Fix: Use streamText from the AI SDK rather than generateText. streamText returns the response token-by-token, which keeps the connection active and responsive, even if the full generation takes longer than the timeout window (as the stream remains open as long as the connection holds).
Async/Await Loop Blocking
- Issue: Awaiting the full generation before starting the stream defeats the purpose of streaming. Developers might accidentally buffer tokens.
- Fix: Ensure you are using for await on the generator (the stream), not on the finished result. The stream must be processed asynchronously as it arrives.
TypeScript Strict Mode Issues
- Issue: readStreamableValue returns string | React.ReactNode. TypeScript might complain when you try to use React-specific methods on a generic string type.
- Fix: Use type guards rigorously. Always check typeof chunk === 'string' or React.isValidElement(chunk) before manipulating the data. Do not assume the type of the stream chunk.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.