Skip to content

Chapter 4: Rate Limiting AI Endpoints (Token Buckets)

Theoretical Foundations

In the previous chapters, we established that modern AI-powered applications are built on a foundation of Agents. We conceptualized these Agents as sophisticated Microservices, where each service is responsible for a discrete, high-value task: retrieving a document, parsing a query, or synthesizing a response. We also explored the Query Vector, the mathematical fingerprint of a user's intent, which allows these microservices to perform semantic search rather than simple keyword matching.

However, there is a critical economic and technical constraint that governs all distributed systems, especially those involving Large Language Models (LLMs): Resource Scarcity. The "Edge-First Deployment Strategy" pushes computation closer to the user to reduce latency, but the final, most expensive computation—the LLM inference—often resides in centralized cloud infrastructure or is metered by third-party API providers (like OpenAI or Anthropic). These providers charge not just per request, but per token (the atomic unit of text processed by the model). A single complex query can consume thousands of tokens, and a malicious actor or a buggy client can generate thousands of requests, leading to astronomical costs and service degradation.

To manage this, we implement Rate Limiting. But standard rate limiting (e.g., "100 requests per minute") is insufficient for AI endpoints because requests are not equal. A request to summarize a short paragraph is cheap; a request to generate a 2,000-word essay is expensive. We need a system that understands the cost of work, not just the volume of requests.

This brings us to the Token Bucket Algorithm.

The Token Bucket: A Water Reservoir Analogy

Imagine every user of your application has a personal water reservoir. This reservoir has two key properties:

  1. Capacity (Bucket Size): The maximum volume of water the reservoir can hold. This represents the user's burst allowance. A user can accumulate tokens over time up to this limit, allowing them to make a sudden burst of expensive requests without being blocked.
  2. Refill Rate (Leak Rate): The speed at which water flows into the reservoir. This represents the user's sustained rate limit. It is measured in tokens per second (or per minute).

When a user makes a request to an AI endpoint (e.g., generateReport), we attempt to draw a specific amount of water (tokens) from the reservoir.

  • Sufficient Water: If the reservoir has enough water, the request is processed, and the water is removed. The user's "budget" decreases.
  • Insufficient Water: If the reservoir is empty (or doesn't have enough water for the requested draw), the request is rejected with a 429 Too Many Requests error. The user must wait for the reservoir to refill.

This model elegantly solves the "bursty" nature of AI traffic. A user might not use the app for an hour, accumulating a full bucket of tokens. When they return, they can perform several expensive operations in quick succession before hitting the sustained rate limit. Conversely, a user spamming the API will drain their bucket quickly and be forced to wait, protecting the system from abuse.

The Web Development Analogy: API Gateway vs. Edge Middleware

In traditional web development, rate limiting is often handled at the API Gateway level. This is like a bouncer at the front door of a nightclub. Every person (request) is counted as they enter. If the count exceeds a threshold, the bouncer stops letting people in. This is simple and effective for standard HTTP endpoints.

However, in an Edge-First architecture using tRPC, we are not just counting requests at the door; we are inspecting the intent and cost of each request inside the application logic, right at the edge.

Think of tRPC procedures as Microservices (as discussed in Book 6). Each procedure is a specialized endpoint. The generateReport procedure is a microservice that performs a heavy, expensive LLM call. The searchVector procedure is a microservice that performs a lighter, faster vector search.

Placing a rate limiter at the API Gateway is like having a bouncer who only counts people but doesn't know what they plan to do inside. One person might just order a water (cheap), while another might order a bottle of champagne (expensive). The bouncer treats them equally.

By integrating the Token Bucket algorithm directly into the tRPC middleware pipeline, we move the "bouncer" inside each microservice. We can now apply different rate limits based on the type of work the procedure performs. The searchVector procedure might have a generous bucket (e.g., 1,000 tokens per minute), while the generateReport procedure has a strict bucket (e.g., 10 tokens per minute). This is intelligent rate limiting.

Under the Hood: Scalability and State

The challenge with any stateful rate limiter is state management. Where do we store the current water level (token count) and the last refill time for millions of users?

  • In-Memory (e.g., Node.js Map): Fast but not scalable. In a serverless or edge function environment, instances are ephemeral. A user's request might hit a different server instance on the next call, losing their token count. This is unsuitable for a global application.
  • Database (e.g., PostgreSQL): Persistent but slow. The round-trip time to query a database for every single request would add significant latency, defeating the purpose of an Edge-First strategy.
  • Distributed Cache (e.g., Redis): The ideal solution. Redis is an in-memory data store that is incredibly fast (microsecond latency) and can be distributed globally. It acts as the shared "reservoir" that all edge function instances can access simultaneously.

This is why we use Upstash Redis. It provides a serverless Redis database that can be deployed at the edge, ensuring low-latency access to token counts from anywhere in the world. The token bucket state is stored in Redis, making it persistent and consistent across all edge function invocations.

Visualizing the Token Bucket Flow

The following diagram illustrates how a request flows through the tRPC middleware and the token bucket check.

This diagram visualizes how an incoming request is processed by the tRPC middleware, which queries the Redis-stored token bucket state to determine if the request has sufficient tokens for processing.
Hold "Ctrl" to enable pan & zoom

This diagram visualizes how an incoming request is processed by the tRPC middleware, which queries the Redis-stored token bucket state to determine if the request has sufficient tokens for processing.

The Role of Redis in the Edge-First Strategy

In the context of our architecture, the Edge-First Deployment Strategy dictates that we minimize latency by performing computations as close to the user as possible. The token bucket check is a lightweight computation, but it requires state. By using Upstash Redis, we bridge the gap between stateless edge functions and stateful rate limiting.

  1. Atomic Operations: Redis provides atomic operations like INCRBY and EXPIRE. We can update a user's token count and set a time-to-live (TTL) on the key in a single, non-blocking operation. This ensures that even under high concurrency, token counts are accurate and race conditions are avoided.
  2. Global Consistency: A user in Tokyo and a user in New York both interact with the same Redis instance (or a globally replicated one). This means a user cannot bypass rate limits by switching regions or making concurrent requests from different locations.
  3. Cost Control: By enforcing limits at the edge, we prevent expensive LLM calls from ever being executed. This is crucial for cost control. It is far cheaper to reject a request with a 429 error than to process it and discover it exceeds a budget.

Integrating with tRPC Middleware

In tRPC, middleware is a function that runs before the main procedure resolver. It has access to the context (like user information) and can modify the request flow. By placing our token bucket logic here, we create a reusable, declarative way to protect our AI endpoints.

The flow is as follows:

  1. The client calls a tRPC procedure (e.g., ai.generateReport).
  2. The tRPC router invokes the middleware chain.
  3. The rate-limiting middleware extracts the user's ID from the context (e.g., from a JWT token).
  4. It constructs a unique key for the user and the specific procedure (e.g., rate_limit:user_123:generateReport).
  5. It queries Redis to get the current token count and the last refill timestamp.
  6. It calculates the new token count based on the elapsed time since the last request.
  7. It checks if the new count is greater than or equal to the cost of the current request.
  8. If yes, it decrements the token count in Redis and calls next() to proceed to the procedure resolver.
  9. If no, it throws a tRPC error with code TOO_MANY_REQUESTS, which is automatically converted to an HTTP 429 response.

This approach ensures that the rate-limiting logic is completely decoupled from the business logic of the AI procedures. We can apply the same middleware to any procedure, simply by configuring the bucket size and refill rate.

Summary

The Token Bucket algorithm is not just a technical implementation detail; it is a fundamental economic model for managing scarce computational resources in a distributed, Edge-First architecture. By comparing it to a water reservoir, we can intuitively understand how it handles both bursty traffic and sustained usage. By integrating it directly into the tRPC middleware pipeline, we move from a simple "request counter" to an intelligent, cost-aware gatekeeper that protects our expensive AI microservices, ensuring fair usage and preventing abuse. This foundational concept will be implemented in the subsequent sections using Upstash Redis and tRPC's powerful middleware system.

Basic Code Example

This example demonstrates a minimal, self-contained token bucket rate limiter using Upstash Redis. It's designed for a serverless environment (like Vercel Edge Functions) and integrates directly into a tRPC middleware pipeline. The goal is to protect a hypothetical AI endpoint (/api/trpc/ai.query) that performs expensive LLM transformations.

We will use the @upstash/redis library, which is ideal for serverless runtimes due to its HTTP-based nature. The logic is broken down into a reusable middleware function that checks the user's token balance before allowing the request to proceed to the expensive AI operation.

// File: src/server/ratelimit.ts
import { Redis } from "@upstash/redis";
import { TRPCError } from "@trpc/server";
import { middleware } from "./trpc"; // Assuming a central tRPC instance

// 1. CONFIGURATION
// We define a strict interface for our rate limit configuration.
// This prevents typos and makes the configuration self-documenting.
interface RateLimitConfig {
  /**

   * The maximum number of tokens the user can hold in their bucket.
   * This represents the burst capacity.
   * @example 10
   */
  maxTokens: number;
  /**

   * The number of tokens to add to the bucket per second (refill rate).
   * This is the sustained throughput.
   * @example 0.1 (1 token every 10 seconds)
   */
  refillRate: number;
  /**

   * The cost of a single request.
   * @example 1
   */
  tokenCost: number;
}

// 2. REDIS CLIENT INITIALIZATION
// In a real application, these would be environment variables.
// For serverless, we create the client once per function invocation (or use a singleton).
const redis = new Redis({
  url: process.env.UPSTASH_REDIS_REST_URL!,
  token: process.env.UPSTASH_REDIS_REST_TOKEN!,
});

// 3. THE CORE TOKEN BUCKET LOGIC
/**

 * Checks if a user has enough tokens for a request and deducts them if so.
 * This implements the core token bucket algorithm.
 *
 * @param userId - A unique identifier for the user (e.g., from JWT or API key).
 * @param config - The rate limit configuration for this specific endpoint.
 * @returns A promise that resolves to `true` if the request is allowed, `false` otherwise.
 */
async function checkTokenBucket(
  userId: string,
  config: RateLimitConfig
): Promise<boolean> {
  // We use a Redis key specific to the user and the endpoint to avoid collisions.
  // Format: `ratelimit:<userId>:<endpointName>`
  const key = `ratelimit:ai_endpoint:${userId}`;

  // We need to perform an atomic operation to avoid race conditions.
  // Redis Lua scripts are perfect for this. They run as a single transaction.
  const luaScript = `
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local token_cost = tonumber(ARGV[3])
    local now = tonumber(ARGV[4])

    -- Get current state from Redis
    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')

    local current_tokens = tonumber(bucket[1])
    local last_refill = tonumber(bucket[2])

    -- Initialize if this is the first request
    if current_tokens == nil then
      current_tokens = max_tokens
      last_refill = now
    end

    -- Calculate time elapsed and refill tokens
    local time_elapsed = now - last_refill
    local tokens_to_add = time_elapsed * refill_rate
    current_tokens = math.min(max_tokens, current_tokens + tokens_to_add)

    -- Update the last refill time
    last_refill = now

    -- Check if user has enough tokens for the request
    if current_tokens >= token_cost then
      -- Deduct the cost and save the new state
      current_tokens = current_tokens - token_cost
      redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', last_refill)
      -- Set an expiry to automatically clean up old keys
      redis.call('EXPIRE', key, 86400) -- 24 hours
      return 1 -- Allowed
    else
      -- Not enough tokens, update the last_refill time anyway to be accurate for next check
      redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', last_refill)
      redis.call('EXPIRE', key, 86400)
      return 0 -- Denied
    end
  `;

  const now = Math.floor(Date.now() / 1000); // Current time in seconds

  // Execute the Lua script
  const result = await redis.eval(
    luaScript,
    [key],
    [config.maxTokens, config.refillRate, config.tokenCost, now]
  );

  // The script returns 1 for allowed, 0 for denied
  return result === 1;
}

// 4. INTEGRATION WITH T-RPC MIDDLEWARE
/**

 * tRPC middleware that enforces the token bucket rate limit.
 * This middleware should be applied to any procedure that needs protection.
 */
export const rateLimitMiddleware = middleware(async ({ ctx, next, path }) => {
  // In a real app, `ctx.user` would come from an auth middleware
  const userId = ctx.user?.id || "anonymous";

  // Configuration for the AI endpoint. You could have different configs per route.
  const config: RateLimitConfig = {
    maxTokens: 10, // User can burst 10 requests
    refillRate: 0.1, // Refill 1 token every 10 seconds (6 per minute)
    tokenCost: 1, // Each AI query costs 1 token
  };

  const isAllowed = await checkTokenBucket(userId, config);

  if (!isAllowed) {
    // If denied, throw a specific TRPCError. The client can catch this.
    throw new TRPCError({
      code: "TOO_MANY_REQUESTS",
      message: "Rate limit exceeded. Please wait and try again.",
    });
  }

  // If allowed, proceed to the actual procedure logic.
  return next();
});

Line-by-Line Explanation

This section breaks down the code block by block, explaining the purpose and underlying mechanics of each part.

1. Configuration and Setup

  • interface RateLimitConfig: We start by defining a TypeScript interface. This is a best practice for type safety. It ensures that any configuration we pass to our rate limiter has the required properties (maxTokens, refillRate, tokenCost) with the correct types. This prevents runtime errors caused by typos or incorrect values.
  • Redis Client: We initialize the Upstash Redis client. In a serverless context, it's crucial to use an HTTP-based client like this, as it doesn't maintain a persistent connection pool, which aligns with the ephemeral nature of serverless functions. The url and token are typically stored in environment variables for security.

2. The Core Token Bucket Logic (checkTokenBucket)

This function is the heart of the rate limiter. It encapsulates the entire token bucket algorithm within a single, atomic Redis operation.

  • Lua Scripting: We use a Lua script for a critical reason: atomicity. If we were to execute these commands individually (get, calculate, set), another request could modify the data in between our reads and writes, leading to race conditions and inaccurate rate limiting. By sending the entire script to Redis, it executes as a single, uninterruptible transaction.
  • Key Structure: ratelimit:ai_endpoint:${userId}. This key naming convention is vital for scalability. It isolates rate limits per user and per endpoint, allowing you to have different limits for different API methods.
  • Script Breakdown:
    • Initialization: The script first checks if a bucket for the user exists. If not (current_tokens == nil), it initializes the bucket with the max_tokens capacity and sets the last_refill timestamp to the current time.
    • Refill Calculation: It calculates how much time has passed since the last request (time_elapsed) and multiplies it by the refillRate to determine how many new tokens to add. It then caps the total at max_tokens using math.min, enforcing the burst limit.
    • Deduction and Check: The script checks if the current_tokens are sufficient for the token_cost of the request.
    • Atomic Update: If there are enough tokens, it deducts the cost, updates the tokens and last_refill fields in the Redis hash, and sets a TTL (Time To Live) on the key to automatically clean up inactive users. This prevents Redis from being cluttered with keys for one-time visitors.
    • Return Value: The script returns 1 (true) if the request is allowed and 0 (false) otherwise. This simple integer result is efficient to transmit and parse.

3. tRPC Middleware Integration (rateLimitMiddleware)

Middleware is the perfect place to enforce cross-cutting concerns like rate limiting, authentication, and logging.

  • Context (ctx): The middleware receives a ctx object, which is shared across the request lifecycle. We extract a userId from it (e.g., from a JWT token set by a previous authentication middleware). If no user is authenticated, we default to "anonymous", which allows you to apply a (stricter) rate limit to unauthenticated traffic.
  • Path-Specific Configuration: The example shows how you could dynamically select a rate limit configuration based on the path of the tRPC procedure. This allows you to have a very strict limit on a costly AI endpoint (/api/trpc/ai.query) while having a more lenient limit on a simple data-fetching endpoint.
  • Error Handling: If checkTokenBucket returns false, we throw a TRPCError with the code TOO_MANY_REQUESTS. This is a standard HTTP status code (429) that tRPC automatically translates for the client, allowing it to implement retry logic (e.g., using an ExponentialBackoff strategy).
  • next() Call: If the check passes, we call next(), which passes control to the next middleware in the chain, eventually reaching the procedure's resolver. This ensures the expensive AI transformation only runs for authorized, rate-limited requests.

The Flow of Logic

  1. A client sends a request to a protected tRPC procedure (e.g., ai.query).
  2. The rateLimitMiddleware is invoked first.
  3. It extracts the userId from the request context.
  4. It calls checkTokenBucket with the userId and a specific RateLimitConfig.
  5. checkTokenBucket constructs a Redis key and executes the Lua script atomically.
  6. The script calculates the current token balance, checks against the cost, and updates the bucket state in Redis.
  7. The script returns 1 (allowed) or 0 (denied).
  8. If denied, the middleware throws a TOO_MANY_REQUESTS error, and the request terminates with a 429 status code.
  9. If allowed, the middleware calls next(), and the request proceeds to the expensive AI logic.
The middleware executes next() to pass control to the subsequent route handler, where the expensive AI processing occurs.
Hold "Ctrl" to enable pan & zoom

The middleware executes `next()` to pass control to the subsequent route handler, where the expensive AI processing occurs.

Common Pitfalls

  1. Race Conditions in Client-Side Logic: A common mistake is to implement the rate limit logic on the client-side (e.g., in a React useEffect hook). This is fundamentally insecure. A malicious user can simply bypass your JavaScript and send unlimited requests directly to your API endpoint. The rate limiter must always live on the server.

  2. Vercel/AWS Lambda Timeouts: Serverless functions have strict execution time limits (e.g., 10 seconds on Vercel's Hobby plan). If your rate limiter's Redis client is slow to respond, it can eat into the time budget of your main AI function. The Upstash Redis client is generally very fast, but you should always wrap your Redis calls in Promise.race with a timeout to prevent a slow Redis instance from blocking your entire function.

    // Example of a timeout wrapper
    const timeout = (ms: number) => new Promise((_, r) => setTimeout(r, ms));
    try {
      await Promise.race([checkTokenBucket(...), timeout(5000)]); // 5s timeout
    } catch (e) {
      // Handle timeout - maybe fail open or closed depending on your policy
    }
    

  3. Incorrect Refill Rate Calculation: The refill rate is often misunderstood. If you want to allow 60 requests per minute, your refillRate should be 1 (one token per second), and maxTokens could be 60 (allowing a full minute's worth of requests in a burst). A common bug is setting refillRate to 60 and maxTokens to 1, which would allow one request per second but no bursts, which is the opposite of the intended behavior.

  4. Stateless vs. Stateful Middleware: In some edge function runtimes, you might be tempted to store the token bucket in a global variable. This is a critical error. Serverless instances are ephemeral and can be spun up or down at any moment. A user's next request might hit a different instance with no knowledge of the previous state. Always use a centralized, external data store like Redis.

  5. Not Handling Anonymous Users: Failing to rate limit unauthenticated requests ("anonymous") leaves your API vulnerable to DDoS attacks from bots that don't bother with authentication. Always apply a (potentially stricter) rate limit to all incoming requests, regardless of authentication status.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.