Chapter 4: Rate Limiting AI Endpoints (Token Buckets)
Theoretical Foundations
In the previous chapters, we established that modern AI-powered applications are built on a foundation of Agents. We conceptualized these Agents as sophisticated Microservices, where each service is responsible for a discrete, high-value task: retrieving a document, parsing a query, or synthesizing a response. We also explored the Query Vector, the mathematical fingerprint of a user's intent, which allows these microservices to perform semantic search rather than simple keyword matching.
However, there is a critical economic and technical constraint that governs all distributed systems, especially those involving Large Language Models (LLMs): Resource Scarcity. The "Edge-First Deployment Strategy" pushes computation closer to the user to reduce latency, but the final, most expensive computation—the LLM inference—often resides in centralized cloud infrastructure or is metered by third-party API providers (like OpenAI or Anthropic). These providers charge not just per request, but per token (the atomic unit of text processed by the model). A single complex query can consume thousands of tokens, and a malicious actor or a buggy client can generate thousands of requests, leading to astronomical costs and service degradation.
To manage this, we implement Rate Limiting. But standard rate limiting (e.g., "100 requests per minute") is insufficient for AI endpoints because requests are not equal. A request to summarize a short paragraph is cheap; a request to generate a 2,000-word essay is expensive. We need a system that understands the cost of work, not just the volume of requests.
This brings us to the Token Bucket Algorithm.
The Token Bucket: A Water Reservoir Analogy
Imagine every user of your application has a personal water reservoir. This reservoir has two key properties:
- Capacity (Bucket Size): The maximum volume of water the reservoir can hold. This represents the user's burst allowance. A user can accumulate tokens over time up to this limit, allowing them to make a sudden burst of expensive requests without being blocked.
- Refill Rate (Leak Rate): The speed at which water flows into the reservoir. This represents the user's sustained rate limit. It is measured in tokens per second (or per minute).
When a user makes a request to an AI endpoint (e.g., generateReport), we attempt to draw a specific amount of water (tokens) from the reservoir.
- Sufficient Water: If the reservoir has enough water, the request is processed, and the water is removed. The user's "budget" decreases.
- Insufficient Water: If the reservoir is empty (or doesn't have enough water for the requested draw), the request is rejected with a
429 Too Many Requestserror. The user must wait for the reservoir to refill.
This model elegantly solves the "bursty" nature of AI traffic. A user might not use the app for an hour, accumulating a full bucket of tokens. When they return, they can perform several expensive operations in quick succession before hitting the sustained rate limit. Conversely, a user spamming the API will drain their bucket quickly and be forced to wait, protecting the system from abuse.
The Web Development Analogy: API Gateway vs. Edge Middleware
In traditional web development, rate limiting is often handled at the API Gateway level. This is like a bouncer at the front door of a nightclub. Every person (request) is counted as they enter. If the count exceeds a threshold, the bouncer stops letting people in. This is simple and effective for standard HTTP endpoints.
However, in an Edge-First architecture using tRPC, we are not just counting requests at the door; we are inspecting the intent and cost of each request inside the application logic, right at the edge.
Think of tRPC procedures as Microservices (as discussed in Book 6). Each procedure is a specialized endpoint. The generateReport procedure is a microservice that performs a heavy, expensive LLM call. The searchVector procedure is a microservice that performs a lighter, faster vector search.
Placing a rate limiter at the API Gateway is like having a bouncer who only counts people but doesn't know what they plan to do inside. One person might just order a water (cheap), while another might order a bottle of champagne (expensive). The bouncer treats them equally.
By integrating the Token Bucket algorithm directly into the tRPC middleware pipeline, we move the "bouncer" inside each microservice. We can now apply different rate limits based on the type of work the procedure performs. The searchVector procedure might have a generous bucket (e.g., 1,000 tokens per minute), while the generateReport procedure has a strict bucket (e.g., 10 tokens per minute). This is intelligent rate limiting.
Under the Hood: Scalability and State
The challenge with any stateful rate limiter is state management. Where do we store the current water level (token count) and the last refill time for millions of users?
- In-Memory (e.g., Node.js
Map): Fast but not scalable. In a serverless or edge function environment, instances are ephemeral. A user's request might hit a different server instance on the next call, losing their token count. This is unsuitable for a global application. - Database (e.g., PostgreSQL): Persistent but slow. The round-trip time to query a database for every single request would add significant latency, defeating the purpose of an Edge-First strategy.
- Distributed Cache (e.g., Redis): The ideal solution. Redis is an in-memory data store that is incredibly fast (microsecond latency) and can be distributed globally. It acts as the shared "reservoir" that all edge function instances can access simultaneously.
This is why we use Upstash Redis. It provides a serverless Redis database that can be deployed at the edge, ensuring low-latency access to token counts from anywhere in the world. The token bucket state is stored in Redis, making it persistent and consistent across all edge function invocations.
Visualizing the Token Bucket Flow
The following diagram illustrates how a request flows through the tRPC middleware and the token bucket check.
The Role of Redis in the Edge-First Strategy
In the context of our architecture, the Edge-First Deployment Strategy dictates that we minimize latency by performing computations as close to the user as possible. The token bucket check is a lightweight computation, but it requires state. By using Upstash Redis, we bridge the gap between stateless edge functions and stateful rate limiting.
- Atomic Operations: Redis provides atomic operations like
INCRBYandEXPIRE. We can update a user's token count and set a time-to-live (TTL) on the key in a single, non-blocking operation. This ensures that even under high concurrency, token counts are accurate and race conditions are avoided. - Global Consistency: A user in Tokyo and a user in New York both interact with the same Redis instance (or a globally replicated one). This means a user cannot bypass rate limits by switching regions or making concurrent requests from different locations.
- Cost Control: By enforcing limits at the edge, we prevent expensive LLM calls from ever being executed. This is crucial for cost control. It is far cheaper to reject a request with a
429error than to process it and discover it exceeds a budget.
Integrating with tRPC Middleware
In tRPC, middleware is a function that runs before the main procedure resolver. It has access to the context (like user information) and can modify the request flow. By placing our token bucket logic here, we create a reusable, declarative way to protect our AI endpoints.
The flow is as follows:
- The client calls a tRPC procedure (e.g.,
ai.generateReport). - The tRPC router invokes the middleware chain.
- The rate-limiting middleware extracts the user's ID from the context (e.g., from a JWT token).
- It constructs a unique key for the user and the specific procedure (e.g.,
rate_limit:user_123:generateReport). - It queries Redis to get the current token count and the last refill timestamp.
- It calculates the new token count based on the elapsed time since the last request.
- It checks if the new count is greater than or equal to the cost of the current request.
- If yes, it decrements the token count in Redis and calls
next()to proceed to the procedure resolver. - If no, it throws a tRPC error with code
TOO_MANY_REQUESTS, which is automatically converted to an HTTP 429 response.
This approach ensures that the rate-limiting logic is completely decoupled from the business logic of the AI procedures. We can apply the same middleware to any procedure, simply by configuring the bucket size and refill rate.
Summary
The Token Bucket algorithm is not just a technical implementation detail; it is a fundamental economic model for managing scarce computational resources in a distributed, Edge-First architecture. By comparing it to a water reservoir, we can intuitively understand how it handles both bursty traffic and sustained usage. By integrating it directly into the tRPC middleware pipeline, we move from a simple "request counter" to an intelligent, cost-aware gatekeeper that protects our expensive AI microservices, ensuring fair usage and preventing abuse. This foundational concept will be implemented in the subsequent sections using Upstash Redis and tRPC's powerful middleware system.
Basic Code Example
This example demonstrates a minimal, self-contained token bucket rate limiter using Upstash Redis. It's designed for a serverless environment (like Vercel Edge Functions) and integrates directly into a tRPC middleware pipeline. The goal is to protect a hypothetical AI endpoint (/api/trpc/ai.query) that performs expensive LLM transformations.
We will use the @upstash/redis library, which is ideal for serverless runtimes due to its HTTP-based nature. The logic is broken down into a reusable middleware function that checks the user's token balance before allowing the request to proceed to the expensive AI operation.
// File: src/server/ratelimit.ts
import { Redis } from "@upstash/redis";
import { TRPCError } from "@trpc/server";
import { middleware } from "./trpc"; // Assuming a central tRPC instance
// 1. CONFIGURATION
// We define a strict interface for our rate limit configuration.
// This prevents typos and makes the configuration self-documenting.
interface RateLimitConfig {
/**
* The maximum number of tokens the user can hold in their bucket.
* This represents the burst capacity.
* @example 10
*/
maxTokens: number;
/**
* The number of tokens to add to the bucket per second (refill rate).
* This is the sustained throughput.
* @example 0.1 (1 token every 10 seconds)
*/
refillRate: number;
/**
* The cost of a single request.
* @example 1
*/
tokenCost: number;
}
// 2. REDIS CLIENT INITIALIZATION
// In a real application, these would be environment variables.
// For serverless, we create the client once per function invocation (or use a singleton).
const redis = new Redis({
url: process.env.UPSTASH_REDIS_REST_URL!,
token: process.env.UPSTASH_REDIS_REST_TOKEN!,
});
// 3. THE CORE TOKEN BUCKET LOGIC
/**
* Checks if a user has enough tokens for a request and deducts them if so.
* This implements the core token bucket algorithm.
*
* @param userId - A unique identifier for the user (e.g., from JWT or API key).
* @param config - The rate limit configuration for this specific endpoint.
* @returns A promise that resolves to `true` if the request is allowed, `false` otherwise.
*/
async function checkTokenBucket(
userId: string,
config: RateLimitConfig
): Promise<boolean> {
// We use a Redis key specific to the user and the endpoint to avoid collisions.
// Format: `ratelimit:<userId>:<endpointName>`
const key = `ratelimit:ai_endpoint:${userId}`;
// We need to perform an atomic operation to avoid race conditions.
// Redis Lua scripts are perfect for this. They run as a single transaction.
const luaScript = `
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local token_cost = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
-- Get current state from Redis
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local current_tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
-- Initialize if this is the first request
if current_tokens == nil then
current_tokens = max_tokens
last_refill = now
end
-- Calculate time elapsed and refill tokens
local time_elapsed = now - last_refill
local tokens_to_add = time_elapsed * refill_rate
current_tokens = math.min(max_tokens, current_tokens + tokens_to_add)
-- Update the last refill time
last_refill = now
-- Check if user has enough tokens for the request
if current_tokens >= token_cost then
-- Deduct the cost and save the new state
current_tokens = current_tokens - token_cost
redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', last_refill)
-- Set an expiry to automatically clean up old keys
redis.call('EXPIRE', key, 86400) -- 24 hours
return 1 -- Allowed
else
-- Not enough tokens, update the last_refill time anyway to be accurate for next check
redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', last_refill)
redis.call('EXPIRE', key, 86400)
return 0 -- Denied
end
`;
const now = Math.floor(Date.now() / 1000); // Current time in seconds
// Execute the Lua script
const result = await redis.eval(
luaScript,
[key],
[config.maxTokens, config.refillRate, config.tokenCost, now]
);
// The script returns 1 for allowed, 0 for denied
return result === 1;
}
// 4. INTEGRATION WITH T-RPC MIDDLEWARE
/**
* tRPC middleware that enforces the token bucket rate limit.
* This middleware should be applied to any procedure that needs protection.
*/
export const rateLimitMiddleware = middleware(async ({ ctx, next, path }) => {
// In a real app, `ctx.user` would come from an auth middleware
const userId = ctx.user?.id || "anonymous";
// Configuration for the AI endpoint. You could have different configs per route.
const config: RateLimitConfig = {
maxTokens: 10, // User can burst 10 requests
refillRate: 0.1, // Refill 1 token every 10 seconds (6 per minute)
tokenCost: 1, // Each AI query costs 1 token
};
const isAllowed = await checkTokenBucket(userId, config);
if (!isAllowed) {
// If denied, throw a specific TRPCError. The client can catch this.
throw new TRPCError({
code: "TOO_MANY_REQUESTS",
message: "Rate limit exceeded. Please wait and try again.",
});
}
// If allowed, proceed to the actual procedure logic.
return next();
});
Line-by-Line Explanation
This section breaks down the code block by block, explaining the purpose and underlying mechanics of each part.
1. Configuration and Setup
interface RateLimitConfig: We start by defining a TypeScript interface. This is a best practice for type safety. It ensures that any configuration we pass to our rate limiter has the required properties (maxTokens,refillRate,tokenCost) with the correct types. This prevents runtime errors caused by typos or incorrect values.RedisClient: We initialize the Upstash Redis client. In a serverless context, it's crucial to use an HTTP-based client like this, as it doesn't maintain a persistent connection pool, which aligns with the ephemeral nature of serverless functions. Theurlandtokenare typically stored in environment variables for security.
2. The Core Token Bucket Logic (checkTokenBucket)
This function is the heart of the rate limiter. It encapsulates the entire token bucket algorithm within a single, atomic Redis operation.
- Lua Scripting: We use a Lua script for a critical reason: atomicity. If we were to execute these commands individually (get, calculate, set), another request could modify the data in between our reads and writes, leading to race conditions and inaccurate rate limiting. By sending the entire script to Redis, it executes as a single, uninterruptible transaction.
- Key Structure:
ratelimit:ai_endpoint:${userId}. This key naming convention is vital for scalability. It isolates rate limits per user and per endpoint, allowing you to have different limits for different API methods. - Script Breakdown:
- Initialization: The script first checks if a bucket for the user exists. If not (
current_tokens == nil), it initializes the bucket with themax_tokenscapacity and sets thelast_refilltimestamp to the current time. - Refill Calculation: It calculates how much time has passed since the last request (
time_elapsed) and multiplies it by therefillRateto determine how many new tokens to add. It then caps the total atmax_tokensusingmath.min, enforcing the burst limit. - Deduction and Check: The script checks if the
current_tokensare sufficient for thetoken_costof the request. - Atomic Update: If there are enough tokens, it deducts the cost, updates the
tokensandlast_refillfields in the Redis hash, and sets aTTL(Time To Live) on the key to automatically clean up inactive users. This prevents Redis from being cluttered with keys for one-time visitors. - Return Value: The script returns
1(true) if the request is allowed and0(false) otherwise. This simple integer result is efficient to transmit and parse.
- Initialization: The script first checks if a bucket for the user exists. If not (
3. tRPC Middleware Integration (rateLimitMiddleware)
Middleware is the perfect place to enforce cross-cutting concerns like rate limiting, authentication, and logging.
- Context (
ctx): The middleware receives actxobject, which is shared across the request lifecycle. We extract auserIdfrom it (e.g., from a JWT token set by a previous authentication middleware). If no user is authenticated, we default to"anonymous", which allows you to apply a (stricter) rate limit to unauthenticated traffic. - Path-Specific Configuration: The example shows how you could dynamically select a rate limit configuration based on the
pathof the tRPC procedure. This allows you to have a very strict limit on a costly AI endpoint (/api/trpc/ai.query) while having a more lenient limit on a simple data-fetching endpoint. - Error Handling: If
checkTokenBucketreturnsfalse, we throw aTRPCErrorwith the codeTOO_MANY_REQUESTS. This is a standard HTTP status code (429) that tRPC automatically translates for the client, allowing it to implement retry logic (e.g., using anExponentialBackoffstrategy). next()Call: If the check passes, we callnext(), which passes control to the next middleware in the chain, eventually reaching the procedure's resolver. This ensures the expensive AI transformation only runs for authorized, rate-limited requests.
The Flow of Logic
- A client sends a request to a protected tRPC procedure (e.g.,
ai.query). - The
rateLimitMiddlewareis invoked first. - It extracts the
userIdfrom the request context. - It calls
checkTokenBucketwith theuserIdand a specificRateLimitConfig. checkTokenBucketconstructs a Redis key and executes the Lua script atomically.- The script calculates the current token balance, checks against the cost, and updates the bucket state in Redis.
- The script returns
1(allowed) or0(denied). - If denied, the middleware throws a
TOO_MANY_REQUESTSerror, and the request terminates with a 429 status code. - If allowed, the middleware calls
next(), and the request proceeds to the expensive AI logic.
Common Pitfalls
-
Race Conditions in Client-Side Logic: A common mistake is to implement the rate limit logic on the client-side (e.g., in a React
useEffecthook). This is fundamentally insecure. A malicious user can simply bypass your JavaScript and send unlimited requests directly to your API endpoint. The rate limiter must always live on the server. -
Vercel/AWS Lambda Timeouts: Serverless functions have strict execution time limits (e.g., 10 seconds on Vercel's Hobby plan). If your rate limiter's Redis client is slow to respond, it can eat into the time budget of your main AI function. The Upstash Redis client is generally very fast, but you should always wrap your Redis calls in
Promise.racewith a timeout to prevent a slow Redis instance from blocking your entire function. -
Incorrect Refill Rate Calculation: The refill rate is often misunderstood. If you want to allow 60 requests per minute, your
refillRateshould be1(one token per second), andmaxTokenscould be60(allowing a full minute's worth of requests in a burst). A common bug is settingrefillRateto60andmaxTokensto1, which would allow one request per second but no bursts, which is the opposite of the intended behavior. -
Stateless vs. Stateful Middleware: In some edge function runtimes, you might be tempted to store the token bucket in a global variable. This is a critical error. Serverless instances are ephemeral and can be spun up or down at any moment. A user's next request might hit a different instance with no knowledge of the previous state. Always use a centralized, external data store like Redis.
-
Not Handling Anonymous Users: Failing to rate limit unauthenticated requests (
"anonymous") leaves your API vulnerable to DDoS attacks from bots that don't bother with authentication. Always apply a (potentially stricter) rate limit to all incoming requests, regardless of authentication status.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.