Chapter 17: Custom Domains & SSL

Theoretical Foundations

Imagine your multi-tenant SaaS application as a massive, modern office building. Each tenant is a company that has leased a floor in this building. However, they don't want to use a generic address like "Building 1, Floor 5." They want their own prestigious address: "Acme Corp Headquarters, 123 Innovation Way." Your job is to make sure that when a visitor (a user) types "acme.com" into their browser, they are seamlessly and securely guided to the correct floor, the correct office suite, and the correct desk, all without ever knowing the complex routing happening behind the scenes.

This is the fundamental role of the Custom Domains & SSL layer in your SaaS boilerplate. It acts as the building's master switchboard operator. This operator doesn't just connect a call; they verify the caller's identity, ensure the line is secure, and route the call to the exact extension needed.

Let's break this down with a more technical analogy. Think of your entire infrastructure as a Distributed Hash Table (DHT), a concept you'll recall from our deep dive into vector databases in Chapter 12. In a DHT, data is sharded across many nodes, and a consistent hashing algorithm determines which node is responsible for which piece of data. Our custom domain routing system is a DHT for HTTP requests.

The Key: The incoming request's Host header (e.g., acme.your-saas.com or www.acme.com).
The Value (or rather, the Pointer): The specific backend container, database instance, or application service that handles requests for that tenant.
The Hashing Algorithm: The routing logic within our reverse proxy.

Just as a vector database uses embeddings to map complex data to a point in a high-dimensional space for efficient retrieval, our reverse proxy maps a domain name to a specific, isolated execution environment. This mapping is not just a simple lookup; it's a dynamic, secure, and automated process that forms the backbone of tenant isolation and branding.

The "Why": Beyond Simple DNS

Why can't we just point a DNS A record to a single server IP and be done with it? This is the approach of a single-tenant application, and it fails spectacularly in a multi-tenant SaaS environment for several critical reasons:

Tenant Isolation and Security: A primary promise of SaaS is that one customer's data and operations are completely isolated from another's. If all tenants are served by a single application instance, a bug or a malicious exploit could potentially leak data between tenants. By routing acme.com to Container A (with Database A) and beta.com to Container B (with Database B), we create hard boundaries. This is the architectural equivalent of giving each tenant a physical key to their own floor, not just a key to the main building entrance.
Branding and Professionalism: For B2B SaaS, allowing customers to use their own domain (e.g., portal.acme.com) is a non-negotiable feature. It reinforces their brand identity and provides a seamless experience for their users. It signals trust and professionalism. A your-saas.com/customer-acme URL feels like a generic, off-the-shelf solution; a portal.acme.com feels like a bespoke, integrated tool.
Scalability and Resource Management: In a containerized environment (like Docker or Kubernetes), each tenant's application might run in its own container or pod. The reverse proxy is the load balancer and traffic cop that directs traffic to these ephemeral containers. As tenants are onboarded or offboarded, containers are created and destroyed. The routing table in the proxy must be updated dynamically, without any downtime or manual intervention. This is a level of agility that a static DNS record could never provide.
Centralized SSL/TLS Termination: Security is paramount. All traffic must be encrypted in transit (HTTPS). Managing individual SSL certificates for hundreds or thousands of custom domains manually is an operational nightmare. We need a centralized system that can automatically provision, renew, and manage certificates for any domain a tenant brings. This is the "SSL termination" part of the equation—the proxy handles the cryptographic handshake, decrypts the traffic, and forwards it securely over an internal network to the correct backend service.

The "How": The Three Pillars of the Reverse Proxy

Our reverse proxy (using a tool like Traefik or Caddy) is not just a simple router; it's an intelligent gatekeeper built on three pillars that work in concert.

Pillar 1: Dynamic Configuration & Service Discovery

In a static web server like traditional Nginx, you'd have to manually edit a configuration file every time you add a new tenant domain and then reload the server. This is brittle and doesn't scale. Our system must be dynamic.

The proxy needs to be aware of our application's state. When a new tenant signs up and configures their custom domain, this information is stored in a central database (e.g., a tenants table). The proxy needs a way to "discover" this change and update its routing rules automatically.

This is where the concept of Entry Point Node from our LangGraph workflows becomes relevant. Think of the tenant onboarding process as a workflow. The final step of this workflow isn't just "provision database." It's "publish routing configuration." This action sends a signal to our reverse proxy's configuration provider (like a Consul catalog, a Docker label, or a Kubernetes Ingress resource) that a new mapping is required. The proxy, which is constantly watching this source of truth, immediately updates its internal routing table. This is analogous to how an agent in a LangGraph workflow transitions from one node to the next based on a defined state; the proxy transitions its routing state based on the application's state.

Pillar 2: Automated SSL Certificate Management (ACME)

This is the magic of modern web infrastructure. The Automated Certificate Management Environment (ACME) protocol, developed by the Let's Encrypt project, allows for fully automated certificate issuance and renewal.

Here’s the under-the-hood process:

Challenge: When a tenant adds portal.acme.com, the proxy detects this new domain. It communicates with the Let's Encrypt servers via the ACME protocol. To prove it controls the domain, Let's Encrypt issues a challenge. The most common is the HTTP-01 challenge, where Let's Encrypt asks the proxy to serve a specific, secret token from a well-known path (e.g., /.well-known/acme-challenge/some-token).
Response: The proxy, being a web server, can easily serve this token. It responds to the challenge, proving ownership of the domain.
Issuance: Upon successful validation, Let's Encrypt issues a trusted SSL certificate for portal.acme.com.
Storage & Renewal: The proxy stores this certificate securely (often in memory or a mounted volume) and configures itself to use it for all incoming HTTPS connections to that domain. Crucially, it sets a timer. Long before the certificate expires (Let's Encrypt certificates are valid for 90 days), the proxy will automatically repeat this entire process in the background to renew it.

This process is completely hands-off. You, the SaaS provider, never have to manually handle certificate files, expiration dates, or complex renewal scripts. The proxy handles it all, ensuring 100% uptime for SSL.

Pillar 3: Intelligent Routing Logic

Once a request arrives at the proxy, encrypted and validated, the routing logic kicks in. This is where the "DHT for HTTP requests" analogy becomes concrete.

The proxy inspects the Host header of the incoming request.

Subdomain Routing (e.g., acme.your-saas.com): This is the simplest form. The proxy has a rule: "Any request with a Host header matching the pattern *.your-saas.com should be routed to the main application load balancer." But how does it know which tenant acme is? The application itself must be stateful. The subdomain is passed to the application, which then looks up the tenant in its database. This is less isolated but easier to manage for smaller deployments.
Fully Custom Domains (e.g., www.acme.com): This is the more powerful and complex scenario. The proxy's routing table contains explicit mappings:
- Host: www.acme.com -> Service: tenant-acme-app
- Host: portal.beta.com -> Service: tenant-beta-app

The Service here is an internal target, like a Docker container name, a Kubernetes service, or an IP:port combination. This is where true isolation happens. The request for www.acme.com is physically routed to a different set of backend resources than a request for portal.beta.com. The proxy might also add headers to the forwarded request, such as X-Tenant-ID: acme, so the backend application doesn't need to perform another database lookup to identify the tenant.

Visualizing the Request Flow

Let's visualize the complete journey of a request from a user's browser to the correct tenant's application container.

This diagram illustrates how a proxy injects a tenant identification header into an incoming request, allowing the backend application container to immediately process the request without performing an additional database lookup.

This flow highlights the separation of concerns. The Public DNS handles the initial, coarse-grained direction to our proxy. The proxy, with its dynamic configuration and SSL capabilities, handles the secure, fine-grained routing to the correct isolated environment.

The "Under the Hood": Configuration as Data

The true power of this system lies in treating routing configuration as data, not as static code. In a well-architected boilerplate, the moment a tenant completes their domain setup in your application's UI, a record is created or updated in a central configuration store. This store is the "single source of truth" for the reverse proxy.

For example, a configuration entry might look like this (in a pseudo-JSON format):

{
  "domain": "portal.acme.com",
  "tenant_id": "tenant_acme_123",
  "backend_service": "http://tenant-acme-app:8080",
  "ssl_policy": "letsencrypt-prod",
  "security_headers": {
    "X-Frame-Options": "DENY",
    "Content-Security-Policy": "default-src 'self'"
  }
}

The reverse proxy is configured to watch this data store. When a new entry is added, it dynamically creates a new router in its own configuration. This is a form of Infrastructure as Code (IaC), but applied dynamically at runtime rather than statically at deployment time. It's the same principle we apply in our LangGraph workflows where the state object dictates the execution path; here, the configuration data dictates the network path.

By building this system, you are not just implementing a feature; you are creating a foundational platform that is secure, scalable, and brandable. It's the silent, invisible engine that makes a multi-tenant SaaS feel like a dedicated, enterprise-grade solution for every single customer.

Basic Code Example

This example demonstrates a simplified, self-contained routing engine for a multi-tenant SaaS. It simulates the logic required to map incoming requests (identified by a custom domain or subdomain) to the correct tenant's database connection or application context. We will use Zod for runtime validation to ensure data integrity, a critical security measure when handling external inputs like domain headers or query parameters.

The code is written in TypeScript and focuses on the core routing logic, ignoring the complexities of actual HTTP servers or SSL termination for clarity.

// @ts-check
// Ensure TypeScript type checking is active in environments that support it.

/**

 * ============================================================================
 * 1. DEPENDENCIES & SETUP
 * ============================================================================
 * We simulate a database lookup and validation. In a real app, you would
 * import `z` from 'zod' and connect to a database client.
 */

// Mocking Zod for this self-contained example (simulating 'zod' library behavior)
// In a real project: import { z } from 'zod';
const z = {
  object: (schema: any) => ({
    parse: (data: any) => {
      // Simple mock validation logic
      const errors: string[] = [];
      for (const key in schema) {
        if (data[key] === undefined || data[key] === null) {
          errors.push(`Missing field: ${key}`);
        }
      }
      if (errors.length > 0) {
        throw new Error(`Validation Error: ${errors.join(', ')}`);
      }
      return data;
    }
  }),
  string: () => 'string', // Placeholder for type inference
};

/**

 * ============================================================================
 * 2. TYPE DEFINITIONS
 * ============================================================================
 * Strict typing ensures we know exactly what shape our data is in.
 */

/**

 * Represents the raw incoming request data.
 * In a real server (Next.js/Express), this would come from headers or query params.
 */
interface IncomingRequest {
  hostname: string; // e.g., "acme.saas-app.com" or "localhost:3000"
  path: string;     // e.g., "/dashboard"
}

/**

 * Represents a Tenant record in the database.
 * This includes the custom domain mapping and the specific database connection string.
 */
interface TenantRecord {
  id: string;
  customDomain: string; // The domain the tenant owns
  dbConnectionUrl: string; // Isolated database instance for this tenant
}

/**

 * Represents the resolved context after routing.
 * This object is passed to the application logic to handle the request.
 */
interface TenantContext {
  tenantId: string;
  databaseUrl: string;
  requestedPath: string;
}

/**

 * ============================================================================
 * 3. MOCK DATABASE LAYER
 * ============================================================================
 * Simulates a database lookup. In production, this would be a SQL query
 * or a Redis cache lookup.
 */

const mockTenantDb: TenantRecord[] = [
  {
    id: 'tenant_123',
    customDomain: 'acme.saas-app.com',
    dbConnectionUrl: 'postgresql://user:pass@db.acme.internal/tenant_123',
  },
  {
    id: 'tenant_456',
    customDomain: 'globex.saas-app.com',
    dbConnectionUrl: 'postgresql://user:pass@db.globex.internal/tenant_456',
  },
];

/**

 * Simulates fetching tenant data by their custom domain.
 * @param domain - The domain extracted from the request.
 * @returns Promise<TenantRecord | null>
 */
async function fetchTenantByDomain(domain: string): Promise<TenantRecord | null> {
  // Simulate async database call
  return new Promise((resolve) => {
    setTimeout(() => {
      const tenant = mockTenantDb.find((t) => t.customDomain === domain);
      resolve(tenant || null);
    }, 50); // Simulate network latency
  });
}

/**

 * ============================================================================
 * 4. ROUTING LOGIC
 * ============================================================================
 * The core logic that maps a request to a specific tenant context.
 */

/**

 * Validates the incoming request structure and resolves the tenant context.
 * 
 * Why Zod? Even though we have TypeScript interfaces, those are erased at runtime.
 * If a malicious user sends a malformed request, we need runtime validation
 * to prevent crashes or security vulnerabilities.
 */
const RequestSchema = z.object({
  hostname: z.string(),
  path: z.string(),
});

/**

 * Main routing function.
 * 1. Validates input.
 * 2. Looks up tenant in DB.
 * 3. Returns enriched context or throws an error.
 */
async function routeRequest(request: IncomingRequest): Promise<TenantContext> {
  // Step 1: Runtime Validation
  // We parse the request to ensure 'hostname' and 'path' exist and are strings.
  const validatedRequest = RequestSchema.parse(request);

  // Step 2: Domain Extraction & Lookup
  // In a real reverse proxy (like Traefik), this hostname is passed via headers (X-Forwarded-Host).
  // Here, we use it directly.
  const { hostname, path } = validatedRequest;

  console.log(`[Router] Received request for host: ${hostname}`);

  // Step 3: Database Resolution
  const tenant = await fetchTenantByDomain(hostname);

  if (!tenant) {
    // If no tenant is found, we throw an error to be caught by the global error handler.
    throw new Error(`Tenant not found for domain: ${hostname}`);
  }

  // Step 4: Construct Context
  // This object represents the "State" for the duration of the request.
  const context: TenantContext = {
    tenantId: tenant.id,
    databaseUrl: tenant.dbConnectionUrl,
    requestedPath: path,
  };

  return context;
}

/**

 * ============================================================================
 * 5. SIMULATION & EXECUTION
 * ============================================================================
 * Running the logic with valid and invalid inputs to demonstrate behavior.
 */

async function runSimulation() {
  console.log('--- Starting SaaS Router Simulation ---\n');

  // Scenario A: Valid Custom Domain
  // The user owns "acme.saas-app.com" and points it to our SaaS.
  try {
    const validRequest: IncomingRequest = {
      hostname: 'acme.saas-app.com',
      path: '/dashboard',
    };

    const context = await routeRequest(validRequest);

    console.log('✅ Success (Valid Domain):');
    console.log(`   Tenant ID: ${context.tenantId}`);
    console.log(`   Database: ${context.databaseUrl}`);
    console.log(`   Routing to: ${context.requestedPath}\n`);
  } catch (error) {
    console.error('❌ Error:', (error as Error).message);
  }

  // Scenario B: Invalid/Unregistered Domain
  // A user tries to access a domain not configured in the system.
  try {
    const invalidRequest: IncomingRequest = {
      hostname: 'hacker.saas-app.com',
      path: '/admin',
    };

    await routeRequest(invalidRequest);
  } catch (error) {
    console.error('❌ Error (Unregistered Domain):');
    console.error(`   ${(error as Error).message}\n`);
  }

  // Scenario C: Malformed Input (Runtime Validation Failure)
  // Simulating a corrupted request object (e.g., missing hostname).
  try {
    const malformedRequest: any = {
      // hostname is missing
      path: '/profile',
    };

    await routeRequest(malformedRequest);
  } catch (error) {
    console.error('❌ Error (Validation Failure):');
    console.error(`   ${(error as Error).message}\n`);
  }
}

// Execute the simulation
runSimulation();

Line-by-Line Explanation

1. Dependencies & Setup

Mocking Zod: In a real-world application, you would install zod via npm. Since this is a self-contained example, we create a minimal z object that mimics Zod's object and string methods. The parse method in our mock checks if required fields exist, simulating the strict validation Zod provides.
Why this matters: Even in TypeScript, data coming from the outside world (HTTP requests, database rows) is inherently untrusted. any types are dangerous. Zod enforces a schema at runtime, preventing the application from crashing or behaving unpredictably when it receives malformed data.

2. Type Definitions

IncomingRequest: Defines the shape of the data we receive from the reverse proxy (e.g., Nginx, Traefik). In a real Next.js app, hostname is often derived from req.headers.host.
TenantRecord: Represents a row in your tenants table. Crucially, it includes dbConnectionUrl. This highlights the multi-tenant architecture where data is physically or logically separated.
TenantContext: This is the "resolved" object. Once the router determines who is asking, it bundles the necessary credentials and identifiers into this object. This context is often attached to the request object (e.g., req.tenant) so downstream controllers don't need to re-query the database.

3. Mock Database Layer

mockTenantDb: An array acting as our database.
fetchTenantByDomain: An asynchronous function that simulates a database query. In production, this would be await db.select().from(tenants).where(eq(tenants.customDomain, domain)).
Async/Await: We use async/await because database calls are I/O bound and non-blocking. This ensures the Node.js event loop isn't frozen while waiting for the database response.

4. Routing Logic

RequestSchema.parse: This is the first line of defense. It validates the request object against the defined schema. If validation fails, Zod throws an error, which stops execution immediately. This prevents undefined errors later in the code.
fetchTenantByDomain: We look up the tenant. If the domain isn't found, we throw a specific error. In a real app, you might redirect to a generic 404 page or a "Claim Domain" page.
Context Construction: We create a TenantContext object. This is the "payload" that the rest of the application uses. By passing this object around, we ensure that every part of the app has access to the correct database connection string, ensuring strict tenant isolation.

5. Simulation & Execution

Scenario A (Success): Demonstrates the happy path. The domain exists, validation passes, and the context is returned.
Scenario B (Tenant Not Found): Demonstrates error handling for unknown domains. The catch block intercepts the error thrown by the router.
Scenario C (Validation Failure): Demonstrates the power of runtime validation. Even though TypeScript might compile the code if we passed a partial object, our Zod mock catches the missing hostname at runtime, preventing a crash.

Visualizing the Flow

The following diagram illustrates the logic flow within the routeRequest function.

The diagram visually traces the execution path of the routeRequest function, highlighting how the Zod schema validation intercepts a partial object to prevent a runtime crash. — The diagram visually traces the execution path of the `routeRequest` function, highlighting how the Zod schema validation intercepts a partial object to prevent a runtime crash.

Common Pitfalls

When implementing custom domain routing in a production SaaS environment, watch out for these specific issues:

The "localhost" Trap (Vercel/Netlify Preview Deployments):
- Issue: When developing locally or using Vercel preview branches, the hostname might be localhost:3000 or my-app-git-branch.vercel.app. If your database only contains acme.saas-app.com, your router will fail.
- Solution: Implement a "Default Tenant" or "Dev Tenant" logic. If the environment is development, map unknown domains to a hardcoded test tenant ID. Never hardcode production logic that assumes specific domains.
Async/Await Loops in High Traffic:
- Issue: If you perform a database lookup for every request without caching, you will hit database connection limits under load.
- Solution: Use Redis or an in-memory store (like Node-Cache) to cache the domain -> tenantId mapping. The flow should be: Check Cache -> If missing, Query DB -> Update Cache. This reduces database load by 90%+.
Insecure Direct Object References (IDOR) via Subdomains:
- Issue: If you route based on subdomains (e.g., tenant-id.saas-app.com), ensure the tenant-id in the URL matches the data actually stored for that tenant.
- Solution: Never trust the subdomain alone for data isolation. Always verify the tenantId against the authenticated user's session. A user from Tenant A should never be able to access Tenant B's data, even if they manually change the URL.
Zod Validation Hallucinations:
- Issue: Developers often forget to validate nested objects. If your IncomingRequest has a headers object, you must validate that too. Malicious headers can inject SQL or XSS payloads.
- Solution: Use strict Zod schemas. Enable .strict() mode to throw errors if there are unknown keys in the input object, preventing parameter pollution attacks.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.