Skip to content

Chapter 23: Architecting Distributed Inference: Dynamic Batching and Stateful Orchestration

Theoretical Foundations

Architecting distributed inference pipelines requires a fundamental shift in perspective. We are no longer building monolithic applications where data flows linearly through a single process. Instead, we are composing a distributed system where intelligent agents—containerized microservices—collaborate to perform complex AI tasks. The core challenge lies in managing the tension between stateless computation (the actual model inference) and stateful orchestration (the logic that directs the flow of data and context between agents).

The Dichotomy: Stateless Inference vs. Stateful Orchestration

To understand this architecture, consider the analogy of a high-end restaurant kitchen versus a fast-food assembly line.

A monolithic AI application is like a single chef in a kitchen. They receive an order, gather ingredients, cook the dish, plate it, and serve it. If the restaurant gets busy, you can’t just hire a fraction of a chef; you must hire a whole new chef who duplicates the entire process. This is inefficient and scales poorly.

In a distributed inference pipeline, we separate concerns. The stateless inference workloads are the specialized cooking stations (grill, fryer, sauté). They have no memory of previous orders. A grill receives raw ingredients (input data), applies a fixed set of rules (the model weights), and outputs a cooked item (inference result). It doesn't care if the next order is for a burger or a steak; it just processes what it's given. This is the GPU-bound workload.

The stateful orchestration logic is the expediter (or sous-chef) calling out orders. They maintain the state of the table: who ordered what, what modifications were requested, and the sequence of courses. The expediter doesn't cook, but they direct the flow. If the steak needs to be ready before the salad, the expediter ensures the grill station gets the steak order first and holds the salad station. This is the CPU-bound, latency-sensitive logic.

In our C# ecosystem, this separation is not just a pattern; it is enforced by the containerization boundary. The inference service runs in a container optimized for GPU drivers (e.g., NVIDIA CUDA), while the orchestration service runs in a standard .NET runtime container.

The Role of C# Interfaces in Decoupling

The glue that holds this distributed system together, allowing us to swap inference backends without breaking the orchestration logic, is the C# Interface. In the context of AI applications, interfaces are crucial for defining the contract between the orchestrator and the actual model execution.

Imagine we have an orchestrator managing a RAG (Retrieval-Augmented Generation) pipeline. It needs to call an LLM to generate a response. We don't want the orchestrator to know if we are using OpenAI's GPT-4, a local Llama 3 instance, or a fine-tuned model running on Azure ML.

We define a generic interface:

using System.Threading.Tasks;
using System.Collections.Generic;

namespace DistributedInference.Contracts
{
    // This interface represents the contract for any AI model that can generate text.
    // It is agnostic of the underlying implementation (cloud, local, edge).
    public interface IInferenceEngine
    {
        // The method accepts a prompt and a context dictionary (for state management)
        // and returns a structured response.
        Task<InferenceResult> GenerateAsync(string prompt, Dictionary<string, object> context);
    }

    public record InferenceResult(string Content, Dictionary<string, object> Metadata);
}

By depending on IInferenceEngine rather than a concrete class like OpenAiClient or LlamaSharpClient, the orchestration logic remains stable even if we swap the underlying inference provider. This adheres to the Dependency Inversion Principle (from SOLID principles, referenced in Book 2, Chapter 4: "Design Patterns for Scalable Services"). The high-level orchestration module does not depend on low-level inference details; both depend on an abstraction.

Dynamic Batching: Optimizing the "Grill Station"

In the restaurant analogy, if every single order went to the grill station immediately, the chef would spend most of their time flipping one burger at a time. This is inefficient. Instead, the expediter groups orders: "Three burgers, two well-done, one medium." The grill station processes them simultaneously.

In AI inference, this is Dynamic Batching. GPUs are massively parallel processors. Sending one inference request at a time leaves the GPU underutilized; the overhead of kernel launches and memory transfers dominates the execution time. Dynamic batching aggregates multiple incoming requests into a single batch tensor, executes them in parallel on the GPU, and then demultiplexes the results.

This technique is critical for maximizing throughput. However, it introduces latency trade-offs. If the batch waits too long to fill up, the latency for the first request increases. If it's too small, utilization drops.

The orchestration layer must implement this logic. It is a stateful operation because the orchestrator must track which requests belong to which client while waiting for the batch to fill or a timeout to occur.

Service Meshes: The Nervous System

When we have multiple agents (e.g., a "Text Summarizer," a "Sentiment Analyzer," and a "Named Entity Recognizer") communicating in a pipeline, we need a reliable communication layer. In a monolith, this is a method call. In a distributed system, it's a network call.

A Service Mesh (like Istio or Linkerd) acts as the nervous system for our microservices. It handles:

  1. Service Discovery: How does the Summarizer find the Sentiment Analyzer?
  2. Traffic Management: How do we route 10% of traffic to a new version of the Sentiment Analyzer (Canary Deployment)?
  3. Resilience: If the Sentiment Analyzer is slow, the mesh can enforce timeouts or retry policies so the Summarizer doesn't hang.

In C#, while the service mesh is typically an infrastructure component (sidecar proxies), we interact with it via standard HTTP/gRPC calls. The mesh ensures that even though our agents are physically separated in the cluster, they can communicate as if they were local.

Kubernetes CRDs: Declarative Orchestration

Managing complex inference workflows (e.g., "Summarize this document, then translate the summary, then detect PII") manually is error-prone. We need a way to define the desired state of the pipeline and let the system figure out how to achieve it.

This is where Kubernetes Custom Resource Definitions (CRDs) come in. Standard Kubernetes resources (Pods, Deployments) are generic. CRDs allow us to extend the Kubernetes API to understand our specific AI domain.

We can define a CRD called InferencePipeline. Instead of writing imperative code to chain services, we declare the pipeline in YAML:

apiVersion: ai.example.com/v1
kind: InferencePipeline
metadata:
  name: document-processing-pipeline
spec:
  steps:

    - name: summarizer
      model: "llama-3-70b"
      inputFrom: "root"

    - name: translator
      model: "m2m100-1.2b"
      inputFrom: "summarizer"

    - name: pii-detector
      model: "bert-finetuned-pii"
      inputFrom: "translator"

A Custom Controller (written in C# using the Kubernetes Client Library) watches for these CRDs. When a user creates this resource, the controller doesn't just run a script; it calculates the dependency graph and provisions the necessary Kubernetes Jobs or Services to satisfy the topology.

This approach separates the intent (the pipeline definition) from the execution (the container orchestration). It allows developers to define complex workflows without deep knowledge of the underlying infrastructure.

Autoscaling: Balancing Latency and Cost

The final piece of the puzzle is scaling. In a restaurant, you don't keep 50 chefs on standby during lunch lull; you scale based on demand. In AI inference, GPU resources are expensive. We cannot afford to have a cluster of A100s sitting idle.

We use Horizontal Pod Autoscalers (HPA) or Kubernetes Event-driven Autoscaling (KEDA).

  • Standard HPA scales based on CPU/Memory usage. However, for AI, CPU usage is often low while the GPU is saturated. This is misleading.
  • KEDA is superior here. It scales based on the length of a message queue (e.g., RabbitMQ or Azure Service Bus). If the queue length grows, it means our inference agents are overwhelmed, and KEDA spins up more replicas.

The orchestration strategy must account for this. We decouple the entry point (API Gateway) from the processing agents using a queue. The API Gateway accepts the request and drops it into a queue, immediately returning a 202 Accepted. This prevents client timeouts during high load. The agents pick up messages from the queue, process them, and push results to a output queue or database.

Visualizing the Architecture

The following diagram illustrates the flow of data and control in a distributed inference pipeline using C# microservices and Kubernetes.

The diagram illustrates how C# microservices deployed on Kubernetes orchestrate a distributed inference pipeline, where agents retrieve tasks from an input queue, process them, and push results to an output queue or database.
Hold "Ctrl" to enable pan & zoom

The diagram illustrates how C# microservices deployed on Kubernetes orchestrate a distributed inference pipeline, where agents retrieve tasks from an input queue, process them, and push results to an output queue or database.

Architectural Implications and Edge Cases

When designing these systems, several edge cases must be considered:

  1. Idempotency: Since we use queues for resilience, a message might be delivered twice. The inference agents must be idempotent. Processing the same input twice should yield the same result without side effects (e.g., double billing).
  2. Model Warm-up: GPU models have a "cold start" penalty. Loading a 70B parameter model into VRAM can take minutes. Autoscaling from zero to one replica is slow. We use Pre-warming or Sticky Sessions to keep a minimum number of replicas ready.
  3. Data Gravity: Moving large tensors (e.g., video frames) over the network is expensive. In the restaurant analogy, this is like moving a heavy roast from the loading dock to the kitchen. We try to keep data locality high. If Agent A and Agent B are in the same pipeline, Kubernetes scheduling should ideally place them on the same node (using Affinity rules) to utilize shared memory or high-speed interconnects (NVLink).

Summary of Concepts

  • Interfaces (IInferenceEngine): The abstraction layer allowing the orchestration logic to remain decoupled from specific AI model implementations.
  • Stateless Agents: Containerized services that perform the heavy lifting (inference) without retaining memory of previous requests.
  • Stateful Orchestrator: The logic (often a Kubernetes Operator) that manages the workflow state, dependencies, and execution order.
  • CRDs: The declarative API that allows developers to define complex AI workflows as Kubernetes resources.
  • Dynamic Batching: The technique of grouping requests to maximize GPU parallelism, managed by the inference service or a sidecar proxy.

By combining these elements, we move from simple model deployment to a robust, scalable, and maintainable AI platform capable of handling enterprise-grade workloads.

Basic Code Example

Here is a self-contained C# example demonstrating a basic orchestration strategy for a distributed inference pipeline. This example simulates a stateless inference microservice using ASP.NET Core, dynamic batching, and a mock stateful orchestrator.

using System.Collections.Concurrent;
using System.Diagnostics;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;

// ---------------------------------------------------------
// 1. Domain Models
// ---------------------------------------------------------
public record InferenceRequest(Guid Id, string InputData);
public record InferenceResult(Guid Id, string OutputData, long ProcessingTimeMs);

// ---------------------------------------------------------
// 2. Stateful Orchestrator (The "Brain")
// ---------------------------------------------------------
public class InferenceOrchestrator
{
    private readonly ILogger<InferenceOrchestrator> _logger;
    // Thread-safe collection to act as a dynamic batch buffer
    private readonly ConcurrentQueue<InferenceRequest> _batchQueue = new();
    private readonly Timer _batchProcessorTimer;
    private const int BatchSize = 4; // Optimize for GPU parallelism
    private const int MaxWaitTimeMs = 50; // Max latency budget for batching

    public InferenceOrchestrator(ILogger<InferenceOrchestrator> logger)
    {
        _logger = logger;
        // Start the background batching loop
        _batchProcessorTimer = new Timer(ProcessBatchAsync, null, 0, MaxWaitTimeMs);
    }

    public Task<InferenceResult> ProcessRequestAsync(InferenceRequest request)
    {
        var tcs = new TaskCompletionSource<InferenceResult>();
        // Attach the TCS to the request so the processor can find it later
        request.CompletionSource = tcs;
        _batchQueue.Enqueue(request);
        return tcs.Task;
    }

    private void ProcessBatchAsync(object? state)
    {
        if (_batchQueue.IsEmpty) return;

        var batch = new List<InferenceRequest>();
        while (batch.Count < BatchSize && _batchQueue.TryDequeue(out var req))
        {
            batch.Add(req);
        }

        if (batch.Count > 0)
        {
            // Offload to the stateless inference worker
            _ = Task.Run(() => ExecuteInferenceBatch(batch));
        }
    }

    private async Task ExecuteInferenceBatch(List<InferenceRequest> batch)
    {
        _logger.LogInformation("Processing batch of {Count} requests", batch.Count);

        // SIMULATION: In a real scenario, this sends data to a GPU-backed container
        // or a sidecar process (e.g., via gRPC or shared memory).
        await Task.Delay(20); // Simulate GPU compute latency

        var stopwatch = Stopwatch.StartNew();

        foreach (var req in batch)
        {
            // Simulate AI Inference Logic
            var result = new InferenceResult(
                req.Id, 
                $"Processed: {req.InputData.ToUpper()}", 
                stopwatch.ElapsedMilliseconds
            );

            // Complete the waiting task
            req.CompletionSource?.SetResult(result);
        }
    }
}

// Extension to attach TCS to the record (requires a wrapper or property mutation)
// For this simple demo, we use a static dictionary to map IDs to completion sources
// to avoid modifying the immutable record directly.
public static class RequestTracker
{
    public static readonly ConcurrentDictionary<Guid, TaskCompletionSource<InferenceResult>> PendingRequests = new();
}

// ---------------------------------------------------------
// 3. Stateless Inference Microservice (The "Muscle")
// ---------------------------------------------------------
public class InferenceService
{
    private readonly InferenceOrchestrator _orchestrator;

    public InferenceService(InferenceOrchestrator orchestrator)
    {
        _orchestrator = orchestrator;
    }

    public async Task<IResult> HandleInference(HttpRequest request)
    {
        // 1. Deserialization
        var inputData = request.Query["data"].ToString();
        if (string.IsNullOrWhiteSpace(inputData)) 
            return Results.BadRequest("Missing 'data' query parameter.");

        var inferenceRequest = new InferenceRequest(Guid.NewGuid(), inputData);

        // 2. Orchestration Call
        // The service itself is stateless; it delegates stateful batching to the orchestrator.
        var result = await _orchestrator.ProcessRequestAsync(inferenceRequest);

        // 3. Response
        return Results.Ok(new 
        { 
            RequestId = result.Id, 
            Output = result.OutputData,
            LatencyMs = result.ProcessingTimeMs 
        });
    }
}

// ---------------------------------------------------------
// 4. Main Application Entry Point
// ---------------------------------------------------------
var builder = WebApplication.CreateBuilder(args);

// Register the stateful orchestrator as a Singleton (lifecycle matches the app)
builder.Services.AddSingleton<InferenceOrchestrator>();
builder.Services.AddSingleton<InferenceService>();

var app = builder.Build();

// Map the endpoint
app.MapGet("/infer", (InferenceService service, HttpRequest request) => 
    service.HandleInference(request));

app.Run();

// ---------------------------------------------------------
// 5. Helper Class for the Immutable Record Workaround
// ---------------------------------------------------------
public static class InferenceRequestExtensions
{
    // Since records are immutable, we use a static tracker for the demo
    public static void RegisterCompletion(this InferenceRequest request, TaskCompletionSource<InferenceResult> tcs)
    {
        RequestTracker.PendingRequests.TryAdd(request.Id, tcs);
    }
}

Line-by-Line Explanation

1. Domain Models (InferenceRequest, InferenceResult)

  • Lines 9-10: We define immutable records (record) for our data transfer objects (DTOs). In a distributed system, immutability prevents side effects when objects are passed between microservices or threads.
    • InferenceRequest: Contains a unique ID and the raw input (e.g., text for an LLM).
    • InferenceResult: Contains the ID (for correlation), the processed output, and the processing time for metrics.

2. The Stateful Orchestrator (InferenceOrchestrator)

This class represents the "Orchestration Logic" separating it from the "Inference Workload."

  • Lines 15-17 (Buffer): We use a ConcurrentQueue<T>. This is thread-safe, allowing multiple HTTP requests (threads) to enqueue requests simultaneously without locking, which is crucial for high-throughput systems.
  • Lines 18-20 (Configuration):
    • BatchSize = 4: This mimics optimizing for GPU parallelism. GPUs process matrices in parallel; sending 1 item at a time leaves the GPU underutilized.
    • MaxWaitTimeMs = 50: This is a latency trade-off. We wait up to 50ms to fill a batch. If we wait too long, latency suffers; if we don't wait, throughput drops.
  • Lines 22-27 (Constructor & Timer):
    • Instead of a complex background worker service, we use a System.Threading.Timer. This ticks every 50ms (MaxWaitTimeMs).
    • Architectural Note: This timer drives the "Dynamic Batching" mechanism. It checks if we have enough data to justify a GPU invocation.
  • Lines 29-34 (Public API):
    • ProcessRequestAsync receives a request. It creates a TaskCompletionSource (TCS). This allows the HTTP thread to await a result that hasn't been computed yet. The TCS is the bridge between the request and the eventual batch processing.
  • Lines 36-50 (Batching Logic):
    • The timer callback (ProcessBatchAsync) tries to drain the queue.
    • It fills a local list batch up to the BatchSize.
    • It offloads the actual processing to a Task.Run. This is critical: we don't want the batch processing to block the timer thread, which would prevent subsequent ticks from firing.
  • Lines 52-70 (Execution Simulation):
    • ExecuteInferenceBatch simulates the heavy lifting. In a real Kubernetes deployment, this is where you would call a separate Inference Server (e.g., TensorFlow Serving, TorchServe) or a GPU-accelerated sidecar.
    • req.CompletionSource.SetResult(result): This "unblocks" the specific HTTP request waiting for that ID.

3. The Stateless Microservice (InferenceService)

  • Lines 72-88: This class represents the containerized API endpoint. It is stateless; it holds no data about the batch. It simply receives an HTTP request, passes it to the orchestrator, and awaits the result.
  • Line 82: await _orchestrator.ProcessRequestAsync(inferenceRequest). This is the hand-off. The HTTP request lifecycle is now tied to the orchestrator's batch processing.

4. Main Application (Program)

  • Lines 92-96: We use the Minimal API pattern (ASP.NET Core 6+).
  • Line 94: builder.Services.AddSingleton<InferenceOrchestrator>();. This is a critical architectural decision. The orchestrator must be a Singleton so that the batching queue and timer persist across all incoming HTTP requests. If it were Scoped or Transient, the batch queue would reset for every request, breaking the batching logic.

Visualizing the Architecture

The following diagram illustrates the flow of data through the stateless microservice and the stateful orchestrator.

This diagram illustrates how the stateful orchestrator maintains a persistent batch queue across requests, whereas a stateless microservice would reset the queue with each request, breaking the batching logic.
Hold "Ctrl" to enable pan & zoom

This diagram illustrates how the stateful orchestrator maintains a persistent batch queue across requests, whereas a stateless microservice would reset the queue with each request, breaking the batching logic.

Common Pitfalls

  1. Blocking the Event Loop with Synchronous GPU Calls

    • Mistake: Calling a synchronous library (like an unconfigured ONNX runtime or a blocking file I/O) directly inside the ExecuteInferenceBatch method.
    • Consequence: Since the orchestrator runs on a timer and processes batches on the thread pool, a synchronous block will starve the timer thread. The batch queue will fill up, latency will skyrocket, and the application will eventually stop accepting new connections.
    • Fix: Always await non-blocking I/O or offload CPU/GPU intensive work to a separate process or use Task.Run as shown.
  2. Treating the Orchestrator as Transient/Scoped

    • Mistake: Registering InferenceOrchestrator as AddScoped.
    • Consequence: In ASP.NET Core, a scope is created per HTTP request. This means every HTTP request gets its own orchestrator with its own empty queue. Batching becomes impossible because the queues never accumulate items across requests.
    • Fix: Use AddSingleton for the orchestrator or use a distributed cache/queue (like Redis or RabbitMQ) if the API is scaled across multiple pods.
  3. Ignoring Backpressure

    • Mistake: Allowing the ConcurrentQueue to grow infinitely if the GPU worker is slow or down.
    • Consequence: Memory usage will spike and eventually crash the container (OOMKilled).
    • Fix: Add a SemaphoreSlim or check _batchQueue.Count before enqueuing. If the queue is too large, return a 503 Service Unavailable immediately to the client.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.