Chapter 20: Orchestrating Agent Swarms: High-Throughput Inference with KEDA

Theoretical Foundations

The theoretical foundation of deploying autonomous agent swarms within a Kubernetes environment rests on a fundamental paradigm shift: moving from monolithic, synchronous request-response cycles to distributed, asynchronous, event-driven architectures. In the context of high-throughput AI inference, this shift is not merely an optimization; it is a prerequisite for scalability and resilience. To understand this, we must dissect the architectural components that enable agent swarms to function as cohesive, intelligent systems rather than disjointed microservices.

The Swarm as a Distributed Neural Network

Imagine a biological brain. It is not a single, massive neuron processing every thought sequentially. Instead, it is a vast network of interconnected neurons firing in parallel, passing signals through synapses, and coordinating to produce complex behaviors. An autonomous agent swarm operating in Kubernetes mirrors this structure. Each agent is a "neuron"—a discrete, stateless unit of computation. The Kubernetes cluster acts as the "glial support structure," managing the lifecycle and environment of these neurons. The service mesh and message queues form the "synapses," transmitting signals (tasks) and ensuring the network's integrity.

In this analogy, high-throughput inference is the process of the brain reacting to a stimulus. If every sensory input required a single, monolithic neuron to process it, the brain would be incredibly slow and fragile. By distributing the load across millions of neurons, the brain can process vast amounts of information simultaneously and continue functioning even if individual neurons fail. Similarly, a single monolithic AI service handling all inference requests is a bottleneck. It has a single point of failure, scales vertically (which is expensive and has hard limits), and cannot adapt to fluctuating loads. An agent swarm, however, scales horizontally. When inference demand spikes, Kubernetes can spawn more agent pods (neurons), and when demand wanes, it can terminate them, optimizing resource utilization.

The Role of C# Interfaces in Agent Abstraction

Before an agent can be deployed, it must be defined in code. In C#, the most critical tool for building flexible, swappable AI agents is the interface. An interface defines a contract without implementation. In the context of AI, this is paramount because the underlying Large Language Model (LLM) or inference engine might change. You might start with OpenAI's GPT-4, switch to a self-hosted Llama 3 model for cost control, or use a specialized fine-tuned model for a specific task.

Consider the following interface definition. It abstracts the core capability of an AI agent: processing a prompt and generating a response.

using System.Threading.Tasks;

namespace AgentSwarm.Core.Abstractions
{
    /// <summary>
    /// Defines the contract for any AI inference engine.
    /// This allows swapping between different models (OpenAI, Local Llama, etc.)
    /// without changing the agent's orchestration logic.
    /// </summary>
    public interface IInferenceEngine
    {
        /// <summary>
        /// Asynchronously processes a prompt and returns a structured response.
        /// </summary>
        /// <param name="prompt">The input text for the model.</param>
        /// <param name="context">Optional context for the inference session.</param>
        /// <returns>A task representing the asynchronous operation, containing the model's response.</returns>
        Task<string> InferAsync(string prompt, InferenceContext context);
    }

    public class InferenceContext
    {
        public string ModelVersion { get; set; }
        public float Temperature { get; set; } = 0.7f;
        public int MaxTokens { get; set; } = 1024;
    }
}

By programming against the IInferenceEngine interface, the agent's core logic remains decoupled from the specific AI provider. This is a direct application of the Dependency Inversion Principle (D), the 'D' in SOLID, which we discussed in Book 5, Chapter 3. We inject the concrete implementation (e.g., OpenAiEngine or LlamaEngine) into the agent at runtime, typically via Dependency Injection (DI) containers provided by frameworks like ASP.NET Core. This architectural choice is foundational for building resilient systems, as it allows for A/B testing, canary deployments of new models, and fallback mechanisms without rewriting the agent's orchestration code.

Asynchronous Communication via Distributed Task Queues

In a synchronous architecture, an API gateway receives a request, forwards it to an agent, waits for the agent to complete its inference task (which can take seconds), and then returns the response. This model is brittle and inefficient for long-running tasks. If the connection drops, the work is lost. If many requests arrive simultaneously, the agents are overwhelmed, and requests are queued at the API level, leading to timeouts.

The solution is a distributed task queue. This is the central nervous system of the swarm. When a task is generated, it is not sent directly to an agent. Instead, it is published as a message to a queue. Agents, running as independent consumers, pull tasks from this queue and process them. This decouples the task producer from the consumer, providing several key benefits:

Resilience: If an agent crashes while processing a task, the message remains in the queue and can be picked up by another agent.
Load Leveling: Bursts of traffic are absorbed by the queue, and agents process tasks at a steady, sustainable rate.
Scalability: The number of agents can be scaled independently of the number of message producers.

A common implementation for this in the .NET ecosystem is using a library like MassTransit or Rebus on top of a message broker like RabbitMQ or Azure Service Bus. The concept is that an agent is essentially a message consumer with a specific logic handler.

using System.Threading.Tasks;
using MassTransit;

namespace AgentSwarm.Agents.Consumers
{
    // This class represents an agent that listens for tasks on a message queue.
    public class InferenceAgent : IConsumer<InferenceTask>
    {
        private readonly IInferenceEngine _inferenceEngine;

        // The IInferenceEngine is injected, demonstrating the power of the interface defined earlier.
        public InferenceAgent(IInferenceEngine inferenceEngine)
        {
            _inferenceEngine = inferenceEngine;
        }

        public async Task Consume(ConsumeContext<InferenceTask> context)
        {
            var task = context.Message;

            // The agent's core logic: process the task using the injected engine.
            var result = await _inferenceEngine.InferAsync(task.Prompt, task.Context);

            // Publish a result event for other parts of the system to consume.
            await context.Publish(new InferenceResult { TaskId = task.Id, Result = result });
        }
    }

    // A simple message contract for the task queue.
    public record InferenceTask
    {
        public Guid Id { get; init; }
        public string Prompt { get; init; }
        public InferenceContext Context { get; init; }
    }

    public record InferenceResult
    {
        public Guid TaskId { get; init; }
        public string Result { get; init; }
    }
}

This pattern transforms the agent from a passive HTTP endpoint into an active, autonomous worker. It continuously polls the queue, processes tasks as they arrive, and reports its results asynchronously. This is the essence of a "cloud-native" design pattern, where applications are built to leverage the dynamic and distributed nature of the cloud environment.

Kubernetes as the Orchestrator

Kubernetes is the platform that gives this distributed system of agents a home. It manages the lifecycle of the agent pods, ensuring they are running, healthy, and accessible. However, the true power of Kubernetes in this context is its ability to perform dynamic scaling.

In a static deployment, you might run 10 agent pods. If 100 inference tasks arrive simultaneously, 90 of them will be stuck in the queue, waiting for an agent to become free. This leads to high latency. Conversely, if you provision 100 agents to handle peak load, you are paying for idle resources during off-peak hours.

This is where KEDA (Kubernetes Event-driven Autoscaling) comes in. KEDA is a CNCF project that scales Kubernetes deployments based on the number of events in an external system, such as the length of a message queue. It acts as the "thermostat" for our agent swarm.

The Analogy: Think of a busy restaurant kitchen. The head chef (KEDA) doesn't decide how many line cooks (agent pods) to have based on the time of day. Instead, they look at the order rail (the message queue). If the rail is full of tickets, the head chef calls in more line cooks. If the rail is empty, they send cooks home. This is precisely what KEDA does.

KEDA monitors the message queue (e.g., RabbitMQ). It has a "Scaler" that checks the queue length. If the queue length exceeds a defined threshold (e.g., 10 messages per agent), KEDA instructs the Kubernetes Horizontal Pod Autoscaler (HPA) to increase the number of replicas for the agent deployment. When the queue length drops, KEDA scales the replicas back down. This ensures that the number of agents is always proportional to the current workload, providing both high throughput during peak times and cost efficiency during lulls.

Service Mesh for Secure and Resilient Communication

While the task queue handles asynchronous communication between agents and the outside world, agents often need to communicate with each other synchronously. For example, a "Planner" agent might break down a complex query into sub-tasks and assign them to specialized "Worker" agents. This inter-agent communication must be secure, reliable, and observable.

A service mesh (e.g., Istio, Linkerd) provides this layer of control. It injects a lightweight proxy (a "sidecar") next to each agent pod. All network traffic to and from the agent is routed through this proxy, which is managed centrally by the service mesh control plane.

The service mesh provides several critical capabilities for an agent swarm:

Secure Communication (mTLS): The sidecar proxies can automatically encrypt all traffic between agents using mutual TLS (mTLS). This is crucial when agents are handling sensitive data. The agents themselves don't need to manage certificates; the service mesh handles the encryption and identity verification transparently.
Resilience and Fault Tolerance: The service mesh can implement patterns like retries, circuit breakers, and timeouts at the network level. If a Worker agent is failing, the service mesh can prevent the Planner agent from repeatedly sending requests to it (circuit breaking), preventing cascading failures across the swarm.
Observability: The sidecar proxies capture detailed telemetry data for every request—latency, error rates, and throughput. This data is aggregated and can be visualized in dashboards (e.g., Grafana), giving you a real-time view of the health and performance of the entire swarm. This is far more powerful than logging from individual agents, as it provides a holistic, network-level perspective.

The Complete Architectural Flow

Let's synthesize these components into a single, coherent flow for a high-throughput inference request:

Task Ingestion: An external client sends a request to an API Gateway. The gateway validates the request and publishes an InferenceTask message to a RabbitMQ queue. The gateway immediately returns a correlation ID to the client, acknowledging receipt.
Dynamic Scaling: KEDA is constantly monitoring the RabbitMQ queue. It detects a new message and, if necessary, scales up the number of InferenceAgent pods in the Kubernetes cluster.
Task Processing: An available InferenceAgent pod pulls the InferenceTask message from the queue. The agent's Consume method is invoked.
Inference Execution: The agent uses its injected IInferenceEngine (e.g., a call to an OpenAI model or a local Llama instance) to process the prompt. This is the most resource-intensive step.
Inter-Agent Communication (Optional): If the task is complex, the agent might need to query a specialized service (e.g., a database agent for retrieving context). This communication is routed through the service mesh sidecar, which ensures it is secure and resilient.
Result Publication: Once the inference is complete, the agent publishes an InferenceResult message to a results queue or a dedicated event stream (like Azure Event Hubs or Kafka).
Asynchronous Retrieval: The original client can either poll an endpoint with the correlation ID or listen to a WebSocket for the final result, which is delivered once the InferenceResult is processed by a results handler.

This architecture is visualized below using a Graphviz diagram.

The diagram illustrates an asynchronous workflow where a client initiates a request, receives a correlation ID, and either polls an endpoint or listens via WebSocket to retrieve the final InferenceResult processed by a results handler. — The diagram illustrates an asynchronous workflow where a client initiates a request, receives a correlation ID, and either polls an endpoint or listens via WebSocket to retrieve the final `InferenceResult` processed by a results handler.

Conclusion

The theoretical foundation for containerizing agents and scaling inference is a synthesis of several architectural patterns. It is not about a single technology but about how these technologies—C# interfaces for abstraction, distributed task queues for decoupling, Kubernetes for orchestration, KEDA for event-driven scaling, and service meshes for network resilience—work in concert. By building on these principles, we create an agent swarm that is not only capable of high-throughput inference but is also resilient, observable, and cost-effective, mirroring the robust and adaptive nature of a biological neural network.

Basic Code Example

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Channels;
using System.Threading.Tasks;

namespace AgentSwarmInference
{
    // ============================================================================
    // 1. CORE DATA MODELS
    // ============================================================================

    /// <summary>
    /// Represents an inference task dispatched to the agent swarm.
    /// In a real-world scenario (e.g., LLM inference), this might contain
    /// tokenized prompts, image tensors, or structured JSON payloads.
    /// </summary>
    public record InferenceTask(Guid Id, string Payload, DateTime CreatedAt);

    /// <summary>
    /// Represents the result of an inference operation.
    /// </summary>
    public record InferenceResult(Guid TaskId, string Output, TimeSpan ProcessingTime);

    // ============================================================================
    // 2. DISTRIBUTED TASK QUEUE (SIMULATED)
    // ============================================================================

    /// <summary>
    /// Simulates a distributed message broker (like RabbitMQ, Kafka, or Azure Service Bus).
    /// In a Kubernetes environment, this would be an external dependency, but for this
    /// self-contained example, we use a thread-safe Channel<T>.
    /// </summary>
    public class DistributedTaskQueue
    {
        // Channel<T> is a modern, high-performance concurrency primitive for passing data
        // between producers and consumers. It handles buffering and backpressure automatically.
        private readonly Channel<InferenceTask> _channel;

        public DistributedTaskQueue()
        {
            // Bounded capacity prevents memory exhaustion during traffic spikes.
            // In a real K8s deployment, this bound would be dictated by the queue's retention policy.
            var options = new BoundedChannelOptions(capacity: 1000)
            {
                FullMode = BoundedChannelFullMode.Wait // Blocks producer if queue is full
            };
            _channel = Channel.CreateBounded<InferenceTask>(options);
        }

        /// <summary>
        /// Dispatches a task to the swarm.
        /// </summary>
        public async Task EnqueueAsync(InferenceTask task, CancellationToken ct = default)
        {
            await _channel.Writer.WriteAsync(task, ct);
        }

        /// <summary>
        /// Reads a task from the swarm (simulating a worker pulling from a queue).
        /// </summary>
        public async Task<InferenceTask> DequeueAsync(CancellationToken ct = default)
        {
            return await _channel.Reader.ReadAsync(ct);
        }
    }

    // ============================================================================
    // 3. AUTONOMOUS AGENT IMPLEMENTATION
    // ============================================================================

    /// <summary>
    /// Represents a single "Pod" or container in the Kubernetes cluster.
    /// It encapsulates the logic to process tasks independently.
    /// </summary>
    public class AutonomousAgent
    {
        private readonly string _agentId;
        private readonly Random _random = new();

        public AutonomousAgent(string agentId)
        {
            _agentId = agentId;
        }

        /// <summary>
        /// Performs the actual inference work. 
        /// In a real scenario, this would call an ONNX runtime or an LLM API.
        /// </summary>
        public async Task<InferenceResult> ProcessTaskAsync(InferenceTask task)
        {
            var startTime = DateTime.UtcNow;

            // Simulate compute-intensive work (e.g., matrix multiplication).
            // We use Task.Delay to simulate network latency or GPU processing time.
            var processingDelay = _random.Next(50, 200); 
            await Task.Delay(processingDelay);

            var duration = DateTime.UtcNow - startTime;

            // Simulate a transformation of the input payload.
            var output = $"[Agent {_agentId}] Processed: {task.Payload.ToUpperInvariant()} (Latency: {processingDelay}ms)";

            return new InferenceResult(task.Id, output, duration);
        }
    }

    // ============================================================================
    // 4. KEDA-STYLE SCALING LOGIC (SIMULATED)
    // ============================================================================

    /// <summary>
    /// Simulates the logic of KEDA (Kubernetes Event-driven Autoscaling).
    /// KEDA calculates the 'Desired Replica Count' based on queue metrics.
    /// Formula: DesiredReplicas = ceil( QueueLength / TargetQueueLength )
    /// </summary>
    public class ScalingController
    {
        private readonly int _targetQueueLengthPerAgent;

        public ScalingController(int targetQueueLengthPerAgent)
        {
            _targetQueueLengthPerAgent = targetQueueLengthPerAgent;
        }

        /// <summary>
        /// Calculates how many agents (pods) should be running.
        /// </summary>
        /// <param name="currentQueueLength">Current messages in the queue.</param>
        /// <param name="currentReplicas">Currently running agents.</param>
        /// <returns>The new target replica count.</returns>
        public int CalculateDesiredReplicas(int currentQueueLength, int currentReplicas)
        {
            // Avoid division by zero
            if (_targetQueueLengthPerAgent <= 0) return currentReplicas;

            // KEDA Formula Implementation
            int desired = (int)Math.Ceiling((double)currentQueueLength / _targetQueueLengthPerAgent);

            // Safety Clamps (Standard in production autoscalers)
            // 1. Minimum replicas (keep the service alive).
            if (desired < 1) desired = 1;

            // 2. Maximum replicas (prevent cost explosion).
            if (desired > 50) desired = 50; 

            return desired;
        }
    }

    // ============================================================================
    // 5. ORCHESTRATOR (THE SWARM MANAGER)
    // ============================================================================

    /// <summary>
    /// Manages the lifecycle of the agent pool.
    /// In Kubernetes, this logic is split:
    /// - KEDA ScaledObject: Calculates the desired count.
    /// - Kubernetes HPA: Adjusts the Deployment replica count.
    /// - K8s Scheduler: Places Pods on Nodes.
    /// </summary>
    public class SwarmOrchestrator
    {
        private readonly DistributedTaskQueue _queue;
        private readonly ScalingController _controller;
        private readonly List<AutonomousAgent> _activeAgents;
        private readonly ConcurrentDictionary<Guid, Task<InferenceResult>> _processingTasks;

        // Cancellation token to stop the swarm
        private CancellationTokenSource _cts;

        public SwarmOrchestrator(DistributedTaskQueue queue, ScalingController controller)
        {
            _queue = queue;
            _controller = controller;
            _activeAgents = new List<AutonomousAgent>();
            _processingTasks = new ConcurrentDictionary<Guid, Task<InferenceResult>>();
            _cts = new CancellationTokenSource();
        }

        /// <summary>
        /// Starts the control loop.
        /// In K8s, this is analogous to the KEDA operator reconciling state.
        /// </summary>
        public async Task StartControlLoopAsync()
        {
            Console.WriteLine("🚀 Starting Swarm Control Loop...");

            // Start the scaling monitor (simulates KEDA metrics server)
            var scalingTask = Task.Run(async () =>
            {
                while (!_cts.Token.IsCancellationRequested)
                {
                    await Task.Delay(2000, _cts.Token); // Check metrics every 2 seconds

                    // Get current metrics
                    int queueLength = 0; // In reality, this is fetched from the queue API
                    // For this simulation, we estimate based on internal tracking or a shared counter
                    // We'll rely on the processing tasks count to simulate load for this demo

                    int currentReplicas = _activeAgents.Count;
                    int desiredReplicas = _controller.CalculateDesiredReplicas(_processingTasks.Count, currentReplicas);

                    if (desiredReplicas != currentReplicas)
                    {
                        Console.WriteLine($"[KEDA Simulator] Queue Load: {_processingTasks.Count} tasks. " +
                                          $"Current Replicas: {currentReplicas}. " +
                                          $"Scaling to: {desiredReplicas}");

                        AdjustAgentPool(desiredReplicas);
                    }
                }
            }, _cts.Token);

            // Start the task dispatcher (simulates K8s Service Mesh routing)
            var dispatchTask = Task.Run(async () =>
            {
                while (!_cts.Token.IsCancellationRequested)
                {
                    try
                    {
                        // Simulate incoming traffic
                        await Task.Delay(500, _cts.Token);

                        // Only dispatch if we have agents
                        if (_activeAgents.Count > 0)
                        {
                            var task = new InferenceTask(Guid.NewGuid(), $"Request_{DateTime.Now.Ticks}", DateTime.UtcNow);
                            await _queue.EnqueueAsync(task, _cts.Token);
                            Console.WriteLine($"[Dispatcher] Enqueued task {task.Id}");
                        }
                    }
                    catch (OperationCanceledException) { break; }
                }
            }, _cts.Token);

            // Start the worker loop (simulates Pod execution)
            // In a real K8s setup, every Pod runs this loop independently.
            // Here, we simulate multiple agents running on the shared orchestrator context.
            var workerTask = Task.Run(async () =>
            {
                while (!_cts.Token.IsCancellationRequested)
                {
                    // If no agents are available, wait
                    if (_activeAgents.Count == 0)
                    {
                        await Task.Delay(100, _cts.Token);
                        continue;
                    }

                    try
                    {
                        // Pull task from queue
                        var task = await _queue.DequeueAsync(_cts.Token);

                        // Round-robin selection of an agent (simulates Load Balancer)
                        var agent = _activeAgents.OrderBy(a => Guid.NewGuid()).First();

                        // Start processing asynchronously (parallel execution)
                        var processingTask = agent.ProcessTaskAsync(task);

                        // Track the task
                        _processingTasks.TryAdd(task.Id, processingTask);

                        // Fire and forget cleanup
                        _ = processingTask.ContinueWith(t =>
                        {
                            _processingTasks.TryRemove(task.Id, out _);
                            if (t.IsCompletedSuccessfully)
                            {
                                Console.WriteLine($"[Result] {t.Result.Output}");
                            }
                        });
                    }
                    catch (OperationCanceledException) { break; }
                }
            }, _cts.Token);

            await Task.WhenAll(scalingTask, dispatchTask, workerTask);
        }

        /// <summary>
        /// Adjusts the number of active agents.
        /// In K8s, this triggers a Deployment scale event.
        /// </summary>
        private void AdjustAgentPool(int targetCount)
        {
            int currentCount = _activeAgents.Count;

            if (targetCount > currentCount)
            {
                // Scale Up
                for (int i = currentCount; i < targetCount; i++)
                {
                    var newAgent = new AutonomousAgent($"agent-pod-{i + 1}");
                    _activeAgents.Add(newAgent);
                    Console.WriteLine($"  + Added Agent: agent-pod-{i + 1}");
                }
            }
            else if (targetCount < currentCount)
            {
                // Scale Down (Graceful Shutdown)
                // In K8s, this corresponds to receiving a SIGTERM signal.
                for (int i = currentCount - 1; i >= targetCount; i--)
                {
                    var agent = _activeAgents[i];
                    Console.WriteLine($"  - Removing Agent: {agent.GetHashCode()} (Graceful Shutdown)");
                    _activeAgents.RemoveAt(i);
                }
            }
        }

        public void Stop()
        {
            Console.WriteLine("🛑 Stopping Swarm...");
            _cts.Cancel();
        }
    }

    // ============================================================================
    // 6. MAIN PROGRAM EXECUTION
    // ============================================================================

    class Program
    {
        static async Task Main(string[] args)
        {
            // 1. Initialize Infrastructure
            var queue = new DistributedTaskQueue();

            // KEDA Configuration: Target 5 concurrent tasks per Agent Pod
            var kedaConfig = new ScalingController(targetQueueLengthPerAgent: 5);

            var orchestrator = new SwarmOrchestrator(queue, kedaConfig);

            // 2. Run the Swarm
            // We run for a limited time to demonstrate the scaling behavior
            var runTask = orchestrator.StartControlLoopAsync();

            // Let it run for 10 seconds to observe scaling
            await Task.Delay(10000);

            orchestrator.Stop();

            // Allow time for graceful shutdown
            await Task.Delay(2000);

            Console.WriteLine("Simulation Complete.");
        }
    }
}

Detailed Line-by-Line Explanation

1. Core Data Models (`InferenceTask`, `InferenceResult`)

public record: We use C# 9+ record types here. Records are immutable by default, which is critical for distributed systems. When you pass data between agents or queues, you want to ensure that the data isn't modified unexpectedly by another thread. This prevents race conditions common in concurrent inference pipelines.
Guid Id: Unique identifiers are essential for tracking tasks in a distributed environment. If a request fails and needs to be retried, the ID ensures idempotency (processing the same task twice doesn't corrupt state).

2. Distributed Task Queue (`DistributedTaskQueue`)

Channel<T>: This is a modern .NET concurrency primitive (introduced in .NET Core 2.1). It replaces older patterns like BlockingCollection or manual lock statements with Monitor.
- Why? Channels are optimized for high-throughput scenarios. They handle thread safety internally and support async/await natively, which is vital for non-blocking I/O in microservices.
BoundedChannelOptions: In a real Kubernetes pod, memory is limited (e.g., 512Mi). If the queue fills up faster than the agents can process, you risk an OutOfMemory exception. Bounding the channel forces the producer (the entry point) to Wait (backpressure) until space is available, stabilizing the system.
FullMode = Wait: This mimics TCP flow control. It prevents the "bufferbloat" phenomenon where a fast producer overwhelms a slow consumer.

3. Autonomous Agent (`AutonomousAgent`)

ProcessTaskAsync: This represents the "Inference" step. In a real Cloud-Native AI scenario, this method would likely wrap a call to an ONNX Runtime session or a call to an external LLM API (like Azure OpenAI).
Task.Delay: We simulate latency here. Inference isn't instant; it takes time (GPU compute). By simulating this delay, we allow the ScalingController to observe a backlog of tasks and trigger scaling before the queue overflows.

4. Scaling Controller (`ScalingController`)

The Math: Math.Ceiling((double)currentQueueLength / _targetQueueLengthPerAgent). This is the core logic of KEDA's ScaledObject. It answers: "If I want each agent to handle 5 tasks, how many agents do I need for 12 pending tasks?" (Answer: 3).
Clamping: We enforce min: 1 and max: 50. In Kubernetes, you would define minReplicas and maxReplicas in your YAML manifest. This prevents scaling to zero (which would drop traffic) or scaling infinitely (which would bankrupt the company).

5. Swarm Orchestrator (`SwarmOrchestrator`)

Control Loop: This is the heart of Kubernetes-style orchestration. Instead of imperative code (do A, then B), we run a continuous loop that compares the desired state (calculated by KEDA) with the actual state (running agents) and reconciles them.
Async Processing: Note the line var processingTask = agent.ProcessTaskAsync(task);. We do not await this immediately inside the loop. If we did, the orchestrator would process tasks one by one. By firing and forgetting (and tracking the task in _processingTasks), we simulate parallel processing across multiple agents.
Graceful Shutdown: The AdjustAgentPool method handles scale-down. In Kubernetes, when a Deployment scales down, it sends a SIGTERM signal to the Pod. The Pod has a grace period (default 30s) to finish current requests. Our code simulates this by simply removing the agent from the list, but in a real app, you would hook into AppDomain.CurrentDomain.ProcessExit or use IHostApplicationLifetime.

6. Execution Flow

The Main method starts the orchestrator.
The StartControlLoopAsync launches three concurrent tasks:
1. Scaling Monitor: Checks the load every 2 seconds.
2. Dispatcher: Injects fake traffic every 500ms.
3. Worker: Pulls tasks and assigns them to agents.
This separation of concerns mimics the separation of the Kubernetes Control Plane (Scaling) and the Data Plane (Inference).

Visualizing the Architecture

The following diagram illustrates the flow of data and control signals within the simulated environment.

This diagram illustrates the separation of concerns between the Kubernetes Control Plane (Scaling) and the Data Plane (Inference), showing how control signals manage resources independently from the flow of data during model execution.

Common Pitfalls

Blocking the Control Loop
- Mistake: Awaiting a long-running inference task directly inside the main orchestrator loop (e.g., await agent.ProcessTaskAsync(task)).
- Consequence: The orchestrator cannot accept new requests or check scaling metrics while waiting for the current task to finish. This effectively turns your parallel swarm into a single-threaded pipeline, destroying throughput.
- Fix: Always await tasks in parallel or use Task.WhenAll. In the example, we store the Task in a dictionary to track completion without blocking the dispatcher.
Ignoring Graceful Shutdown
- Mistake: Immediately terminating agents when scaling down.
- Consequence: If an agent is writing a result to a database or updating a state store, a hard kill leaves data in an inconsistent state.
- Fix: In Kubernetes, configure terminationGracePeriodSeconds and handle CancellationToken in your C# code to stop accepting new work but finish current work before exiting.
Over-Scaling (The "Thundering Herd")
- Mistake: Setting the targetQueueLengthPerAgent too low (e.g., 1).
- Consequence: A small burst of traffic causes KEDA to scale from 1 to 50 pods instantly. All 50 pods start up, consume resources, and then sit idle once the burst passes. This wastes money and stresses the cluster.
- Fix: Use a higher target value (like 5 or 10) and configure stabilization windows in your KEDA ScaledObject to prevent rapid oscillation (flapping).

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.