Skip to content

Chapter 2: Your First Agent: Containerizing a Simple C# Microservice

Theoretical Foundations

The theoretical foundation for containerizing AI agents as microservices rests on a fundamental shift in perspective: moving from monolithic, stateful AI applications to ephemeral, stateless, and independently scalable computational units. This paradigm is not merely an operational convenience; it is an architectural necessity for building robust, multi-agent systems that can handle the unpredictable, bursty nature of generative AI workloads. To understand this, we must dissect the lifecycle of an AI agent, the nature of inference, and the constraints imposed by modern hardware.

The Agent as a Stateless Computational Process

At its core, an AI agent—whether a simple chatbot or a complex reasoning engine—can be modeled as a stateless function. It accepts a context (a prompt, a set of tools, conversation history) and returns a response (text, a tool call, a structured object). The critical word here is stateless. While a conversation has state (history), the agent's processing logic itself should not hold persistent state between requests. This is the first principle of microservice design applied to AI.

Analogy: The Specialized Kitchen Station Imagine a high-end restaurant kitchen. A monolithic AI application is like a single chef who tries to do everything: chop vegetables, sear the steak, and plate the dessert. This chef is slow, hard to scale, and if they fall ill, the entire kitchen stops. A microservices architecture, however, is a series of specialized stations: a vegetable station, a grill station, a pastry station. Each station is an independent unit. If the grill station gets overwhelmed with orders, you can add more grill chefs (scale out) without affecting the vegetable station. Each station is stateless; it doesn't remember the previous order, it just executes the current one perfectly. This is exactly how we treat an AI agent: a specialized station that takes an order (prompt) and produces a result (inference).

Containerization: The Immutable Artifact

The "container" in containerized AI is the standardized packaging unit for our specialized kitchen station. It encapsulates the agent's logic, its dependencies (like the PyTorch or TensorFlow runtime, or in our case, the .NET runtime and ML.NET or ONNX Runtime), and its configuration into a single, immutable artifact.

Why is this critical for AI?

  1. Dependency Hell: AI frameworks evolve rapidly. One agent might require a specific version of CUDA for GPU acceleration, while another might be optimized for CPU-only inference using a different version of the ONNX runtime. Containers isolate these environments, preventing conflicts.
  2. Reproducibility: In scientific computing and AI, reproducibility is paramount. A container ensures that the agent runs identically on a developer's laptop, a staging server, and a production Kubernetes cluster. This eliminates the "it works on my machine" problem.
  3. Portability: The container abstracts away the underlying host OS and hardware specifics. This is crucial for hybrid cloud strategies where you might want to run lightweight CPU-bound agents on-premise and heavy GPU-bound agents in the cloud.

The .NET Context: In C#, we leverage Dockerfile to define this immutable artifact. The base image might be mcr.microsoft.com/dotnet/aspnet:8.0 for a web API agent or a custom image pre-loaded with CUDA drivers and the ONNX Runtime for GPU-accelerated inference. The key is that the agent's code is compiled into a self-contained executable, bundled, and deployed as this atomic unit.

Orchestration and the Kubernetes Control Plane

Once we have containerized agents, we need a way to manage their lifecycle, networking, and scaling. This is the role of an orchestrator, and Kubernetes is the de facto standard. Kubernetes provides a declarative API where we define the desired state of our system, and the control plane works tirelessly to make the reality match that state.

Analogy: The Air Traffic Control System Kubernetes is the air traffic control (ATC) for our containerized agents. We don't tell individual planes (containers) where to fly. Instead, we file a flight plan (a Kubernetes manifest). The ATC (Kubernetes control plane) ensures that:

  • There are always enough planes of a certain type running (ReplicaSets).
  • Planes are routed to the correct runway (Service discovery).
  • If a plane crashes, a new one is automatically dispatched (Self-healing).
  • During peak travel times, more planes are added to the sky (Horizontal Pod Autoscaling).

Multi-Agent Workflows: The Service Mesh

When we have multiple agents—e.g., a "Router Agent" that decides which specialist agent to call, a "Retrieval Agent" for fetching data, and a "Generation Agent" for synthesizing the final answer—we create a distributed system. The communication between these agents is the nervous system of the application.

A Service Mesh (like Istio or Linkerd) is the dedicated infrastructure layer for this communication. It handles service discovery, load balancing, retries, and circuit breaking without polluting the agent's business logic.

Why is this non-negotiable for AI? AI agents are notoriously flaky. An LLM might hallucinate, a network call to a vector database might time out, or a GPU might be under heavy load. A service mesh provides the resilience patterns needed:

  • Retries with Exponential Backoff: If an agent fails to respond, the mesh can retry the request a few times, waiting longer between each attempt.
  • Circuit Breakers: If an agent is consistently failing, the mesh can "trip the circuit" and stop sending traffic to it, preventing a cascade failure across the entire system.
  • Traffic Splitting: We can deploy a new version of an agent and gradually shift traffic to it (canary deployment) to test its performance and accuracy in production without risking the entire workflow.

Dynamic Scaling: The Art of Resource Management

AI inference is resource-intensive and highly variable. A request to a large language model can consume gigabytes of GPU memory and take seconds to complete, while a simple classification task might be microseconds on a CPU. Static provisioning is inefficient and costly.

Dynamic Scaling is the ability to adjust the number of running agent instances based on real-time demand and resource utilization. This is where the concept of Horizontal Pod Autoscaling (HPA) in Kubernetes becomes critical.

The Scaling Triggers:

  1. CPU/Memory Utilization: The classic metrics. If an agent's CPU usage exceeds 80%, Kubernetes spins up more replicas.
  2. Custom Metrics (The AI-Specific Case): For AI, we often need more sophisticated metrics. We might scale based on:
    • GPU VRAM Utilization: A direct measure of how much memory the model is using.
    • Inference Latency (P99): If the 99th percentile latency of inference requests crosses a threshold (e.g., 500ms), we scale out to distribute the load.
    • Queue Length: The number of pending requests waiting for an agent to be free.

The "What If" Scenario: The Cold Start Problem A critical edge case in AI scaling is the "cold start." Loading a multi-gigabyte model into GPU memory can take 30-60 seconds. If we scale from 0 to 1 replica on a sudden traffic spike, the first few requests will time out. The solution involves pre-warming or over-provisioning (keeping a minimum number of replicas always ready) and using readiness probes in Kubernetes to ensure traffic is only sent to a pod once the model is fully loaded and ready to infer.

Architectural Patterns: The Sidecar and the Init Container

In Kubernetes, we can use specific patterns to enhance our AI agents.

  • The Sidecar Pattern: Imagine a small, secondary container that runs alongside your main agent container in the same Pod (the smallest deployable unit in Kubernetes). This sidecar can handle tasks like:

    • Metrics Collection: Scraping inference latency and throughput from the agent's logs and exporting them to Prometheus.
    • Model Warm-up: A sidecar could periodically send "ping" requests to the main agent to keep the model loaded in GPU memory, preventing it from being swapped out.
    • Security: A sidecar can handle mutual TLS (mTLS) encryption for all traffic leaving the agent, ensuring secure communication between microservices.
  • The Init Container Pattern: An init container runs before the main application container starts. For an AI agent, this is the perfect place to perform one-time setup tasks, such as:

    • Downloading the model weights from a remote storage (e.g., Azure Blob Storage, S3) if they aren't already present.
    • Pre-processing and caching data.
    • Validating that the required GPU drivers are present and functional.

The Role of C# and Modern .NET in this Ecosystem

While the orchestration layer is language-agnostic, the choice of C# for building these agents is strategic, especially when leveraging modern features.

Interfaces for Abstraction and Swapping: As mentioned in the directives, interfaces are the cornerstone of a flexible agent architecture. An IAgent interface allows you to decouple the agent's contract from its implementation. This is vital for swapping between different AI backends.

// The contract for any agent in our system
public interface IAgent
{
    Task<AgentResponse> ProcessAsync(AgentRequest request, CancellationToken cancellationToken);
}

// A concrete implementation for an OpenAI-based agent
public class OpenAIAgent : IAgent
{
    private readonly IOpenAIClient _client;
    // ... constructor injection
    public async Task<AgentResponse> ProcessAsync(AgentRequest request, CancellationToken cancellationToken)
    {
        // Logic to call OpenAI API
    }
}

// A concrete implementation for a local ONNX-based agent
public class OnnxAgent : IAgent
{
    private readonly InferenceSession _session;
    // ... constructor injection
    public async Task<AgentResponse> ProcessAsync(AgentRequest request, CancellationToken cancellationToken)
    {
        // Logic for local ONNX inference
    }
}
This pattern allows a dependency injection container to resolve the correct implementation at runtime, enabling A/B testing or region-specific deployments (e.g., using a local model in regions with strict data sovereignty laws).

Reference to Previous Concepts: This builds upon the foundational concepts of dependency injection and inversion of control discussed in Book 2: Architectural Patterns for Scalable Systems. We are applying those same enterprise-grade principles to the domain of AI, ensuring that our agents are not tightly coupled to specific AI providers or hardware, which is a common pitfall in monolithic AI projects.

Async/Await and Cancellation Tokens: Inference is an I/O-bound and often long-running operation. Modern C#'s async/await pattern is essential for building non-blocking agents that can handle thousands of concurrent requests without exhausting the thread pool. Furthermore, CancellationToken is a first-class citizen in .NET and is critical for managing request lifecycles. If a client disconnects while waiting for a long inference, the token can be triggered to cancel the underlying GPU operation, freeing up valuable resources immediately.

public async Task<InferenceResult> GenerateTextAsync(string prompt, CancellationToken ct)
{
    // The cancellation token is passed down to the underlying HTTP call or native library call
    // If the request is cancelled, the GPU work is aborted, and resources are released.
    return await _inferenceEngine.RunAsync(prompt, ct);
}

Visualization of the Architecture

The following diagram illustrates the flow of a request through a containerized, multi-agent system, highlighting the separation of concerns between the application logic and the orchestration infrastructure.

Conclusion: The Synthesis

In summary, the theoretical foundation for containerizing AI agents is a synthesis of established software engineering principles—statelessness, immutability, and loose coupling—adapted to the unique constraints of AI workloads. By treating agents as independent, containerized microservices, we unlock the ability to:

  • Scale with Precision: Allocate GPU resources only when and where they are needed.
  • Build with Resilience: Isolate failures and prevent them from cascading through a multi-agent workflow.
  • Innovate with Agility: Swap models, frameworks, and hardware backends without rewriting the entire application.

This architectural approach transforms AI from a monolithic, fragile black box into a dynamic, resilient, and scalable distributed system, ready for the demands of modern enterprise applications. The use of modern C# features like interfaces, async/await, and robust dependency injection provides the language-level support needed to implement these patterns cleanly and effectively.

Basic Code Example

We are creating a simple "Hello World" example of a containerized AI agent as a microservice. The example will be a basic HTTP service that returns a greeting. We'll use ASP.NET Core for the web API and Docker for containerization.

The real-world context: Imagine you are building a chatbot for an e-commerce site. One of the agents in the system is the "GreetingAgent" that welcomes users. We want to deploy this agent as an independent microservice so it can be scaled and managed separately.

We'll create:

  1. A simple ASP.NET Core Web API with one endpoint that returns a greeting.
  2. A Dockerfile to containerize the application.
  3. A brief explanation of how to build and run the container.

Note: We are not including Kubernetes or service mesh in this basic example, but we'll mention how they would fit in.

Let's write the code.

First, the C# code for the ASP.NET Core Web API.

using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using System.Text.Json;

namespace GreetingAgentMicroservice
{
    public class Program
    {
        public static void Main(string[] args)
        {
            var builder = WebApplication.CreateBuilder(args);

            // Configure services for dependency injection
            builder.Services.AddSingleton<IGreetingService, GreetingService>();
            builder.Services.AddControllers(); // For potential future endpoints

            var app = builder.Build();

            // Define a single endpoint for our greeting agent
            app.MapGet("/api/greet/{userName}", (string userName, IGreetingService greetingService) =>
            {
                var greeting = greetingService.GenerateGreeting(userName);
                return Results.Ok(new { Message = greeting, Timestamp = DateTime.UtcNow });
            });

            // Configure the HTTP request pipeline
            if (app.Environment.IsDevelopment())
            {
                app.UseDeveloperExceptionPage();
            }

            app.UseRouting();

            // Run the application
            app.Run();
        }
    }

    // Interface for the greeting service (dependency inversion)
    public interface IGreetingService
    {
        string GenerateGreeting(string userName);
    }

    // Concrete implementation of the greeting service
    public class GreetingService : IGreetingService
    {
        private readonly List<string> _greetingTemplates = new()
        {
            "Hello, {0}! Welcome to our AI-powered platform.",
            "Hi {0}, great to see you today!",
            "Greetings, {0}! How can our AI assist you?"
        };

        public string GenerateGreeting(string userName)
        {
            if (string.IsNullOrWhiteSpace(userName))
            {
                throw new ArgumentException("User name cannot be empty", nameof(userName));
            }

            // Simple business logic: select a random greeting template
            var random = new Random();
            var template = _greetingTemplates[random.Next(_greetingTemplates.Count)];

            // Format the greeting with the user's name
            return string.Format(template, userName);
        }
    }
}

Dockerfile for Containerization

# Use the official .NET 8 runtime image for production
FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base
WORKDIR /app
EXPOSE 80
EXPOSE 443

# Use the .NET 8 SDK image for building
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
WORKDIR /src
COPY ["GreetingAgentMicroservice.csproj", "."]
RUN dotnet restore "GreetingAgentMicroservice.csproj"
COPY . .
RUN dotnet build "GreetingAgentMicroservice.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "GreetingAgentMicroservice.csproj" -c Release -o /app/publish

# Final stage: create the production image
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "GreetingAgentMicroservice.dll"]

Docker Compose File (Optional, for local testing)

version: '3.8'
services:
  greeting-agent:
    build: .
    ports:

      - "5000:80"
    environment:

      - ASPNETCORE_ENVIRONMENT=Development
    restart: unless-stopped

Step-by-Step Explanation

  1. ASP.NET Core Application Setup:
  2. We create a minimal API using WebApplication.CreateBuilder(args). This sets up the host, configuration, and dependency injection container.
  3. The GreetingService is registered as a singleton in the dependency injection container. This means one instance will be shared across all requests, which is efficient for stateless services.
  4. We define a single HTTP GET endpoint at /api/greet/{userName} that takes a path parameter for the user's name.

  5. Business Logic Implementation:

  6. The IGreetingService interface defines a contract for generating greetings. This allows for easy swapping of implementations (e.g., for testing or different environments).
  7. The GreetingService implements this interface with a simple random selection from a list of greeting templates. It also includes basic validation (checking for empty user names).

  8. Endpoint Logic:

  9. The endpoint uses dependency injection to get an instance of IGreetingService. It calls GenerateGreeting with the provided user name and returns a JSON response containing the greeting message and a timestamp.
  10. The response is wrapped in an Ok result (HTTP 200) with a structured JSON object.

  11. Containerization with Docker:

  12. Base Image: We use mcr.microsoft.com/dotnet/aspnet:8.0 for the runtime, which is optimized for running ASP.NET Core applications in production.
  13. Build Stage: We use the SDK image to restore dependencies, build the project, and publish the application. This stage includes all necessary build tools.
  14. Final Stage: We copy the published output from the build stage into a clean runtime image. This keeps the final image small and secure (no build tools).
  15. The ENTRYPOINT specifies the command to run the application.

  16. Docker Compose (Optional):

  17. This file simplifies local testing by defining a service that builds the Dockerfile and maps port 5000 on the host to port 80 in the container.
  18. The environment variable ASPNETCORE_ENVIRONMENT is set to Development to enable developer-friendly error pages.

Common Pitfalls

  1. Missing Dependency Injection Registration:
  2. If you forget to register GreetingService in builder.Services.AddSingleton<IGreetingService, GreetingService>(), the application will throw an exception when trying to resolve the service. Always ensure your dependencies are registered.

  3. Incorrect Dockerfile Paths:

  4. The COPY commands in the Dockerfile rely on the project file name matching the directory structure. If your project file is named differently, update the paths accordingly. Also, ensure the Dockerfile is in the same directory as your project file.

  5. Port Conflicts in Containerization:

  6. When running the container, if port 80 is already in use on the host, you'll get an error. Use -p <host_port>:80 to map to a different host port (e.g., -p 5000:80).

  7. State Management in Services:

  8. The GreetingService is registered as a singleton, which means it's shared across all requests. If you add any state (like a counter), it will be shared, which could lead to race conditions in a multi-threaded environment. For stateless services, this is fine, but be cautious.

  9. Security in Production:

  10. This example uses HTTP and no authentication. In a real-world scenario, you would need to add HTTPS, authentication, and authorization. Also, consider rate limiting to prevent abuse.

Real-World Context

In an e-commerce platform, the GreetingAgent might be one of many microservices. It could be deployed to a Kubernetes cluster where it scales based on incoming traffic. Other agents (like ProductRecommendationAgent or OrderProcessingAgent) would communicate with this service via HTTP or gRPC. By containerizing each agent, you can update, scale, and manage them independently, improving resilience and development velocity.

Visualization of Microservice Architecture

Diagram: microservice_architecture
Hold "Ctrl" to enable pan & zoom

Diagram: microservice_architecture

This diagram shows how the user's request flows through an API gateway to the GreetingAgent microservice, which internally uses the GreetingService. The entire agent is containerized, allowing it to be deployed and scaled independently.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.