Chapter 26: Capstone Project: Building a Distributed Retrieval-Augmented Generation (RAG) Pipeline

Introduction

Throughout this book, we have explored the theoretical foundations and practical patterns for building containerized AI agents. We have discussed everything from dependency injection and asynchronous communication to Kubernetes orchestration and service mesh resilience. In this final chapter, we will synthesize these concepts into a complete, end-to-end project: a distributed Retrieval-Augmented Generation (RAG) pipeline.

A RAG system is a perfect example of a modern AI application. It enhances a Large Language Model (LLM) by providing it with relevant, up-to-date information from an external knowledge base. This mitigates model hallucinations and allows the agent to answer questions about specific, private data.

Our RAG pipeline will be a multi-agent system composed of four distinct microservices, demonstrating a realistic hybrid C# and Python environment.

The Goal: Build a scalable, containerized system that can answer a user's question by: 1. Converting the question into a vector embedding. 2. Searching a vector database for relevant document chunks. 3. Passing the question and the retrieved context to an LLM to generate a final answer.

This project will serve as a practical demonstration of the principles of decoupling, state management, service discovery, and orchestration that are central to Cloud-Native AI.

Architecture Overview

Our system will consist of four microservices orchestrated by Docker Compose for local development and Kubernetes for production.

Orchestrator Service (C# - ASP.NET Core): The user-facing API gateway. It receives the user's query and orchestrates the calls to the other services in the pipeline. This represents the "Control Plane" logic discussed in previous chapters.
Embedding Service (Python - FastAPI): A stateless worker responsible for converting text into vector embeddings using a transformer model. This is a classic GPU-accelerated workload.
Vector Store Service (Python - FastAPI with ChromaDB): A stateful service that stores and searches vector embeddings. It uses a persistent volume to store its database.
Generation Service (Python - FastAPI): A stateless worker that takes a prompt (user query + context) and generates a final text response using an LLM. This is another GPU-accelerated workload.

The following diagram illustrates the flow of a request through the system:

Visualizing the Architecture: The RAG Pipeline Diagram

The diagram above represents the logical topology and the data flow of our distributed application. It visualizes how a user request traverses the Kubernetes cluster, transforming from raw text into a sophisticated, context-aware AI response.

We can break this diagram down into three specific layers:

1. The Boundary Layers

The User: Represents any external client (a web browser, a mobile app, or a curl command) initiating a request.
The Kubernetes Cluster: The dotted line represents the trust boundary. Everything inside this box is running within our managed container environment.
Ingress Controller: The entry point. It acts as the traffic warden, accepting external HTTP traffic on port 80 or 443 and routing it to the internal Orchestrator Service. This abstracts the internal network topology from the outside world.

2. The Control Plane (The Brain)

Orchestrator Service (C#): Located at the center of the flow, this is the only service that communicates with all other components. It implements the "Mediator" pattern. It does not perform any heavy AI computation; its job is to manage the workflow logic, handle errors, and format data for the specialized workers.

3. The Data Plane (The Muscle)

These are the specialized workers, grouped together because they represent the computationally intensive or stateful parts of the system. * Embedding Service: A stateless, GPU-accelerated worker. It converts human language into mathematical vectors (embeddings). * Vector Store Service: The system's "Long-Term Memory." It stores embeddings and performs similarity searches. * PVC (Persistent Volume Claim): Notice the cylinder connected to the Vector Store. This represents the physical storage on the disk. Because the Vector Store is stateful, it must write data here to ensure knowledge persists even if the pod crashes or restarts. * Generation Service: A stateless, GPU-accelerated worker running the Large Language Model (LLM). It synthesizes the final answer.

The Request Lifecycle (Steps 1-9)

The numbered arrows in the diagram correspond to the sequential flow of a single Retrieval-Augmented Generation request:

Ingestion (Steps 1-2): The user sends a JSON payload (POST /query). The Ingress routes this to the Orchestrator.
Vectorization (Steps 3-4): The Orchestrator cannot search the database with raw text. It sends the user's query to the Embedding Service, which returns a float array (the vector).
Retrieval (Steps 5-6): The Orchestrator takes that vector and sends it to the Vector Store. The store performs a "Nearest Neighbor" search to find document chunks that are mathematically similar to the query. It returns this text context to the Orchestrator.
Synthesis (Steps 7-8): The Orchestrator constructs a new prompt that combines the user's original question with the context retrieved in Step 6. It sends this enriched prompt to the Generation Service. The LLM generates the answer.
Completion (Step 9): The Orchestrator streams the final answer back to the user, completing the cycle.

This visual representation highlights the decoupling achieved in this chapter. The Orchestrator doesn't know how to calculate a vector or how to run an LLM; it simply knows who to ask. This allows us to scale the Generation Service (the most expensive component) independently of the Vector Store (the memory component).

Component Deep Dive & Code

We will organize the project directory with a separate folder for each service and present all the files for your review. Subsequently, we will explain each file and how the components interact.

1. Orchestrator Service (C#)

This service is the brain. It's written in C# to leverage its strong typing, performance, and excellent support for building resilient HTTP clients.

File: orchestrator/Program.cs

using System.Net.Http.Json;
using System.Text.Json;

var builder = WebApplication.CreateBuilder(args);

// Configure HttpClients for downstream services
// In K8s, these URLs would be service names resolved by DNS.
builder.Services.AddHttpClient("Embedding", client => 
    client.BaseAddress = new Uri(builder.Configuration["Services:Embedding"] ?? "http://localhost:8001"));
builder.Services.AddHttpClient("VectorStore", client => 
    client.BaseAddress = new Uri(builder.Configuration["Services:VectorStore"] ?? "http://localhost:8002"));
builder.Services.AddHttpClient("Generation", client => 
    client.BaseAddress = new Uri(builder.Configuration["Services:Generation"] ?? "http://localhost:8003"));

var app = builder.Build();

// The main orchestration endpoint
app.MapPost("/query", async (QueryRequest request, IHttpClientFactory clientFactory) =>
{
    var logger = app.Logger;

    try
    {
        logger.LogInformation("1. Starting RAG pipeline for query: '{Query}'", request.Query);

        // Step 1: Get query embedding
        var embedClient = clientFactory.CreateClient("Embedding");
        var embedResponse = await embedClient.PostAsJsonAsync("/embed", new { text = request.Query });
        if (!embedResponse.IsSuccessStatusCode) return Results.Problem("Embedding service failed.");
        var queryVector = await embedResponse.Content.ReadFromJsonAsync<EmbeddingResponse>();

        // Step 2: Search for context in vector store
        var vectorClient = clientFactory.CreateClient("VectorStore");
        var searchResponse = await vectorClient.PostAsJsonAsync("/search", new { vector = queryVector.Embedding });
        if (!searchResponse.IsSuccessStatusCode) return Results.Problem("Vector store search failed.");
        var searchResults = await searchResponse.Content.ReadFromJsonAsync<SearchResponse>();
        var context = string.Join("\n", searchResults.Results);

        // Step 3: Generate the final answer
        var generationClient = clientFactory.CreateClient("Generation");
        var finalPrompt = $"Context: {context}\n\nQuestion: {request.Query}\n\nAnswer:";
        var generationResponse = await generationClient.PostAsJsonAsync("/generate", new { prompt = finalPrompt });
        if (!generationResponse.IsSuccessStatusCode) return Results.Problem("Generation service failed.");

        logger.LogInformation("4. RAG pipeline complete.");
        return Results.Stream(await generationResponse.Content.ReadAsStreamAsync());
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "An error occurred during the RAG pipeline.");
        return Results.Problem("An internal error occurred.");
    }
});

// Endpoint to add documents to the knowledge base
app.MapPost("/add-document", async (AddDocumentRequest request, IHttpClientFactory clientFactory) =>
{
    var embedClient = clientFactory.CreateClient("Embedding");
    var vectorClient = clientFactory.CreateClient("VectorStore");

    var embedResponse = await embedClient.PostAsJsonAsync("/embed", new { text = request.Content });
    var embedding = await embedResponse.Content.ReadFromJsonAsync<EmbeddingResponse>();

    await vectorClient.PostAsJsonAsync("/add", new { text = request.Content, vector = embedding.Embedding });

    return Results.Ok(new { status = "Document added" });
});


app.Run("http://0.0.0.0:8080");

// DTOs
public record QueryRequest(string Query);
public record AddDocumentRequest(string Content);
public record EmbeddingResponse(float[] Embedding);
public record SearchResponse(string[] Results);

File: orchestrator/Dockerfile

FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
WORKDIR /src
COPY . .
RUN dotnet publish -c Release -o /app/publish

FROM mcr.microsoft.com/dotnet/aspnet:8.0
WORKDIR /app
COPY --from=build /app/publish .
EXPOSE 8080
ENTRYPOINT ["dotnet", "orchestrator.dll"]

2. AI Services (Python)

These Python services use FastAPI for the API layer and mock the ML logic for simplicity.

File: shared/requirements.txt

fastapi
uvicorn
sentence-transformers
torch
chromadb
numpy

File: embedding-service/app.py

from fastapi import FastAPI
from pydantic import BaseModel
# from sentence_transformers import SentenceTransformer # In a real scenario

app = FastAPI()
# model = SentenceTransformer('all-MiniLM-L6-v2') # In a real scenario

class EmbedRequest(BaseModel):
    text: str

@app.post("/embed")
def embed(request: EmbedRequest):
    # embedding = model.encode(request.text).tolist() # Real logic
    embedding = [0.1] * 384 # Mock embedding
    return {"embedding": embedding}

File: embedding-service/Dockerfile

FROM python:3.11-slim
WORKDIR /app
COPY shared/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY embedding-service/app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

File: vector-store/app.py

from fastapi import FastAPI
from pydantic import BaseModel
import chromadb

app = FastAPI()
# This will store data in the /app/chroma_db directory inside the container
client = chromadb.PersistentClient(path="/app/chroma_db")
collection = client.get_or_create_collection(name="knowledge_base")

class AddRequest(BaseModel):
    text: str
    vector: list[float]

class SearchRequest(BaseModel):
    vector: list[float]

@app.post("/add")
def add_document(request: AddRequest):
    collection.add(
        embeddings=[request.vector],
        documents=[request.text],
        ids=[str(hash(request.text))]
    )
    return {"status": "added"}

@app.post("/search")
def search(request: SearchRequest):
    results = collection.query(
        query_embeddings=[request.vector],
        n_results=2
    )
    return {"results": results['documents'][0]}

(The Dockerfile for vector-store is similar to embedding-service)

File: generation-service/app.py

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class GenerateRequest(BaseModel):
    prompt: str

@app.post("/generate")
def generate(request: GenerateRequest):
    # Mock LLM response
    return {"response": f"Based on the provided context, the answer to your question is: [Mocked AI Answer for '{request.prompt[:50]}...']"}

(The Dockerfile for generation-service is similar to embedding-service)

Local Orchestration with Docker Compose

This docker-compose.yml file allows you to run the entire distributed system on your local machine.

File: docker-compose.yml

version: '3.8'

services:
  orchestrator:
    build:
      context: .
      dockerfile: orchestrator/Dockerfile
    ports:
      - "8080:8080"
    environment:
      - Services__Embedding=http://embedding-service:8000
      - Services__VectorStore=http://vector-store:8000
      - Services__Generation=http://generation-service:8000
    networks:
      - ai-net
    depends_on:
      - embedding-service
      - vector-store
      - generation-service

  embedding-service:
    build:
      context: .
      dockerfile: embedding-service/Dockerfile
    networks:
      - ai-net

  vector-store:
    build:
      context: .
      dockerfile: vector-store/Dockerfile
    volumes:
      - chroma_data:/app/chroma_db
    networks:
      - ai-net

  generation-service:
    build:
      context: .
      dockerfile: generation-service/Dockerfile
    networks:
      - ai-net

networks:
  ai-net:
    driver: bridge

volumes:
  chroma_data:
    driver: local

To run locally: docker-compose up --build

Production Orchestration with Kubernetes

These manifests deploy our RAG pipeline to a Kubernetes cluster.

File: kubernetes-manifests.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: rag-pipeline

# Persistent Storage for the Vector Database
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: chroma-db-pvc
  namespace: rag-pipeline
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

# Vector Store Service (Stateful)
apiVersion: apps/v1
kind: Deployment # Using Deployment for simplicity, StatefulSet is better for production
metadata:
  name: vector-store
  namespace: rag-pipeline
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vector-store
  template:
    metadata:
      labels:
        app: vector-store
    spec:
      containers:
      - name: vector-store
        image: your-registry/vector-store:latest
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: db-data
          mountPath: /app/chroma_db
      volumes:
      - name: db-data
        persistentVolumeClaim:
          claimName: chroma-db-pvc

apiVersion: v1
kind: Service
metadata:
  name: vector-store
  namespace: rag-pipeline
spec:
  selector:
    app: vector-store
  ports:
  - port: 8000
    targetPort: 8000

# Embedding Service (Stateless)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
  namespace: rag-pipeline
spec:
  replicas: 2
  selector:
    matchLabels:
      app: embedding-service
  template:
    metadata:
      labels:
        app: embedding-service
    spec:
      # This service requires a GPU
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4 # Example for GKE
      containers:
      - name: embedding-service
        image: your-registry/embedding-service:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1

apiVersion: v1
kind: Service
metadata:
  name: embedding-service
  namespace: rag-pipeline
spec:
  selector:
    app: embedding-service
  ports:
  - port: 8000
    targetPort: 8000

# Generation Service (Stateless)
# (Manifest is very similar to Embedding Service)

# Orchestrator Service (C#)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orchestrator
  namespace: rag-pipeline
spec:
  replicas: 2
  selector:
    matchLabels:
      app: orchestrator
  template:
    metadata:
      labels:
        app: orchestrator
    spec:
      containers:
      - name: orchestrator
        image: your-registry/orchestrator:latest
        ports:
        - containerPort: 8080
        env:
        - name: Services__Embedding
          value: "http://embedding-service.rag-pipeline.svc.cluster.local:8000"
        - name: Services__VectorStore
          value: "http://vector-store.rag-pipeline.svc.cluster.local:8000"
        - name: Services__Generation
          value: "http://generation-service.rag-pipeline.svc.cluster.local:8000"

apiVersion: v1
kind: Service
metadata:
  name: orchestrator
  namespace: rag-pipeline
spec:
  selector:
    app: orchestrator
  ports:
  - port: 80
    targetPort: 8080

# Ingress to expose the Orchestrator
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rag-ingress
  namespace: rag-pipeline
spec:
  rules:
  - host: rag.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: orchestrator
            port:
              number: 80

Step-by-Step Deployment and Testing

Build & Push Images: Build each service's Docker image and push it to your container registry (e.g., Docker Hub, GCR, ACR).

docker build -t your-registry/orchestrator:latest -f orchestrator/Dockerfile .
# Repeat for other services
docker push your-registry/orchestrator:latest

Deploy to Kubernetes:

kubectl apply -f kubernetes-manifests.yaml

Verify Pods:
```
kubectl get pods -n rag-pipeline -w
```

Add a Document to the Knowledge Base:

kubectl port-forward svc/orchestrator -n rag-pipeline 8080:80 &
curl -X POST http://localhost:8080/add-document -H "Content-Type: application/json" -d '{"content": "The T4 GPU is a cost-effective choice for inference workloads."}'

Query the Pipeline:

curl -X POST http://localhost:8080/query -H "Content-Type: application/json" -d '{"query": "What is the T4 GPU good for?"}'

You should receive a response generated by the LLM, informed by the document you just added.

A Guided Tour of the RAG Pipeline Project Files

To fully understand the capstone project, it is essential to dissect each file and its role within the distributed system. This section provides a comprehensive breakdown of the project's structure, explaining the purpose of each service, the logic within its code, and how it communicates with other components.

Project Directory Structure

The project is organized into a monorepo structure, where each microservice resides in its own directory. This is a common pattern for managing multi-service applications.

rag-pipeline-project/
├── orchestrator/                # C# ASP.NET Core Service (The "Brain")
│   ├── Program.cs
│   └── Dockerfile
├── embedding-service/           # Python FastAPI Service (Stateless GPU Worker)
│   ├── app.py
│   └── Dockerfile
├── vector-store/                # Python FastAPI Service (Stateful Worker)
│   ├── app.py
│   └── Dockerfile
├── generation-service/          # Python FastAPI Service (Stateless GPU Worker)
│   ├── app.py
│   └── Dockerfile
├── shared/                      # Shared dependencies for Python services
│   └── requirements.txt
├── docker-compose.yml           # Orchestration for Local Development
└── kubernetes-manifests.yaml    # Orchestration for Production

1. The Orchestrator Service (C#)

This service is the user-facing entry point and the "control plane" of our RAG pipeline. It doesn't perform any AI inference itself; its sole responsibility is to orchestrate the workflow by calling the specialized Python agents.

File: orchestrator/Program.cs

Purpose: This file contains the complete C# ASP.NET Core application. It defines the API endpoints, configures HTTP clients for communicating with downstream services, and implements the core orchestration logic.
Key Logic:
- Dependency Injection: builder.Services.AddHttpClient(...) is used to register and configure HttpClient instances. This is a best practice in .NET for managing HTTP connections efficiently. We give each client a name ("Embedding", "VectorStore", etc.) for easy retrieval.
- /query Endpoint: This is the heart of the orchestrator. It executes the RAG pipeline step-by-step:
  1. It receives the user's QueryRequest.
  2. It calls the embedding-service to convert the query into a vector.
  3. It calls the vector-store with that vector to search for relevant context.
  4. It constructs a final, context-rich prompt and sends it to the generation-service.
  5. It streams the final response back to the user.
- /add-document Endpoint: This utility endpoint allows us to populate our knowledge base. It follows a similar pattern of calling the embedding-service first, then the vector-store.
- Configuration: The URLs for downstream services are read from configuration (builder.Configuration["Services:Embedding"]). This allows us to change the service addresses without recompiling the code, a key principle of cloud-native design.

File: orchestrator/Dockerfile

Purpose: This file provides the instructions to build a container image for the C# orchestrator.
Key Logic (Multi-Stage Build):
1. FROM ... AS build: The first stage uses the .NET SDK image, which contains all the necessary tools to compile the C# code.
2. FROM ...: The final stage uses the lightweight ASP.NET runtime image.
3. COPY --from=build ...: This crucial step copies only the compiled application from the build stage into the final runtime stage. This ensures the final image is small, secure, and free of build tools and source code.

2. The AI Agent Services (Python)

These three services represent the "data plane"—the specialized, computationally-intensive workers. They are written in Python using FastAPI, a modern, high-performance web framework ideal for building AI microservices.

File: shared/requirements.txt

Purpose: This file lists all the Python dependencies required by the AI services. A shared file ensures that all our Python services use the same versions of libraries like torch and fastapi, preventing dependency conflicts.

File: embedding-service/app.py

Purpose: A stateless worker dedicated to one task: converting text to vector embeddings. In a real scenario, this would load a sentence-transformer model onto a GPU.
Key Logic: It exposes a single /embed endpoint. It receives text and returns a vector. For this project, the model loading and encoding are mocked to keep the example lightweight.
Interaction: It is called by the Orchestrator service.

File: vector-store/app.py

Purpose: A stateful worker that manages the knowledge base. It uses ChromaDB, a popular open-source vector database.
Key Logic:
- chromadb.PersistentClient(path="/app/chroma_db"): This line is critical. It tells ChromaDB to store its data on the local filesystem at the /app/chroma_db path. When we run this in a container, this path will be mapped to a persistent volume, ensuring our data survives pod restarts.
- /add Endpoint: Receives a document and its pre-computed vector, adding it to the collection.
- /search Endpoint: Receives a query vector and performs a similarity search to find the most relevant documents.
Interaction: It is called by the Orchestrator service.

File: generation-service/app.py

Purpose: A stateless worker that acts as a wrapper for a Large Language Model (LLM).
Key Logic: It exposes a /generate endpoint that receives a full prompt (including context) and returns a final, human-readable answer. The actual LLM call is mocked.
Interaction: It is called by the Orchestrator service as the final step in the pipeline.

File: *-service/Dockerfile

Purpose: These Dockerfiles are nearly identical. They provide instructions for building the Python services.
Key Logic:
1. Start with a python:3.11-slim base image.
2. Copy the requirements.txt and install dependencies. This step is cached by Docker, so it only runs again if the dependencies change.
3. Copy the application code (app.py).
4. Use uvicorn as the production-grade web server to run the FastAPI application.

3. Orchestration for Local Development

File: docker-compose.yml

Purpose: This file defines and configures our multi-container application for local development. It allows us to spin up the entire distributed system with a single command.
Key Logic:
- services:: Defines each of our four microservices.
- build:: Tells Docker Compose how to build the image for each service (pointing to its context and Dockerfile).
- ports:: Exposes the orchestrator's port 8080 to the host machine so we can interact with it. The other services are not exposed, as they only need to communicate internally.
- environment:: Injects the URLs of the downstream services into the orchestrator. Note that we use the service names (embedding-service, vector-store), which Docker Compose automatically resolves to the correct container IP addresses.
- networks:: Creates a shared virtual network (ai-net) that allows the containers to discover and communicate with each other by name.
- volumes:: The vector-store service uses a named volume chroma_data. This instructs Docker to persist the data in the /app/chroma_db directory on the host machine, ensuring our knowledge base is not lost when we stop and restart the containers.
- depends_on:: Ensures that the downstream AI services are started before the orchestrator starts.

4. Orchestration for Production

File: kubernetes-manifests.yaml

Purpose: This file contains the declarative definitions for deploying our application to a production Kubernetes cluster. It is the "source of truth" for our deployment.
Key Resources:
- Namespace:: Creates an isolated environment (rag-pipeline) for our application.
- PersistentVolumeClaim (PVC):: Requests a piece of persistent storage from the cluster for our vector database. This is the Kubernetes equivalent of the named volume in Docker Compose.
- Deployment:: Defines the desired state for each microservice (e.g., "I want 2 replicas of the embedding-service running"). Kubernetes' control plane will work to ensure this state is always met.
  - nodeSelector & resources: The embedding-service manifest includes these to ensure it's scheduled on a node with a GPU and that it is allocated one GPU device.
  - volumeMounts:: The vector-store deployment mounts the chroma-db-pvc into the /app/chroma_db directory in the container.
  - env:: The orchestrator deployment injects the service URLs using Kubernetes' internal DNS format (e.g., http://<service-name>.<namespace>.svc.cluster.local:<port>).
- Service:: Creates a stable network endpoint (a virtual IP address) for each set of pods. This allows the orchestrator to call http://embedding-service without needing to know the individual IP addresses of the embedding-service pods, which can change at any time.
- Ingress:: Exposes the orchestrator service to the outside world, typically through a public IP address managed by a cloud load balancer. It routes external traffic (e.g., from rag.example.com) to the internal orchestrator service.

How They Interact: A Step-by-Step Walkthrough

Let's trace a user query: POST /query with {"query": "What is the T4 GPU good for?"}

User to Ingress: The request from the user's browser or application hits the public domain (rag.example.com). The Kubernetes Ingress resource receives this traffic.
Ingress to Orchestrator: The Ingress rules route the request to the orchestrator Service. The Service then load-balances the request to one of the healthy orchestrator pods.
Orchestrator to Embedding Service: The Program.cs in the orchestrator pod handles the request. Its C# code creates an HTTP client and sends a POST request to http://embedding-service.rag-pipeline.svc.cluster.local:8000/embed. Kubernetes DNS resolves this name to the IP of the embedding-service Service, which routes the request to an embedding-service pod running on a GPU node. The app.py in that pod generates the embedding and returns it.
Orchestrator to Vector Store: The orchestrator C# code receives the embedding. It then sends a new POST request to http://vector-store.rag-pipeline.svc.cluster.local:8000/search with the vector. Kubernetes routes this to the vector-store pod. The app.py in that pod uses ChromaDB to search its data, which is stored on the Persistent Volume Claim. It returns the most relevant text chunks.
Orchestrator to Generation Service: The orchestrator's C# code combines the original query with the retrieved context into a single, large prompt. It sends this prompt in a POST request to http://generation-service.rag-pipeline.svc.cluster.local:8000/generate. Kubernetes routes this to a generation-service pod on a GPU node. The app.py in that pod mock-generates the final answer.
Response to User: The orchestrator receives the final answer and streams it back through its own HTTP response, up through the Service and Ingress, and finally back to the user.

This distributed, decoupled architecture ensures that each component can be scaled, updated, and managed independently, creating a system that is far more resilient, scalable, and maintainable than a single monolithic application.

Connecting to Theoretical Foundations

This capstone project is the practical culmination of the theories discussed throughout the book:

Decoupling (Chapters 3, 6): The C# orchestrator is completely decoupled from the Python-based AI logic via a clean HTTP API contract. We used interfaces (conceptually) and HttpClientFactory (practically) to achieve this.
Containerization (Chapter 2, 4): Each service is an immutable, portable container, isolating dependencies (Python vs .NET, different ML libraries).
State Management (Chapter 7): The Vector Store is a stateful service, managed with a PersistentVolumeClaim to ensure data survives pod restarts, while the other services are stateless.
Orchestration (Chapter 8, 10): Kubernetes manages the lifecycle, networking (via Service DNS), and resource allocation (nvidia.com/gpu) of all agents.
Scaling (Chapter 5, 15): The stateless embedding-service and generation-service can be scaled horizontally by simply increasing the replicas in their Deployment, a concept that can be automated with HPA/KEDA as discussed in previous chapters.

This project provides a robust, scalable, and maintainable foundation for building even more complex distributed AI systems.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.