Chapter 17: Logging & Telemetry with OpenTelemetry

Theoretical Foundations

In the realm of high-performance AI services, particularly those serving Large Language Models (LLMs) or diffusion models, the application code often feels like a black box. A request goes in, and a token stream comes out. When latency spikes or error rates climb, the immediate questions are: Where did the time go? Which model version caused the issue? How does the internal pipeline behave under load? This is where OpenTelemetry (OTel) becomes the nervous system of your ASP.NET Core application.

The Conceptual Shift: From Logs to Signals

Traditionally, logging in .NET has been about writing text lines to a file or console. While useful for debugging, these logs are unstructured and lack context. In a distributed AI system—where a single HTTP request might trigger a database lookup, a cache check, a call to an external AI provider (like OpenAI), and a tokenization process—we need a holistic view.

OpenTelemetry is not just logging; it is a unified standard for Observability. It comprises three distinct signals:

Traces: The lifecycle of a request as it moves through the system.
Metrics: Aggregated numerical data about system performance (e.g., requests per second, GPU memory usage).
Logs: Discrete events with high-fidelity context.

In the context of AI, these signals are critical. An AI model inference is often the most expensive operation in your stack. Understanding its behavior is not a luxury; it is a requirement for cost management and user experience.

The Air Traffic Control Analogy

Imagine an airport. Without a control tower, planes land and take off blindly. If a plane is delayed, no one knows why. Was it the weather? A mechanical issue? A shortage of fuel trucks?

Traditional Logging is like a pilot writing "Delayed" in a physical logbook after landing. It tells you something happened, but not the chain of events.
OpenTelemetry Tracing is the radar system. It tracks the plane (the request) from the moment it enters the airspace (API Gateway) to taxiing (Model Loading), taking off (Inference), and landing (Response). It shows the exact duration of each phase.
OpenTelemetry Metrics are the dashboard in the control tower showing the number of planes landing per hour (throughput) and the average runway occupancy time (latency).

In an AI Web API, a single user prompt might trigger a complex chain: Authentication → Input Validation → Semantic Search (RAG) → Model Inference → Output Filtering. OpenTelemetry allows us to visualize this entire journey as a single, cohesive unit.

Distributed Tracing in AI Pipelines

In Chapter 15, we discussed Dependency Injection (DI) to manage the lifecycle of our IChatService implementations. This architectural pattern is foundational for OpenTelemetry.

When a request arrives at your ASP.NET Core controller, OpenTelemetry automatically generates a TraceContext. This context contains a TraceId (unique identifier for the entire operation) and a SpanId (identifier for the current operation).

In an AI application, we care deeply about propagation. If your API calls an external vector database (like Pinecone) or an external LLM provider, the trace context must be injected into the HTTP headers. This ensures that the latency observed in the external service is linked back to the specific request in your application.

Consider the flow of a chat completion request:

Root Span: The HTTP request hits the controller.
Child Span 1: The application queries a vector database for context (RAG). The duration of this query is isolated.
Child Span 2: The application constructs the prompt and sends it to the AI model. This is often the longest span.
Child Span 3: The application processes the stream of tokens.

Without tracing, you might see a 5-second total latency but have no idea that 4.5 seconds were spent waiting for the database. With tracing, the visualization makes this bottleneck obvious immediately.

A visual trace diagram highlights a database query as the bottleneck, clearly showing that five seconds were spent waiting for the database response.

Structured Logging vs. Telemetry

In previous chapters, we utilized ILogger<T> for logging. OpenTelemetry enhances this by turning logs into structured data correlated with traces.

When an AI model fails (e.g., a safety filter triggers or the model hallucinates), a simple error log is insufficient. We need context: What was the input prompt? What was the temperature setting? Which model version was active?

OpenTelemetry allows us to attach Attributes (or Tags) to logs and traces. In an AI context, these attributes are vital for debugging:

gen_ai.request.model: The specific model name (e.g., "gpt-4-turbo").
gen_ai.request.max_tokens: The limit set for the response.
gen_ai.response.finish_reason: Why the model stopped generating (e.g., "stop", "length").
gen_ai.usage.prompt_tokens: The cost of the input.

By structuring this data, we can query our observability platform (like Jaeger or Prometheus) to answer questions like: "What is the average latency for Model Version 'v1.2' compared to 'v1.3'?" or "How often does the 'safety_filter' error occur?"

Metrics: The Pulse of the Model

While traces tell the story of a single request, metrics tell the story of the system over time. In AI applications, metrics are often more critical than logs because AI workloads are bursty and resource-intensive.

We categorize metrics into four Golden Signals:

Latency: The time required to service a request. In AI, this is heavily influenced by GPU memory bandwidth and model size.
Traffic: The number of requests per second (RPS). This helps in scaling decisions.
Errors: The rate of failed requests (e.g., HTTP 5xx or 4xx).
Saturation: How "full" your resource is. For AI, this is often GPU utilization or VRAM usage.

OpenTelemetry provides instruments to create these metrics:

Counter: Increments by a value (e.g., total number of tokens generated).
Histogram: Records a distribution of values (e.g., request latency). This is crucial for AI because latency is rarely linear; it depends heavily on input length.
ObservableCounter: A counter that is calculated externally (e.g., reading GPU temperature).

The Correlation of Telemetry and Model Versions

One of the most powerful features of OpenTelemetry in AI is the ability to correlate telemetry with specific model versions. In the context of Chapter 15, where we discussed interfaces for different AI providers, we often switch models dynamically.

Imagine you are performing a Canary Deployment. You route 5% of traffic to a new model version (v2) and 95% to v1. Without telemetry, you are flying blind. With OpenTelemetry, you tag every trace with model.version.

In your observability dashboard, you can overlay two graphs:

Latency (P95) for v1 vs. v2.
Token Throughput for v1 vs. v2.

If v2 introduces a regression that causes a memory leak, you will see the saturation metrics (GPU memory) spike specifically for traces tagged with v2. This allows for instant rollback before the entire system crashes.

The Architecture of the OpenTelemetry SDK

The OpenTelemetry .NET SDK operates on a pipeline architecture. Understanding this is crucial for optimizing performance in high-throughput AI APIs.

Instrumentation: This is the code that captures the data. In ASP.NET Core, we use automatic instrumentation (middleware) to capture HTTP requests. For custom AI operations, we use manual instrumentation.
Processor: Data goes through a processor. The BatchSpanProcessor is essential here. It collects spans in memory and sends them in batches to avoid overwhelming the network or the observability backend. In an AI API generating thousands of spans per second (especially with streaming responses), the processor configuration dictates the overhead of telemetry.
Exporter: The component that sends data to a backend (Jaeger, Zipkin, Prometheus, OTLP).

A critical architectural consideration for AI is Sampling. In a high-traffic AI chat application, you might generate millions of tokens per minute. Exporting telemetry for every request can be prohibitively expensive and slow down the model inference itself.

OpenTelemetry supports Head-based Sampling (deciding to sample at the start of a request) or Tail-based Sampling (deciding to sample based on the outcome of the request). For AI, Tail-based sampling is often preferred. For example, you might decide to sample 100% of requests that result in an error or have high latency, while only sampling 1% of successful, fast requests. This ensures you capture the "bad" signals without drowning in data.

The "What If": Failure Modes and Telemetry

Let's consider edge cases where telemetry saves the day.

Scenario 1: The Silent Timeout. An AI model inference might hang indefinitely due to a deadlock in the underlying C++ bindings or a network partition with the GPU cluster. Without telemetry, the request hangs until the server's timeout kills it. With OpenTelemetry, the trace remains "active" but never completes. Observability tools can detect "stale" traces and alert you to a stuck process.

Scenario 2: The "Noisy Neighbor" Effect. In a multi-tenant AI API, one user might send a massive prompt (e.g., 50,000 tokens) that monopolizes the GPU, causing high latency for everyone else.

Without Telemetry: You see high average latency but don't know why.
With Telemetry: You look at the http.request.body.size attribute on your traces. You realize that the 99th percentile latency correlates perfectly with the top 1% of request sizes. You can then implement rate limiting based on token count.

Integration with Modern C# Features

In modern C# (using Activity from System.Diagnostics), we interact with OpenTelemetry through the Activity class. This class represents a span in the trace.

When building an AI service, we use the ActivitySource to create custom spans. This is particularly useful when dealing with asynchronous streams (IAsyncEnumerable), which are common in AI chat APIs.

For example, when streaming tokens back to the client, the standard HTTP middleware might mark the request as complete as soon as the first byte is sent. However, the actual model inference might still be running. By manually creating an Activity around the streaming loop, we can accurately measure the total time the model took to generate the response, including the time between tokens.

Furthermore, C# DiagnosticListener allows us to hook into internal .NET events. This is how the automatic instrumentation for HttpClient works. It listens for outgoing HTTP requests and automatically creates child spans, ensuring that a call to an external AI provider is seamlessly linked to the incoming request.

Theoretical Foundations

In summary, OpenTelemetry provides the visibility required to run production-grade AI systems. It transforms the opaque process of model inference into a transparent, measurable, and debuggable pipeline.

Traces provide the narrative of the request lifecycle.
Metrics provide the statistical health of the system.
Logs provide the granular details of specific events.

By correlating these signals with model versions and input characteristics, we move from reactive firefighting to proactive optimization. We can answer not just "Is it broken?" but "How efficient is it?" and "Why is it behaving this way?". This theoretical understanding sets the stage for implementing the practical instrumentation in the subsequent sections.

Basic Code Example

Here is a basic code example demonstrating how to integrate OpenTelemetry into an ASP.NET Core application serving a simple AI chat endpoint. This setup focuses on capturing traces for HTTP requests and logging structured telemetry for model inference.

Real-World Context

Imagine you have deployed an AI chat API that generates responses using a large language model. In production, users report occasional slowness, but you lack visibility into where the latency occurs—is it the network, the model inference, or database lookups? This code solves that by instrumenting the application to emit telemetry data (traces and logs) that can be visualized in tools like Jaeger or Prometheus.

Code Example

using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
using OpenTelemetry;
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using System;
using System.Diagnostics;
using System.Threading.Tasks;

var builder = WebApplication.CreateBuilder(args);

// 1. Configure OpenTelemetry Resources
// Define the service name and version to identify this application in telemetry backends.
var serviceName = "AI-Chat-API";
var serviceVersion = "1.0.0";

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(serviceName: serviceName, serviceVersion: serviceVersion))

    // 2. Add Tracing (Distributed Tracing)
    // Tracks the lifecycle of a request across services.
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation() // Automatically traces incoming HTTP requests
        .AddConsoleExporter()) // Export traces to the console for this demo (replace with Jaeger/OTLP in prod)

    // 3. Add Metrics
    // Collects quantitative data like request counts and latency histograms.
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation() // Collect HTTP request metrics
        .AddConsoleExporter());

// 4. Configure Logging
// We need to hook OpenTelemetry into the standard ILogger system.
builder.Logging.ClearProviders();
builder.Logging.AddOpenTelemetry(options =>
{
    options.IncludeScopes = true; // Include scope information (e.g., request ID)
    options.ParseStateValues = true; // Parse log state into structured attributes
    options.AddConsoleExporter(); // Export logs to console
});

var app = builder.Build();

// 5. Define a Custom Activity Source for Manual Tracing
// This allows us to create spans for specific operations (e.g., model inference).
static class TelemetryConstants
{
    public static readonly ActivitySource ActivitySource = new("AI.Chat.API");
}

// 6. Create the Chat Endpoint
app.MapPost("/chat", async (HttpContext context) =>
{
    // Read the prompt from the request body
    var reader = new StreamReader(context.Request.Body);
    var prompt = await reader.ReadToEndAsync();

    // Start a manual span for the model inference process
    using var activity = TelemetryConstants.ActivitySource.StartActivity("Model.Inference");

    // Add tags (attributes) to the span for better filtering in observability tools
    activity?.SetTag("model.version", "v1.2");
    activity?.SetTag("prompt.length", prompt.Length);

    // Simulate AI Model Inference
    var logger = context.RequestServices.GetRequiredService<ILogger<Program>>();

    // Structured Logging: Log the inference start with context
    logger.LogInformation("Starting model inference for prompt length: {PromptLength}", prompt.Length);

    // Simulate latency
    await Task.Delay(100); 

    // Simulate an error scenario for demonstration
    if (prompt.Contains("error"))
    {
        // Record an exception event on the span
        activity?.SetStatus(ActivityStatusCode.Error, "Simulated inference failure");

        // Structured Logging: Log the error
        logger.LogError("Model inference failed for prompt: {Prompt}", prompt);

        return Results.Problem("Model inference failed.");
    }

    // Record success
    activity?.SetStatus(ActivityStatusCode.Ok);
    logger.LogInformation("Model inference completed successfully.");

    return Results.Ok(new { response = "This is a generated AI response." });
});

app.Run();

Detailed Line-by-Line Explanation

Namespace Imports: We import necessary namespaces. OpenTelemetry.* contains the core APIs for tracing, metrics, and logging. System.Diagnostics is required for the ActivitySource class used in manual instrumentation.
Builder Initialization: var builder = WebApplication.CreateBuilder(args); initializes the ASP.NET Core host builder. This provides access to services and logging configuration.
Resource Configuration:
- builder.Services.AddOpenTelemetry(): The entry point for configuring the OpenTelemetry SDK.
- .ConfigureResource(...): Defines metadata about the application (Service Name, Version, Environment). This metadata is attached to every span, metric, and log emitted, allowing observability backends to group data by service.
Tracing Pipeline:
- .WithTracing(tracing => ...): Configures the tracing SDK.
- .AddAspNetCoreInstrumentation(): Middleware that automatically creates a span for every incoming HTTP request. It captures standard details like HTTP method, route, and status code.
- .AddConsoleExporter(): For this "Hello World" example, we export data to the console. In a real production environment, you would use .AddOtlpExporter() to send data to a collector like Jaeger or Zipkin.
Metrics Pipeline:
- .WithMetrics(metrics => ...): Configures the metrics SDK.
- .AddAspNetCoreInstrumentation(): Automatically collects metrics like http.server.duration (how long requests take) and http.server.active_requests (current load).
- .AddConsoleExporter(): Periodically prints metric snapshots to the console.
Logging Configuration:
- builder.Logging.ClearProviders(): Removes default logging providers (like the console logger) to avoid duplicate output and ensure OpenTelemetry handles logging.
- builder.Logging.AddOpenTelemetry(...): Routes all ILogger calls through the OpenTelemetry SDK.
- IncludeScopes and ParseStateValues: These settings ensure that log context (like the Request ID) and structured data (like {PromptLength}) are preserved as distinct fields in the telemetry backend, rather than just plain text strings.
Activity Source Definition:
- TelemetryConstants.ActivitySource: We create a static ActivitySource. This acts as a factory for creating custom spans (Activities). It is crucial for instrumenting specific business logic, such as the AI model inference, which isn't covered by the ASP.NET Core instrumentation.
The /chat Endpoint:
- app.MapPost(...): Defines an HTTP POST endpoint.
- using var activity = ...: We manually start an Activity named "Model.Inference". The using statement ensures the span is disposed (ended) when the logic completes, capturing the duration.
- activity?.SetTag(...): We attach metadata (Tags) to the span. In a dashboard, you can filter traces by model.version="v1.2" to compare performance between model updates.
- logger.LogInformation(...): We use the injected ILogger to emit a structured log. Because we configured OpenTelemetry logging, this log entry is correlated with the current trace (it shares the same Trace ID).
- await Task.Delay(100): Simulates the time taken to process the prompt in a real AI model.
- activity?.SetStatus(...): Explicitly marks the span as OK or Error. This is vital for alerting systems to detect failure rates.

Common Pitfalls

Missing using Statements for Activities: A frequent mistake is creating an Activity but forgetting to wrap it in a using block. If you don't dispose of the activity, the span will never be marked as finished, resulting in "hanging" traces in your observability dashboard and inaccurate duration metrics.
Forgetting Resource Configuration: Without setting the Service Name, all telemetry data might appear under a generic name (e.g., "unknown_service"), making it impossible to distinguish between different microservices in a distributed system.
Console Exporter in Production: The AddConsoleExporter() is strictly for development. It blocks the application thread to write to standard output and generates massive I/O load. In production, always use an exporter that sends data asynchronously to a remote backend (e.g., OTLP, Jaeger, Prometheus).
Incompatible Log Levels: OpenTelemetry captures logs based on the standard ILogger configuration. If your appsettings.json sets the log level to Warning, the LogInformation calls in the example will be ignored. Ensure your log level is set appropriately (e.g., Information or Debug) to capture the telemetry you need.

Visualizing the Telemetry Flow

The following diagram illustrates how data flows from the application code to the observability backend.

The diagram visualizes the telemetry flow from application code emitting logs and metrics to the observability backend for analysis.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.