Chapter 3: Dependency Injection (DI) Service Lifetimes

Theoretical Foundations

Dependency Injection (DI) is the architectural backbone of modern ASP.NET Core applications, acting as the central nervous system that manages the lifecycle and dependencies of your software components. In the context of building AI Web APIs, where we serve computationally expensive models like ONNX or ML.NET inference engines, understanding DI lifetimes is not merely a best practice—it is a critical requirement for ensuring thread safety, efficient memory management, and stable performance under load. When an AI model consumes gigabytes of RAM and requires significant CPU/GPU resources to load, a misconfigured service lifetime can lead to memory leaks, race conditions, or catastrophic application crashes.

To understand the three primary lifetimes—Transient, Scoped, and Singleton—we must first visualize the DI container as a sophisticated factory and inventory manager. Imagine a high-end restaurant kitchen (the application) that prepares specialized dishes (responses). The DI container is the head chef who decides when to prep ingredients and how to distribute them to the line cooks (request handlers). The lifetimes dictate whether the chef prepares a fresh ingredient for every single order, prepares it once per shift, or prepares it once for the entire lifetime of the restaurant.

The Analogy: The Restaurant Kitchen

Consider three types of kitchen resources:

Fresh Herbs (Transient): These are added individually to each plate just before serving. They are cheap to acquire, perishable, and specific to the dish. If two cooks need basil, they each grab a fresh handful. No sharing occurs.
The Cutting Board (Scoped): This is a durable tool assigned to a specific station for the duration of a single service (e.g., the dinner rush). One cook uses it to chop vegetables for a specific table's order. Once the table leaves, the board is washed and can be reused for the next table, but it is never shared between two different tables simultaneously.
The Industrial Oven (Singleton): This is a massive, expensive piece of hardware that takes hours to heat up and consumes significant energy. It is built once and serves the entire kitchen for the duration of the restaurant's existence. Every cook shares this single instance. If the oven is not thread-safe (e.g., two cooks try to set different temperatures simultaneously), the kitchen burns down.

In our AI API, the "Industrial Oven" is the ONNX model inference engine. Loading a 4GB model into memory takes time and RAM. We cannot afford to load it for every request (Transient), nor do we want to dispose of it after a single user's session (Scoped). We need it to be a Singleton.

Transient Lifetime

Definition: A service registered as Transient is created anew every time it is requested from the container or injected into a consuming class. If the same Transient service is injected multiple times into a single request (e.g., into a Controller and a Service it calls), each injection receives a distinct instance.

Theoretical Implications: In a standard stateless web application, Transient is often the default choice because it guarantees isolation. No state is carried over between requests, eliminating side effects. However, in AI applications, Transient lifetimes introduce specific risks and inefficiencies.

Memory Churn: If a Transient service holds a reference to an unmanaged resource (like a pointer to GPU memory allocated by a native ML library), creating and destroying it for every HTTP request can cause excessive garbage collection pressure and memory fragmentation.
Thread Safety: Because every thread gets its own instance, you generally don't need to worry about locking mechanisms within the service itself. However, if the Transient service depends on a shared resource (like a database connection or a static logger), you must ensure those external dependencies are thread-safe.

AI Context: Imagine a TokenizerService that converts raw text into tokens for an LLM. Tokenizers are usually lightweight, stateless, and fast. Registering this as Transient is ideal. If you have 100 concurrent requests, you have 100 tokenizer instances, ensuring that one request's tokenization logic doesn't block another's.

// Conceptual Registration (No Execution Code)
services.AddTransient<ITokenizerService, GptTokenizerService>();

Scoped Lifetime

Definition: A service registered as Scoped is created once per client request (in a web application) or per logical operation scope. Within a single HTTP request, if multiple components request the same Scoped service, they all receive the exact same instance.

Theoretical Implications: Scoped lifetimes are the bridge between the ephemeral nature of Transient and the permanence of Singleton. They are essential for maintaining consistency during a single operation.

The Unit of Work Pattern: This is the primary use case. In AI applications, you often need to log metadata, track token usage, or save conversation history to a database. You don't want to open and close a database connection for every micro-interaction within a single API call. You want one connection (or connection context) that persists for the duration of that request.
Stateful Contexts: If you are building a multi-step AI agent (e.g., a chain-of-thought reasoning process), you might need a Scoped ConversationContext that accumulates state as the agent reasons through a problem. This state must not leak into the next user's request, nor should it be destroyed halfway through the current request.

AI Context: Consider DbContext (Entity Framework Core). In a standard web app, DbContext is registered as Scoped. In an AI app, you might use it to store the inputs and outputs of an AI generation request. If you registered it as Transient, you would create a new database context for every log entry within a single request, leading to connection exhaustion. If you registered it as Singleton, all users would share the same database context, leading to concurrency conflicts and memory leaks as tracked entities pile up.

// Conceptual Registration
services.AddScoped<IChatSessionRepository, ChatSessionRepository>();
services.AddScoped<DbContext, AiDbContext>();

Singleton Lifetime

Definition: A service registered as Singleton is created only once for the entire lifetime of the application. The DI container creates the instance upon the first request and then maintains it in memory until the application shuts down. Every subsequent request for that service receives the same instance.

Theoretical Implications: Singletons are powerful but dangerous. They are the most efficient regarding memory and initialization overhead but impose the strictest constraints on thread safety.

Thread Safety (The Critical Constraint): Since a Singleton is shared across all threads handling concurrent HTTP requests, the class must be thread-safe. If a Singleton service has mutable state (fields or properties that change), you must use synchronization primitives (locks, semaphores, concurrent collections) to prevent race conditions.
Memory Footprint: A Singleton lives forever. If it holds a large object graph (like a loaded AI model), that memory is occupied until the app restarts. This is usually desirable for AI models to avoid reloading costs, but it requires careful memory management to ensure no unmanaged resources are leaked.

AI Context (The "Industrial Oven"): This is where AI Web APIs differ significantly from standard CRUD apps. Loading an ONNX model or an ML.NET pipeline is expensive. It involves reading megabytes/gigabytes from disk, parsing the model architecture, and allocating memory on the CPU or GPU.

If you register an IInferenceEngine as Transient, the application would load the model from disk for every single user request. This would render the API unusable under load. If you register it as Scoped, it would load once per user session (HTTP request), which is still too heavy for high-throughput scenarios.

Therefore, the AI inference engine must be a Singleton. It is loaded once and then serves thousands of requests. However, the engine itself must be designed to be thread-safe. It cannot hold temporary state (like the current input text) in instance fields; that state must be passed as method parameters or exist in Scoped/Transient objects.

// Conceptual Registration
services.AddSingleton<IInferenceEngine, OnnxInferenceEngine>();

Architectural Implications and Dependency Chains

The complexity arises when dependencies have mismatched lifetimes. This is known as the "Captive Dependency" problem.

Scenario:

IInferenceEngine (Singleton) depends on IDatabaseLogger (Scoped) to log model usage.
IDatabaseLogger depends on DbContext (Scoped).

The Problem: The IInferenceEngine is created once. It captures a reference to the IDatabaseLogger (or DbContext) that was created during the application's startup. However, DbContext is designed to be short-lived. If the Singleton engine tries to use that captured DbContext later during a user request, the DbContext may have already been disposed of by the container (since its scope ended), or worse, it might be serving multiple requests simultaneously, causing data corruption.

The Solution: Singletons cannot depend on Scoped services directly. Instead, they should depend on IServiceProvider (the factory) or use a Func<T> delegate to resolve the Scoped service at runtime when needed.

Visualizing the Lifetimes: The following diagram illustrates how instances are distributed across time and requests.

The diagram visually contrasts the lifetimes of Singleton, Scoped, and Transient services across multiple HTTP requests, illustrating how a Singleton instance persists throughout the application's life, while Scoped instances are created and disposed per request, and Transient instances are created anew every time they are requested.

Practical Application in AI APIs

When building an AI API, we often deal with a specific hierarchy of services:

Singletons:
- Model Containers: Classes that hold InferenceSession (ONNX) or PredictionEngine (ML.NET). These are heavy and stateless.
- Configuration Wrappers: Strongly typed configuration objects loaded at startup.
- HTTP Client Factories: While HttpClient is often registered via IHttpClientFactory (which manages its own lifetime), the factory itself is effectively a singleton.
Scoped Services:
- DbContext: For persisting chat history or audit logs.
- User Context Providers: Services that extract user identity from the HTTP context and make it available to other services.
- Conversation State: Objects that track the current turn of a chat session.
Transient Services:
- DTOs (Data Transfer Objects): While not usually registered explicitly, any helper class that processes data for a single operation.
- Validators: FluentValidation validators are typically transient.
- Prompt Templating Engines: If they are stateless and lightweight.

Thread Safety Deep Dive

For the Singleton IInferenceEngine, thread safety is paramount. Let's analyze the internal structure of a Singleton AI service.

Unsafe Singleton:

public class UnsafeOnnxEngine : IInferenceEngine
{
    private float[] _outputBuffer; // Mutable state

    public float[] Predict(float[] input)
    {
        // RACE CONDITION: Multiple threads might overwrite _outputBuffer simultaneously
        _outputBuffer = new float[1024]; 
        // ... inference logic ...
        return _outputBuffer;
    }
}

Safe Singleton (Stateless Operation): The Singleton should act as a stateless gateway to the underlying model. Any state required for a specific prediction must be passed in and returned, or allocated locally within the method (stack or heap, but isolated per call).

public class SafeOnnxEngine : IInferenceEngine
{
    // The underlying session is thread-safe (per ONNX documentation)
    private readonly InferenceSession _session; 

    public SafeOnnxEngine(InferenceSession session)
    {
        _session = session;
    }

    public float[] Predict(float[] input)
    {
        // No instance fields are modified here.
        // The 'input' and 'output' are local to the thread.
        var inputs = new List<NamedOnnxValue> { ... };
        using var results = _session.Run(inputs);
        // Process results into a new array
        return results.ToArray(); 
    }
}

Summary of Selection Criteria

To choose the correct lifetime, apply this decision matrix:

Is the service expensive to create (e.g., loads a file, opens a network connection, initializes an AI model)?
- Yes: Singleton (ensure thread safety).
- No: Proceed to next question.
Does the service need to maintain state across multiple method calls within a single request?
- Yes: Scoped.
- No: Proceed to next question.
Is the service stateless and lightweight?
- Yes: Transient.

By mastering these lifetimes, you ensure that your AI API remains responsive. You prevent the "cold start" penalty of reloading models for every user while avoiding the memory bloat of retaining unnecessary data. You create a robust architecture where the heavy lifting (Singleton Inference) is separated from the request-specific logic (Scoped/Transient), mirroring the efficient operation of a well-run industrial kitchen.

Basic Code Example

Here is the 'Hello World' level code example for Basic Dependency Injection Service Lifetimes in ASP.NET Core, focusing on AI model inference components.

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using System;
using System.Threading;
using System.Threading.Tasks;

// --- 1. Domain Models ---

// Represents the input data for our AI model.
public record TextInput(string Text);

// Represents the output prediction from our AI model.
public record TextPrediction(string Label, float Confidence);

// --- 2. AI Inference Service Interfaces & Implementations ---

// Interface defining the contract for our AI inference engine.
public interface IInferenceEngine
{
    Task<TextPrediction> PredictAsync(TextInput input);
}

// SCOPED IMPLEMENTATION:
// Simulates an inference engine that holds state (like a DB context or a non-thread-safe ML.NET pipeline).
// In a real scenario, this might wrap an ML.NET PredictionEngine<TData, TPrediction>.
// WARNING: PredictionEngine<T, T> is NOT thread-safe. It must be instantiated per scope.
public class ScopedInferenceEngine : IInferenceEngine
{
    private readonly Guid _instanceId = Guid.NewGuid(); // Simulating state/identity
    private readonly Random _random = new Random(); // Simulating internal state

    public ScopedInferenceEngine()
    {
        Console.WriteLine($"[Scoped] InferenceEngine created. Instance ID: {_instanceId}");
    }

    public Task<TextPrediction> PredictAsync(TextInput input)
    {
        // Simulate AI processing delay
        Thread.Sleep(50);

        // Simulate a prediction result
        string label = input.Text.Contains("error", StringComparison.OrdinalIgnoreCase) ? "Negative" : "Positive";
        float confidence = (float)_random.NextDouble() * (1.0f - 0.5f) + 0.5f; // 0.5 to 1.0

        return Task.FromResult(new TextPrediction(label, confidence));
    }
}

// SINGLETON IMPLEMENTATION:
// Simulates a heavy, thread-safe ONNX Runtime inference session.
// This loads the model into memory once and serves all requests.
public class SingletonInferenceEngine : IInferenceEngine
{
    private readonly Guid _instanceId = Guid.NewGuid();

    public SingletonInferenceEngine()
    {
        Console.WriteLine($"[Singleton] InferenceEngine created. Instance ID: {_instanceId}");
        // In a real app, heavy model loading (e.g., OnnxRuntime.InferenceSession) happens here.
    }

    public Task<TextPrediction> PredictAsync(TextInput input)
    {
        // Simulate AI processing delay
        Thread.Sleep(50);

        // Simulate a prediction result
        string label = input.Text.Contains("error", StringComparison.OrdinalIgnoreCase) ? "Negative" : "Positive";
        float confidence = 0.99f; // High confidence for singleton demo

        return Task.FromResult(new TextPrediction(label, confidence));
    }
}

// --- 3. Application Logic (Simulating Request Handling) ---

public class RequestSimulator
{
    private readonly IInferenceEngine _engine;

    // The dependency is injected here. The lifetime of 'engine' depends on how it was registered.
    public RequestSimulator(IInferenceEngine engine)
    {
        _engine = engine;
    }

    public async Task ProcessRequestAsync(string requestText)
    {
        Console.WriteLine($"  -> Processing request: '{requestText}'");

        var input = new TextInput(requestText);
        var prediction = await _engine.PredictAsync(input);

        Console.WriteLine($"  -> Prediction: {prediction.Label} (Confidence: {prediction.Confidence:F2})");
    }
}

// --- 4. Main Program Execution ---

class Program
{
    static async Task Main(string[] args)
    {
        Console.WriteLine("=== DEMONSTRATING SERVICE LIFETIMES ===\n");

        // --- SCENARIO A: SCOPED LIFETIME ---
        // Services are created once per client request (scope).
        Console.WriteLine("--- SCENARIO 1: SCOPED LIFETIME (Simulating Web Request) ---");
        var scopedProvider = new ServiceCollection()
            .AddScoped<IInferenceEngine, ScopedInferenceEngine>()
            .BuildServiceProvider();

        // Simulate Request 1
        using (var scope1 = scopedProvider.CreateScope())
        {
            var processor1 = scope1.ServiceProvider.GetRequiredService<RequestSimulator>();
            await processor1.ProcessRequestAsync("Hello AI");
        }

        // Simulate Request 2 (New Scope = New Instance)
        using (var scope2 = scopedProvider.CreateScope())
        {
            var processor2 = scope2.ServiceProvider.GetRequiredService<RequestSimulator>();
            await processor2.ProcessRequestAsync("Another Request");
        }

        Console.WriteLine("Notice: Two different InferenceEngine instances were created.\n");


        // --- SCENARIO B: SINGLETON LIFETIME ---
        // Service is created once and shared throughout the application lifetime.
        Console.WriteLine("--- SCENARIO 2: SINGLETON LIFETIME (Simulating Shared Resource) ---");
        var singletonProvider = new ServiceCollection()
            // Note: We register RequestSimulator as Transient here so we can resolve it multiple times
            // in the main flow, but the IInferenceEngine it consumes is Singleton.
            .AddSingleton<IInferenceEngine, SingletonInferenceEngine>()
            .AddTransient<RequestSimulator>() 
            .BuildServiceProvider();

        // Simulate Request 1
        var processor1 = singletonProvider.GetRequiredService<RequestSimulator>();
        await processor1.ProcessRequestAsync("Request A");

        // Simulate Request 2
        var processor2 = singletonProvider.GetRequiredService<RequestSimulator>();
        await processor2.ProcessRequestAsync("Request B");

        Console.WriteLine("Notice: The SAME InferenceEngine instance was reused for both requests.\n");

        // Keep console open
        Console.WriteLine("Press any key to exit...");
        Console.ReadKey();
    }
}

Detailed Explanation

Here is the line-by-line breakdown of the code example, explaining how dependency injection lifetimes function within an AI service context.

1. Domain and Service Definitions

TextInput / TextPrediction: Simple records representing the data contract for our AI model. In a production ML.NET or ONNX scenario, these would map to input tensor structures or schema classes.
IInferenceEngine: The abstraction. This is crucial for DI. It decouples the application logic from the specific implementation (e.g., ONNX Runtime vs. ML.NET).
ScopedInferenceEngine:
- Line 26: private readonly Guid _instanceId = Guid.NewGuid(); creates a unique ID for every instantiation. This is the visual proof of the lifetime.
- Line 29-31: The constructor writes to the console. In a real web app, this happens when the scope is created (usually at the start of an HTTP request).
- Why Scoped? ML.NET's PredictionEngine<TData, TPrediction> is not thread-safe. If you register it as a Singleton, concurrent HTTP requests will corrupt its internal state, leading to incorrect predictions or crashes. Scoped ensures one engine per request.
SingletonInferenceEngine:
- Line 49: The constructor runs only once. In a real app, this is where new InferenceSession("model.onnx") is called. Loading a 500MB model on every request is inefficient; Singletons solve this.
- Thread Safety: This class is designed to be thread-safe. It holds no mutable state (the _instanceId is readonly). It simply executes the model inference.

2. The Request Simulator (Consumer)

RequestSimulator:
- Line 70: The constructor accepts IInferenceEngine. This is Constructor Injection.
- Crucial Nuance: The RequestSimulator class itself is not registered with a specific lifetime in the main code, but typically it would be registered as Transient (created every time it's needed) or Scoped (created once per request).
- Line 76: We call _engine.PredictAsync. We don't know (or care) if _engine is a fresh instance or a reused one; the DI container handles that based on the registration.

3. Execution Flow (Scenario A: Scoped)

Registration: AddScoped<IInferenceEngine, ScopedInferenceEngine>() tells the container: "Create one instance of this engine per scope."
Scope 1 Creation: scopedProvider.CreateScope() simulates the start of an HTTP request.
Resolution: GetRequiredService<RequestSimulator> is called. The container sees RequestSimulator needs IInferenceEngine. Since we are inside Scope 1, it creates a new ScopedInferenceEngine (Instance ID: Guid A).
Scope 1 Disposal: The using block ends. The scope is disposed. In a real app, ScopedInferenceEngine (and its underlying DbContext or non-thread-safe ML object) is disposed of here, freeing memory.
Scope 2 Creation: A new HTTP request arrives.
Resolution: A new ScopedInferenceEngine (Instance ID: Guid B) is created. This prevents data leakage between users.

4. Execution Flow (Scenario B: Singleton)

Registration: AddSingleton<IInferenceEngine, SingletonInferenceEngine>() tells the container: "Create one instance when first requested, and keep it alive until the app shuts down."
First Resolution: GetRequiredService<RequestSimulator> is called. The container creates SingletonInferenceEngine (Instance ID: Guid C). It holds onto this instance.
Second Resolution: GetRequiredService<RequestSimulator> is called again. The container does not create a new engine. It retrieves the existing instance (Guid C) and injects it into the new RequestSimulator.
Visual Proof: You will see the constructor message [Singleton] InferenceEngine created appear only once in the console output, while the ProcessRequest logic runs twice.

Common Pitfalls

1. Injecting Scoped Services into Singleton Services (Captive Dependency) This is the most dangerous mistake in AI API development.

The Scenario: You register your heavy ONNX/ML.NET engine as Singleton (to save memory/loading time). You also register a DbContext (database context) as Scoped (standard practice).
The Mistake: You inject the DbContext into the SingletonInferenceEngine.
The Consequence: The Singleton service captures the Scoped service instance from the first request. That DbContext instance is now "trapped" inside the Singleton. It never gets disposed. As subsequent requests arrive, they try to use the Singleton, which uses the trapped, now-disposed, or stale DbContext. This causes memory leaks, concurrency exceptions, and data corruption.

The Fix:

Option A: Do not use Singleton for services that depend on Scoped services. Use Scoped for the inference engine as well (sacrificing some performance for safety).

Option B: Use the IServiceScopeFactory inside the Singleton service. Create a scope manually when inference is needed.

public class SingletonInferenceEngine : IInferenceEngine
{
    private readonly IServiceScopeFactory _scopeFactory;

    public SingletonInferenceEngine(IServiceScopeFactory scopeFactory)
    {
        _scopeFactory = scopeFactory;
    }

    public async Task<TextPrediction> PredictAsync(TextInput input)
    {
        // Create a fresh scope for this specific operation
        using var scope = _scopeFactory.CreateScope();
        var dbContext = scope.ServiceProvider.GetRequiredService<MyDbContext>();
        // Use the dbContext here...
        return await Task.FromResult(new TextPrediction("Result", 1.0f));
    }
}

2. Storing Mutable State in Singletons

The Scenario: You register an InferenceEngine as a Singleton, but inside the class, you have a private List<float> _cache or a mutable counter.
The Consequence: Since the Singleton instance is shared across all threads (HTTP requests), Request A and Request B will race to modify _cache. This leads to race conditions, NullReferenceException, or corrupted data.
The Fix: Singletons must be stateless (or hold immutable state). If you need caching, use a thread-safe collection like ConcurrentDictionary or MemoryCache.

3. Over-Registration of HttpClient

The Context: AI APIs often call external services or other internal microservices.
The Mistake: Creating new HttpClient() inside a service.
The Consequence: This leads to socket exhaustion. HttpClient is designed to be reused.
The Fix: Use IHttpClientFactory. Register it as Singleton (it manages the lifetime internally) and inject HttpClient via Typed Clients or Named Clients.
```
// In Program.cs
builder.Services.AddHttpClient<IAiService, AiService>();
```

Visualizing Lifetimes

The following diagram illustrates the flow of object creation and disposal in the Scoped vs. Singleton scenarios demonstrated above.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.