Skip to content

Chapter 13: Object Detection with YOLO and ONNX

Theoretical Foundations

Object detection represents one of the most computationally intensive yet transformative tasks in Edge AI. Unlike simple image classification, which assigns a single label to an entire image, object detection must simultaneously localize (where is it?) and classify (what is it?) multiple entities within a single frame. This requires a fundamental shift in how we process visual data, moving from a single prediction to a complex output tensor containing coordinates, confidence scores, and class probabilities.

The You Only Look Once (YOLO) architecture revolutionized this field by reframing object detection as a single regression problem to bounding boxes and class probabilities directly from full images in one evaluation. Unlike its predecessors, which treated object detection as a classification problem on thousands of potential regions (R-CNN, Fast R-CNN), YOLO divides the image into a grid and, for each grid cell, predicts multiple bounding boxes and class probabilities simultaneously. This "single shot" approach is what makes YOLO uniquely suited for real-time Edge AI applications where latency is critical.

The YOLO Architecture: A Grid-Based Perspective

Imagine looking at a busy intersection through a window pane. Traditional detection methods would be like taking thousands of snapshots of individual small squares, analyzing each one separately, and then trying to piece together the bigger picture. YOLO, conversely, looks at the entire intersection once and immediately identifies where cars, pedestrians, and cyclists are located relative to the grid lines of the window.

In technical terms, YOLO divides the input image into an \( S \times S \) grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts \( B \) bounding boxes and \( C \) class probabilities. The bounding box is defined by five values: \( (x, y, w, h, confidence) \). Here, \( (x, y) \) represents the center coordinates relative to the grid cell bounds, while \( (w, h) \) represents the width and height relative to the whole image. The confidence score reflects the model's certainty that the box contains an object and how accurately it thinks the box fits the object.

For a standard COCO dataset (80 classes), a typical YOLO output tensor might look like a 3D tensor of shape \( 13 \times 13 \times 30 \) (for a coarse grid) or \( 52 \times 52 \times 30 \) (for a fine grid). The "30" dimension breaks down into 5 values per bounding box (x, y, w, h, confidence) times 3 anchor boxes, plus 80 class probabilities. However, modern YOLO versions (like YOLOv5 or YOLOv8) often output a concatenated tensor of shape \( [batch, num_anchors, grid\_h, grid\_w, 85] \) (4 box coordinates + 1 objectness score + 80 class scores).

The ONNX Runtime: The Universal Neural Network Translator

To run these models in C#, we rely on the Open Neural Network Exchange (ONNX) format. ONNX is the lingua franca of deep learning. Just as JSON or XML standardized data exchange between disparate systems, ONNX standardizes the representation of neural networks. A model trained in Python using PyTorch or TensorFlow can be exported to ONNX and then loaded into a C# application via the ONNX Runtime.

In the context of Edge AI, the ONNX Runtime is not merely a loader; it is a high-performance inference engine. It abstracts away the underlying hardware specifics (CPU, GPU, NPU) while providing a unified API for execution. For C#, the Microsoft.ML.OnnxRuntime NuGet package provides the necessary bindings. The critical architectural decision here is the separation of the model graph (the "what") from the execution provider (the "how"). This allows the same C# code to switch seamlessly between running on a CPU (using the CPU execution provider) or a GPU (using the CUDA or DirectML execution provider) without changing a single line of inference logic.

The Inference Pipeline: A Multi-Stage Assembly Line

Running a YOLO model locally is not a single function call; it is a pipeline resembling a manufacturing assembly line. Each stage transforms the data until a final product (a list of detected objects) is ready.

  1. Input Pre-processing (The Standardization Phase): Neural networks are mathematical functions that expect inputs within a specific range and dimension. A raw image from a webcam is typically a 3D tensor of integers (0-255) with dimensions [Height, Width, Channels]. YOLO expects a float tensor of shape [Batch, Channels, Height, Width] normalized to 0.0–1.0.

    • Resizing: The image must be resized to the model's input resolution (e.g., 640x640). This introduces aspect ratio distortion unless letterboxing (adding gray bars) is used.
    • Normalization: Pixel values are divided by 255.0.
    • Transposition: The channel dimension is moved from the last (HWC) to the second (CHW) position, which is the standard for PyTorch-trained models exported to ONNX.
    • Batching: Even for a single image, ONNX Runtime expects a batch dimension (e.g., [1, 3, 640, 640]).
  2. Inference (The Black Box): The pre-processed tensor is fed into the ONNX Runtime session. Internally, the runtime traverses the ONNX graph, executing operators (Conv, MaxPool, Concat) layer by layer. For Edge AI, the efficiency of this step depends heavily on the execution provider. On a CPU, the runtime utilizes highly optimized linear algebra libraries (MKL, OpenBLAS). On a GPU, it offloads matrix multiplications to the GPU's massive parallel architecture.

    • Analogy: Think of the ONNX Runtime as a universal translator for a supercomputer. It takes the C# instructions, translates them into the specific dialect of the hardware (CPU or GPU), and retrieves the result back into a C#-friendly format.
  3. Post-processing (The Interpretation Phase): The output of the ONNX model is raw data—a massive array of floating-point numbers. To a human, this looks like noise. To the post-processor, it is a map of potential objects.

    • Sigmoid Activation: YOLO uses sigmoid functions to bound the center coordinates (forcing them between 0 and 1 relative to the grid cell) and softmax for class probabilities.
    • Confidence Thresholding: We filter out predictions with a confidence score below a certain threshold (e.g., 0.5). This eliminates the vast majority of false positives.
    • Non-Maximum Suppression (NMS): This is the most critical step for accuracy. YOLO often predicts multiple overlapping bounding boxes for the same object (e.g., one tight box and one slightly looser box). NMS looks at these overlapping boxes, sorts them by confidence score, and keeps only the best one while suppressing others that have a high Intersection over Union (IoU) with the selected box.
    • Coordinate Mapping: Finally, the normalized coordinates (0–1) must be mapped back to the original image dimensions. If the image was resized or letterboxed, we must reverse those transformations to place the bounding box correctly on the original video feed.

The Role of Modern C# in Edge AI

C# has evolved significantly, making it a first-class citizen for high-performance AI workloads. The language features introduced in recent versions (C# 8.0 through 12) allow for expressive, safe, and efficient code that rivals Python in readability while offering superior execution speed in many scenarios.

Memory Management and Span: In Python, memory management is often abstracted away, leading to frequent garbage collection pauses. In C#, we have granular control. For real-time video processing, allocating and deallocating large byte arrays for every frame can cause GC pressure, leading to stuttering. Modern C# utilizes Span<T> and Memory<T> to work with slices of memory without allocating new objects.

  • Analogy: Imagine a chef (the CPU) preparing a meal (processing video frames). If the chef has to run to the pantry (heap allocation) for every single ingredient, the cooking is slow. Span<T> is like having all ingredients laid out on the counter within arm's reach. It allows the chef to slice and dice data in contiguous memory blocks without the overhead of memory allocation.

Async/Await for Non-Blocking Pipelines: Edge AI applications often involve capturing video frames, processing them, and rendering them to a UI simultaneously. If the inference takes 50ms (20 FPS), a synchronous approach would freeze the UI. Modern C# async/await allows us to overlap these operations.

  • Concept Reference: In Book 8, we discussed asynchronous streams (IAsyncEnumerable<T>). This is crucial here. We can create an asynchronous stream that yields detected objects as they are processed, allowing the UI to consume them incrementally without waiting for the entire batch to finish.

Records and Pattern Matching for Configuration: Configuring YOLO models (thresholds, model paths, execution providers) is complex. C# Records (public record ModelConfig(string Path, float ConfidenceThreshold, float IoUThreshold);) provide immutable data structures that are perfect for passing configuration through the pipeline. Combined with Pattern Matching, we can dynamically switch execution providers based on hardware availability:

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

public IExecutionProvider GetExecutionProvider(HardwareCapabilities hardware)
{
    return hardware switch
    {
        { HasNvidiaGpu: true } => new CUDAExecutionProvider(0),
        { HasAmdGpu: true } => new ROCmExecutionProvider(0),
        _ => new CPUExecutionProvider()
    };
}

Visualizing the Data Flow

To understand the transformation of data from raw pixels to detected objects, visualize the pipeline. The data moves from a 2D spatial representation to a high-dimensional tensor, and back to a 2D spatial representation.

This diagram illustrates the AI inference pipeline, where raw pixel data is first transformed into a high-dimensional tensor for processing by a hardware-specific execution provider (such as CUDA, ROCm, or CPU) and then converted back into a 2D spatial representation to visualize detected objects.
Hold "Ctrl" to enable pan & zoom

This diagram illustrates the AI inference pipeline, where raw pixel data is first transformed into a high-dimensional tensor for processing by a hardware-specific execution provider (such as CUDA, ROCm, or CPU) and then converted back into a 2D spatial representation to visualize detected objects.

The "Why" of Edge Inference

Why go through the trouble of running YOLO locally on a C# application instead of sending images to a cloud API?

  1. Latency: Cloud inference introduces network round-trip time. For autonomous drones or industrial robotics, 50ms of network latency can be the difference between a successful maneuver and a crash. Local inference via ONNX Runtime is limited only by the hardware's compute capability.
  2. Privacy: Video data processed locally never leaves the device. This is non-negotiable for applications in healthcare, security, or private residences.
  3. Cost and Reliability: Cloud inference costs scale with usage. Local inference has a fixed hardware cost. Furthermore, an Edge AI application works offline; it does not fail if the internet connection drops.

Architectural Implications of ONNX in C

When building this in C#, we must design for the lifecycle of the ONNX session. The InferenceSession object is heavy; it loads the model into memory and prepares the execution graph. It should be instantiated once and reused, not created for every frame.

Furthermore, we must consider thread safety. The ONNX Runtime is thread-safe for inference execution, meaning multiple threads can call session.Run() concurrently using different input tensors. This allows for parallel processing of video frames if the hardware has multiple cores, maximizing throughput.

However, the post-processing (NMS) is inherently sequential if implemented naively. Modern C# patterns allow us to parallelize the sorting and filtering steps using PLINQ (Parallel LINQ), but we must be careful with shared state. Immutable collections or ConcurrentBag are often used here to accumulate results without race conditions.

Theoretical Foundations

In this section, we established that object detection via YOLO is a grid-based regression problem. We identified ONNX as the bridge between Python-trained models and C# production environments. We detailed the three-stage pipeline: Pre-processing (standardization), Inference (execution), and Post-processing (interpretation). Finally, we highlighted how modern C# features—specifically Span<T> for memory efficiency and async/await for responsiveness—are essential for building high-performance Edge AI applications that can process video streams in real-time on local hardware.

The next section will move from theory to practice, demonstrating how to load the ONNX model and prepare the input tensor using the specific APIs provided by the Microsoft.ML.OnnxRuntime NuGet package.

Basic Code Example

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;
using System.Linq;
using System.Numerics;
using System.Runtime.InteropServices;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

namespace YoloHelloWorld
{
    class Program
    {
        static void Main(string[] args)
        {
            // 1. SETUP: Define paths and parameters
            // In a real scenario, these would be command-line arguments or config files.
            string modelPath = "yolov8n.onnx"; // A small, fast YOLOv8 Nano model
            string imagePath = "input.jpg";    // A sample image to detect objects in
            string outputPath = "output.jpg";  // Where to save the annotated image

            Console.WriteLine($"Loading YOLO model from: {modelPath}");

            // 2. LOAD MODEL: Initialize the ONNX Runtime Inference Session
            // We use the CPU execution provider for simplicity on any machine.
            var sessionOptions = new SessionOptions();
            sessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_WARNING;

            using var session = new InferenceSession(modelPath, sessionOptions);

            // 3. PRE-PROCESS IMAGE: Convert image to the tensor format YOLO expects
            // YOLOv8 expects a 640x640 RGB image normalized to 0-1.
            var inputTensor = PreProcessImage(imagePath, out int originalWidth, out int originalHeight);

            // 4. PREPARE INPUTS: Create the NamedOnnxValue for the session
            // Note: YOLOv8 uses 'images' as the input name. Check your specific model with Netron.
            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("images", inputTensor)
            };

            Console.WriteLine("Running inference...");

            // 5. RUN INFERENCE: Execute the model
            using var results = session.Run(inputs);

            // 6. POST-PROCESS: Extract raw output and apply Non-Maximum Suppression (NMS)
            // YOLOv8 output shape is usually (1, 4 + num_classes, 8400) -> Transposed to (1, 8400, 84)
            var detections = PostProcessYoloV8(results, originalWidth, originalHeight);

            // 7. RENDER: Draw bounding boxes on the original image
            RenderBoxes(imagePath, outputPath, detections);

            Console.WriteLine($"Detection complete. Saved to {outputPath}");
            Console.WriteLine($"Found {detections.Count} objects.");
        }

        // --- HELPER METHODS ---

        static DenseTensor<float> PreProcessImage(string imagePath, out int originalWidth, out int originalHeight)
        {
            using Bitmap bitmap = new Bitmap(imagePath);
            originalWidth = bitmap.Width;
            originalHeight = bitmap.Height;

            // YOLOv8 standard input size
            const int targetSize = 640;

            // Resize image while preserving aspect ratio (letterbox padding)
            // For simplicity in this "Hello World", we will resize directly without padding
            // to keep the code concise, though padding is preferred for accuracy.
            using Bitmap resized = new Bitmap(bitmap, new Size(targetSize, targetSize));

            // Lock bits for fast memory access
            var data = resized.LockBits(new Rectangle(0, 0, targetSize, targetSize), 
                ImageLockMode.ReadOnly, PixelFormat.Format24bppRgb);

            try
            {
                // Create the tensor: Shape (1, 3, 640, 640) -> Batch, Channels, Height, Width
                var tensor = new DenseTensor<float>(new[] { 1, 3, targetSize, targetSize });

                // Copy pixels and normalize to 0-1
                // Note: In production, use Span<T> and unsafe code for maximum performance.
                unsafe
                {
                    byte* ptr = (byte*)data.Scan0.ToPointer();
                    int bytesPerPixel = 3; // 24bpp

                    for (int y = 0; y < targetSize; y++)
                    {
                        for (int x = 0; x < targetSize; x++)
                        {
                            int rowOffset = y * data.Stride;
                            int colOffset = x * bytesPerPixel;

                            // Bitmap is usually BGR, but YOLO expects RGB.
                            // We will just grab R, G, B and normalize.
                            byte b = ptr[rowOffset + colOffset];
                            byte g = ptr[rowOffset + colOffset + 1];
                            byte r = ptr[rowOffset + colOffset + 2];

                            // Normalize (0-255 -> 0.0-1.0)
                            // Note: YOLOv8 usually requires ImageNet normalization, 
                            // but standard 0-1 works for many exported ONNX models.
                            // We assume RGB order here.
                            tensor[0, 0, y, x] = r / 255.0f;
                            tensor[0, 1, y, x] = g / 255.0f;
                            tensor[0, 2, y, x] = b / 255.0f;
                        }
                    }
                }
                return tensor;
            }
            finally
            {
                resized.UnlockBits(data);
            }
        }

        static List<Detection> PostProcessYoloV8(OrtValue results, int originalWidth, int originalHeight)
        {
            // 1. Extract the tensor data
            // Output shape for YOLOv8 is usually (1, 84, 8400) -> (Batch, Classes+Coords, Anchors)
            // We need to transpose or iterate accordingly.
            var tensor = results.AsTensor<float>();
            int dimensions = tensor.Dimensions[1]; // Should be 84 (4 coords + 80 classes)
            int candidates = tensor.Dimensions[2]; // Should be 8400

            var detections = new List<Detection>();
            float confidenceThreshold = 0.25f; // Minimum confidence to consider a detection
            float iouThreshold = 0.45f;        // Intersection over Union threshold for NMS

            // 2. Iterate over all 8400 candidates
            for (int i = 0; i < candidates; i++)
            {
                // Find the class with the highest score
                float maxScore = 0;
                int maxClass = -1;

                // Skip the first 4 coordinates (cx, cy, w, h)
                for (int c = 4; c < dimensions; c++)
                {
                    float score = tensor[0, c, i];
                    if (score > maxScore)
                    {
                        maxScore = score;
                        maxClass = c - 4; // Shift index back to 0-based class ID
                    }
                }

                // Filter by confidence
                if (maxScore < confidenceThreshold) continue;

                // Extract coordinates (normalized 0-1 relative to 640x640)
                float cx = tensor[0, 0, i];
                float cy = tensor[0, 1, i];
                float w = tensor[0, 2, i];
                float h = tensor[0, 3, i];

                // Convert normalized center-x, center-y, width, height to absolute bounding box
                // Note: We are mapping back to the original image size here.
                // If you used letterboxing, you must undo the scaling and padding logic here.
                float x = (cx - w / 2) * originalWidth;
                float y = (cy - h / 2) * originalHeight;
                w *= originalWidth;
                h *= originalHeight;

                detections.Add(new Detection
                {
                    Box = new RectangleF(x, y, w, h),
                    ClassId = maxClass,
                    Score = maxScore
                });
            }

            // 3. Apply Non-Maximum Suppression (NMS)
            // This removes overlapping boxes for the same object.
            return ApplyNMS(detections, iouThreshold);
        }

        static List<Detection> ApplyNMS(List<Detection> detections, float iouThreshold)
        {
            // Sort by score descending
            var sorted = detections.OrderByDescending(d => d.Score).ToList();
            var result = new List<Detection>();

            while (sorted.Count > 0)
            {
                // Take the highest scoring detection
                var best = sorted[0];
                result.Add(best);
                sorted.RemoveAt(0);

                // Remove any remaining detections that overlap significantly with the best one
                sorted.RemoveAll(d => CalculateIoU(best.Box, d.Box) > iouThreshold);
            }

            return result;
        }

        static float CalculateIoU(RectangleF a, RectangleF b)
        {
            float areaA = a.Width * a.Height;
            float areaB = b.Width * b.Height;

            if (areaA == 0 || areaB == 0) return 0;

            float intersectionX = Math.Max(0, Math.Min(a.Right, b.Right) - Math.Max(a.Left, b.Left));
            float intersectionY = Math.Max(0, Math.Min(a.Bottom, b.Bottom) - Math.Max(a.Top, b.Top));
            float intersectionArea = intersectionX * intersectionY;

            float unionArea = areaA + areaB - intersectionArea;
            return unionArea == 0 ? 0 : intersectionArea / unionArea;
        }

        static void RenderBoxes(string inputPath, string outputPath, List<Detection> detections)
        {
            using Bitmap bitmap = new Bitmap(inputPath);
            using Graphics g = Graphics.FromImage(bitmap);

            // Load COCO class names (truncated for brevity)
            var classNames = new[] { "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light" }; // ... and 70 more

            using Pen pen = new Pen(Color.Red, 3);
            using Font font = new Font("Arial", 12);
            using SolidBrush brush = new SolidBrush(Color.Red);

            foreach (var det in detections)
            {
                // Draw bounding box
                g.DrawRectangle(pen, det.Box.X, det.Box.Y, det.Box.Width, det.Box.Height);

                // Draw label
                string label = classNames.Length > det.ClassId ? classNames[det.ClassId] : $"Class {det.ClassId}";
                string text = $"{label}: {det.Score:P0}";
                g.DrawString(text, font, brush, det.Box.X, det.Box.Y - 20);
            }

            bitmap.Save(outputPath, ImageFormat.Jpeg);
        }
    }

    // Data structure for a single detection
    public class Detection
    {
        public RectangleF Box { get; set; }
        public int ClassId { get; set; }
        public float Score { get; set; }
    }
}

Line-by-Line Explanation

1. Setup and Model Loading

  1. using Directives: We import standard libraries (System, System.Drawing for image manipulation) and the Microsoft.ML.OnnxRuntime package, which is the standard NuGet package for running ONNX models in .NET.
  2. Main Method: The entry point of our console application.
  3. modelPath: Specifies the location of the .onnx file. In this example, we assume yolov8n.onnx (YOLOv8 Nano) is in the same directory. This is a lightweight model suitable for edge devices.
  4. SessionOptions: Configures the ONNX Runtime. Here, we set the logging level to suppress verbose info, keeping the console output clean. We default to the CPU execution provider, which requires no additional drivers.
  5. InferenceSession: This is the core object that loads the ONNX model into memory and prepares it for execution. It parses the model graph and optimizes it for the target hardware.

2. Image Pre-Processing

  1. PreProcessImage: A dedicated method to handle image transformations. Neural networks require specific input formats (tensors), not raw image files.
  2. Bitmap Loading: We load the image using System.Drawing.Bitmap. This is a standard, albeit older, .NET library. For high-performance production apps, libraries like ImageSharp or SkiaSharp are preferred, but System.Drawing is built-in and sufficient for this example.
  3. Resizing: YOLO models are trained on fixed image sizes (e.g., 640x640). We resize the input image to match this dimension. Note: In a production environment, you must implement "Letterboxing" (adding gray padding) to maintain the aspect ratio, otherwise, objects may look distorted, affecting accuracy.
  4. LockBits: Accessing pixel data via GetPixel is extremely slow. LockBits gives us direct memory access to the image buffer (a pointer), allowing for rapid iteration over pixels.
  5. DenseTensor<float>: We create a tensor with shape [1, 3, 640, 640]. This corresponds to [BatchSize, Channels, Height, Width].
  6. Normalization: Neural networks work best with small floating-point numbers. We divide pixel values (0-255) by 255.0f to normalize them to the range 0.0-1.0.
  7. Channel Ordering: Images are typically stored as BGR (Blue-Green-Red) in memory. YOLO models expect RGB. We manually map the channels in the loop (r, g, b).

3. Inference Execution

  1. NamedOnnxValue: The ONNX Runtime expects inputs as a dictionary of named values. The key "images" must match the input layer name defined in the ONNX model (check using Netron tool).
  2. session.Run: This triggers the actual computation. The model executes forward propagation, passing data through layers (Convolution, Pooling, etc.) to produce the output tensor. This is the most computationally expensive line.

4. Post-Processing (YOLO Logic)

  1. Output Shape: YOLOv8 outputs a tensor of shape (1, 84, 8400).
    • 84: 4 coordinates (cx, cy, w, h) + 80 class probabilities (for COCO dataset).
    • 8400: The number of anchor boxes/predictions across the grid.
  2. Candidate Iteration: We loop through all 8400 potential detections. This is a "brute force" approach suitable for the example, though optimized implementations use vectorized operations.
  3. Score Calculation: For each anchor box, we look at classes 4 through 83 (skipping the coordinate data). We find the class with the highest probability.
  4. Confidence Filtering: We discard detections where the highest class score is below 0.25. This removes weak predictions (noise).
  5. Coordinate Conversion: The model outputs normalized coordinates (0-1 relative to the 640x640 input). We multiply by the original image dimensions to get pixel coordinates. We also convert Center-X/Center-Y format to Top-Left X/Y format for drawing.

5. Non-Maximum Suppression (NMS)

  1. Why NMS?: A single object (e.g., a car) often triggers multiple overlapping bounding boxes. NMS ensures we keep only the best one.
  2. ApplyNMS:
    • We sort all detections by score (highest to lowest).
    • We pick the highest scoring box and add it to our final results.
    • We calculate the Intersection over Union (IoU) between this box and all remaining boxes.
    • If the IoU is greater than the threshold (e.g., 0.45), it means the boxes are overlapping significantly (likely the same object), so we discard the lower-scoring one.
  3. CalculateIoU: A geometric utility that calculates the ratio of the area of overlap to the area of union between two rectangles.

6. Rendering

  1. RenderBoxes: We reload the original image (to get the full resolution) and draw on it using System.Drawing.Graphics.
  2. Drawing: We iterate through our filtered detections list, draw a Rectangle for the bounding box, and overlay a text string with the class name and confidence score.
  3. Saving: The final image is saved to disk.

Common Pitfalls

  1. Input Tensor Shape Mismatch:

    • Issue: The most common error. ONNX models are strict about input shapes. If the model expects (1, 3, 640, 640) and you provide (1, 640, 640, 3), the runtime will throw an exception.
    • Fix: Always verify the input shape using a tool like Netron. In C#, ensure your DenseTensor dimensions match exactly (Batch, Channels, Height, Width).
  2. Normalization Inconsistency:

    • Issue: The model expects pixels in the range [0, 1] but you feed in [0, 255], or vice versa. This results in predictions that are completely wrong (usually all zeros or garbage values).
    • Fix: Check the model's training configuration. Most modern models (like YOLOv8) use float32 inputs normalized to 0-1. If the model was trained with ImageNet stats (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), you must apply those specific subtractions and divisions.
  3. Incorrect Output Parsing:

    • Issue: Assuming the output tensor is flat or in a different layout. YOLOv8 transposes the output to (1, 84, 8400). Beginners often try to read it as (1, 8400, 84) without realizing the memory layout is different.
    • Fix: Use tensor.Dimensions to debug the shape at runtime. Print the dimensions to the console before looping.
  4. Forgetting to Dispose:

    • Issue: Bitmap and OrtValue wrap unmanaged memory. Failing to dispose of them can lead to memory leaks, especially in video processing loops.
    • Fix: Use using statements for Bitmap, Graphics, InferenceSession, and OrtValue.
  5. Aspect Ratio Distortion:

    • Issue: Simply resizing an image to 640x640 stretches or squashes objects. A tall person might look short, confusing the model.
    • Fix: Implement "Letterboxing". Resize the image so the longest side fits 640, then pad the shorter side with gray pixels (128) to reach 640. When converting coordinates back, you must account for this padding.

Visualizing the Data Flow

The diagram illustrates the preprocessing pipeline where an input image is resized to maintain its aspect ratio with the longest side at 640 pixels, padded with gray pixels (128) to form a square 640x640 input, and the subsequent mapping of coordinates back to the original image space by subtracting the padding offsets.
Hold "Ctrl" to enable pan & zoom

The diagram illustrates the preprocessing pipeline where an input image is resized to maintain its aspect ratio with the longest side at 640 pixels, padded with gray pixels (128) to form a square 640x640 input, and the subsequent mapping of coordinates back to the original image space by subtracting the padding offsets.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon


Loading knowledge check...



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.