Chapter 12: Running BERT for Text Classification

Theoretical Foundations

The theoretical foundation of running BERT for text classification locally in C# rests on the convergence of three distinct domains: the mathematical architecture of Transformer models, the standardized representation of data via ONNX (Open Neural Network Exchange), and the high-performance execution environment provided by the .NET runtime and the ONNX Runtime (ORT). To understand how we classify text using DistilBERT in C#, we must first deconstruct the journey of a raw string—like a user review—through a series of transformations until it becomes a prediction vector.

The Transformer Architecture: Attention as Context

At its core, DistilBERT is a distilled version of BERT (Bidirectional Encoder Representations from Transformers). Unlike traditional Recurrent Neural Networks (RNNs) that process text sequentially (word by word, left to right), the Transformer architecture processes the entire sequence simultaneously. This parallelism is made possible by the Self-Attention Mechanism.

Imagine you are reading the sentence: "The bank of the river was flooded."

If you were an RNN, you would read "The," then "bank," and hold that in memory. By the time you reach "river," you might still be associating "bank" with a financial institution, creating confusion. You would have to wait until the end of the sentence to correct your understanding.

A Transformer, however, looks at every word at once. The attention mechanism calculates a "relevance score" between every pair of words in the sentence. When the model looks at the word "bank," it simultaneously attends strongly to "river" and "flooded," while paying less attention to "The." This creates a dynamic context vector where the meaning of "bank" is weighted heavily by its neighbors.

In the context of C# and ONNX, we are not training this mechanism; we are utilizing a pre-calibrated set of weights. These weights represent the statistical understanding of language learned from billions of sentences. The "Attention" is not a piece of code we write, but a series of matrix multiplications (dot products) executed by the ONNX Runtime.

Tokenization: The Bridge Between Human and Machine

Before a sentence enters the neural network, it must be translated into numbers. This process is called Tokenization. DistilBERT uses a specific vocabulary (usually 30,000+ tokens) consisting of sub-words.

Analogy: Think of tokenization as a librarian cataloging books. Instead of cataloging by full titles (which would require a unique catalog for every possible book title), the librarian breaks titles down into common words and sub-words. "Unbelievable" might become "un" + "##believ" + "##able". This allows the model to handle unknown words by breaking them into known components.

In our C# pipeline, the tokenizer is the gatekeeper. It performs three critical operations:

Mapping: Converting strings to integer IDs based on a vocabulary file.
Special Tokens: Adding [CLS] (Classification) at the start and [SEP] (Separator) at the end. The [CLS] token's final hidden state is specifically designed to aggregate the entire sequence's meaning for classification tasks.
Padding and Truncation: Neural networks require fixed-size inputs. If we batch sentences, they must all be the same length. We pad shorter sentences with zeros and truncate longer ones.

The ONNX Standard: The Universal Format

This is where the concept from Book 8 (Interoperability) becomes critical. BERT models are typically trained in Python frameworks like PyTorch or TensorFlow. To run them efficiently in a C# application, we cannot simply import the raw Python code.

We convert the trained model into the ONNX format (Open Neural Network Exchange). ONNX is the "PDF" of AI models—a universal file format that describes the computational graph.

The Computational Graph: An ONNX model is a Directed Acyclic Graph (DAG). It defines:

Inputs: The tensors (multi-dimensional arrays) entering the model (e.g., input_ids, attention_mask, token_type_ids).
Nodes: The mathematical operations (Add, MatMul, Softmax, LayerNorm).
Outputs: The resulting tensors (e.g., the logits for "Positive" and "Negative").

By using ONNX, we decouple the training environment (Python/Linux) from the inference environment (C#/Windows/Edge Device). This is the essence of Edge AI: taking a massive, cloud-trained model and shrinking it into a portable file that runs locally on a user's machine without internet connectivity.

The C# Execution Environment: ONNX Runtime

In C#, we utilize the Microsoft.ML.OnnxRuntime NuGet package. This package provides the InferenceSession class, which is the engine that executes the ONNX graph.

Why C# and .NET for this?

Span<T> and Memory Management: When processing text, we deal with arrays of integers (token IDs). Modern C# allows us to use Span<T> and Memory<T> to manipulate these arrays without unnecessary heap allocations. This is vital for low-latency inference on edge devices where Garbage Collection (GC) pressure can cause frame drops or UI freezes.
IDisposable Pattern: The ONNX Runtime manages unmanaged resources (GPU memory, C++ execution providers). C#'s using statements ensure that the model session and tensor memory are released immediately after inference, preventing memory leaks in long-running applications.
Async/Await: While inference itself is synchronous (blocking) for a single prediction, the pre-processing (tokenization) and post-processing (interpreting logits) can be offloaded. However, for real-time edge inference, we often use synchronous execution to minimize thread context switching overhead.

Theoretical Foundations

Let us visualize the flow of data through the system.

This diagram illustrates a synchronous inference pipeline where data flows sequentially through stages—preprocessing, model execution, and post-processing—to minimize thread context switching overhead for efficient real-time edge processing.

1. The Input Tensor Construction

The ONNX model expects specific input names, usually input_ids, attention_mask, and sometimes token_type_ids.

input_ids: The sequence of integers representing tokens.
attention_mask: A binary tensor (0s and 1s) telling the model which tokens are real data and which are padding. Without this, the model would try to derive meaning from the padding zeros, diluting the result.

Analogy: Imagine a lecture hall. The input_ids are the students sitting in chairs. The attention_mask is a roll call. If a chair is empty (padding), the attention_mask tells the professor (the model) to ignore that chair entirely when taking attendance.

2. The Execution Providers

When we create an InferenceSession in C#, we can specify an ExecutionProvider. This is the bridge to hardware acceleration.

CPU: The default. Uses the CPU's AVX instructions. Good for compatibility but slower.
CUDA (NVIDIA): Offloads matrix multiplications to the GPU. Essential for large models or batch processing.
DirectML (Windows): Hardware-agnostic GPU acceleration for Windows devices (AMD, Intel, NVIDIA).
OpenVINO (Intel): Optimized for Intel CPUs/NPUs.

In an Edge AI scenario (e.g., a C# WPF application on a laptop), we might dynamically select the provider based on available hardware. If an NVIDIA GPU is detected, we use CUDA; otherwise, we fall back to CPU.

3. The Inference Step

Once the OrtValue (a wrapper around the tensor data) is created from our C# arrays, we call session.Run(). The ONNX Runtime (written in C++ for speed) takes over.

It traverses the graph.
It performs the matrix multiplications defined by the model weights.
It applies activation functions (GELU, Tanh).
It calculates the Softmax function at the final layer to convert raw scores (logits) into probabilities (0 to 1).

4. Post-Processing

The model returns a tensor of shape [1, 2] (Batch Size 1, Classes 2). We extract these values in C#.

Index 0: Logit for Negative
Index 1: Logit for Positive

We apply the Softmax function if it wasn't baked into the model output: $$ \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} $$

This gives us a confidence score. We then map this back to a human-readable label.

Why This Matters for Edge AI

The theoretical importance of this pipeline in the context of "Edge AI" is Latency and Privacy.

Latency: In a cloud-based scenario, sending text to an API, waiting for the server to process it, and receiving the response introduces network latency (often 100ms-500ms). Local inference via ONNX Runtime on a CPU can complete BERT inference in <50ms.
Privacy: Since the text never leaves the user's device, sensitive data (medical notes, private messages) remains secure.
Offline Capability: The application functions without an internet connection, crucial for industrial IoT or remote field work.

Summary of Concepts

To summarize the theoretical foundation:

DistilBERT uses the Transformer architecture and Self-Attention to understand context bidirectionally.
Tokenization converts natural language into a fixed-size numerical representation, handling sub-words and padding.
ONNX serves as the universal container, abstracting the model architecture from the training framework.
ONNX Runtime in C# provides the engine to execute these graphs, leveraging hardware acceleration (CPU/GPU) via ExecutionProviders.
C# Features like Span<T> and IDisposable ensure that memory management is efficient and safe, which is paramount for maintaining high frame rates in UI applications or handling high throughput in services.

This theoretical framework sets the stage for the practical implementation, where we will translate these concepts into code, loading the model and processing real-world text data.

Basic Code Example

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.Json;

namespace BertSentimentAnalysis
{
    // This example demonstrates a complete, self-contained BERT-based sentiment analysis pipeline.
    // We will classify a short text snippet as either "Positive" or "Negative" using a DistilBERT model.
    class Program
    {
        static void Main(string[] args)
        {
            // 1. Define the input text we want to analyze.
            string inputText = "The movie was absolutely fantastic! The acting was superb.";

            // 2. Load the ONNX model. 
            // NOTE: For this example to run, you must download a DistilBERT model for sequence classification
            // (e.g., from HuggingFace converted to ONNX) and place it in the execution directory.
            // We will assume the filename is "distilbert-base-uncased-finetuned-sst-2-english.onnx".
            string modelPath = "distilbert-base-uncased-finetuned-sst-2-english.onnx";

            // 3. Initialize the tokenizer. 
            // In a production environment, we would use a dedicated tokenizer library. 
            // Here, we implement a basic WordPiece tokenizer logic manually to keep the example self-contained.
            var tokenizer = new BasicBertTokenizer();

            // 4. Tokenize the input text.
            // This converts raw text into numerical IDs and generates necessary attention masks.
            var encodedInput = tokenizer.Encode(inputText);

            // 5. Create the ONNX Runtime Inference Session.
            // We wrap this in a 'using' statement to ensure proper disposal of unmanaged resources.
            using var session = new InferenceSession(modelPath);

            // 6. Prepare the inputs for the model.
            // BERT models typically require three inputs: 
            // - Input IDs: The tokenized text.
            // - Attention Mask: Tells the model which tokens to pay attention to (1 for real tokens, 0 for padding).
            // - Token Type IDs: Used for sentence pairs (not needed for single-sentence classification, set to 0).
            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("input_ids", encodedInput.InputIds),
                NamedOnnxValue.CreateFromTensor("attention_mask", encodedInput.AttentionMask),
                NamedOnnxValue.CreateFromTensor("token_type_ids", encodedInput.TokenTypeIds)
            };

            // 7. Run inference.
            // The session.Run method executes the forward pass of the neural network.
            using var results = session.Run(inputs);

            // 8. Extract and process the output.
            // The output is usually a tensor of logits (raw scores) for each class.
            var outputTensor = results.First().AsTensor<float>();

            // 9. Convert logits to probabilities using Softmax.
            var probabilities = Softmax(outputTensor.ToArray());

            // 10. Determine the predicted label.
            // Assuming index 0 is Negative and index 1 is Positive (standard for SST-2).
            string predictedLabel = probabilities[1] > 0.5 ? "Positive" : "Negative";

            // 11. Display the results.
            Console.WriteLine($"Input Text: \"{inputText}\"");
            Console.WriteLine($"Predicted Sentiment: {predictedLabel}");
            Console.WriteLine($"Confidence Scores: Negative={probabilities[0]:P2}, Positive={probabilities[1]:P2}");
        }

        // Helper method to calculate Softmax probabilities.
        // Softmax converts raw logits into a probability distribution summing to 1.
        static float[] Softmax(float[] logits)
        {
            var max = logits.Max();
            var exp = logits.Select(x => Math.Exp(x - max)).ToArray();
            var sum = exp.Sum();
            return exp.Select(x => (float)(x / sum)).ToArray();
        }
    }

    /// <summary>
    /// A simplified implementation of a BERT tokenizer for demonstration purposes.
    /// Real-world applications should use libraries like Microsoft.ML.Tokenizers or HuggingFace tokenizers.
    /// </summary>
    public class BasicBertTokenizer
    {
        // Basic vocabulary mapping for demonstration (truncated for brevity).
        // In a real scenario, this would be loaded from a 'vocab.txt' file.
        private readonly Dictionary<string, int> _vocab = new()
        {
            { "[CLS]", 101 }, { "[SEP]", 102 }, { "[PAD]", 0 },
            { "the", 1996 }, { "movie", 3185 }, { "was", 2001 },
            { "absolutely", 4593 }, { "fantastic", 11025 }, { "!", 999 },
            { "acting", 3724 }, { "superb", 11344 }, { "the", 1996 }
        };

        public EncodedInput Encode(string text, int maxLen = 128)
        {
            // 1. Normalize and split text into words.
            var words = text.ToLower().Split(new[] { ' ', '!', '.', '?' }, StringSplitOptions.RemoveEmptyEntries);

            // 2. Convert words to token IDs.
            var tokenIds = new List<int>();
            tokenIds.Add(_vocab["[CLS]"]); // Start of sequence token

            foreach (var word in words)
            {
                // Simple lookup (in reality, WordPiece tokenization splits unknown words)
                if (_vocab.TryGetValue(word, out int id))
                {
                    tokenIds.Add(id);
                }
                else
                {
                    // Handle unknown words by mapping to a generic ID or [UNK]
                    // For this example, we'll skip or map to a placeholder
                    tokenIds.Add(100); // Assuming 100 is [UNK]
                }
            }

            tokenIds.Add(_vocab["[SEP]"]); // End of sequence token

            // 3. Padding and Attention Mask creation
            // Create tensors with the specified max length.
            var inputIds = new int[maxLen];
            var attentionMask = new int[maxLen];
            var tokenTypeIds = new int[maxLen];

            // Copy values
            for (int i = 0; i < Math.Min(tokenIds.Count, maxLen); i++)
            {
                inputIds[i] = tokenIds[i];
                attentionMask[i] = 1; // 1 indicates a real token
                tokenTypeIds[i] = 0;  // 0 indicates the first sentence
            }

            // Fill the rest with Padding IDs (0)
            for (int i = tokenIds.Count; i < maxLen; i++)
            {
                inputIds[i] = _vocab["[PAD]"];
                attentionMask[i] = 0; // 0 indicates padding
                tokenTypeIds[i] = 0;
            }

            // 4. Convert to DenseTensors for ONNX Runtime
            // We need to reshape to [1, sequence_length] for BERT inputs.
            var inputIdsTensor = new DenseTensor<int>(inputIds, new[] { 1, maxLen });
            var attentionMaskTensor = new DenseTensor<int>(attentionMask, new[] { 1, maxLen });
            var tokenTypeIdsTensor = new DenseTensor<int>(tokenTypeIds, new[] { 1, maxLen });

            return new EncodedInput
            {
                InputIds = inputIdsTensor,
                AttentionMask = attentionMaskTensor,
                TokenTypeIds = tokenTypeIdsTensor
            };
        }
    }

    // Container for the encoded inputs
    public class EncodedInput
    {
        public Tensor<int> InputIds { get; set; }
        public Tensor<int> AttentionMask { get; set; }
        public Tensor<int> TokenTypeIds { get; set; }
    }
}

Detailed Line-by-Line Explanation

Namespaces and Setup:
- Microsoft.ML.OnnxRuntime: This is the core namespace for running ONNX models. It provides the InferenceSession class, which is the entry point for loading and executing models.
- Microsoft.ML.OnnxRuntime.Tensors: This namespace provides tensor data structures. ONNX models expect inputs and outputs in the form of multi-dimensional arrays (tensors).
- System.Text.Json: Included for potential JSON handling, though not strictly used in this minimal example, it's standard for configuration in larger apps.
The Main Method:
- Input Definition: We define a raw string inputText. This represents a user review or comment in a real-world app (e.g., an e-commerce site or social media monitor).
- Model Path: We specify the path to the .onnx file. Crucial Note: ONNX Runtime does not download models; it only executes them. You must acquire a pre-trained BERT model (converted to ONNX format) separately.
- Tokenizer Initialization: We instantiate BasicBertTokenizer. Raw text cannot be fed directly into a neural network; it must be converted to numbers. This class handles that conversion.
Tokenization (tokenizer.Encode):
- The Encode method performs the critical preprocessing steps:
  - Normalization: Lowercasing and splitting text.
  - Tokenization: Breaking words into sub-words (WordPiece). In our simplified code, we use a dictionary lookup.
  - Special Tokens: Adding [CLS] (Classification) at the start and [SEP] (Separator) at the end. BERT requires these to understand the structure of the input.
- The result is an EncodedInput object containing DenseTensor<int> objects. These tensors have a shape of [1, sequence_length] (Batch Size 1, Sequence Length 128).
Inference Session:
- using var session = new InferenceSession(modelPath);: This loads the ONNX model from disk into memory. It analyzes the model graph to determine optimal execution providers (e.g., CPU, CUDA, TensorRT). The using keyword ensures that when the scope ends, the memory is released.
Input Preparation:
- List<NamedOnnxValue>: ONNX Runtime expects inputs as a list of named values. The names ("input_ids", "attention_mask", etc.) must match the input names defined in the ONNX model graph exactly.
- Tensors: We pass the tensors created by the tokenizer. These contain the numerical representation of our text.
Execution:
- session.Run(inputs): This is the "Forward Pass." The data flows through the neural network layers (Embeddings, Attention Heads, Feed-Forward Networks) to produce the output logits.
Post-Processing:
- Logits: The raw output is a tensor of floating-point numbers (logits). These are unnormalized scores.
- Softmax: We apply the Softmax function to convert these logits into probabilities (0 to 1, summing to 1). This makes the result interpretable (e.g., "85% confident this is Positive").
- Thresholding: We compare the probability of the "Positive" class (index 1) against a threshold (0.5) to make a final decision.

Graphviz Visualization of the Pipeline

A diagram illustrating a pipeline where a model's raw probability output is fed into a thresholding step, which outputs a binary decision based on a comparison to a set value (e.g., 0.5).

Common Pitfalls

Mismatched Input Names: ONNX models are exported with specific input names (e.g., input_ids vs ids). If the NamedOnnxValue keys don't match exactly, session.Run will throw an KeyNotFoundException. Always inspect the model metadata using tools like Netron or session.InputMetadata.
Shape Mismatch: BERT models are strict about input shapes. If the model expects [Batch_Size, Sequence_Length] (e.g., [1, 128]) but you provide a flat array or a different dimension, the inference will fail. Ensure your tensors are reshaped correctly.
Missing Tokenizer Files: The vocabulary file (usually vocab.txt or tokenizer.json) is essential. Our example uses a hardcoded dictionary, but in reality, loading the wrong vocab file results in completely different token IDs, leading to garbage predictions.
Data Type Mismatch: ONNX models usually expect Int64 (long) or Int32 for input IDs. If you pass float tensors for input IDs, the runtime will throw an error. Conversely, output logits are almost always Float.
Memory Leaks: Failing to dispose of InferenceSession or IDisposable tensors can lead to memory leaks, especially critical in edge devices with limited RAM. Always use using statements.

Real-World Context

Imagine you are building a Customer Support Dashboard for a streaming service. Thousands of user comments arrive every minute via social media and support tickets.

The Problem: You cannot manually read every comment to determine if the user is happy or angry.
The Solution: You deploy this BERT inference code on a local server (or even an edge device near the data center) to process comments in real-time.
Why Local Inference?:
- Latency: Sending data to a cloud API (like Azure Cognitive Services) adds network latency. Local inference is sub-100ms.
- Privacy: User comments might contain sensitive data. Processing locally ensures data never leaves your private network.
- Cost: For high volume, cloud API costs per request add up. Local inference has a fixed hardware cost.

This code snippet is the foundational building block for that system. In a production environment, you would wrap this logic in a high-throughput web API (ASP.NET Core) or a background worker service.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.