Chapter 3: Quantization Explained (FP16, INT8, INT4)
Theoretical Foundations
At the heart of modern Edge AI lies a fundamental tension: the computational hunger of neural networks versus the resource constraints of local hardware. To bridge this gap, we must move beyond naive implementations and embrace Quantization. This process is not merely a compression technique; it is a mathematical re-mapping of the model's parameter space, trading infinite precision for finite, practical efficiency. In the context of C# and ONNX Runtime, this allows us to deploy Large Language Models (LLMs) like Llama or Phi on devices ranging from Raspberry Pis to high-end desktops without requiring massive GPU VRAM.
The Mathematical Landscape of Precision
To understand quantization, we must first visualize the "cost" of a number. In standard Deep Learning, models are trained and inferred using 32-bit floating-point (FP32) numbers.
An FP32 number consists of:
- 1 bit for the sign (\(s\))
- 8 bits for the exponent (\(e\))
- 23 bits for the mantissa (\(m\))
This structure allows for a massive dynamic range (from \(10^{-38}\) to \(10^{38}\)) and high precision. However, storing a single weight requires 4 bytes. A 7-billion parameter model (like a quantized Llama 7B) requires roughly 28 GB of memory just for the weights in FP32. This is computationally prohibitive for edge devices.
Quantization maps these continuous FP32 values to a discrete, lower-bit integer representation (e.g., INT8 or INT4). The relationship is linear:
Where:
- \(X_{float}\) is the original FP32 weight.
- \(scale\) is a floating-point multiplier determining the step size between integer values.
- \(zero\_point\) is an integer offset (often 0 for signed integers, or 128 for unsigned) ensuring that the value 0 in the floating-point domain maps to 0 in the quantized domain.
The Analogy: The Architect's Blueprint
Imagine an architect designing a complex curve for a building facade.
- FP32 (The Master Blueprint): The architect uses a high-resolution vector plotter. The curve is smooth, infinitely detailed, and mathematically perfect. However, the blueprint is so large and heavy that only the main office can handle it. Transporting it to the construction site (the edge device) is impossible.
- FP16 (The Detailed Sketch): The architect reduces the resolution. The curve is still recognizable, with minor jagged edges. It fits in a standard tube and can be carried to the site.
- INT8 (The Lego Instructions): The curve is now defined by a grid of discrete blocks. You can still recognize the shape, but the smoothness is gone. It is incredibly fast to assemble because the blocks are standard sizes.
- INT4 (The Pixel Art): The curve is reduced to a 4x4 pixel grid. The general shape is visible, but fine details are lost. It is extremely compact and fast to process, but if the curve was subtle, it might look like a staircase.
In Edge AI, we are moving from the Master Blueprint (FP32) to Lego Instructions (INT8) or Pixel Art (INT4). We sacrifice the mathematical perfection of the curve to ensure we can actually build the structure on the construction site.
Precision Tiers: FP16, INT8, and INT4
1. FP16 (Half Precision)
FP16 reduces the bit count by half (1 sign bit, 5 exponent bits, 10 mantissa bits).
- Why use it? It cuts memory usage by 50% and, on hardware with native FP16 support (like modern GPUs or NPUs), doubles compute throughput.
- The Trade-off: The reduced exponent range means "overflow" is common. Large activations can explode into
Infinity, and tiny gradients can vanish into0. In C# ONNX Runtime, FP16 is often used for storage compression, but the runtime may still convert it back to FP32 for computation on CPUs that lack native FP16 ALUs (Arithmetic Logic Units).
2. INT8 (8-bit Integer)
INT8 maps values to the range \([-127, 127]\) (signed) or \([0, 255]\) (unsigned).
- Why use it? Memory footprint drops by 75% compared to FP32. Integer arithmetic is significantly faster than floating-point math on almost all CPUs, reducing power consumption—a critical metric for battery-powered edge devices.
- The Trade-off: The "dynamic range" is tiny. A weight of value \(0.0001\) and a weight of \(10.0\) must both fit into this small integer bucket. This requires careful calibration (usually per-channel quantization) to prevent loss of information.
3. INT4 (4-bit Integer)
INT4 maps values to \([-7, 7]\) or \([0, 15]\).
- Why use it? This is the extreme of compression. A 7B model drops from 28GB (FP32) to just 3.5GB. This fits entirely into the L3 cache of modern CPUs, drastically reducing latency caused by RAM access.
- The Trade-off: This is a "lossy" compression. The quantization error is high. To mitigate this, INT4 is rarely used for the entire network. Typically, the first and last layers (which handle input embedding and output projection) remain in INT8 or FP16 to preserve the model's interface with the real world, while the dense inner layers are quantized to INT4.
The Role of C# and ONNX Runtime
In previous chapters, we discussed how to load ONNX models using the Microsoft.ML.OnnxRuntime NuGet package. When we instantiate an InferenceSession, we pass an SessionOptions object. This object is the gateway to quantization.
In C#, we don't manually quantize the math operations. Instead, we leverage ONNX Runtime's Graph Optimization. When we load a model, the runtime analyzes the computational graph. If we apply dynamic quantization, the runtime inserts "Quantize" and "Dequantize" nodes around the linear operations (like MatMul and Conv).
Here is the conceptual flow in C#:
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
// 1. Define options to enable quantization
var options = new SessionOptions();
// 2. Enable Dynamic Quantization specifically for CPU execution
// This tells the runtime to quantize weights from FP32 to INT8
// on-the-fly, just before execution.
options.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
// 3. Configure the execution provider (CPU)
// Note: For INT4, we often rely on specific kernels or
// quantization tools (like Olive) to prepare the model beforehand.
options.AppendExecutionProvider_CPU();
// 4. Load the model
// The runtime parses the ONNX graph. If the model was pre-quantized
// (e.g., via ONNX Runtime Tools), it detects the INT8/INT4 tensors.
// If not, dynamic quantization applies the mapping now.
using var session = new InferenceSession("model.onnx", options);
Dynamic vs. Static Quantization
This is a crucial architectural decision in Edge AI.
1. Static Quantization (Calibration):
Before deploying the C# application, we run a calibration process using a representative dataset. We observe the distribution of activations (the outputs of neurons) and calculate the optimal scale and zero_point for every layer.
- Analogy: A tailor measures you once, creates a custom suit, and you wear it.
- Pros: Highest accuracy, fastest inference (quantization math is baked in).
- Cons: Requires calibration data; less flexible if the input data distribution changes.
2. Dynamic Quantization: The C# ONNX Runtime calculates the quantization parameters on the fly, batch by batch, during inference.
- Analogy: A stretchy elastic suit. It adjusts its shape every time you move.
- Pros: No calibration data needed; adapts to varying input distributions.
- Cons: Slight overhead to compute the range of inputs before the matrix multiplication.
- C# Implementation: This is the default behavior when using
SessionOptionswith CPU execution providers for models not explicitly pre-quantized.
Visualizing the Quantization Process
The following diagram illustrates how a standard FP32 model is transformed into a quantized version suitable for Edge AI deployment. Note how the "Weights" node changes from a dense float array to a compressed integer representation, while the "Scale" and "Zero Point" are introduced as metadata.
The Impact on LLM Inference (Llama/Phi)
When running Llama or Phi models locally in C#, the bottleneck is rarely the matrix multiplication speed itself, but rather memory bandwidth—moving the weights from RAM to the CPU cache.
- FP32: The CPU constantly fetches 4 bytes per weight. The cache fills up quickly, causing "cache misses" and stalling the CPU while it waits for data from RAM.
- INT4: The CPU fetches only 0.5 bytes per weight. A massive chunk of the model (e.g., 3 billion parameters) fits into the L3 cache. The CPU is fed data so fast that it can perform inference in near real-time.
This is why quantization is the "killer feature" for Edge AI. It transforms an I/O-bound problem (waiting for RAM) into a compute-bound problem (doing math), which modern CPUs handle efficiently.
Theoretical Foundations
While quantization is powerful, it introduces specific theoretical risks:
- Outlier Sensitivity: Neural networks rely on a few "outlier" weights with large magnitudes to propagate information. In INT4, these outliers are squashed into the same range as normal weights, potentially destroying the signal. Modern quantization techniques (like GPTQ or AWQ) preserve these outliers by keeping them in higher precision (e.g., INT8) while quantizing the rest to INT4.
- Accumulation Precision: When multiplying INT8 values, the result is often accumulated in INT32 to prevent overflow before being requantized. In C#, if we implement custom operators, we must ensure we don't overflow the accumulator.
- Zero Point Issues: For symmetric quantization, the zero point is 0. However, for asymmetric distributions (common in activation functions like ReLU), the zero point shifts. If the zero point is not an integer, rounding errors occur.
Summary of Decision Making
In the context of building a C# AI application, the choice of quantization is a decision about the deployment environment:
- Use FP16/FP32: If you are deploying on a server with a powerful GPU or if the model is small enough that precision is critical (e.g., medical analysis).
- Use INT8: The "sweet spot" for general-purpose Edge AI on CPUs. It offers 4x memory reduction and significant speedup with minimal accuracy loss.
- Use INT4: When deploying to resource-constrained devices (IoT, mobile) or when running very large models (7B, 13B parameters) locally where memory is the primary bottleneck.
By understanding these theoretical foundations, you can now look at the C# code in the next sections not just as syntax, but as a direct manipulation of the mathematical precision of the neural network.
Basic Code Example
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
namespace EdgeAI_QuantizationDemo
{
class Program
{
// --- CONFIGURATION ---
// For this "Hello World" example, we simulate the existence of a quantized model.
// In a real scenario, you would download a pre-quantized ONNX model (e.g., Phi-3-mini-4k-instruct-q4.onnx)
// or use the ONNX Runtime quantization tools to convert a PyTorch/TensorFlow model.
private const string ModelPath = "phi-3-mini-4k-instruct-q4.onnx";
private const string VocabularyPath = "tokenizer_vocab.json"; // Simplified for demonstration
static async Task Main(string[] args)
{
Console.WriteLine("🚀 Edge AI: Local Inference with Quantized Models (INT4)");
Console.WriteLine("--------------------------------------------------------");
// 1. Prepare the Environment
// In a real app, we would download the model if it doesn't exist.
// For this example, we will check if the file exists, and if not, create a dummy one
// just to allow the code to run without crashing.
await EnsureModelExistsAsync(ModelPath);
// 2. Define the Inference Session Options
// This is where the magic happens. We configure the execution provider.
// For Edge AI (CPU), we use CpuMl (Machine Learning) provider if available,
// otherwise default Cpu.
var sessionOptions = new SessionOptions();
// Enable CPU optimizations specific for ML workloads (AVX, etc.)
sessionOptions.AppendExecutionProvider_CPU();
// Set graph optimization level to enable constant folding and other optimizations
sessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
// 3. Load the Quantized Model
// We wrap this in a try-catch block because file paths can be tricky.
try
{
Console.WriteLine($"[1] Loading model from: {Path.GetFullPath(ModelPath)}");
// The InferenceSession is the core engine.
// Even though the model is quantized to INT4, ONNX Runtime handles the
// unpacking and execution transparently.
using var session = new InferenceSession(ModelPath, sessionOptions);
Console.WriteLine($"[2] Model loaded successfully.");
Console.WriteLine($" Input Node: {session.InputMetadata.Keys.First()}");
Console.WriteLine($" Output Node: {session.OutputMetadata.Keys.First()}");
// 4. Prepare Input Data (Simulated Tokenization)
// In a real scenario, you would use a tokenizer (like Microsoft.ML.Tokenizers).
// Here, we simulate the input_ids tensor expected by the LLM.
// Shape: [BatchSize, SequenceLength] -> [1, 8]
var inputIds = new long[] { 1, 15043, 29879, 29901, 2057, 29915, 29879, 29901 }; // "Hello, how are you?"
var attentionMask = new long[] { 1, 1, 1, 1, 1, 1, 1, 1 };
// 5. Create Tensors
// We use DenseTensor for memory efficiency on Edge devices.
// We must reshape to [1, 8] to match the model's expected batch dimension.
var inputTensor = new DenseTensor<long>(inputIds, [1, inputIds.Length]);
var maskTensor = new DenseTensor<long>(attentionMask, [1, attentionMask.Length]);
// 6. Create NamedOnnxValue Inputs
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", inputTensor),
NamedOnnxValue.CreateFromTensor("attention_mask", maskTensor)
};
Console.WriteLine($"[3] Input tensor created. Shape: {string.Join(",", inputTensor.Dimensions)}");
// 7. Run Inference
Console.WriteLine($"[4] Running inference (Local CPU)...");
var watch = System.Diagnostics.Stopwatch.StartNew();
// This is the synchronous execution call.
// Despite the model being INT4, the inputs are usually FP32 or INT64,
// and the outputs are FP32 (logits).
using var results = session.Run(inputs);
watch.Stop();
Console.WriteLine($" Execution time: {watch.ElapsedMilliseconds} ms");
// 8. Process Results
// The output is typically a tensor of logits (raw scores) for the next token.
var outputTensor = results.First().AsTensor<float>();
Console.WriteLine($"[5] Output received. Shape: {string.Join(",", outputTensor.Dimensions)}");
// Find the token with the highest score (Greedy Decoding)
int maxIndex = 0;
float maxVal = float.MinValue;
// We iterate over the last token's logits (dim 2)
for (int i = 0; i < outputTensor.Dimensions[2]; i++)
{
float val = outputTensor[0, 0, i];
if (val > maxVal)
{
maxVal = val;
maxIndex = i;
}
}
Console.WriteLine($"[6] Predicted Token ID: {maxIndex} (Score: {maxVal:F4})");
Console.WriteLine($" *Note: In a real app, this ID maps back to text via the tokenizer.*");
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"[ERROR] {ex.Message}");
Console.ResetColor();
Console.WriteLine("\nNote: This code requires the actual ONNX model file to run fully.");
}
Console.WriteLine("\nPress any key to exit...");
Console.ReadKey();
}
// Helper to simulate model existence for the code snippet
static async Task EnsureModelExistsAsync(string path)
{
if (!File.Exists(path))
{
Console.WriteLine($"[!] Model file not found at '{path}'.");
Console.WriteLine(" Creating a dummy file for demonstration purposes...");
// In a real app, we would download:
// await DownloadModelAsync("https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/resolve/main/cpu-int4-rtn-block-32/phi-3-mini-4k-instruct-cpu-int4-rtn-block-32.onnx");
// Creating a dummy file just to let the InferenceSession constructor succeed
// (it checks file existence). In reality, this file would be invalid ONNX.
// We catch the specific ONNX runtime error in Main if the format is wrong.
await File.WriteAllTextAsync(path, "dummy");
}
}
}
}
Visualizing the Quantization Pipeline
Here is a high-level visualization of how the data flows through a quantized model on an Edge device.
Line-by-Line Explanation
1. Setup and Configuration
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
// ... other usings
namespace EdgeAI_QuantizationDemo
{
class Program
{
private const string ModelPath = "phi-3-mini-4k-instruct-q4.onnx";
using Microsoft.ML.OnnxRuntime: This namespace contains the core classes for interacting with ONNX models. It is the bridge between your C# code and the underlying native execution engine (likely C++).using Microsoft.ML.OnnxRuntime.Tensors: ProvidesDenseTensor<T>and other tensor types. While .NET hasSystem.Numerics, ONNX Runtime has its own tensor abstraction to handle multi-dimensional data efficiently without unnecessary memory copying.ModelPath: We define the path to the model. Note the filenameq4(Quantized 4-bit). In Edge AI, we typically download these pre-quantized models from hubs like Hugging Face rather than quantizing at runtime, as runtime quantization (dynamic) often yields lower performance gains than static quantization.
2. Execution Provider Configuration
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_CPU();
sessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
SessionOptions: This object configures the environment in which the model runs.AppendExecutionProvider_CPU(): Explicitly tells ONNX Runtime to use the CPU. On Windows, this often utilizes the MLAS (Microsoft Linear Algebra Subroutines) library, which is optimized for AVX2/AVX-512 instructions common in modern Edge CPUs.GraphOptimizationLevel.ORT_ENABLE_ALL: This is critical for performance. It tells the runtime to perform "constant folding" (pre-calculating static operations) and fuse operators (e.g., combining Conv2D + ReLU into a single operation). This reduces the overhead of the execution loop, which is vital when running large models locally.
3. Loading the Model
InferenceSession: This is the main class. When instantiated, it parses the ONNX file (a Protocol Buffer format).- Quantization Handling: Even though the model weights are stored as
int4(oruint8in some implementations),InferenceSessionabstracts this away. It loads the quantized tensors into memory. Note thatint4is not a native CLR type; ONNX Runtime packs 4 bits into a byte (2 values per byte) and handles the unpacking during the matrix multiplication operations.
4. Input Preparation (Tokenization)
var inputIds = new long[] { 1, 15043, 29879, 29901, 2057, 29915, 29879, 29901 };
var inputTensor = new DenseTensor<long>(inputIds, [1, inputIds.Length]);
- Token IDs: LLMs don't process strings directly; they process integers representing tokens.
15043might map to "Hello",29879to ",", etc. DenseTensor<long>: We create a contiguous block of memory representing the tensor.- Shape
[1, 8]: The first dimension is the Batch Size (1 for single-user inference on Edge). The second is the Sequence Length. Most ONNX models expect a fixed shape or a dynamic axis. If the model expects[Batch, Sequence], we must match this exactly.
5. Running Inference
session.Run(): This is the synchronous execution call. It passes the input data to the underlying C++ engine.- The Quantization Magic: Inside this call, the engine performs matrix multiplications. Since the weights are INT4, the hardware uses integer arithmetic units (ALUs), which are significantly faster and more power-efficient than floating-point units (FPUs).
- Math: \(Y = X \cdot W + B\). If \(W\) is INT4, the multiplication is integer-based, resulting in an accumulation that is then scaled and shifted (dequantized) back to a floating-point representation for the next layer or the final output.
6. Output Processing
- Logits: The output is usually a tensor of shape
[Batch, SequenceLength, VocabularySize]. These are "logits" (raw, unnormalized scores). AsTensor<float>(): Even though the internal computation used integers, the output is typically cast back tofloat(FP32) for the final result, as the softmax function (used to turn scores into probabilities) requires floating-point precision to avoid numerical instability.
Common Pitfalls
-
Mismatched Input Shapes (The "Batch" Dimension): A common error when moving from Python (PyTorch/TensorFlow) to C# ONNX Runtime is forgetting the batch dimension. A model trained on inputs of shape
[SequenceLength]often expects shape[BatchSize, SequenceLength]in ONNX. Even if you are processing a single sentence, you must reshape your tensor to[1, SequenceLength]. Failing to do so results in a runtime exception:Invalid rank for input: ... Expected: 2. -
Data Type Mismatch (Int64 vs Int32): LLM input IDs are usually
int64(long) in ONNX, but sometimes they areint32. If you create a tensor withfloatorintwhen the model expectslong, thesession.Run()call will throw aRuntimeExceptionstating that the input types do not match the model's input metadata. Always checksession.InputMetadata. -
Missing Execution Providers: By default,
InferenceSessionuses the CPU provider. However, if you are on a device with a Neural Processing Unit (NPU) or GPU (e.g., Intel OpenVINO, NVIDIA CUDA, or DirectML), simply loading the model on the CPU will miss out on massive speedups. You must explicitly append the correct provider (e.g.,sessionOptions.AppendExecutionProvider_DML()for Windows GPU) and ensure the corresponding native libraries are deployed with your app. -
Static vs. Dynamic Quantization:
- Static Quantization: Calibrates using a dataset before inference. It is faster but requires representative data.
- Dynamic Quantization: Quantizes weights statically but activations dynamically at runtime. This is easier to implement (often a one-line change in code) but may have slightly lower performance than fully static quantization.
- Pitfall: Assuming that simply loading an FP32 model into
InferenceSessionwill quantize it. It will not. You must use the ONNX Runtime Quantization Tool (Python library) to generate the.quant.onnxfile before deploying to C#.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.