Chapter 5: Setting up the Local Environment

Theoretical Foundations

The theoretical foundation for running Large Language Models (LLMs) like Llama and Phi locally on a .NET stack relies on a specific architectural convergence: the separation of the model's mathematical logic from its execution hardware. In the cloud, this abstraction is often hidden behind REST APIs. Locally, we must explicitly manage how the mathematical graph of the neural network interacts with the underlying silicon (CPU or GPU).

The ONNX Runtime: The Universal Translator

At the heart of local inference is the Open Neural Network Exchange (ONNX) format. To understand why this is critical, imagine a complex architectural blueprint for a skyscraper. This blueprint is designed in a specific software (like AutoCAD). If you want to build the skyscraper using a different construction company (a different hardware vendor or execution engine), you cannot hand them the proprietary AutoCAD file; you need a universal standard like PDF or DWG.

ONNX is that universal standard for neural networks. Llama and Phi are originally defined in frameworks like PyTorch or TensorFlow. Before they can run in C#, they are exported to ONNX. This format represents the neural network as a Directed Acyclic Graph (DAG) of operators (nodes) and data flows (edges).

In C#, we do not interact with the raw ONNX file directly. We interact with the ONNX Runtime (ORT). The ORT is the "construction company" that reads the blueprint and executes it. It is a high-performance inference engine that is highly optimized for specific hardware backends.

The Execution Providers (EPs)

The "Why" behind the ONNX Runtime is its ability to swap Execution Providers (EPs) without changing your C# application code. This is the architectural equivalent of an Interface in object-oriented programming, a concept we explored in Book 2: Design Patterns.

Just as an IEnumerable<T> allows you to iterate over a list or a database stream without knowing the underlying storage, the InferenceSession in ONNX Runtime uses an abstraction layer for hardware acceleration.

CPU EP: The default fallback. It uses highly optimized linear algebra libraries (like MKL or BLAS) to run on the CPU. It is universal but slower for large matrix multiplications.
CUDA EP (NVIDIA): This delegates the graph execution to the NVIDIA GPU. It utilizes cuDNN and cuBLAS libraries. This is crucial because LLMs are essentially massive matrix multiplications; GPUs excel at parallelizing these operations.
DirectML EP (Windows): This is a DirectX-based API that allows ONNX to run on any DirectX 12 compatible GPU (AMD, Intel, NVIDIA). It is the bridge between the generic world of ML and the specific graphics pipeline of Windows.
CoreML EP (Apple Silicon): For Mac users, this delegates to the Neural Engine.

Analogy: Think of the ONNX Runtime as a professional driver. The ONNX model file is the route plan. The driver can choose to drive a car (CPU), a Formula 1 race car (CUDA/DirectML), or a boat (CoreML). The route plan (the model) remains identical; only the vehicle (execution provider) changes the speed.

The .NET Integration: Microsoft.ML.OnnxRuntime

In C#, we interact with the ONNX Runtime via the Microsoft.ML.OnnxRuntime NuGet package. This package provides the managed wrapper around the native C++ runtime. It is the bridge between the managed .NET garbage-collected world and the unmanaged, high-performance native world.

The Session and the Graph

The central object is the InferenceSession. When you load a model (e.g., llama-3-2-1b-instruct.onnx), the session parses the ONNX graph. This graph is static. Unlike dynamic execution in Python, the ONNX graph is "frozen" to maximize performance.

In a previous chapter, we discussed Memory Management in .NET. It is vital here. The InferenceSession holds unmanaged memory allocated by the native C++ runtime. If the .NET Garbage Collector (GC) moves the object, it could corrupt the native pointer. Therefore, the IDisposable pattern is strictly enforced. The session must be wrapped in a using block or explicitly disposed.

// Conceptual representation of the session lifecycle
using var session = new InferenceSession("model.onnx", SessionOptions.Default);
// The session holds the graph definition and pre-allocated memory pools.

Tokenization: The Bridge to the Model

LLMs do not process text; they process numbers. The process of converting raw text into a sequence of integers (tokens) is called Tokenization. This is a non-neural step that must happen before the ONNX graph executes.

In the cloud, tokenization is often handled by the API server. Locally, we must handle it explicitly. This introduces the concept of Vocabulary Mappings.

Vocabulary: The set of all possible tokens (words, sub-words, characters) the model knows.
Tokenizer Model: A separate file (often tokenizer.json or tokenizer.model) that contains the rules for splitting text (e.g., Byte-Pair Encoding).

Analogy: Imagine you are sending a coded telegram to a spy (the LLM). You cannot just write English sentences; you must translate them into a specific codebook (the tokenizer). If you send the wrong code (an unknown token), the spy will not understand. The tokenizer ensures the input matches the model's expected integer vocabulary.

The Inference Loop: The Forward Pass

Once the text is tokenized, the integers are fed into the InferenceSession. This triggers the Forward Pass.

Input Tensor Creation: The tokens are wrapped in a NamedOnnxValue. This is a tensor (a multi-dimensional array) usually shaped [1, SequenceLength].
Graph Execution: The runtime traverses the graph. It starts with the input embedding layer, moves through the attention mechanisms (Self-Attention), and feeds forward through the dense layers.
Output Logits: The result is a tensor of floating-point numbers called logits. These represent the probability distribution of the next token in the sequence.

The "Why" of Output Processing: The logits tensor is usually huge (e.g., 32000 dimensions for Llama, representing the vocabulary size). We must apply a Softmax function to convert these raw scores into probabilities. Then, we use a Sampling Strategy (Greedy, Top-K, Top-P) to select the next token.

This is where the Microsoft.ML.OnnxRuntime.Extensions package often comes in, providing helper methods to handle these post-processing steps efficiently.

GPU Acceleration: CUDA vs. DirectML

For local AI, the choice of hardware acceleration is the single biggest factor in performance.

CUDA (NVIDIA)

CUDA is the industry standard for deep learning. It requires:

NVIDIA GPU: Hardware with Compute Capability 6.0+.
CUDA Toolkit: Installed at the system level (not via NuGet).
cuDNN: The Deep Neural Network library.

When the InferenceSession detects the CUDA Execution Provider, it copies the input tensors from CPU RAM to GPU VRAM. The entire graph execution happens on the GPU. The output logits are then copied back to CPU RAM (or kept on GPU for the next step).

DirectML (Windows)

DirectML is Microsoft's answer to hardware agnosticism. It allows C# developers to run AI on AMD, Intel, or NVIDIA GPUs without installing specific CUDA drivers.

Architecture: It sits on top of DirectX 12.
Benefit: It is pre-installed on Windows 10/11.
Trade-off: While highly optimized, it sometimes lags slightly behind the absolute bleeding-edge performance of the latest CUDA libraries for specific niche operations.

The Model Lifecycle: Download and Verification

Before code execution, the model must be acquired. In the context of local AI, we often download models from Hugging Face.

However, a raw PyTorch checkpoint (.bin) cannot be loaded by C#. It must be converted to ONNX. This conversion is a theoretical step involving:

Export: Using torch.onnx.export to trace the model execution.
Optimization: Using tools like onnxruntime-tools to fuse operators (e.g., combining LayerNorm and Add into a single kernel) to reduce latency.

Verification: Once the ONNX file is downloaded, we must verify its integrity. The ONNX Runtime provides a method to check the model's input and output names. This is crucial because if the input name in code (e.g., "input_ids") does not match the input name in the model graph (e.g., "tokens"), the inference will fail.

Architectural Visualization

The following diagram illustrates the flow of data from C# to the hardware and back. Note how the InferenceSession acts as the gatekeeper.

Theoretical Foundations

1. SessionOptions and Threading

The SessionOptions class is where we configure the execution environment. It allows us to set intra-op and inter-op thread counts.

Intra-op: Parallelism within a single operator (e.g., matrix multiplication).
Inter-op: Parallelism across different nodes in the graph.
Why it matters: On a CPU, setting these incorrectly can lead to thread thrashing. On a GPU, these settings are often ignored because the GPU handles parallelism internally.

2. Memory Mapping (Memory Mapped Files)

Loading a 4GB Llama model into managed RAM is inefficient. The ONNX Runtime supports loading models via memory-mapped files. This allows the OS to page parts of the file into physical RAM only when needed, reducing startup time and memory footprint. This is handled transparently by the InferenceSession constructor if the file path is provided, but understanding this helps in optimizing large model loading.

3. KV Caching (Key-Value Cache)

In a generative model, we calculate the attention for previous tokens repeatedly. To avoid this, we use a KV Cache.

Concept: We store the intermediate Key and Value matrices for the processed tokens.
Implementation: In ONNX, this often requires a stateful model or specific inputs/outputs designed to carry this state over multiple inference calls.
C# Implication: We must manage the lifecycle of these tensors across multiple calls to session.Run(), ensuring we don't allocate new memory for every generated token (which would be slow).

4. Quantization (The Trade-off)

To run LLMs on consumer hardware (like a laptop without a high-end GPU), we use Quantization. This reduces the precision of the model weights from 16-bit floating point (FP16) to 8-bit integers (INT8) or even 4-bit integers (INT4).

Analogy: It’s like taking a high-resolution photo (FP16) and converting it to a compressed JPEG (INT8). The file size is smaller, and it loads faster, but there is a slight loss in detail.
ONNX Support: The ONNX Runtime automatically handles the de-quantization if the model is quantized using specific ONNX-compatible tools (like onnxruntime-tools).

Theoretical Foundations

To build a local AI application in C#, we are essentially building a pipeline:

Acquisition: Downloading the ONNX model (the static graph).
Tokenization: Converting text to integers using a vocabulary map.
Session Initialization: Creating the bridge to the native runtime and selecting the Execution Provider (GPU or CPU).
Inference: Passing the input tensor through the graph to get logits.
Sampling: Converting logits back to a token, then back to text.

This architecture decouples the model (data) from the execution engine (logic), allowing C# developers to leverage the full power of local hardware without writing complex CUDA kernels themselves. The Microsoft.ML.OnnxRuntime package is the key that unlocks this door, providing a managed, type-safe interface to the unmanaged, high-performance world of AI inference.

Basic Code Example

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;

namespace EdgeAIHelloWorld
{
    class Program
    {
        static async Task Main(string[] args)
        {
            // 1. SETUP: Define the model path and ensure the directory exists.
            // In a real app, this would be a configurable setting.
            string modelDirectory = Path.Combine(Environment.CurrentDirectory, "models");
            string modelPath = Path.Combine(modelDirectory, "phi-3-mini-4k-instruct-onnx", "cpu_and_mobile", "cpu-int4-rtn-block-32-acc-level-4.onnx");

            Directory.CreateDirectory(modelDirectory);

            Console.WriteLine($"[1] Checking for model at: {modelPath}");

            // 2. MODEL LOADING: Verify the model exists or download it if missing.
            // This ensures the example is self-contained and runnable immediately.
            if (!File.Exists(modelPath))
            {
                Console.WriteLine("[2] Model not found. Starting download...");
                // NOTE: In a production environment, you would cache this model locally 
                // or use a secure artifact repository.
                await DownloadModelAsync(modelPath);
            }
            else
            {
                Console.WriteLine("[2] Model found locally.");
            }

            // 3. SESSION CREATION: Initialize the ONNX Runtime Inference Session.
            // We use 'using' statements to ensure proper disposal of unmanaged resources.
            // We explicitly target the CPU execution provider for maximum compatibility.
            Console.WriteLine("[3] Initializing ONNX Runtime Session...");
            var sessionOptions = new SessionOptions();
            sessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_WARNING;

            // For GPU acceleration (DirectML on Windows or CUDA on Linux/NVIDIA), 
            // you would change this line:
            // var sessionOptions = SessionOptions.MakeSessionOptionWithCudaProvider(); 
            // OR
            // var sessionOptions = SessionOptions.MakeSessionOptionWithDirectMLProvider();

            using var inferenceSession = new InferenceSession(modelPath, sessionOptions);

            // 4. TOKENIZATION: Convert text input into numerical tokens.
            // LLMs do not process strings directly; they process integers (token IDs).
            // We will simulate a tokenizer for this "Hello World" example.
            // Real-world usage requires a dedicated tokenizer library (e.g., Microsoft.ML.Tokenizers).
            string prompt = "Explain the concept of Edge AI in one sentence.";
            Console.WriteLine($"\n[4] Input Prompt: \"{prompt}\"");

            var tokenizer = new SimpleTokenizer();
            var inputTokens = tokenizer.Encode(prompt);

            // 5. PREPARE TENSORS: Convert token lists into ONNX Runtime tensors.
            // ONNX Runtime expects specific input shapes (dimensions).
            // For this model, inputs are usually: input_ids (1, sequence_length) and attention_mask (1, sequence_length).
            var inputIds = new DenseTensor<long>(inputTokens.Select(t => (long)t).ToArray(), [1, inputTokens.Count]);
            var attentionMask = new DenseTensor<long>(Enumerable.Repeat(1L, inputTokens.Count).ToArray(), [1, inputTokens.Count]);

            // 6. BUILD INPUT CONTAINER: Map input names to tensors.
            // The input names (e.g., "input_ids") must match the ONNX model's graph definition exactly.
            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("input_ids", inputIds),
                NamedOnnxValue.CreateFromTensor("attention_mask", attentionMask)
            };

            // 7. INFERENCE: Execute the forward pass (local inference).
            Console.WriteLine("[5] Running inference...");
            using var results = inferenceSession.Run(inputs);

            // 8. POST-PROCESSING: Extract the output logits.
            // The model outputs 'logits' (raw scores) for the next token prediction.
            var outputTensor = results.First().AsTensor<long>();

            // 9. DECODING: Convert token IDs back to text.
            // We take the most likely token (greedy decoding) for simplicity.
            Console.WriteLine("\n[6] Generated Output Tokens:");
            var outputTokens = outputTensor.ToArray();

            string decodedResponse = tokenizer.Decode(outputTokens);

            Console.WriteLine($"\n>>> RESULT: {decodedResponse}");
        }

        static async Task DownloadModelAsync(string destinationPath)
        {
            // Hugging Face URL for a standard ONNX Phi-3 Mini model (CPU)
            // Note: URLs change. In production, use a versioned link or hash verification.
            string url = "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/resolve/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4.onnx";

            using var httpClient = new HttpClient();
            // Important: Set a reasonable timeout for large downloads
            httpClient.Timeout = TimeSpan.FromMinutes(10);

            using var response = await httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
            response.EnsureSuccessStatusCode();

            // Create the directory structure if it doesn't exist
            var directory = Path.GetDirectoryName(destinationPath);
            if (!string.IsNullOrEmpty(directory)) Directory.CreateDirectory(directory);

            await using var fileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write, FileShare.None);
            await response.Content.CopyToAsync(fileStream);

            Console.WriteLine($"[Download Complete] Model saved to {destinationPath}");
        }
    }

    /// <summary>
    /// A minimalistic tokenizer for demonstration purposes.
    /// In a real application, use Microsoft.ML.Tokenizers or HuggingFace tokenizers.
    /// This class simulates the mapping between text and integers.
    /// </summary>
    public class SimpleTokenizer
    {
        private readonly Dictionary<string, int> _vocab;
        private readonly Dictionary<int, string> _invVocab;

        public SimpleTokenizer()
        {
            // A tiny vocabulary for demonstration. 
            // Real models have 32,000+ tokens.
            _vocab = new Dictionary<string, int>
            {
                { "<s>", 1 }, { "</s>", 2 }, { "<unk>", 3 },
                { "Explain", 100 }, { "the", 101 }, { "concept", 102 },
                { "of", 103 }, { "Edge", 104 }, { "AI", 105 },
                { "in", 106 }, { "one", 107 }, { "sentence", 108 },
                { ".", 109 }, { "is", 110 }, { "a", 111 },
                { "technology", 112 }, { "that", 113 }, { "processes", 114 },
                { "data", 115 }, { "locally", 116 }, { "on", 117 },
                { "devices", 118 }, { "rather", 119 }, { "than", 120 },
                { "cloud", 121 }, { "servers", 122 }
            };

            _invVocab = _vocab.ToDictionary(kvp => kvp.Value, kvp => kvp.Key);
        }

        public int[] Encode(string text)
        {
            // Simple whitespace tokenization for demo
            var tokens = text.Split(new[] { ' ', '\t', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
            var ids = new List<int> { 1 }; // Start with <s>

            foreach (var token in tokens)
            {
                // Clean punctuation for matching
                string cleanToken = token.Trim('.', ',', '!', '?');
                if (_vocab.TryGetValue(cleanToken, out int id))
                {
                    ids.Add(id);
                }
                else
                {
                    ids.Add(3); // <unk>
                }
            }
            ids.Add(2); // End with </s>
            return ids.ToArray();
        }

        public string Decode(int[] ids)
        {
            var words = new List<string>();
            foreach (var id in ids)
            {
                if (id == 1 || id == 2) continue; // Skip start/end tokens
                if (_invVocab.TryGetValue(id, out string? word))
                {
                    words.Add(word);
                }
                else
                {
                    words.Add("[UNK]");
                }
            }
            return string.Join(" ", words);
        }
    }
}

Detailed Line-by-Line Explanation

1. Setup and Namespace Imports

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;

Microsoft.ML.OnnxRuntime: This is the core namespace containing the InferenceSession class, which is the engine that loads the model and executes computations.
Microsoft.ML.OnnxRuntime.Tensors: Provides the DenseTensor<T> class. ONNX Runtime does not use standard arrays for inputs; it requires data wrapped in a Tensor structure that defines dimensions (shape) and data type.
System.Net.Http: Used here to download the model file dynamically, ensuring the code runs even if the user hasn't manually downloaded the model yet.

2. The Main Execution Flow

string modelDirectory = Path.Combine(Environment.CurrentDirectory, "models");
string modelPath = Path.Combine(modelDirectory, "phi-3-mini-4k-instruct-onnx", "cpu_and_mobile", "cpu-int4-rtn-block-32-acc-level-4.onnx");
Directory.CreateDirectory(modelDirectory);

Context: We define a path for the ONNX model. In this example, we target the Phi-3 Mini model quantized to INT4 (4-bit integers). This quantization drastically reduces memory usage (crucial for Edge AI) while maintaining reasonable accuracy.
Directory.CreateDirectory: Ensures the folder structure exists before attempting to write files.

3. Model Download Logic

if (!File.Exists(modelPath))
{
    await DownloadModelAsync(modelPath);
}

Why this matters: In Edge AI, models are often large (hundreds of MBs). We check locally first to avoid unnecessary network traffic.
DownloadModelAsync: A helper method using HttpClient to stream the file from Hugging Face directly to disk. This keeps the application memory efficient by not loading the entire model into RAM during the download.

4. Session Initialization (The Core)

var sessionOptions = new SessionOptions();
sessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_WARNING;
using var inferenceSession = new InferenceSession(modelPath, sessionOptions);

SessionOptions: This object configures the runtime behavior.
- Execution Providers (EPs): By default, ONNX Runtime uses the CPU. However, for Edge AI, performance is critical. To use a GPU (NVIDIA) or NPU (Neural Processing Unit), you would configure SessionOptions to use CudaProvider or DirectMLProvider (Windows).
- using statement: InferenceSession manages native (C++) resources. The using keyword ensures these resources are released immediately when the scope ends, preventing memory leaks on resource-constrained edge devices.

5. Tokenization (Text to Numbers)

var tokenizer = new SimpleTokenizer();
var inputTokens = tokenizer.Encode(prompt);

The Problem: LLMs cannot understand strings like "Hello". They only understand numbers (Token IDs).
The Solution: We use a SimpleTokenizer. In a real-world scenario, you would use the Microsoft.ML.Tokenizers NuGet package, which handles complex sub-word splitting (Byte-Pair Encoding).
Note: The SimpleTokenizer here is a dictionary lookup for demonstration. It maps words to IDs (e.g., "Edge" -> 104).

6. Tensor Construction

var inputIds = new DenseTensor<long>(inputTokens.Select(t => (long)t).ToArray(), [1, inputTokens.Count]);
var attentionMask = new DenseTensor<long>(Enumerable.Repeat(1L, inputTokens.Count).ToArray(), [1, inputTokens.Count]);

Shape [1, sequence_length]:
- 1: The Batch Size. We are processing 1 prompt at a time.
- sequence_length: The number of tokens in our prompt.
input_ids: The actual token IDs.
attention_mask: Tells the model which tokens to pay attention to. We use 1 for every token (meaning "attend to this") and 0 for padding (if we had padding).
Data Type: ONNX models are strict about types. Phi-3 expects long (Int64) for inputs, not int (Int32).

7. Creating the Input Container

var inputs = new List<NamedOnnxValue>
{
    NamedOnnxValue.CreateFromTensor("input_ids", inputIds),
    NamedOnnxValue.CreateFromTensor("attention_mask", attentionMask)
};

Mapping: ONNX models have named inputs. You must know these names (usually found in the model documentation or by inspecting the model graph).
- "input_ids": Standard name for token IDs.
- "attention_mask": Standard name for the mask.
NamedOnnxValue: This class acts as a bridge, linking the C# tensor object to the specific input node in the ONNX graph.

8. Inference Execution

using var results = inferenceSession.Run(inputs);

Run(): This is the synchronous execution call. It performs the "Forward Pass."
- Data flows from the input layer, through the hidden layers (matrix multiplications, activation functions), to the output layer.
- On an Edge device (like a Raspberry Pi or Jetson Nano), this single line consumes the most CPU/GPU cycles.
results: Contains the output tensors defined by the model.

9. Output Processing

var outputTensor = results.First().AsTensor<long>();

Output Interpretation: The model outputs Logits. These are raw, unnormalized scores for every possible token in the vocabulary.
Greedy Decoding: For this simple example, we implicitly assume the model output is the next token ID (or we are taking the argmax of the logits). In a full implementation, you would iterate this process: predict the next token, add it to the input, and run inference again.

10. Decoding

string decodedResponse = tokenizer.Decode(outputTokens);

We convert the generated integer IDs back into human-readable text using the inverse vocabulary of our tokenizer.

Common Pitfalls

Input Name Mismatch:
- The Mistake: Passing a tensor to an input named "tokens" when the model expects "input_ids".
- The Result: OnnxRuntimeException: Invalid Argument: Node (node_name) ... Missing required input: 'input_ids'.
- Fix: Always verify input names using Netron (a model visualization tool) or by inspecting inferenceSession.InputMetadata.Keys.
Data Type Mismatch (Long vs Int):
- The Mistake: Creating a DenseTensor<int> when the model expects DenseTensor<long> (Int64).
- The Result: A runtime error stating that the data type is incompatible.
- Fix: Check the model's input schema. LLMs almost always require Int64 for token IDs.
Forgetting the Execution Provider:
- The Mistake: Running a large model (like Llama 2 or Phi 3) on the CPU without specific optimizations.
- The Result: Extremely slow inference (e.g., 1 token per second).
- Fix: If you have a compatible NVIDIA GPU, install the CUDA packages and use SessionOptions.MakeSessionOptionWithCudaProvider(). For Windows laptops with Intel/AMD integrated graphics, use DirectML.
Memory Leaks in Loops:
- The Mistake: Creating InferenceSession or Tensor objects inside a while loop (e.g., a chat loop) without disposing of them.
- The Result: The application consumes gigabytes of RAM and crashes on edge devices.
- Fix: Ensure IDisposable objects (Sessions, Tensors, Results) are disposed of properly using using blocks or explicit .Dispose() calls.

Visualizing the Inference Flow

The following diagram illustrates the data flow within the C# application and how it interacts with the ONNX Runtime engine.

This diagram illustrates the sequential flow of data from the C# application, through the ONNX Runtime engine for inference, and back to the application, highlighting the use of the Dispose() method to manage memory resources. — This diagram illustrates the sequential flow of data from the C# application, through the ONNX Runtime engine for inference, and back to the application, highlighting the use of the `Dispose()` method to manage memory resources.

Flow Description:

Text Input: The user provides a raw string.
Tokenization: The string is split into sub-word tokens and mapped to integers.
Tensorization: Integers are wrapped in a Tensor structure compatible with the ONNX engine.
ONNX Runtime: The engine executes the mathematical operations defined by the .onnx file.
Decoding: The resulting integers are mapped back to strings for the user.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.