Chapter 14: Text-to-Speech (TTS) with Local Models

Theoretical Foundations

The transition from processing text to generating human-like speech represents a fundamental shift in how applications interact with users. While local Large Language Models (LLMs) like Phi or Llama excel at understanding and generating text, they remain silent—literally. To bridge this gap, we implement Text-to-Speech (TTS) systems that run entirely on the edge, ensuring privacy, low latency, and offline capability. This section dissects the theoretical architecture of local TTS, focusing on the conversion of text into acoustic features and subsequently into raw audio waveforms using the ONNX Runtime within a C# environment.

The TTS Pipeline: A Multi-Stage Transformation

At its core, a modern TTS system is not a single monolithic model but a pipeline of specialized components. Unlike older concatenative methods that stitched together pre-recorded phonemes, modern neural TTS utilizes a sequence-to-sequence approach. The process generally follows this flow:

Text Normalization & Tokenization: Raw text is cleaned and converted into a sequence of phonemes or characters.
Acoustic Model (Text-to-Spectrogram): This model (often based on architectures like FastSpeech or VITS) converts the token sequence into a visual representation of sound—the spectrogram.
Vocoder (Spectrogram-to-Waveform): This model takes the abstract spectrogram and reconstructs the raw audio waveform (the actual sound waves).

Analogy: The Architect and the Builder

Imagine constructing a building. The Acoustic Model acts as the architect. It takes the blueprint (the text) and creates a detailed schematic diagram (the spectrogram). This diagram doesn't look like a building; it shows the structure, the height of the walls, and the placement of windows, but it is abstract. The Vocoder is the master builder. It looks at the schematic and uses physical materials (mathematical synthesis) to construct the actual physical building (the audio waveform). One cannot exist without the other if you want a functional result.

Deep Dive 1: The Acoustic Model (The Architect)

In the context of local TTS, we often utilize models like Piper or VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech). These models are typically trained on massive datasets of audio and text pairs.

The Spectrogram: Visualizing Sound

Before understanding the model, we must understand its output: the Mel Spectrogram.

Sound is a wave. In a digital system, we represent it as a sequence of amplitude values over time (the waveform). However, raw waveforms are high-dimensional and noisy for a neural network to process directly. A spectrogram breaks the audio into short time windows and analyzes the frequency content of each window.

A Mel Spectrogram is a variation where the frequency axis is converted to the Mel scale, which mimics human hearing sensitivity (we perceive lower frequencies better than higher ones). To a neural network, a Mel Spectrogram looks like an image. The horizontal axis is time, and the vertical axis is frequency intensity.

Why is this critical for C# developers? When we load these models in ONNX Runtime, we are not dealing with "audio" initially. We are dealing with tensors of floating-point numbers representing this spectral image. In C#, we manage these tensors using the Microsoft.ML.OnnxRuntime library. The input to the TTS model is a tensor of integers (text tokens), and the output is a 3D tensor representing the Mel Spectrogram (Batch Size × Mel Channels × Time Frames).

The Architecture: VITS and Monotonic Alignment

Models like VITS utilize a Variational Autoencoder (VAE) combined with Normalizing Flows. Without getting lost in the calculus, the goal is to map the discrete text input to a continuous latent space that represents speech.

A critical concept here is Monotonic Alignment Search (MAS). When we speak, the timing of phonemes is rigid; "Hello" always flows in a specific order. The acoustic model must learn this alignment without explicit timing labels during training. It ensures that as the text tokens progress, the corresponding audio frames progress linearly.

In C#, when we invoke the inference session for the acoustic model, we are essentially asking the model: "Given this sequence of token IDs, generate the sequence of Mel frames that statistically resemble the training data of a human voice."

Deep Dive 2: The Vocoder (The Builder)

The spectrogram generated by the acoustic model is a magnitude representation. It lacks phase information (the "shape" of the wave). Converting a spectrogram back to audio is an ill-posed problem—there are infinite possible phase configurations that result in the same magnitude.

Modern vocoders, such as HiFi-GAN (often used with Piper), are Generative Adversarial Networks (GANs). They are trained to generate waveforms that are indistinguishable from real audio.

How HiFi-GAN Works

HiFi-GAN uses a generator to create audio from the spectrogram and a discriminator to judge if the audio is real or fake. The generator learns to upsample the spectrogram (which has a low time resolution) to the high-resolution audio waveform (typically 22,050 Hz or 44,100 Hz).

The C# Implication: This is computationally expensive. Generating audio frame-by-frame in real-time requires efficient memory management. In C#, we cannot afford to allocate massive arrays on the heap for every millisecond of audio. We must use Span and Memory to manage buffers for the audio chunks.

Furthermore, the vocoder takes the Mel Spectrogram tensor and outputs a raw audio tensor (1D array of floats). This raw audio must be converted to a standard format (like PCM 16-bit) to be playable.

Theoretical Foundations

The beauty of using ONNX (Open Neural Network Exchange) is hardware abstraction. Whether you are running on an Intel CPU, an AMD GPU, or an ARM-based Raspberry Pi, the ONNX Runtime optimizes the execution graph.

The Execution Graph

When we load a TTS model in C#, we are loading a directed acyclic graph (DAG). Each node in the graph represents an operation (e.g., Matrix Multiplication, Layer Normalization, Convolution).

The diagram illustrates the sequence of fundamental neural network operations—Matrix Multiplication, Layer Normalization, and Convolution—processing input data to generate an output feature map.

Memory Management and Async/Await

TTS inference is blocking by nature (math takes time). However, in a responsive UI or a server application, we cannot freeze the main thread while the GPU crunches numbers. This is where C#'s async and await keywords become architectural pillars.

We wrap the ONNX inference calls in Task.Run() or use the native async capabilities of the ONNX Runtime (if supported by the execution provider). This allows the application to remain responsive while the "architect" and "builder" are working in the background.

Real-World Analogy: The Orchestra Conductor

Think of the TTS system as an orchestra.

The Text: The musical score.
The Acoustic Model (VITS): The conductor. They don't play an instrument but dictate the tempo, volume, and pitch (the spectrogram) for every section.
The Vocoder (HiFi-GAN): The musicians. They take the conductor's abstract directions and produce the actual sound waves from their instruments.

In our C# application, the OnnxSession object is the stage manager. It loads the sheet music (the .onnx file), allocates the rehearsal space (memory buffers), and ensures the conductor and musicians stay in sync (tensor shapes and dimensions).

Edge Cases and Nuances

Voice Conditioning (Speaker Embeddings): Many local TTS models support multiple voices. This is handled via Speaker Embeddings (vectors of floats). In the C# implementation, we must manage these embeddings as additional inputs to the ONNX graph. If the tensor shape does not match the expected embedding dimension (e.g., 256 floats), the inference will fail.
- Concept Reference: As discussed in Book 8, Chapter 12 regarding vector databases, we treat these embeddings similarly—static vectors that modify the context of the inference.
Length Scales and Noise: Neural networks are sensitive to input scaling. The acoustic model outputs a spectrogram with a specific time dimension. If the text is long, the time dimension grows. The vocoder must handle variable-length inputs. In C#, this means we cannot pre-allocate fixed-size arrays. We must use dynamic lists or resizeable spans, carefully managing the heap allocations to prevent Garbage Collection (GC) spikes, which would cause audio stuttering.
Streaming vs. Batch: For real-time conversation (like the LLM chatbots in previous chapters), waiting for the full text to be generated before starting TTS introduces unacceptable latency.
- Strategy: We implement Streaming TTS. As the LLM generates text chunks, we send them to the TTS pipeline immediately.
- C# Feature: We use System.Threading.Channels or IAsyncEnumerable<T> to create a producer/consumer pipeline. The LLM produces text tokens (producer), and the TTS system consumes them to generate audio chunks (consumer). This decouples the generation speed from the playback speed.

The Role of Modern C# Features

To build a robust local TTS system, we leverage specific C# capabilities:

ref struct and Span<T>: When processing raw audio buffers (converting float arrays to byte arrays for playback), using Span<T> allows us to work with stack-allocated memory or shared memory slices, avoiding heap allocations entirely. This is vital for high-frequency audio processing loops.
record types: For configuration management (e.g., record TtsConfig(string ModelPath, int SampleRate, float LengthScale)), records provide immutability and value semantics, ensuring that configuration changes propagate predictably through the pipeline.
Dependency Injection (DI): We abstract the TTS engine behind an interface, just as we did with the LLMs.
```
public interface ITtsEngine
{
    Task<ReadOnlyMemory<byte>> SynthesizeAsync(string text, CancellationToken ct);
}
```
This allows us to swap between a high-quality but slow vocoder and a faster but lower-quality one based on the device capabilities (e.g., desktop vs. mobile edge device).

Theoretical Foundations

Input: A string of text enters the C# application.
Tokenization: The text is converted into a sequence of integers (tokens) using a phonemizer.
Acoustic Inference: The tokens are passed to the ONNX Acoustic Model. The runtime executes the graph, producing a Mel Spectrogram tensor.
Vocoder Inference: The spectrogram tensor is passed to the ONNX Vocoder model. The runtime executes the upsampling graph, producing a raw audio tensor (floats).
Post-Processing: The float audio is converted to PCM 16-bit integers (standard audio format).
Playback: The PCM data is fed into an audio buffer (e.g., WaveOutEvent in NAudio) for real-time playback.

By understanding this pipeline, we move beyond simple API calls and gain full control over the speech synthesis process, enabling us to create truly private, offline, and responsive voice-enabled applications.

Basic Code Example

Here is a self-contained, "Hello World" level example of running a Text-to-Speech (TTS) model locally using C# and ONNX Runtime.

This example simulates the core logic of a TTS pipeline: converting a text string into a sequence of audio tokens (acoustic features) and then synthesizing those tokens into a raw audio waveform using a vocoder.

Prerequisites

To run this code, you need the ONNX Runtime NuGet package. You can install it via the .NET CLI:

dotnet add package Microsoft.ML.OnnxRuntime

The Code Example

This example uses a mock "TTS" model and a "Vocoder" model to demonstrate the pipeline. In a real-world scenario, you would download pre-trained ONNX models (like Piper or VITS) and replace the dummy inference logic with actual InferenceSession.Run() calls.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

namespace LocalTtsDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            // 1. Define the input text
            string textInput = "Hello, world! This is local TTS.";
            Console.WriteLine($"Input Text: \"{textInput}\"");

            // 2. Tokenize: Convert text to phonemes/IDs (Simulated)
            // In a real app, this uses a tokenizer model or a dictionary.
            var tokenIds = TokenizeText(textInput);
            Console.WriteLine($"Generated Tokens: {string.Join(", ", tokenIds)}");

            // 3. Acoustic Model Inference: Tokens -> Spectrogram (Mel-Spectrogram)
            // This is the "Brain" of the TTS model (e.g., VITS or Piper encoder).
            var melSpectrogram = RunAcousticModel(tokenIds);
            Console.WriteLine($"Generated Spectrogram Shape: {melSpectrogram.Shape[0]}x{melSpectrogram.Shape[1]}");

            // 4. Vocoder Inference: Spectrogram -> Raw Audio Waveform
            // This converts spectral features into time-domain audio (e.g., WaveRNN or HiFi-GAN).
            var audioWaveform = RunVocoder(melSpectrogram);
            Console.WriteLine($"Generated Audio Samples: {audioWaveform.Length}");

            // 5. Save/Play Audio (Simulated)
            // In a real app, you would write these floats to a .wav file.
            SaveAudioToFile(audioWaveform, "output.wav");
            Console.WriteLine("Audio saved to 'output.wav' (Simulated).");
        }

        // --- Step 1: Text Tokenization ---
        static List<int> TokenizeText(string text)
        {
            // SIMULATION: A real TTS system uses a tokenizer (like BPE or phonemizer).
            // Here, we map characters to dummy IDs for demonstration.
            // IDs 1-26 might represent 'a'-'z', 27 is space, etc.
            var tokens = new List<int>();
            foreach (char c in text.ToLower())
            {
                if (c >= 'a' && c <= 'z') tokens.Add(c - 'a' + 1);
                else if (c == ' ') tokens.Add(27);
                else tokens.Add(28); // punctuation
            }
            return tokens;
        }

        // --- Step 2: Acoustic Model (Text -> Mel-Spectrogram) ---
        static DenseTensor<float> RunAcousticModel(List<int> tokenIds)
        {
            // SIMULATION: In reality, we load an ONNX model file.
            // var session = new InferenceSession("tts_model.onnx");

            // Create dummy input tensor based on token count
            // Shape: [BatchSize=1, SequenceLength=N]
            var inputTensor = new DenseTensor<float>(new[] { 1, tokenIds.Count });
            for (int i = 0; i < tokenIds.Count; i++)
                inputTensor[0, i] = tokenIds[i];

            // SIMULATION: Run Inference
            // var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input", inputTensor) };
            // var results = session.Run(inputs);
            // var outputTensor = results.First().AsTensor<float>();

            // MOCK RESULT: Generate a dummy Mel-Spectrogram
            // Shape: [BatchSize=1, MelChannels=80, TimeSteps=TokenCount * 2]
            int melChannels = 80;
            int timeSteps = tokenIds.Count * 2; 
            var mockMelSpectrogram = new DenseTensor<float>(new[] { 1, melChannels, timeSteps });

            // Fill with dummy data (simulating learned features)
            for (int t = 0; t < timeSteps; t++)
            {
                for (int m = 0; m < melChannels; m++)
                {
                    // Create a simple sine wave pattern to simulate audio features
                    mockMelSpectrogram[0, m, t] = (float)Math.Sin(t * 0.2 + m * 0.1);
                }
            }

            return mockMelSpectrogram;
        }

        // --- Step 3: Vocoder (Mel-Spectrogram -> Audio) ---
        static float[] RunVocoder(DenseTensor<float> melSpectrogram)
        {
            // SIMULATION: In reality, load the vocoder ONNX model (e.g., HiFi-GAN).
            // var session = new InferenceSession("vocoder.onnx");

            // MOCK RESULT: Generate dummy audio samples
            // Input Shape: [1, 80, TimeSteps]
            // Output Shape: [1, AudioLength]

            int timeSteps = (int)melSpectrogram.Dimensions[2];
            int audioLength = timeSteps * 256; // Upsampling factor (e.g., 256x for HiFi-GAN)

            var audioBuffer = new float[audioLength];

            // Generate a dummy waveform based on the mel input
            for (int i = 0; i < audioLength; i++)
            {
                // Modulate amplitude by the mel features
                float feature = melSpectrogram[0, i % 80, i / 256];
                audioBuffer[i] = (float)Math.Sin(i * 0.05) * feature * 0.5f;
            }

            return audioBuffer;
        }

        // --- Step 4: Save Audio (Mock) ---
        static void SaveAudioToFile(float[] audioData, string filename)
        {
            // In a real application, you would use a library like NAudio or 
            // manually construct a WAV header to write the raw PCM data.
            // This is just a placeholder to show where the data goes.
            Console.WriteLine($"[Mock] Writing {audioData.Length} samples to {filename}...");
        }
    }
}

Detailed Line-by-Line Explanation

using Microsoft.ML.OnnxRuntime; and ...Tensors;:
- These namespaces provide the necessary classes to load ONNX models (InferenceSession) and manipulate data structures (Tensor, DenseTensor). ONNX Runtime is highly optimized for CPU and GPU inference.
Main Method:
- textInput: Defines the string we want to convert to speech.
- TokenizeText(textInput): Raw text cannot be fed directly into neural networks. It must be converted into numerical IDs. This step is critical for any NLP task. In a production TTS system (like Piper), this involves a complex tokenizer that handles phonemes, punctuation, and special symbols.
- RunAcousticModel(tokenIds): This is the first inference stage. The Acoustic Model (often a Transformer or FastSpeech variant) takes token IDs and predicts a Mel-Spectrogram. A Mel-Spectrogram is a visual representation of the audio's frequency content over time, but it is not yet audio.
- RunVocoder(melSpectrogram): The Vocoder (like HiFi-GAN or WaveRNN) takes the Mel-Spectrogram and "vocalizes" it into a time-domain waveform (raw audio samples). This is the second inference stage.
- SaveAudioToFile: Writes the raw float data to disk. In a real app, you must wrap this data in a WAV header (RIFF format) for standard media players to read it.
TokenizeText Method:
- Logic: Iterates through characters and assigns a dummy integer ID. This simulates the vocabulary lookup process.
- Why: Neural networks operate on fixed-size vocabularies. Mapping 'a' -> 1, 'b' -> 2 allows the embedding layer to learn vector representations for these tokens.
RunAcousticModel Method:
- DenseTensor<float>: We create a tensor to hold the input data. The shape [1, N] corresponds to [Batch_Size, Sequence_Length].
- Mock Inference: The code block containing // SIMULATION is where session.Run(inputs) would be called. We manually generate a sine wave pattern to represent the Mel-Spectrogram.
- Shape [1, 80, T]: Standard TTS models output 80 Mel frequency bands (channels) across T time steps. This 3D tensor is the input for the Vocoder.
RunVocoder Method:
- Upsampling: The Vocoder converts the spectral features (frequency domain) into audio samples (time domain). This involves a significant increase in data size (e.g., 80 channels * T steps -> T * 256 audio samples).
- Waveform Generation: The mock logic generates a sine wave modulated by the input features to simulate how a Vocoder reconstructs audio based on spectral envelopes.

Common Pitfalls

Missing ONNX Runtime Dependencies:
- Issue: The code compiles but fails at runtime with DllNotFoundException (e.g., onnxruntime or libonnxruntime.so).
- Solution: Ensure you have installed the correct NuGet package (Microsoft.ML.OnnxRuntime). If deploying to Linux (like a Raspberry Pi for Edge AI), ensure the native runtime libraries are present in the execution folder or globally installed.
Incorrect Tensor Shapes:
- Issue: TTS models are extremely sensitive to input dimensions. Feeding a tensor of shape [N] when the model expects [1, N] will throw an RuntimeException.
- Solution: Always verify the input/output names and shapes using a tool like Netron to visualize the .onnx model file before writing inference code.
Audio Format Misinterpretation:
- Issue: Saving the raw float array to a file and trying to play it results in static or silence.
- Solution: Raw inference output is usually PCM 16-bit or Float32 data. You must add a WAV header (RIFF chunk) to the file so media players know the sample rate (e.g., 22050 Hz or 24000 Hz), bit depth, and channel count.
Blocking the UI Thread:
- Issue: TTS inference (especially Vocoder) is computationally expensive. Running this on the main thread of a GUI app will freeze the interface.
- Solution: Wrap the inference calls in Task.Run() or use async/await to offload processing to a background thread.

Visualizing the Pipeline

The data flow in this example follows a specific architecture common in modern Edge AI TTS systems (like VITS or Piper).

This diagram illustrates the asynchronous pipeline of an Edge AI Text-to-Speech system, where the Run() method or async/await keywords offload audio synthesis tasks from the main UI thread to a background thread to maintain responsiveness. — This diagram illustrates the asynchronous pipeline of an Edge AI Text-to-Speech system, where the `Run()` method or `async/await` keywords offload audio synthesis tasks from the main UI thread to a background thread to maintain responsiveness.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.