Chapter 19: Text-to-Speech - Building a Talking Assistant

Theoretical Foundations

The fundamental challenge in building a "talking assistant" is bridging the gap between the digital, asynchronous world of Large Language Models (LLMs) and the continuous, real-time nature of human auditory perception. An LLM does not think in sentences; it thinks in tokens. As discussed in previous chapters regarding the useCompletion hook, an LLM generates text incrementally, one token at a time. If we were to wait for the model to finish generating a complete paragraph before converting it to speech, the user experience would suffer from significant latency—a "stop-and-wait" interaction that feels unnatural and sluggish.

To solve this, we must implement a streaming audio synthesis pipeline. This pipeline treats the stream of text tokens from the LLM not as static data to be stored, but as a live feed of instructions for an audio synthesizer. The theoretical goal is to achieve Zero-Latency Auditory Feedback, where the audio output begins milliseconds after the first text token is generated, creating the illusion of a thinking, speaking entity rather than a pre-recorded response.

The Analogy: The Orchestra Conductor and the Sheet Music

To understand this architecture, imagine an orchestra performing a symphony.

The LLM (The Composer): The model generates the musical score (the text). It writes note by note (token by token), deciding the melody and rhythm.
The Vercel AI SDK (The Conductor): The SDK manages the flow of the performance. It reads the notes as they are written and cues the appropriate sections of the orchestra immediately. It doesn't wait for the entire movement to be composed before starting the performance.
The Web Speech API (The Instrumentalist): This is the musician holding the instrument (the browser's audio synthesizer). When the conductor (SDK) gives the cue, the musician plays the note instantly.
React Server Components (The Stage Manager): They ensure the sheet music is delivered securely to the conductor before the performance begins, handling the logistics of the initial setup without blocking the musicians.

In this analogy, the "text-to-speech" process is not a batch job; it is a live performance. The critical technical component enabling this is the Web Speech API, specifically the SpeechSynthesis interface, which allows JavaScript to programmatically generate audio output in the browser.

The Web Speech API: The Browser's Vocal Cords

The Web Speech API is the standard interface for converting text to speech in the client-side environment. It abstracts the complex Digital Signal Processing (DSP) required to turn text into audible waveforms.

How It Works Under the Hood

When window.speechSynthesis.speak() is called, the browser performs a multi-stage pipeline:

Text Normalization: The raw text is processed to handle abbreviations, numbers, and punctuation (e.g., "Dr." becomes "Doctor").
Grapheme-to-Phoneme (G2P) Conversion: The text is mapped to phonetic representations (IPA - International Phonetic Alphabet). This is crucial for pronunciation accuracy.
Prosody Modeling: The engine applies intonation, stress, and rhythm based on punctuation and context.
Synthesis: A pre-trained acoustic model (often a concatenative synthesizer or a neural TTS model like Tacotron in Chrome) generates the audio waveform.

The Challenge of Streaming: The standard speak() method is designed for complete utterances. If you pass a long string, it synthesizes the whole thing before playing. To achieve the streaming effect required for an AI assistant, we must treat the incoming token stream as a series of short, overlapping utterances. We need to buffer tokens and intelligently decide when to flush the buffer to the synthesizer to maintain natural speech flow without cutting words in half.

The Architecture: Integrating the Stream

The integration of the Vercel AI SDK's useCompletion with the Web Speech API requires a specific architectural pattern. We are essentially creating a bridge between two asynchronous streams: the text stream from the server and the audio synthesis queue on the client.

The Data Flow

Server-Side (RSC): The initial request is handled by a React Server Component. This component securely fetches the initial prompt or context. It does not render the audio; it renders the initial state of the UI.
Client-Side (Hook): The useCompletion hook initiates the request to the AI API. It subscribes to the stream of tokens.
The Bridge (Custom Hook): We create a custom hook (e.g., useSpeechSynthesisStream) that listens to the useCompletion state. As new tokens arrive in the completion string, we push them into a buffer.
The Synthesizer Queue: The Web Speech API operates on a queue (speechSynthesis.getVoices() and speechSynthesis.speak()). To prevent audio overlap and ensure continuity, we must manage this queue manually. We cannot simply call speak() for every token; this would create a cacophony of overlapping syllables.

Visualizing the Token-to-Audio Pipeline

A diagram illustrating the token-to-audio pipeline would show a sequential queue where text tokens are processed one by one into audio, ensuring that the next audio stream does not start until the previous one has fully completed to prevent overlap.

Managing State: The "Boundary" Problem

A critical theoretical hurdle in text-to-speech streaming is word boundary detection. The Web Speech API provides an event called onboundary, which fires when the synthesizer finishes reading a word or sentence. This event is essential for synchronizing the visual UI (highlighting text) with the audio.

However, in a streaming context, the token stream is chunked arbitrarily by the LLM (often sub-word tokens). If we simply feed these chunks to the synthesizer, the engine might mispronounce them because it lacks the full context of the word.

The Solution: Lookahead Buffering We implement a buffering strategy that waits for a "natural break" before synthesizing. * Punctuation Trigger: If a token contains ., ?, !, or ,, we flush the buffer immediately. * Whitespace Trigger: If a token ends with a space, we consider it a potential word boundary. * Timeout Fallback: If the buffer exceeds a certain size (e.g., 10 tokens) or a time threshold (e.g., 200ms) without a break, we flush the buffer to prevent latency buildup.

This ensures that the synthesizer receives intelligible phrases rather than disjointed syllables.

Browser Compatibility and Voice Selection

The Web Speech API is not uniform across browsers. Chrome uses Google's cloud-based synthesis (often requiring an internet connection for high-quality voices), while Safari uses on-device synthesis (offline capable but potentially lower quality).

In the context of the useCompletion architecture, we must abstract this away. We expose a Voice Selection mechanism. The API allows us to query available voices (speechSynthesis.getVoices()). These voices are objects containing metadata (name, lang, localService).

When building the assistant, we treat voice selection as a configuration parameter passed down from the UI to the synthesizer instance. This allows the user to switch between a "Standard" voice (high latency, high quality) and a "Local" voice (low latency, robotic quality) based on their network conditions.

Theoretical Foundations

To build a talking assistant using Next.js and the Vercel AI SDK, we are orchestrating three distinct layers:

The Data Layer (RSC): Securely delivers the initial prompt context.
The Logic Layer (useCompletion): Streams tokens from the LLM in real-time.
The Presentation Layer (Web Speech API): Converts the stream of tokens into an audible waveform, managed by a custom hook that handles buffering, state synchronization, and browser compatibility.

By decoupling the generation of text from the synthesis of audio, we allow the user to perceive the AI's response as fast as the network and the model allow, creating a seamless, conversational experience.

Basic Code Example

In a modern SaaS application, providing an audio feedback loop enhances accessibility and user experience. Instead of waiting for the entire AI response to generate, we stream tokens (words) and immediately synthesize speech. This creates a real-time "talking assistant" effect.

We will build a Client Component that consumes a React Server Component (RSC) stream. The client uses the native Web Speech API (window.speechSynthesis) to convert text tokens into audio as they arrive.

The architecture relies on the Vercel AI SDK's useStreamableUI hook (or useCompletion for raw text) to handle the WebSocket connection and token parsing.

The Code Example

This example is a self-contained Next.js Client Component (TalkingAssistant.tsx). It assumes data is streamed from a Server Action.

'use client';

import React, { useState, useEffect, useRef } from 'react';

// Define the shape of the streamable UI or text token
// In a real app, this comes from `@ai-sdk/react` or a custom stream parser.
type StreamToken = {
  type: 'text';
  content: string;
};

/**
 * TalkingAssistant Component
 * 
 * A client-side component that listens to a text stream and synthesizes
 * speech using the Web Speech API in real-time.
 */
export default function TalkingAssistant() {
  // State to hold the visual text output (for users who prefer reading)
  const [displayText, setDisplayText] = useState<string>('');

  // State to manage the speech synthesis status
  const [isSpeaking, setIsSpeaking] = useState<boolean>(false);

  // Ref to store the SpeechSynthesisUtterance to allow pausing/resuming
  const utteranceRef = useRef<SpeechSynthesisUtterance | null>(null);

  // Ref to store the accumulated text buffer to ensure continuity
  const textBufferRef = useRef<string>('');

  /**
   * Simulates a Server Action stream.
   * In a real app, this would be replaced by `useStreamableUI` or `useChat`.
   * We mock it here to make the example fully self-contained.
   */
  const simulateStream = async (): Promise<void> => {
    const mockTokens = [
      'Hello, ',
      'this is a ',
      'real-time ',
      'audio stream ',
      'from your Next.js app.',
      ' We are using ',
      'React Server Components ',
      'and the Web Speech API.'
    ];

    for (const token of mockTokens) {
      // Simulate network latency
      await new Promise(resolve => setTimeout(resolve, 300));

      // 1. Update Visual UI
      setDisplayText(prev => prev + token);

      // 2. Trigger Audio Synthesis
      speakText(token);
    }
  };

  /**
   * Handles Text-to-Speech synthesis using the Web Speech API.
   * @param text - The text string to speak.
   */
  const speakText = (text: string) => {
    // Check browser support
    if (!window || !window.speechSynthesis) {
      console.error('Web Speech API not supported in this browser.');
      return;
    }

    // Create a new utterance for the specific text token
    const utterance = new SpeechSynthesisUtterance(text);

    // Optional: Select a specific voice (e.g., English US)
    const voices = window.speechSynthesis.getVoices();
    const preferredVoice = voices.find(v => v.name.includes('Google US English') || v.lang === 'en-US');
    if (preferredVoice) utterance.voice = preferredVoice;

    // Configure utterance settings
    utterance.rate = 1.0; // Speed
    utterance.pitch = 1.0; // Tone
    utterance.volume = 1.0; // Volume

    // Event: When speech starts
    utterance.onstart = () => setIsSpeaking(true);

    // Event: When speech ends
    utterance.onend = () => {
      // Only set speaking to false if there are no more pending utterances
      if (window.speechSynthesis.pending === false && window.speechSynthesis.speaking === false) {
        setIsSpeaking(false);
      }
    };

    // CRITICAL: Queue the utterance.
    // The browser handles the queue automatically.
    window.speechSynthesis.speak(utterance);
  };

  /**
   * Pauses the audio playback.
   */
  const pauseSpeech = () => {
    if (window.speechSynthesis.speaking) {
      window.speechSynthesis.pause();
      setIsSpeaking(false);
    }
  };

  /**
   * Resumes the audio playback.
   */
  const resumeSpeech = () => {
    if (window.speechSynthesis.paused) {
      window.speechSynthesis.resume();
      setIsSpeaking(true);
    }
  };

  /**
   * Stops and clears the audio queue.
   */
  const stopSpeech = () => {
    window.speechSynthesis.cancel();
    setIsSpeaking(false);
  };

  // Clean up on unmount
  useEffect(() => {
    return () => {
      window.speechSynthesis.cancel();
    };
  }, []);

  return (
    <div style={{ padding: '20px', fontFamily: 'sans-serif', maxWidth: '600px', margin: '0 auto' }}>
      <h2>AI Talking Assistant</h2>

      {/* Visual Output Area */}
      <div style={{ 
        border: '1px solid #ddd', 
        padding: '15px', 
        minHeight: '100px', 
        marginBottom: '20px',
        borderRadius: '8px',
        backgroundColor: '#f9f9f9'
      }}>
        <p style={{ color: '#333', lineHeight: '1.6' }}>
          {displayText || <span style={{ color: '#999' }}>Waiting for stream...</span>}
        </p>
      </div>

      {/* Controls */}
      <div style={{ display: 'flex', gap: '10px', flexWrap: 'wrap' }}>
        <button 
          onClick={simulateStream}
          disabled={isSpeaking}
          style={{ 
            padding: '10px 20px', 
            backgroundColor: '#0070f3', 
            color: 'white', 
            border: 'none', 
            borderRadius: '5px',
            cursor: 'pointer'
          }}
        >
          Start Stream
        </button>

        <button 
          onClick={pauseSpeech}
          disabled={!isSpeaking}
          style={{ padding: '10px 20px', backgroundColor: '#f59e0b', color: 'white', border: 'none', borderRadius: '5px', cursor: 'pointer' }}
        >
          Pause
        </button>

        <button 
          onClick={resumeSpeech}
          style={{ padding: '10px 20px', backgroundColor: '#10b981', color: 'white', border: 'none', borderRadius: '5px', cursor: 'pointer' }}
        >
          Resume
        </button>

        <button 
          onClick={stopSpeech}
          style={{ padding: '10px 20px', backgroundColor: '#ef4444', color: 'white', border: 'none', borderRadius: '5px', cursor: 'pointer' }}
        >
          Stop
        </button>
      </div>

      <div style={{ marginTop: '20px', fontSize: '0.85rem', color: '#666' }}>
        Status: {isSpeaking ? 'Speaking...' : 'Idle'}
      </div>
    </div>
  );
}

Line-by-Line Explanation

1. Client Directive and Imports

'use client';
import React, { useState, useEffect, useRef } from 'react';

* 'use client': This is a Next.js 13+ directive. It marks this component as a Client Component, meaning it executes in the browser. This is mandatory because the Web Speech API (window.speechSynthesis) is a browser-only API and does not exist in the Node.js server environment. * Hooks: We import standard React hooks. useRef is particularly important here to maintain a reference to the speech synthesis instance without triggering re-renders.

2. State Management

const [displayText, setDisplayText] = useState<string>('');
const [isSpeaking, setIsSpeaking] = useState<boolean>(false);
const utteranceRef = useRef<SpeechSynthesisUtterance | null>(null);
const textBufferRef = useRef<string>('');

* displayText: Stores the accumulated text tokens for visual rendering. While the audio plays, the user sees the text appearing word-by-word. * isSpeaking: A boolean flag used to toggle UI controls (e.g., disabling the "Start" button while audio is active). * utteranceRef: Holds the current SpeechSynthesisUtterance object. This allows us to interact with the specific speech instance (pause/resume) if needed, though the global window.speechSynthesis queue is often sufficient. * textBufferRef: In a high-speed stream, tokens might arrive faster than the speech synthesis can process them. We use this ref to accumulate text internally before passing it to the synthesizer, or to track the total history.

3. The Stream Simulation (Mocking the Backend)

const simulateStream = async (): Promise<void> => {
    const mockTokens = [ ... ];
    for (const token of mockTokens) {
      await new Promise(resolve => setTimeout(resolve, 300));
      setDisplayText(prev => prev + token);
      speakText(token);
    }
};

* Purpose: In a production app, this logic is handled by the useStreamableUI hook from the Vercel AI SDK. Since we need a standalone example, we simulate the server response. * Loop: We iterate through an array of strings (tokens). * Latency: setTimeout simulates network delay between receiving tokens from the LLM. * Dual Updates: For every token, we update the visual state (setDisplayText) and trigger the audio (speakText). This ensures the audio and visual outputs stay synchronized.

4. The Speech Synthesis Logic

const speakText = (text: string) => {
    if (!window || !window.speechSynthesis) { ... }
    const utterance = new SpeechSynthesisUtterance(text);
    // ... voice selection ...
    utterance.onstart = () => setIsSpeaking(true);
    utterance.onend = () => { /* check queue */ };
    window.speechSynthesis.speak(utterance);
};

* Browser Check: The Web Speech API is not available in all environments (e.g., Node.js, some older browsers). We guard against this. * SpeechSynthesisUtterance: This is the object that represents a speech request. We create a new instance for each token (word or phrase). * Voice Selection: window.speechSynthesis.getVoices() returns a list of installed voices. We filter for a preferred voice. Note: Voices often load asynchronously, so you might need an onvoiceschanged event listener in a production app. * The Queue: window.speechSynthesis.speak(utterance) adds the utterance to the browser's internal queue. The browser automatically plays them sequentially. * Event Handlers: * onstart: Updates UI to "Speaking". * onend: Checks if the queue is empty. If the browser is no longer speaking and nothing is pending, we set isSpeaking to false.

5. Controls and Cleanup

const pauseSpeech = () => window.speechSynthesis.pause();
const resumeSpeech = () => window.speechSynthesis.resume();
const stopSpeech = () => window.speechSynthesis.cancel();

useEffect(() => {
    return () => window.speechSynthesis.cancel();
}, []);

* Global Control: The speechSynthesis object is a singleton. Calling pause() pauses the entire queue. cancel() clears the queue entirely. * Cleanup: The useEffect cleanup function ensures that if the user navigates away from the component, the audio stops immediately, preventing "ghost audio" playing in the background.

Visualizing the Data Flow

The following diagram illustrates how tokens flow from the AI model to the user's ears and eyes.

+----------------+       +------------------+       +-------------------+
|  Next.js Server|       |  Client Browser  |       |  User Perception  |
|  (RSC/Action)  |       |  (React State)   |       |                   |
+----------------+       +------------------+       +-------------------+
        |                        |                            |
        | 1. LLM Generates       |                            |
        |    "Hello"             |                            |
        |----------------------->|                            |
        |                        |                            |
        | 2. Stream Token        |                            |
        |    "Hello"             |                            |
        |----------------------->|                            |
        |                        |                            |
        |                        | 3. Update Visual State     |
        |                        |    (displayText)           |
        |                        |--------------------------->| (Sees "Hello")
        |                        |                            |
        |                        | 4. Create Utterance        |
        |                        |    "Hello"                 |
        |                        |--------------------------->| (Hears "Hello")
        |                        |                            |
        | 5. Next Token "World"  |                            |
        |----------------------->|                            |
        |                        | (Repeat steps 3 & 4)       |
        +------------------------+----------------------------+

Common Pitfalls

When implementing text-to-speech in a web application, especially with streaming data, developers often encounter the following issues:

Voice Loading Race Conditions

Issue: window.speechSynthesis.getVoices() returns an empty array immediately after page load. Voices load asynchronously in the background.

Fix: You must attach an event listener for speechSynthesis.onvoiceschanged. Only attempt to select a voice after this event fires.

useEffect(() => {
  let voices = window.speechSynthesis.getVoices();
  if (voices.length !== 0) {
    // Voices loaded
  } else {
    window.speechSynthesis.onvoiceschanged = () => {
      voices = window.speechSynthesis.getVoices();
      // Set voice state here
    };
  }
}, []);

Token Fragmentation (Stuttering)
- Issue: If the LLM streams character-by-character (e.g., "H", "e", "l", "l", "o") and you create a new Utterance for every character, the speech sounds robotic and choppy.
- Fix: Buffer tokens. Accumulate text in a ref and only trigger speech when a sentence boundary (period, comma) is detected or a specific time threshold (e.g., 200ms) has passed without new tokens.
Vercel/AI SDK Timeouts
- Issue: If the audio synthesis takes longer than the serverless function timeout (e.g., 10s on Vercel Hobby), the connection might close before the audio finishes generating.
- Fix: The Web Speech API runs entirely on the client. The server only streams text. Therefore, the serverless function execution time is only dependent on how fast the LLM generates text, not how fast the audio plays. However, ensure your stream doesn't hang waiting for the audio to finish.
Mobile Browser Restrictions
- Issue: iOS Safari and Chrome for Android often block window.speechSynthesis.speak() from executing automatically. It requires a direct user interaction (click) to initialize.
- Fix: Ensure the "Start" button is the first thing the user clicks. Do not attempt to auto-play speech on component mount.
Memory Leaks with Utterances
- Issue: Creating thousands of SpeechSynthesisUtterance objects in a long stream without clearing the queue can consume memory.
- Fix: Use window.speechSynthesis.cancel() effectively when the user pauses or stops the stream to purge the queue.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.