Chapter 11: LLM Observability - LangSmith & Helicone

Theoretical Foundations

Imagine running a complex, high-performance local LLM inference server—like the ones we built using Ollama and WebGPU in the previous chapters. You have a model loading into VRAM, a tokenizer converting text into tensor-friendly IDs, and a transformer architecture performing matrix multiplications at blistering speeds. But once the model starts generating responses, you enter a "black box" scenario. You see the output text, but you have no insight into the process that created it. How long did the prompt processing take versus the token generation? How many tokens were consumed? Did the model hallucinate? Did an agent chain fail silently?

This is the problem LLM Observability solves.

In traditional software engineering, observability is the ability to understand the internal state of a system by examining its external outputs. For local LLMs, this is not just about logging errors; it is about instrumenting the entire lifecycle of a tensor's journey from input to output. We treat the local model not as a monolithic block, but as a distributed system of components: the Tokenizer, the Inference Engine (Ollama/WebGPU), and the Post-processor.

To understand this, we must look back at Book 5, Chapter 9, where we discussed WebGPU Shaders and Memory Management. We learned that moving data between the CPU (System RAM) and the GPU (VRAM) is the primary bottleneck. Observability tools like LangSmith and Helicone act as the telemetry dashboard for this memory pipeline. They don't just measure time; they measure the cost of movement and the efficiency of computation.

The Analogy: The Local LLM as a Microservice Architecture

In web development, we often build applications using Microservices. An API Gateway routes a request to a User Service, which then calls a Payment Service, and finally a Notification Service. If the user complains that "the checkout is slow," you cannot simply look at the final "Success" message. You need Distributed Tracing (like Jaeger or OpenTelemetry) to see exactly how long the request spent in the Payment Service versus the Notification Service.

LLM Observability applies this exact paradigm to local model inference:

The API Gateway (The Prompt): This is the entry point. It receives the user's query. In observability terms, we need to track the Input Token Count here. Just as an API gateway checks payload size to reject oversized requests, we monitor prompt length to estimate VRAM usage.
The Microservices (The Transformer Layers): As the prompt moves through the model's layers (the self-attention and feed-forward networks), it transforms. In a distributed system, we trace the request ID across service boundaries. In an LLM, we trace the Inference Step.
The Database (The KV Cache): In Chapter 9, we discussed the Key-Value (KV) Cache—the memory buffer that stores previous computations to avoid redundant work. In a web app, this is like a Redis cache. Observability must track the cache hit rate or, in LLM terms, the context window utilization. If the cache grows too large, performance degrades (just like a bloated Redis instance).
The Load Balancer (The Scheduler): When running local models, especially with WebGPU, the scheduler decides which operations run on the GPU and which wait in the queue. Observability here measures Queue Latency—the time a request waits before the GPU actually starts processing it.

Without observability, optimizing a local LLM is like trying to tune a race car engine while blindfolded. You might increase the batch size (processing multiple requests at once) to improve throughput, but without metrics, you won't know that you've spiked the VRAM usage and caused out-of-memory (OOM) errors on specific hardware configurations.

The "What": Key Metrics in Local LLM Observability

When we implement observability using tools like LangSmith or Helicone (even locally), we are capturing three specific categories of data. Let's break these down with the rigor of a systems architect.

1. Latency: The Perception of Speed

Latency in LLMs is non-linear. It is not a single number but a composite of distinct phases. We must differentiate between: * Time to First Token (TTFT): The duration from the moment the prompt is sent to the moment the first generated token is received. This is dominated by Prompt Processing (tokenization + processing the initial context). * Under the Hood: This is where the KV Cache is built. If you are using a local model with a large context window (e.g., 32k tokens), the matrix multiplication required to process that initial prompt is massive. WebGPU helps here, but the latency is still bound by the compute shader's dispatch time. * Inter-Token Latency (ITL): The time between consecutive generated tokens. This represents the "streaming" speed. * Under the Hood: This is bound by the Autoregressive Loop. The model generates token \(T\), appends it to the context, and runs the forward pass for token \(T+1\). On local hardware, this is heavily influenced by memory bandwidth (moving the KV cache) rather than raw compute. * Total Generation Time: The sum of TTFT and (ITL \(\times\) Output Tokens).

Analogy: Think of a video streaming service (like Netflix). TTFT is the time it takes to start the video after you press play (buffering the initial segment). ITL is the buffering that occurs if your internet connection slows down. If you only measure "Total Time," you won't know if the slowness is due to the initial load (network) or the playback (bandwidth).

2. Token Usage: The Currency of LLMs

Even on local hardware, tokens represent compute cycles and electricity. * Input Tokens: The length of the prompt. * Output Tokens: The length of the completion. * Total Tokens: The sum, which often dictates the size of the KV Cache.

Why this matters locally: In Chapter 9, we optimized memory by quantizing models (e.g., 4-bit vs. 8-bit). A 7B parameter model in 4-bit uses roughly 3.5GB of VRAM. However, if you feed it a 10,000-token context, the KV Cache can consume an additional 2GB+ of VRAM. Observability tools track this relationship. If you see a spike in VRAM usage, correlating it with "Input Tokens" helps you identify if the model is choking on large contexts.

3. Token Probability and Logits: The Model's Confidence

Observability isn't just about performance; it's about quality. Tools like LangSmith capture the logits (raw output scores before softmax) for every generated token. * Perplexity: A measure of how "surprised" the model is by the correct next token. High perplexity indicates the model is guessing. * Entropy: The randomness of the distribution. High entropy means the model is unsure, often leading to hallucinations.

Analogy: This is like a spell-checker that doesn't just underline words in red but shows you the probability curve of why it chose "receive" over "recieve." If the probability is 51% vs 49%, the model is uncertain. Observability highlights these "weak" tokens, allowing you to adjust sampling parameters (like Temperature or Top-P) dynamically.

The "Why": Debugging Agentic Workflows on Local Hardware

In Book 5, Chapter 10, we introduced Agents—programs that use LLMs to reason and act (e.g., calling a calculator or searching a local vector database). An agent is essentially a loop: Think -> Act -> Observe.

When these agents run locally, they become Exhaustive Asynchronous Resilience challenges. An agent might trigger a WebAssembly function (WASM) to perform a calculation, wait for the result, and feed it back into the LLM. If any step fails, the chain breaks.

The Problem: Without observability, an agent failure looks like a generic "Generation Failed" error. Was the failure in the WASM calculation? Did the LLM refuse to output a valid JSON schema for the tool call? Did the local server run out of memory?

The Solution (Tracing): Tracing visualizes the execution path of an agent. In a web development analogy, this is Request Tracing across microservices. In an LLM agent, a trace looks like this:

Root Span: Agent receives query.
Child Span 1: LLM generates a "Tool Call" request.
Child Span 2: The system invokes a local WASM tool (e.g., a math solver).
Child Span 3: The tool returns a result.
Child Span 4: The LLM processes the result and generates the final answer.

If the agent hangs or fails, the trace shows exactly which span is "open" (hanging) or which span returned an error. This is critical for local development because local environments are less stable than cloud environments—background processes, GPU driver timeouts, and memory fragmentation are common.

The "How": Integrating Type Guards and Resilience

When building the observability layer in TypeScript (as we do for web-based local LLMs), we must enforce Type Guards and Exhaustive Asynchronous Resilience.

Type Guards in Observability Data

Observability data is heterogeneous. A "Log" event looks different from a "Trace" event. A "Metric" event is just a number. When sending data to LangSmith or Helicone, we must ensure the payload matches the expected schema.

Consider the ObservabilityEvent type. We use Type Guards to narrow the union type before serialization. This prevents runtime errors when the observability backend rejects malformed JSON.

// Theoretical Type Definitions for Observability Events
type BaseEvent = {
  timestamp: Date;
  runId: string;
};

type LLMCallEvent = BaseEvent & {
  type: 'llm_call';
  prompt: string;
  model: string;
};

type MetricEvent = BaseEvent & {
  type: 'metric';
  name: 'latency' | 'token_count';
  value: number;
};

type ObservabilityEvent = LLMCallEvent | MetricEvent;

// A Type Guard function
function isLLMCallEvent(event: ObservabilityEvent): event is LLMCallEvent {
  return event.type === 'llm_call';
}

// Usage in the observability pipeline
function logEvent(event: ObservabilityEvent) {
  // We cannot access 'prompt' safely without a guard
  // if (event.type === 'llm_call') { ... } // This is a runtime check

  // Using the Type Guard allows the compiler to narrow the scope
  if (isLLMCallEvent(event)) {
    // Inside this block, TypeScript knows 'event' is LLMCallEvent
    console.log(`Sending prompt to LangSmith: ${event.prompt}`);
    sendToBackend(event); 
  }
}

Exhaustive Asynchronous Resilience

When instrumenting local LLMs, we are dealing with async/await operations that interact with hardware. A WebGPU inference call is a Promise. A file write for logs is a Promise. If the GPU driver crashes, the Promise rejects.

Exhaustive Resilience mandates that we never let an async error go unhandled, and we always clean up resources.

Analogy: Imagine a bank vault (the GPU memory). You request to open it (async call). If the request fails (power outage), you must ensure the door is locked again (finally block) and the security guard is notified (catch block). If you don't, the vault remains vulnerable (memory leak) or the system state becomes inconsistent.

// Theoretical Resilient Observability Wrapper
async function performInferenceWithObservability(
  prompt: string, 
  model: LocalModel
): Promise<string> {
  const startTime = performance.now();
  let tokensUsed = 0;

  try {
    // The critical async operation
    const result = await model.generate(prompt);
    tokensUsed = result.tokenCount;
    return result.text;

  } catch (error) {
    // Mandatory error handling: Log the specific failure context
    // This prevents silent failures in agentic loops
    const errorMessage = error instanceof Error ? error.message : 'Unknown inference error';
    await logToObservabilityTool({
      level: 'ERROR',
      message: `Inference failed: ${errorMessage}`,
      prompt
    });

    // Re-throw to allow the agent loop to handle the failure (e.g., retry logic)
    throw new Error(`Inference Error: ${errorMessage}`);

  } finally {
    // Mandatory cleanup: Regardless of success or failure
    // Release GPU locks or temporary VRAM buffers here
    const endTime = performance.now();
    const latency = endTime - startTime;

    // Send metrics even if the inference failed (partial data is valuable)
    await sendMetrics({
      latency,
      tokensUsed,
      status: 'completed' // or 'failed' based on a flag
    });
  }
}

Visualizing the Observability Pipeline

To visualize how data flows from a local WebGPU inference to an observability tool like Helicone, we can map the architecture. This pipeline ensures that even though the model runs locally, the insights are centralized.

A diagram illustrating the observability pipeline would show metrics such as latency and token usage flowing from a local WebGPU inference process to a centralized observability tool like Helicone.

Theoretical Foundations

In this section, we established that LLM Observability is the translation of distributed systems monitoring to the domain of local model inference. We utilized the analogy of Microservices to explain how an LLM pipeline consists of distinct, measurable components (Tokenizer, Inference Engine, KV Cache).

We defined the critical metrics—Latency (TTFT/ITL), Token Usage, and Logit Probabilities—and explained their physical implications on local hardware (VRAM bandwidth, compute cycles). Finally, we grounded the implementation in TypeScript best practices, specifically using Type Guards to ensure data integrity during serialization and Exhaustive Asynchronous Resilience (try/catch/finally) to maintain system stability during hardware-level operations. This theoretical framework is the prerequisite for implementing the practical monitoring tools discussed in the subsequent sections.

Basic Code Example

This example demonstrates a minimal, self-contained Node.js TypeScript application that interacts with a local LLM (via Ollama) while integrating with Helicone for observability. We will build a simple "Chat Service" that adheres to the Single Responsibility Principle (SRP) by separating the core LLM inference logic from the observability layer.

Scenario: A SaaS backend endpoint that accepts a user prompt, sends it to a local model, and captures the latency and token usage for analysis.

Prerequisites: 1. Ollama running locally (e.g., ollama serve). 2. Node.js installed (v18+ recommended). 3. A Helicone API Key (available at helicone.ai).

The Code

/**
 * @fileoverview Basic Observability Example for Local LLM using Helicone.
 * 
 * This script demonstrates:
 * 1. Separation of concerns (SRP) using dedicated modules.
 * 2. Type-safe inference using TypeScript's type inference.
 * 3. Observability integration via a proxy (Helicone).
 * 4. Function calling schema definition.
 */

import { createClient } from '@helicone/helicone'; // Assuming a hypothetical Helicone TS SDK
import { setTimeout } from 'timers/promises';

// ==========================================
// 1. Domain Definitions & Function Calling Schema
// ==========================================

/**
 * Represents the structure of a chat message.
 * TypeScript Type Inference infers the shape automatically when we use these objects.
 */
type ChatMessage = {
  role: 'system' | 'user' | 'assistant';
  content: string;
};

/**
 * Function Calling Schema (JSON Schema).
 * Defines an external tool the LLM can invoke.
 * In a real app, this would be passed to the LLM's `tools` parameter.
 */
const weatherToolSchema = {
  name: 'get_current_weather',
  description: 'Get the current weather in a given location',
  parameters: {
    type: 'object',
    properties: {
      location: {
        type: 'string',
        description: 'The city and state, e.g. San Francisco, CA',
      },
      unit: { type: 'string', enum: ['celsius', 'fahrenheit'] },
    },
    required: ['location'],
  },
};

// ==========================================
// 2. Observability Module (SRP: Logging & Tracing)
// ==========================================

/**
 * @class ObservabilityManager
 * @description Handles logging and tracing for LLM interactions.
 * Strictly separated from inference logic (SRP).
 */
class ObservabilityManager {
  private heliconeClient: any; // Typing simplified for brevity
  private startTime: number | null = null;

  constructor(apiKey: string) {
    // In a real scenario, we would initialize the Helicone client here.
    // For this example, we simulate the logging output.
    console.log(`[Observability] Initialized with API Key: ${apiKey.substring(0, 5)}...`);
  }

  /**
   * Starts a timing trace for latency measurement.
   */
  startTrace(): void {
    this.startTime = performance.now();
    console.log(`[Observability] Trace started at ${new Date().toISOString()}`);
  }

  /**
   * Ends the trace and calculates latency.
   * @param responseTokens - The number of tokens generated by the LLM.
   */
  endTrace(responseTokens: number): void {
    if (!this.startTime) {
      console.warn('[Observability] Trace ended without a start time.');
      return;
    }

    const endTime = performance.now();
    const latencyMs = endTime - this.startTime;

    // Simulate sending metrics to Helicone dashboard
    console.log(`[Observability] --- TRACE REPORT ---`);
    console.log(`[Observability] Latency: ${latencyMs.toFixed(2)}ms`);
    console.log(`[Observability] Output Tokens: ${responseTokens}`);
    console.log(`[Observability] Estimated Cost: $${(responseTokens * 0.00003).toFixed(6)}`); // Mock pricing
    console.log(`[Observability] -------------------`);
  }

  /**
   * Logs specific events (e.g., errors, tool calls).
   */
  logEvent(event: string, metadata: Record<string, any>): void {
    console.log(`[Observability] Event: ${event}`, JSON.stringify(metadata));
  }
}

// ==========================================
// 3. Inference Module (SRP: Communicating with LLM)
// ==========================================

/**
 * @class LocalLLMService
 * @description Handles direct communication with the local Ollama instance.
 * Does not know about the business logic or external dashboards.
 */
class LocalLLMService {
  private baseUrl: string;

  constructor(modelName: string = 'llama2') {
    this.baseUrl = `http://localhost:11434/api/chat`;
    console.log(`[LLM Service] Configured for model: ${modelName}`);
  }

  /**
   * Sends a prompt to the local Ollama API.
   * Note: In a production app, we would wrap this request with the Helicone proxy URL.
   * 
   * @param messages - Array of chat messages.
   * @returns The raw text response.
   */
  async generate(messages: ChatMessage[]): Promise<{ content: string; tokenCount: number }> {
    try {
      // Simulate network latency to Ollama
      await setTimeout(100); 

      // --- REAL IMPLEMENTATION WOULD LOOK LIKE THIS ---
      // const response = await fetch(this.baseUrl, {
      //   method: 'POST',
      //   headers: { 'Content-Type': 'application/json' },
      //   body: JSON.stringify({
      //     model: 'llama2',
      //     messages: messages,
      //     stream: false
      //   })
      // });
      // const data = await response.json();
      // return { content: data.message.content, tokenCount: data.eval_count };
      // ------------------------------------------------

      // SIMULATION FOR EXAMPLE:
      const mockResponse = "Hello! I am your local LLM. I see you are asking about observability.";
      return {
        content: mockResponse,
        tokenCount: mockResponse.split(' ').length // Rough estimate
      };

    } catch (error) {
      console.error('[LLM Service] Error contacting local Ollama:', error);
      throw error;
    }
  }
}

// ==========================================
// 4. Main Application Logic (Orchestrator)
// ==========================================

/**
 * Main entry point for the SaaS Chat API.
 * Orchestrates the observability and inference services.
 */
async function main() {
  // Configuration (In a real app, use environment variables)
  const HELICONE_API_KEY = process.env.HELICONE_API_KEY || 'sk-helicone-test-key';

  // 1. Initialize Services (Dependency Injection)
  const observability = new ObservabilityManager(HELICONE_API_KEY);
  const llmService = new LocalLLMService('llama2');

  // 2. Prepare Input
  const userPrompt: ChatMessage = { role: 'user', content: 'Tell me a short hello world.' };
  const conversationHistory: ChatMessage[] = [
    { role: 'system', content: 'You are a helpful assistant.' },
    userPrompt
  ];

  console.log(`\n[App] Received prompt: "${userPrompt.content}"\n`);

  // 3. Start Observability Trace
  observability.startTrace();

  try {
    // 4. Execute Inference
    const result = await llmService.generate(conversationHistory);

    // 5. End Observability Trace
    observability.endTrace(result.tokenCount);

    // 6. Return Response (Simulating API Response)
    console.log(`\n[App] Final Response: "${result.content}"`);

  } catch (error) {
    observability.logEvent('LLM_Error', { error: String(error) });
    console.error('[App] Request failed.');
  }
}

// Execute the main function
if (require.main === module) {
  main();
}

Line-by-Line Explanation

1. Domain Definitions & Function Calling Schema

type ChatMessage: We define a TypeScript interface for our data. TypeScript's Type Inference allows us to use this type throughout the code without manually annotating every variable, but defining it ensures strict validation of the data structure (role must be specific strings).
weatherToolSchema: This object defines a Function Calling Schema. It tells the LLM exactly what parameters are needed for a specific tool (e.g., get_current_weather). In a production environment, this JSON schema is passed to the model so it can decide when to invoke external APIs.

2. Observability Module (SRP)

class ObservabilityManager: This class embodies the Single Responsibility Principle. Its only job is to measure and log performance. It does not know how to fetch data from the LLM.
startTrace(): Captures the current high-resolution timestamp (performance.now()). This is the "Start" marker for latency calculation.
endTrace(): Captures the "End" marker, calculates the difference (Latency), and formats a log report. In a real Helicone integration, this is where the heliconeClient.log() method would be called to send data to the cloud dashboard.
logEvent(): A generic method for capturing non-performance data, such as errors or specific tool invocations.

3. Inference Module (SRP)

class LocalLLMService: This class has a single responsibility: communicating with the LLM. It abstracts the underlying API (Ollama).
generate():
- It accepts a structured array of messages.
- Simulation: Since we cannot run a live Ollama instance in this text environment, we use await setTimeout(100) to simulate network latency and return a mock response.
- Real-World Note: To integrate Helicone with a local model, you typically route requests through the Helicone proxy URL (e.g., https://oai.helicone.ai/v1/chat/completions) or inject the Helicone headers into the local request if using a local proxy setup.

4. Main Application Logic

Dependency Injection: We instantiate the ObservabilityManager and LocalLLMService separately. This loose coupling allows us to swap out the observability provider (e.g., to LangSmith) without changing the inference code.
main() Flow:
1. Start Trace: We mark the beginning of the operation.
2. Execute: We call llmService.generate(). This is an async operation.
3. End Trace: We immediately calculate the duration once the promise resolves.
4. Result: We log the final output to the console, simulating a web server returning JSON to a client.

Visualizing the Data Flow

The following diagram illustrates how data moves between the Application, the Observability layer, and the Local LLM.

The diagram visually maps the sequential data flow from the Application generating a request, through the Observability layer capturing metrics, to the Local LLM processing the input and returning a final JSON response to the console.

Common Pitfalls

When implementing LLM observability in a TypeScript SaaS environment, watch out for these specific issues:

Vercel/AWS Lambda Timeouts vs. LLM Latency
- Issue: Serverless functions (like Vercel) often have strict timeouts (e.g., 10 seconds). Local LLMs can be slow, especially on CPU-only hardware.
- Consequence: If the LLM takes 15 seconds to respond, but the serverless timeout is 10s, the connection is severed. The observability tool might never receive the "End Trace" signal, resulting in incomplete data.
- Fix: Use background processing (queues like BullMQ or Inngest) for long-running inference tasks. Do not await the LLM directly in the API route handler.
Async/Await Loops in Node.js
- Issue: While await is non-blocking for the event loop, a poorly written synchronous loop processing large arrays of logs can block the main thread.
- Bad Example:
```
// Blocking!
for (const log of hugeLogArray) {
    await saveLog(log); 
}
```
- Fix: Use Promise.all() for concurrent processing or stream APIs for handling large volumes of observability data.
Hallucinated JSON / Schema Drift
- Issue: When passing observability data to an LLM for evaluation (e.g., "Rate the quality of this response"), the LLM might return malformed JSON, crashing your parsing logic.
- Fix: Always use Zod or Yup to validate incoming data (even from LLMs) before processing it.
```
import { z } from 'zod';
const ResponseSchema = z.object({ score: z.number() });
// This will throw safely if the LLM hallucinates the structure
const result = ResponseSchema.parse(llmOutput); 
```
Type Inference Failures with External APIs
- Issue: When fetching data from Ollama or Helicone, TypeScript might infer a type as any if you don't provide explicit generics to your HTTP client (like Axios or Fetch).
- Fix: Always explicitly type the response interface and cast the result, or use a library like tRPC to ensure end-to-end type safety. Do not rely solely on implicit inference for network responses.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon

Loading knowledge check...

Code License: All code examples are released under the MIT License. Github repo.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.