Chapter 18: Chunking Strategies for PDF/Text
Theoretical Foundations
The fundamental challenge in building robust Retrieval-Augmented Generation (RAG) systems lies not in the sophistication of the Large Language Model (LLM) used, but in the structural integrity of the data fed into it. LLMs possess a finite context window—a hard limit on the amount of text they can process in a single interaction. When dealing with unstructured documents like PDFs, which can span hundreds of pages, naive ingestion (dumping the entire document into the context) is impossible. Furthermore, LLMs exhibit degraded performance when context is cluttered with irrelevant information, a phenomenon known as the "Lost in the Middle" problem.
Chunking is the architectural discipline of decomposing large, unstructured text streams into discrete, semantically coherent units. This process is the bedrock of the "Indexing" phase in a RAG pipeline. Without effective chunking, retrieval mechanisms cannot locate precise answers; with poor chunking, the retrieved context may be too fragmented to provide a complete answer or too broad to fit within the model's token limit.
The Tokenization Barrier and Context Windows
To understand chunking, one must first understand the atomic unit of an LLM: the token. Modern tokenizers (like Byte-Pair Encoding) break text into sub-word units. A rough heuristic is that 100 tokens equate to approximately 75 words. When we process a PDF, we are converting a visual and structural document into a stream of tokens.
Consider a 500-page technical manual. If we attempt to ingest this whole document, we might exceed the 128,000 token limit of a model like GPT-4o. Even if the document fits, the model's attention mechanism must sift through 500 pages to find a specific answer to "What is the torque specification for the alternator bolt?" This forces the model to dilute its focus across irrelevant chapters, increasing the likelihood of hallucination or omission.
Chunking acts as a pre-emptive filter. By splitting the manual into chapters, sections, or paragraphs, we create a library of smaller, indexed documents. When a user asks a question, we use a vector search to find only the relevant chunk (e.g., the "Alternator Installation" section) and feed only that chunk to the LLM. This ensures the model has 100% relevant context within its limited window.
The Analogy: The Encyclopedia and the Index Card
Naive Approach (No Chunking): You carry all 30 volumes to your desk and flip through every page of every book to find the answer. This is physically exhausting (computationally expensive) and slow (latency). * Fixed-Size Chunking: You tear every page of the encyclopedia into strips of exactly 10 lines each. You stack these strips randomly. To find your answer, you read strip 1, strip 2, strip 3... This preserves the text but destroys the logical flow. The answer might be cut in half between two strips. * Semantic Chunking (The Index Card System): You read the encyclopedia and extract key concepts. For the entry on "Lightning," you write a summary on an index card. You file this card alphabetically under "L". When asked the question, you go directly to the "L" drawer, pull out the "Lightning" card, and read the concise, complete summary.
In AI engineering, we want to build the Index Card System. We want chunks that are small enough to be read quickly but large enough to contain a complete thought.
The Spectrum of Granularity
Character Level: The most granular. Rarely used directly, but serves as the base for other methods. 2. Word Level: Splitting by whitespace. Simple, but often ignores semantic boundaries. 3. Sentence Level: Splitting by punctuation (periods, exclamation marks). Good for narrative text. 4. Paragraph/Section Level: Splitting by newlines or structural markers (headings). Ideal for technical documentation where a single concept is explained in a block.
In Microsoft Semantic Kernel, the TextChunker class provides the tools to navigate this spectrum. It abstracts the raw string manipulation into high-level strategies.
Strategy 1: Fixed-Size Chunking
Fixed-size chunking is the "brute force" method. It splits text based purely on token or character count, regardless of semantic meaning.
The Mechanics: The algorithm iterates through the text, counting tokens. When it reaches the limit (e.g., 1000 tokens), it cuts the text. It usually includes an "overlap" mechanism. If we cut at token 1000, the next chunk might start at token 900. This ensures that a sentence split by the cut appears in both chunks, preserving context.
The Analogy: Imagine cutting a long roll of wallpaper into strips of exactly 12 inches. You don't care if you cut through the middle of a flower pattern; you just measure 12 inches and cut. To ensure the pattern isn't lost, you overlap the next strip by 2 inches over the previous one.
Predictable Token Usage: You know exactly how many tokens will be sent to the model. * Simplicity: No complex parsing logic required. * Uniformity: All chunks are the same size, which can simplify database schema design.
Context Fragmentation: A single logical idea might be split across two chunks. * Noise: A chunk might contain a header, a footer, and a random page number, diluting the semantic value.
Implementation Concept:
In Semantic Kernel, this is handled by the SplitIntoChunks method, specifying a maxTokensPerChunk and an overlap.
using Microsoft.SemanticKernel.Text;
// Conceptual usage of Fixed-Size Splitting
// We define a target size (e.g., 1000 tokens) and an overlap (e.g., 100 tokens)
// to mitigate the "seam" issue where context is cut off.
var text = "A very long document text...";
var chunks = TextChunker.SplitIntoChunks(
text,
maxTokensPerChunk: 1000,
overlap: 100
);
Strategy 2: Recursive Chunking (Hierarchical)
Recursive chunking is a refinement of fixed-size chunking that attempts to respect structural boundaries before falling back to arbitrary cuts. It is often called "hierarchical" because it tries to split by larger units first, then smaller ones if the larger unit exceeds the size limit.
The Mechanics: Try to split by double newlines (paragraphs). 2. If a paragraph is still too large (> max tokens), split it by single newlines (lines). 3. If a line is still too large, split by sentences. 4. If a sentence is still too large, split by words (fixed-size fallback).
The Analogy: Imagine packing a suitcase. You first try to pack whole outfits (paragraphs). If an outfit is too bulky, you separate the shirt and pants (sentences). If the shirt is still too big, you fold it smaller (words). You only cut the fabric (arbitrary split) if absolutely necessary.
Why This Matters: This strategy preserves the author's intent. In a PDF, a paragraph usually represents a coherent thought. Recursive chunking ensures that, whenever possible, a chunk contains complete paragraphs.
Architectural Implication:
This is the default recommendation for general-purpose RAG. It balances the strictness of fixed-size limits with the fluidity of natural language structure. In Semantic Kernel, SplitIntoParagraphs and SplitIntoSentences are the building blocks for this strategy.
Strategy 3: Semantic Chunking
This is the most advanced and computationally expensive strategy. Instead of splitting by character count or punctuation, semantic chunking splits based on the meaning of the text. It groups sentences that are semantically similar and splits them when the topic shifts.
Embedding Generation: The text is broken into sentences. Each sentence is converted into a vector embedding (a list of floating-point numbers representing semantic meaning). 2. Similarity Calculation: The algorithm calculates the cosine similarity between adjacent sentences. 3. Thresholding: If the similarity score drops below a threshold (e.g., 0.5), it indicates a shift in topic. A boundary is inserted there.
The Analogy: Imagine a radio broadcast. The DJ talks about a band, then plays a song, then talks about the weather. Semantic chunking is like a smart transcriber that inserts a chapter marker only when the topic changes from "Music" to "Weather," rather than inserting a marker every 30 seconds regardless of content.
Maximum Coherence: Chunks contain highly related information. * Retrieval Accuracy: The vector search finds these chunks easily because the chunk boundaries align with semantic clusters.
Latency: Requires generating embeddings for every sentence during the indexing phase. * Variable Size: Chunks can be very short (if the topic changes rapidly) or very long (if the text is monotonous), making token management harder.
Implementation Concept:
While Semantic Kernel provides text splitters, true semantic chunking often involves a custom pipeline or using the Microsoft.SemanticKernel.Connectors.AI.OpenAI services to generate embeddings for comparison.
using Microsoft.SemanticKernel.Embeddings;
using Microsoft.SemanticKernel.Text;
// Conceptual flow for Semantic Chunking
// 1. Split raw text into sentences (structural split)
var sentences = TextChunker.SplitIntoSentences(rawText);
// 2. Generate embeddings for each sentence to capture meaning
// (This requires an ITextEmbeddingGenerationService)
var embeddings = await embeddingGenerator.GenerateEmbeddingsAsync(sentences);
// 3. Compare adjacent embeddings to detect topic shifts
// If similarity < threshold, create a new chunk boundary.
// This logic is custom and not directly in the core TextChunker,
// but relies on the same underlying principles.
The Role of C# Features in Chunking Architecture
In building these systems using C#, specific language features enable robust and maintainable chunking pipelines.
1. Records and Immutability (record):
When we split a document, we generate metadata. A chunk is not just a string; it is a data structure containing the text, the source document ID, the page number, and the chunk index. Using C# record types ensures that these data structures are immutable. Once a chunk is created and indexed into a vector database, it should not change. Immutability prevents accidental modification of the chunk during processing.
// Defining a robust data structure for a chunk
public record DocumentChunk(
string Id,
string Content,
int SourcePage,
float[]? Embedding = null
);
2. IEnumerable<T> and Yield Return:
Chunking large documents requires memory efficiency. We should not load a 100MB PDF into a single string and then split it. Instead, we use C# iterators (yield return). This allows us to stream the document, process it chunk by chunk, and emit chunks one at a time without consuming massive amounts of RAM.
public IEnumerable<string> StreamChunks(string filePath)
{
using var reader = new StreamReader(filePath);
string? line;
while ((line = reader.ReadLine()) != null)
{
// Accumulate lines until we hit a threshold, then yield
// This prevents loading the whole file into memory
if (/* logic to determine chunk boundary */)
{
yield return line;
}
}
}
3. Pattern Matching: When parsing unstructured text (like PDFs converted to raw text), we often encounter mixed content—headers, footers, body text. C# pattern matching allows us to declaratively filter and categorize text segments before chunking.
public bool IsContentRelevant(string textSegment)
{
return textSegment switch
{
var s when s.StartsWith("Page") || s.StartsWith("Confidential") => false,
var s when string.IsNullOrWhiteSpace(s) => false,
_ => true
};
}
Architectural Implications: The "What If" Scenarios
What if the chunk is too small? You risk destroying the "Chain of Thought." If an answer requires three logical steps to derive, and each step is in a separate chunk, the LLM might only receive step 2. The retrieval system must be smart enough to fetch adjacent chunks (windowing) or use metadata to group related chunks.
What if the chunk is too large? You hit the token limit, incurring truncation costs (paying for tokens you can't use) and losing the end of the document. More critically, the "noise-to-signal" ratio increases. If a chunk contains 50% boilerplate legal text and 50% relevant technical specs, the LLM's attention is diluted, reducing the accuracy of the response.
The Hybrid Approach: First Pass (Retrieval): Use small, semantic chunks (e.g., 256 tokens) for high-precision vector search. 2. Second Pass (Context Augmentation): Once the relevant small chunk is found, retrieve its neighbors (parent/child hierarchy) or the full section to provide surrounding context to the LLM.
The Connection to Microsoft Semantic Kernel
In the context of the Semantic Kernel, chunking is the prerequisite for the TextSearch capabilities. The Kernel's ITextSearch interface relies on a pre-processed index. By utilizing the Microsoft.SemanticKernel.Text library, we standardize how we break down documents.
This standardization is crucial for Plugin Development. A plugin designed to query a manual expects a consistent input format. If one manual is chunked by paragraph and another by sentence, the plugin's logic (and the underlying vector embeddings) will be inconsistent. By enforcing a unified chunking strategy via the Kernel's text splitters, we ensure that our AI plugins behave predictably regardless of the underlying document format.
Visualization of Chunking Strategies
The following diagram illustrates how a single document flows through different chunking strategies, resulting in distinct index structures.
Summary
Chunking is not merely a text manipulation task; it is a semantic architectural decision. It dictates the upper bound of your RAG system's accuracy. Fixed-size chunking offers control, recursive chunking offers structural integrity, and semantic chunking offers conceptual coherence. In the Microsoft Semantic Kernel ecosystem, these strategies are implemented using the TextChunker utilities and C# features like Records and Iterators to create scalable, memory-efficient pipelines. The choice of strategy depends on the document type and the query profile, but the goal remains constant: to feed the LLM the exact context it needs, in the exact format it can digest.
Basic Code Example
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text;
using System.Text.Json;
// ==========================================
// SCENARIO: The "Legal Brief" Aggregator
// ==========================================
// Problem: A junior associate needs to quickly understand the key clauses in a 50-page PDF contract.
// Manual reading is time-consuming. We want to feed the text into an AI, but the entire document
// exceeds the context window of standard LLMs (e.g., 4k/8k/128k tokens).
// Solution: We use Semantic Kernel's TextChunker to split the document into manageable pieces
// that preserve context, allowing an AI agent to analyze each section individually.
namespace ChunkingDemo
{
class Program
{
static async Task Main(string[] args)
{
// 1. SETUP: Initialize the Semantic Kernel
// We use a dummy key here because we aren't calling an LLM, just using the Kernel's utilities.
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion("gpt-4", "fake-api-key")
.Build();
// 2. DATA: Simulate a large PDF text extraction
// In a real app, this would come from a PDF parser like PdfPig or Azure Document Intelligence.
// We create a long string with paragraphs to demonstrate chunking boundaries.
string rawText = GenerateMockLegalText();
Console.WriteLine("--- ORIGINAL TEXT (Simulated PDF Extraction) ---");
Console.WriteLine(rawText.Substring(0, Math.Min(500, rawText.Length)) + "...\n");
Console.WriteLine($"Total Length: {rawText.Length} characters\n");
// 3. CHUNKING: Apply the TextChunker
// We will use the 'RecursiveCharacter' strategy, which tries to split by paragraphs,
// then lines, then spaces, then characters, ensuring semantic units stay together as long as possible.
var chunker = new Microsoft.SemanticKernel.Text.TextChunker();
// Configuration:
// - MaxTokens: 100 (Small for demonstration, usually 500-1000 for RAG)
// - ChunkingStrategy: Recursive (Smart splitting)
// - Overlap: 15 tokens (To prevent cutting sentences in half)
int maxTokensPerChunk = 100;
int overlapTokens = 15;
// The Split method returns a list of strings (chunks).
// Note: We pass a custom tokenizer. In production, you'd use the specific model's tokenizer (e.g., Tiktoken).
// For this demo, we use a simple whitespace tokenizer approximation.
var chunks = chunker.Split(
rawText,
maxTokensPerChunk,
overlapTokens,
new SimpleTokenizer()
);
// 4. OUTPUT: Display the chunks
Console.WriteLine($"--- CHUNKING RESULTS (Strategy: Recursive, MaxTokens: {maxTokensPerChunk}, Overlap: {overlapTokens}) ---\n");
for (int i = 0; i < chunks.Count; i++)
{
Console.WriteLine($"[CHUNK {i + 1}]");
Console.WriteLine($"Length: {chunks[i].Length} chars");
Console.WriteLine("Content:");
Console.WriteLine(chunks[i]);
Console.WriteLine(new string('-', 40));
}
// 5. AGENTIC PATTERN: Simulate processing chunks
// In a real agentic workflow, these chunks would be passed to a loop of AI calls.
Console.WriteLine("\n--- AGENTIC SIMULATION ---");
await ProcessChunksWithAgent(chunks);
}
// Helper: Generates a long text simulating a legal contract
static string GenerateMockLegalText()
{
var sb = new StringBuilder();
sb.AppendLine("SECTION 1: DEFINITIONS");
sb.AppendLine("1.1 'Agreement' refers to this Master Service Agreement between the Parties.");
sb.AppendLine("1.2 'Confidential Information' means any data disclosed that is marked confidential.");
sb.AppendLine("1.3 'Effective Date' is the date first written above.");
sb.AppendLine();
sb.AppendLine("SECTION 2: SERVICES");
sb.AppendLine("2.1 The Provider agrees to deliver the services specified in Exhibit A.");
sb.AppendLine("2.2 Service levels are guaranteed at 99.9% uptime, excluding scheduled maintenance.");
sb.AppendLine("2.3 The Client shall provide necessary access to systems within 5 business days.");
sb.AppendLine();
sb.AppendLine("SECTION 3: PAYMENT");
sb.AppendLine("3.1 Fees are due net 30 days from invoice date.");
sb.AppendLine("3.2 Late payments incur a 1.5% monthly interest charge.");
sb.AppendLine("3.3 All fees are non-refundable unless termination is caused by Provider breach.");
sb.AppendLine();
sb.AppendLine("SECTION 4: TERMINATION");
sb.AppendLine("4.1 Either party may terminate with 60 days written notice.");
sb.AppendLine("4.2 Immediate termination is allowed for breach of confidentiality.");
sb.AppendLine("4.3 Upon termination, all Confidential Information must be returned or destroyed.");
sb.AppendLine();
sb.AppendLine("SECTION 5: LIABILITY");
sb.AppendLine("5.1 Provider liability is capped at the total fees paid in the preceding 12 months.");
sb.AppendLine("5.2 Neither party is liable for indirect or consequential damages.");
sb.AppendLine("5.3 This limitation applies even if a remedy fails of its essential purpose.");
sb.AppendLine();
// Repeat sections to simulate a long document
for (int i = 0; i < 5; i++)
{
sb.AppendLine($"APPENDIX {i + 1}: TECHNICAL SPECIFICATIONS");
sb.AppendLine($"Specification {i + 1} requires compliance with ISO 27001 standards.");
sb.AppendLine($"Audit logs must be retained for a minimum of {365 + (i * 10)} days.");
sb.AppendLine("Data encryption at rest must use AES-256 bit algorithms.");
sb.AppendLine();
}
return sb.ToString();
}
// Helper: Simulates an Agent processing chunks one by one
static async Task ProcessChunksWithAgent(List<string> chunks)
{
foreach (var chunk in chunks)
{
// In a real scenario, we would call:
// var result = await kernel.InvokeAsync("SummarizePlugin", "Summarize", new KernelArguments { ["input"] = chunk });
// For this demo, we just simulate the delay and output.
Console.Write($"Processing chunk of {chunk.Length} chars... ");
await Task.Delay(200); // Simulate network latency
Console.WriteLine("Done.");
}
Console.WriteLine("All chunks processed by Agent.");
}
}
// ==========================================
// CUSTOM TOKENIZER (Simple Implementation)
// ==========================================
// Semantic Kernel requires a tokenizer to calculate token counts.
// For this 'Hello World' example, we implement a basic whitespace tokenizer.
// In production, use: Microsoft.SemanticKernel.Connectors.OpenAI.Tokenizer
public class SimpleTokenizer : Microsoft.SemanticKernel.Text.ITokenizer
{
public int CountTokens(string text)
{
if (string.IsNullOrWhiteSpace(text)) return 0;
// Split by whitespace and punctuation to approximate tokens
var tokens = text.Split(new[] { ' ', '\t', '\n', '\r', '.', ',', ';', ':', '!', '?' },
StringSplitOptions.RemoveEmptyEntries);
return tokens.Length;
}
}
}
Line-by-Line Explanation
This section breaks down the code logic, architectural decisions, and the specific role of Semantic Kernel components.
1. Setup and Initialization
``csharp
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion("gpt-4", "fake-api-key")
.Build();
**Context**: Even though we are only performing chunking (a utility task), we initialize theKernel. In a full Agentic application, the chunker is often used immediately before or after an AI call.
* **Dependency Injection**: We register anIChatCompletionService. TheTextChunker` is part of the core Semantic Kernel libraries and doesn't strictly require an LLM to function, but initializing the Kernel ensures all standard services and logging are available.
2. Data Simulation
``csharp
string rawText = GenerateMockLegalText();
**Real-world mapping**: In a production environment,rawTextwould be the output of a PDF parser (e.g.,PdfPigorAzure.AI.FormRecognizer`). PDFs are binary blobs; text extraction is a separate step that often results in unstructured text with page headers, footers, and noise.
* Structure: The mock text creates distinct sections (Definitions, Services, Payment) to visually demonstrate how the chunker preserves logical blocks.
3. The Chunking Logic
``csharp
var chunker = new Microsoft.SemanticKernel.Text.TextChunker();
var chunks = chunker.Split(rawText, maxTokensPerChunk, overlapTokens, new SimpleTokenizer());
**The Tool**:TextChunkeris the utility class provided by Microsoft.SemanticKernel.Text.
* **Parameters**:
*rawText: The input string.
*maxTokensPerChunk: The hard limit (e.g., 100). This prevents context window overflow.
*overlapTokens: Crucial for RAG (Retrieval-Augmented Generation). By overlapping chunks (e.g., 15 tokens), we ensure that if a sentence is cut off at the end of Chunk A, it appears at the beginning of Chunk B. This prevents losing context at the boundaries.
*ITokenizer: The chunker needs to know how many "tokens" a string occupies, not just characters. Since different models (GPT-4 vs. Llama) tokenize text differently, passing a tokenizer is required for accuracy.
* **Strategy**: The default behavior ofTextChunker.Split` is a Recursive Character Text Splitter. It attempts to split by double newlines (paragraphs), then single newlines (lines), then spaces (words), and finally characters, ensuring the most meaningful splits happen first.
4. The Custom Tokenizer (SimpleTokenizer)
``csharp
public class SimpleTokenizer : Microsoft.SemanticKernel.Text.ITokenizer
{
public int CountTokens(string text) { ... }
}
**Interface**: ImplementsITokenizer.
* **Logic**: For this "Hello World" example, we avoid external dependencies (like thetiktokenlibrary) to keep the code self-contained. We approximate tokens by splitting on whitespace and punctuation.
* **Production Note**: In a real application, you would inject the specific tokenizer for your model. For OpenAI models, you would useMicrosoft.SemanticKernel.Connectors.OpenAI.Tokenizer` which uses the BPE algorithm accurately.
5. Agentic Simulation
```csharp
static async Task ProcessChunksWithAgent(List
Visualizing the Chunking Flow
The following diagram illustrates how a raw document flows through the chunking process before reaching the AI Agent.
Common Pitfalls
The "Sentence Fracture" Problem * Mistake: Using a fixed-size chunker (e.g., every 500 characters) without regard for sentence structure. * Consequence: A semantic unit (a sentence or paragraph) is cut exactly in half. The first half belongs to Chunk A, the second to Chunk B. When the AI retrieves only Chunk B, it lacks the subject or context of the sentence. * Fix: Use Recursive Chunking (as shown in the code) which prioritizes splitting by newlines and spaces before falling back to character limits. Always use Overlap to carry the tail end of a sentence into the next chunk.
-
Ignoring Tokenization Variance
- Mistake: Assuming 1 character equals 1 token, or using a generic tokenizer for a specific model.
- Consequence: You calculate that a chunk is 400 characters and fits within a 500-token limit. However, the model's tokenizer (e.g., GPT-4) uses sub-word tokens, resulting in 600 tokens. This causes runtime errors or silent truncation of data.
- Fix: Always use the
ITokenizerinterface and provide the specific tokenizer for the model you are using.
-
Context Isolation (The "Lost in the Middle" Effect)
- Mistake: Chunking too small or without overlap, causing the AI to miss the "big picture."
- Consequence: If a contract defines a term in Section 1.1, and that term is used in Section 5.3, a small chunk containing only Section 5.3 might hallucinate the meaning of the term.
- Fix: For complex documents, consider Semantic Chunking (grouping sentences by embedding similarity) rather than just text splitting. Alternatively, ensure your Agentic Pattern includes a "Summary" step where high-level summaries of chunks are combined.
The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon
Loading knowledge check...
Code License: All code examples are released under the MIT License. Github repo.
Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.
All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.