Skip to content

Chapter 23: Text Processing - Splitting Strings and Basic Tokenization

Theoretical Foundations

Imagine you have a long sentence like "Alice, Bob, and Charlie are working on an AI project together." You want to extract the individual names to process them separately. This is similar to a librarian sorting a large pile of books into individual shelves. In C#, we use the Split() method to break a single string into an array of smaller strings, called tokens, based on a delimiter (like a comma or a space). This is the foundation of text processing and basic tokenization.

This chapter builds directly on Chapter 11: Arrays, where you learned to store fixed-size collections of data. When we split a string, the result is an array of strings (string[]), allowing you to iterate over each piece using Chapter 12: foreach loops.

The Core Concept: The Split() Method

The Split() method is a powerful tool found on every string instance. It divides a string into substrings based on a specified character or array of characters (the delimiter) and returns a string[].

Syntax:

string[] pieces = originalString.Split(delimiter);

  • originalString: The text you want to break down.
  • delimiter: The character (or characters) where the split occurs. This is not included in the resulting pieces.
  • pieces: An array of strings containing the segments.

Real-World Analogy

Think of a CSV (Comma-Separated Values) file. It looks like this: "John,Doe,30,Engineer" The comma is the delimiter. The Split() method acts like a pair of scissors, cutting the string at every comma to produce four separate pieces: "John", "Doe", "30", and "Engineer".

Basic Usage: Splitting by a Single Character

Let's start with the simplest case: splitting a string by a single comma.

using System;

class TextProcessor
{
    static void Main()
    {
        // A string containing a list of names separated by commas
        string namesList = "Alice,Bob,Charlie";

        // Split the string using the comma as the delimiter
        string[] nameTokens = namesList.Split(',');

        // The result is an array of strings
        // nameTokens[0] is "Alice"
        // nameTokens[1] is "Bob"
        // nameTokens[2] is "Charlie"

        Console.WriteLine("There are " + nameTokens.Length + " names in the list.");

        // We can iterate over the array using a foreach loop (Chapter 12)
        foreach (string name in nameTokens)
        {
            Console.WriteLine("Name: " + name);
        }
    }
}

Output:

There are 3 names in the list.
Name: Alice
Name: Bob
Name: Charlie

Handling Multiple Delimiters

Sometimes text is messy. You might have a sentence with commas, spaces, and periods. Split() can accept an array of characters to split on any of them.

using System;

class TextProcessor
{
    static void Main()
    {
        // A string with various separators
        string rawData = "Apple/Banana,Orange Pear";

        // Define an array of delimiter characters
        char[] delimiters = { '/', ',', ' ' };

        // Split using all delimiters
        string[] fruits = rawData.Split(delimiters);

        // The result will include empty entries if delimiters are adjacent
        // For "Apple/Banana,Orange Pear", the result is:
        // "Apple", "Banana", "", "Orange", "Pear"
        // Note the empty string between the comma and space

        foreach (string fruit in fruits)
        {
            Console.WriteLine($"Fruit: '{fruit}'");
        }
    }
}

Output:

Fruit: 'Apple'
Fruit: 'Banana'
Fruit: ''
Fruit: 'Orange'
Fruit: 'Pear'

Handling Empty Entries and Whitespace

When splitting, you often get empty strings ("") or strings that are just whitespace. This happens when delimiters are adjacent or at the start/end of the string.

Example of Empty Entries:

string data = "first,,second";
string[] parts = data.Split(',');
// parts[0] = "first"
// parts[1] = ""  (empty)
// parts[2] = "second"

Example of Whitespace:

string sentence = "  Hello   World  ";
string[] words = sentence.Split(' ');
// This results in many entries: "", "", "Hello", "", "", "World", "", ""
// The spaces create empty strings between words.

Solution: Removing Empty Entries To clean up the array, you can use the Split() overload that accepts a StringSplitOptions parameter. This allows you to remove empty entries automatically.

using System;

class TextProcessor
{
    static void Main()
    {
        string sentence = "  Hello   World  ";

        // Split by space and remove empty entries
        string[] cleanWords = sentence.Split(' ', StringSplitOptions.RemoveEmptyEntries);

        foreach (string word in cleanWords)
        {
            Console.WriteLine(word);
        }
    }
}

Output:

Hello
World

Tokenization Patterns for AI Applications

In AI applications, tokenization is the process of converting raw text into a sequence of tokens (words, subwords, or characters) that a model can understand. While modern AI uses advanced tokenizers, the basic principle starts with simple splitting.

Use Case: Preprocessing User Input for a Chatbot Imagine a user types a command: "send email to alice@example.com and bob@example.com". To extract the email addresses, you might split the string by spaces and then filter for strings containing "@".

using System;

class Chatbot
{
    static void Main()
    {
        string userInput = "send email to alice@example.com and bob@example.com";

        // Step 1: Split by spaces to get individual words/tokens
        string[] tokens = userInput.Split(' ');

        Console.WriteLine("Extracted Emails:");

        // Step 2: Iterate through tokens and check for the '@' symbol
        foreach (string token in tokens)
        {
            // We use the Contains method (available on strings) to check
            if (token.Contains("@"))
            {
                Console.WriteLine(token);
            }
        }
    }
}

Output:

Extracted Emails:
alice@example.com
and
bob@example.com
Note: The word "and" also contains "@", so in a real scenario, you'd use more complex logic (like checking for a dot after the @). This illustrates why basic tokenization is a starting point.

Performance Considerations for Large Text

When processing large text blocks (e.g., reading a 1MB log file), splitting the entire string into an array at once can consume significant memory. This is because Split() creates a new array and copies all substrings into it.

Memory Concept (Chapter 15: Stack vs Heap)

  • The original string is stored on the Heap.
  • The resulting array of strings is also on the Heap.
  • For very large text, this doubles the memory usage temporarily.

Best Practice for Large Files: If you are processing a massive file, it is often better to read it line by line (using StreamReader) and process each line individually, rather than loading the entire file into a string and splitting it.

// Conceptual example for future reference (not executable here)
// using (StreamReader reader = new StreamReader("largefile.txt"))
// {
//     string line;
//     while ((line = reader.ReadLine()) != null)
//     {
//         string[] tokens = line.Split(',');
//         // Process tokens for this line
//     }
// }

The Join() Method: The Reverse of Split

Sometimes you need to do the opposite: combine an array of strings back into a single string. This is where string.Join() comes in, which was introduced in Chapter 23: String methods.

Syntax:

string combined = string.Join(separator, array);

Example:

using System;

class TextProcessor
{
    static void Main()
    {
        string[] parts = { "Error", "404", "Page not found" };

        // Join with a pipe separator
        string logEntry = string.Join(" | ", parts);

        Console.WriteLine(logEntry);
    }
}

Output:

Error | 404 | Page not found

Theoretical Foundations

  1. Splitting: Breaks a string into an array (string[]) based on delimiters.
  2. Delimiters: Can be a single character or an array of characters.
  3. Empty Entries: Adjacent delimiters create empty strings in the result.
  4. Cleaning: Use StringSplitOptions.RemoveEmptyEntries to filter out empty entries.
  5. Tokenization: Splitting is the first step in breaking text into analyzable pieces for AI or data processing.
  6. Memory: Splitting large strings creates new arrays; be mindful of memory usage.
  7. Reversing: string.Join() combines an array back into a single string.

This foundation allows you to parse user commands, read configuration files, and prepare data for more complex operations in later chapters.

The Join() method visually represents the final step of reassembling a parsed array of strings back into a single, cohesive string, ready for user command processing or data configuration.
Hold "Ctrl" to enable pan & zoom

The `Join()` method visually represents the final step of reassembling a parsed array of strings back into a single, cohesive string, ready for user command processing or data configuration.

Basic Code Example

Let's solve a common problem: processing a list of names provided by a user. Imagine you are building a simple application where a user types in a list of names separated by commas. Your program needs to read that single string, break it apart into individual names, and then greet each person one by one.

The Code Example

Here is the complete code to achieve this. We will use the Split() method to break the string into pieces.

using System;

class Program
{
    static void Main()
    {
        // 1. Prompt the user for input.
        Console.WriteLine("Please enter a list of names separated by commas (e.g., Alice,Bob,Charlie):");

        // 2. Read the single line of text entered by the user.
        string inputLine = Console.ReadLine();

        // 3. Define the character we want to use to split the string.
        //    We use a char here because it is efficient.
        char delimiter = ',';

        // 4. Use the Split method to break the string into an array of strings.
        //    Every time it sees a comma, it creates a new entry in the array.
        string[] names = inputLine.Split(delimiter);

        // 5. Print a separator line for clarity.
        Console.WriteLine("-----------------");
        Console.WriteLine("Processing names...");

        // 6. Loop through the array of names using a foreach loop.
        foreach (string name in names)
        {
            // 7. Greet each individual name found.
            Console.WriteLine($"Hello, {name}!");
        }
    }
}

Step-by-Step Explanation

  1. Reading the Input: We use Console.ReadLine() to capture the user's text. This stores the entire input, including commas, as one single string variable.
  2. Setting the Delimiter: We define a char variable named delimiter and set it to ','. This tells our program what symbol marks the boundary between the data we want.
  3. The Split() Method: This is the core of text processing. inputLine.Split(delimiter) looks at the string, finds every instance of the comma, and chops the string at those points. It returns a string[] (an array of strings).
    • If the user entered "Alice,Bob", the resulting array contains ["Alice", "Bob"].
  4. Iterating the Result: We use a foreach loop (Chapter 12) to go through the names array. This loop automatically runs once for every item found in the array.
  5. Output: Inside the loop, we use string interpolation to print a personalized message for each name.

Visualizing the Split Process

The Split operation conceptually looks like this:

A diagram illustrates how the Split method takes a single string of names separated by commas and divides it into an array of individual strings, which are then processed one by one in a loop to print personalized messages.
Hold "Ctrl" to enable pan & zoom

A diagram illustrates how the `Split` method takes a single string of names separated by commas and divides it into an array of individual strings, which are then processed one by one in a loop to print personalized messages.

Common Pitfalls

1. Whitespace Issues The Split method is literal. If the user types "Alice, Bob" (with a space after the comma), your array will contain ["Alice", " Bob"]. Notice the space before "Bob". This can cause issues if you try to compare names later (e.g., " Bob" == "Bob" is false).

  • Solution: In later chapters, you will learn to trim whitespace. For now, be aware that the delimiter must be exact.

2. Empty Entries If the user types "Alice,,Bob" (two commas in a row), the Split method will create an empty string in the array at that position. The array will look like ["Alice", "", "Bob"]. Your loop will still run three times, and you will see a blank line in the output.

3. Null Input If the user simply presses Enter without typing anything, inputLine will be an empty string "". Calling Split on an empty string usually results in an array with one empty element, rather than an empty array. This can lead to unexpected loops running once with no data.

The chapter continues with advanced code, exercises and solutions with analysis, you can find them on the ebook on Leanpub.com or Amazon



Code License: All code examples are released under the MIT License. Github repo.

Content Copyright: Copyright © 2026 Edgar Milvus | Privacy & Cookie Policy. All rights reserved.

All textual explanations, original diagrams, and illustrations are the intellectual property of the author. To support the maintenance of this site via AdSense, please read this content exclusively online. Copying, redistribution, or reproduction is strictly prohibited.