Contents

Tokenization in NLP

Introduction to Tokenization

Tokenization is the foundational step in Natural Language Processing (NLP) that breaks down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the requirements of the task.

Why is Tokenization Important?

  • Converts unstructured text into structured data
  • Enables feature extraction for ML models
  • Helps in vocabulary creation
  • Essential for subsequent NLP tasks like POS tagging and parsing

Types of Tokenization

1. Word Tokenization

Splitting text into individual words. The most common approach but handles complex cases.

2. Subword Tokenization

Breaking words into smaller meaningful units. Handles unknown words effectively.

3. Character Tokenization

Splitting text into individual characters. Useful for deep learning models.

Common Tokenization Techniques

Whitespace Tokenization

Simplest method using space and newline characters.

text = "Tokenization is the first step in NLP."
tokens = text.split()
print(tokens)
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP.']

Rule-Based Tokenization

Using regular expressions and language-specific rules.

import re

text = "Mr. O'Neil's dog isn't friendly."
tokens = re.findall(r"\b\w+(?:'\w+)?\b", text)
print(tokens)
['Mr', "O'Neil", 's', 'dog', "isn't", 'friendly']

Statistical Tokenization

Uses machine learning models to identify token boundaries.

NLTK (Natural Language Toolkit)

from nltk.tokenize import TreebankWordTokenizer

text = "Let's explore NLTK tokenizer! It handles contractions well."
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

print(tokens)
['Let', "'s", 'explore', 'NLTK', 'tokenizer', '!', 'It', 'handles', 'contractions', 'well', '.']

spaCy

import spacy

nlp = spacy.blank("en")  # Load only a tokenizer without models

text = "spaCy's tokenizer excels at: punctuation, URLs - https://example.com"
doc = nlp(text)
tokens = [token.text for token in doc]

print(tokens)
['spaCy', "'s", 'tokenizer', 'excels', 'at', ':', 'punctuation', ',', 'URLs', '-', 'https://example.com']

Hugging Face Transformers (BERT Tokenizer)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization with subwords: unpredictable ##ably"
tokens = tokenizer.tokenize(text)
print(tokens)
['token', '##ization', 'with', 'sub', '##words', ':', 'unpredictable', '#', '#', 'ab', '##ly']

Subword Tokenization Methods

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent pairs of characters or bytes to create subword units. It is widely used in models like GPT-2.

from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
text = "low lowest newer newest"
tokenizer.train_from_iterator([text], vocab_size=20)

print(tokenizer.encode("lowest").tokens)
['l', 'o', 'w', 'e', 's', 't']

WordPiece (Used in BERT)

WordPiece is a subword tokenization algorithm used in BERT and related models. It is similar to BPE but uses a probabilistic model to decide which subword units to merge. WordPiece handles out-of-vocabulary words effectively by breaking them into known subwords.

from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize a sentence
text = "WordPiece handles subwords like unpredictably."
tokens = tokenizer.tokenize(text)
print(tokens)
['word', '##piece', 'handles', 'sub', '##words', 'like', 'un', '##pre', '##dict', '##ably', '.']

SentencePiece (Used in XLNet)

SentencePiece is a subword tokenization algorithm that treats the input text as a sequence of Unicode characters and does not rely on whitespace for tokenization. It is used in models like XLNet and T5. SentencePiece can operate in two modes:

  • Unigram: Probabilistically selects subwords.
  • BPE: Similar to Byte-Pair Encoding.
from sentencepiece import SentencePieceProcessor

# Load a pre-trained SentencePiece model
spm = SentencePieceProcessor(model_file="spm_model.model")

# Tokenize a sentence
text = "SentencePiece is language-agnostic."
tokens = spm.encode_as_pieces(text)
print(tokens)
['▁SentencePiece', '▁is', '▁language', '-', 'agnostic', '.']

Challenges in Tokenization

Language-Specific Issues

  • Agglutinative languages (e.g., Turkish)
  • No space between words (e.g., Chinese)

Ambiguity in Boundaries

  • “gumball” vs “gum ball”
  • “N.Y.” vs “New York”

Handling Special Cases

  • URLs and email addresses
  • Hashtags and mentions
  • Emojis and unicode characters
import spacy

nlp = spacy.blank("en")  # Lightweight tokenizer without a full model

text = "Check https://example.com & email@domain.com 😀 #NLP @user"
doc = nlp(text)

print([token.text for token in doc])
['Check', 'https://example.com', '&', 'email@domain.com', '😀', '#', 'NLP', '@user']

Choosing the Right Tokenizer

Considerations:

  • Language requirements
  • Domain-specific vocabulary
  • Downstream tasks
  • Computational efficiency