Tokenization in NLP
Introduction to Tokenization
Tokenization is the foundational step in Natural Language Processing (NLP) that breaks down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the requirements of the task.
Why is Tokenization Important?
- Converts unstructured text into structured data
- Enables feature extraction for ML models
- Helps in vocabulary creation
- Essential for subsequent NLP tasks like POS tagging and parsing
Types of Tokenization
1. Word Tokenization
Splitting text into individual words. The most common approach but handles complex cases.
2. Subword Tokenization
Breaking words into smaller meaningful units. Handles unknown words effectively.
3. Character Tokenization
Splitting text into individual characters. Useful for deep learning models.
Common Tokenization Techniques
Whitespace Tokenization
Simplest method using space and newline characters.
text = "Tokenization is the first step in NLP."
tokens = text.split()
print(tokens)
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP.']
Rule-Based Tokenization
Using regular expressions and language-specific rules.
import re
text = "Mr. O'Neil's dog isn't friendly."
tokens = re.findall(r"\b\w+(?:'\w+)?\b", text)
print(tokens)
['Mr', "O'Neil", 's', 'dog', "isn't", 'friendly']
Statistical Tokenization
Uses machine learning models to identify token boundaries.
Popular NLP Libraries for Tokenization
NLTK (Natural Language Toolkit)
from nltk.tokenize import TreebankWordTokenizer
text = "Let's explore NLTK tokenizer! It handles contractions well."
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)
['Let', "'s", 'explore', 'NLTK', 'tokenizer', '!', 'It', 'handles', 'contractions', 'well', '.']
spaCy
import spacy
nlp = spacy.blank("en") # Load only a tokenizer without models
text = "spaCy's tokenizer excels at: punctuation, URLs - https://example.com"
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)
['spaCy', "'s", 'tokenizer', 'excels', 'at', ':', 'punctuation', ',', 'URLs', '-', 'https://example.com']
Hugging Face Transformers (BERT Tokenizer)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization with subwords: unpredictable ##ably"
tokens = tokenizer.tokenize(text)
print(tokens)
['token', '##ization', 'with', 'sub', '##words', ':', 'unpredictable', '#', '#', 'ab', '##ly']
Subword Tokenization Methods
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent pairs of characters or bytes to create subword units. It is widely used in models like GPT-2.
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
text = "low lowest newer newest"
tokenizer.train_from_iterator([text], vocab_size=20)
print(tokenizer.encode("lowest").tokens)
['l', 'o', 'w', 'e', 's', 't']
WordPiece (Used in BERT)
WordPiece is a subword tokenization algorithm used in BERT and related models. It is similar to BPE but uses a probabilistic model to decide which subword units to merge. WordPiece handles out-of-vocabulary words effectively by breaking them into known subwords.
from transformers import BertTokenizer
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Tokenize a sentence
text = "WordPiece handles subwords like unpredictably."
tokens = tokenizer.tokenize(text)
print(tokens)
['word', '##piece', 'handles', 'sub', '##words', 'like', 'un', '##pre', '##dict', '##ably', '.']
SentencePiece (Used in XLNet)
SentencePiece is a subword tokenization algorithm that treats the input text as a sequence of Unicode characters and does not rely on whitespace for tokenization. It is used in models like XLNet and T5. SentencePiece can operate in two modes:
- Unigram: Probabilistically selects subwords.
- BPE: Similar to Byte-Pair Encoding.
from sentencepiece import SentencePieceProcessor
# Load a pre-trained SentencePiece model
spm = SentencePieceProcessor(model_file="spm_model.model")
# Tokenize a sentence
text = "SentencePiece is language-agnostic."
tokens = spm.encode_as_pieces(text)
print(tokens)
['▁SentencePiece', '▁is', '▁language', '-', 'agnostic', '.']
Challenges in Tokenization
Language-Specific Issues
- Agglutinative languages (e.g., Turkish)
- No space between words (e.g., Chinese)
Ambiguity in Boundaries
- “gumball” vs “gum ball”
- “N.Y.” vs “New York”
Handling Special Cases
- URLs and email addresses
- Hashtags and mentions
- Emojis and unicode characters
import spacy
nlp = spacy.blank("en") # Lightweight tokenizer without a full model
text = "Check https://example.com & email@domain.com 😀 #NLP @user"
doc = nlp(text)
print([token.text for token in doc])
['Check', 'https://example.com', '&', 'email@domain.com', '😀', '#', 'NLP', '@user']
Choosing the Right Tokenizer
Considerations:
- Language requirements
- Domain-specific vocabulary
- Downstream tasks
- Computational efficiency