Stemming in NLP
Introduction to Stemming
Stemming is a linguistic normalization process that reduces inflectional/derivational forms to a common base form through heuristic rules. Formally, for a word \( w \), stemming produces a stem \( s \) such that:
\[ \text{stem}(w) = s \quad \text{where} \quad s \in \text{Stems}(L) \]
where \( L \) is the target language.
Mathematical Foundation
Stemming algorithms typically employ suffix-stripping operations based on pattern matching. The general form can be expressed as:
\[ \text{stem}(w) = \begin{cases} w - \delta & \text{if } \exists \delta \in D \text{ where } w = \alpha\delta \\ w & \text{otherwise} \end{cases} \]
Where:
- \( D \): Set of known suffixes
- \( \delta \): Suffix to remove
- \( \alpha \): Remaining stem candidate
Types of Stemming Algorithms
1. Porter Stemmer (Algorithm Details)
The most widely used English stemmer using 5 phases of sequential rule applications.
Algorithm Phases
- Step 1a: Plural removal (e.g., “sses” → “ss”)
- Step 1b: Past participle handling
- Step 2: Doubly suffix removal
- Step 3: -ic-, -full, -ness etc.
- Step 4: Final cleanup
Rule Structure
Each rule follows the format: \[ (\text{condition}) \quad S_1 \rightarrow S_2 \] Where:
- \( S_1 \): Suffix pattern to match
- \( S_2 \): Replacement suffix
- Condition: Constraints on stem length/vowel presence
Example from Step 1a: \[ (\text{measure} > 0) \quad \text{sses} \rightarrow \text{ss} \]
Working Example
For “happiness”:
- Remove “ness”: “happiness” → “happi”
- Final vowel check: Keep “happi”
2. Lancaster Stemmer (Aggressive Approach)
Uses iterative rules with priority weighting system. More aggressive than Porter due to:
\[ \text{stem}(w) = f^n(w) \] Where \( f \) = rule application function, applied iteratively \( n \) times until no more rules match.
Rule Format
Each rule has: \[ \text{priority} : \text{pattern} \rightarrow \text{replacement} \]
Example rule: \[ 5 : \text{“ment”} \rightarrow \text{""} \] (Remove “ment” with priority 5)
Overstemming Example
“argument” → “argue” (correct) vs “advertisement” → “advertise” (overstemming)
3. Snowball Stemmer (Multilingual)
Generalization of Porter algorithm supporting multiple languages through Snowball scripting language.
For language \( L \), stemmer defined as: \[ S_L = \Phi(L) \] Where \( \Phi \) is the Snowball compiler generating language-specific stemmers.
French Example
For “mangeaient” (they were eating):
- Remove plural suffix: “mangeaient” → “mangeait”
- Remove past tense: “mangeait” → “mange”
- Final stem: “mang”
Algorithm Comparison
Feature | Porter | Lancaster | Snowball |
---|---|---|---|
Rule Type | Context-sensitive | Weighted rules | Language-specific |
Aggressiveness | Moderate | High | Configurable |
Language Support | English | English | 15+ languages |
Rule Application | Sequential | Iterative | Script-defined |
Detailed Code Implementation
Porter Stemmer with Context Checks
from nltk.stem import PorterStemmer
porter = PorterStemmer()
complex_words = ["consignment", "consigned", "consigning"]
print("Porter Stemming with Context Awareness:")
for word in complex_words:
stem = porter.stem(word)
print(f"{word.ljust(12)} → {stem} (Length: {len(stem)})")
Porter Stemming with Context Awareness:
consignment → consign (Length: 7)
consigned → consign (Length: 7)
consigning → consign (Length: 7)
Lancaster Weighted Rules
from nltk.stem.lancaster import LancasterStemmer
lancaster = LancasterStemmer()
words = ["maximum", "presumably", "provision"]
print("Lancaster Aggressive Stemming:")
for word in words:
stem = lancaster.stem(word)
print(f"{word.ljust(12)} → {stem}")
Lancaster Aggressive Stemming:
maximum → maxim
presumably → presum
provision → provid
Snowball Stemmer
from nltk.stem import SnowballStemmer
snowball_en = SnowballStemmer('english')
print(snowball_en.stem("running"))
# French
snowball_fr = SnowballStemmer('french')
print(snowball_fr.stem("manger")) # → mang
run
mang
Quantitative Evaluation
Stemming performance can be measured using:
\[ \text{Accuracy} = \frac{\text{Correct Stems}}{\text{Total Words}} \] \[ \text{Overstemming Rate} = \frac{\text{Overstems}}{\text{Total Stems}} \] \[ \text{Understemming Rate} = \frac{\text{Understems}}{\text{Total Stems}} \]
Limitations and Solutions
Overstemming/Understemming
Formal definition of errors: \[ \text{Overstemming: } \exists w_1 \neq w_2 \quad \text{stem}(w_1) = \text{stem}(w_2) \] \[ \text{Understemming: } \exists w_1 \sim w_2 \quad \text{stem}(w_1) \neq \text{stem}(w_2) \]
Hybrid Approach
Combine stemming with lemmatization: \[ \text{processed}(w) = \begin{cases} \text{stem}(w) & \text{if } len(w) > 7 \\ \text{lemma}(w) & \text{otherwise} \end{cases} \]
Advanced Topics
Finite State Stemmers
For morphologically rich languages like Arabic:
\[ \mathcal{A} = (Q, \Sigma, \delta, q_0, F) \] Where:
- \( Q \): Finite set of states
- \( \Sigma \): Input alphabet
- \( \delta \): Transition function
- \( q_0 \): Initial state
- \( F \): Final states
Statistical Stemming
Using co-occurrence probabilities: \[ P(s|w) = \frac{\text{count}(w, s)}{\text{count}(w)} \] Where \( s \) is the candidate stem for word \( w \).
Conclusion
Modern stemming combines rule-based and statistical approaches. While Porter remains popular for English, Snowball provides multilingual support, and Lancaster offers aggressive compression. Choice depends on:
- Language complexity
- Required precision
- Processing constraints
Future directions incorporate neural networks for contextual stemming: \[ \text{stem}(w) = \underset{s}{\text{argmax}} \ P_\theta(s|w, C) \] Where \( C \) is the context window.