Stemming in NLP

Karna included in NLP DL

2025-02-24 730 words 4 minutes

Introduction to Stemming

Stemming is a linguistic normalization process that reduces inflectional/derivational forms to a common base form through heuristic rules. Formally, for a word \( w \), stemming produces a stem \( s \) such that:

\[ \text{stem}(w) = s \quad \text{where} \quad s \in \text{Stems}(L) \]

where \( L \) is the target language.

Mathematical Foundation

Stemming algorithms typically employ suffix-stripping operations based on pattern matching. The general form can be expressed as:

\[ \text{stem}(w) = \begin{cases} w - \delta & \text{if } \exists \delta \in D \text{ where } w = \alpha\delta \\ w & \text{otherwise} \end{cases} \]

Where:

\( D \): Set of known suffixes
\( \delta \): Suffix to remove
\( \alpha \): Remaining stem candidate

Types of Stemming Algorithms

1. Porter Stemmer (Algorithm Details)

The most widely used English stemmer using 5 phases of sequential rule applications.

Algorithm Phases

Step 1a: Plural removal (e.g., “sses” → “ss”)
Step 1b: Past participle handling
Step 2: Doubly suffix removal
Step 3: -ic-, -full, -ness etc.
Step 4: Final cleanup

Rule Structure

Each rule follows the format: \[ (\text{condition}) \quad S_1 \rightarrow S_2 \] Where:

\( S_1 \): Suffix pattern to match
\( S_2 \): Replacement suffix
Condition: Constraints on stem length/vowel presence

Example from Step 1a: \[ (\text{measure} > 0) \quad \text{sses} \rightarrow \text{ss} \]

Working Example

For “happiness”:

Remove “ness”: “happiness” → “happi”
Final vowel check: Keep “happi”

2. Lancaster Stemmer (Aggressive Approach)

Uses iterative rules with priority weighting system. More aggressive than Porter due to:

\[ \text{stem}(w) = f^n(w) \] Where \( f \) = rule application function, applied iteratively \( n \) times until no more rules match.

Rule Format

Each rule has: \[ \text{priority} : \text{pattern} \rightarrow \text{replacement} \]

Example rule: \[ 5 : \text{“ment”} \rightarrow \text{""} \] (Remove “ment” with priority 5)

Overstemming Example

“argument” → “argue” (correct) vs “advertisement” → “advertise” (overstemming)

3. Snowball Stemmer (Multilingual)

Generalization of Porter algorithm supporting multiple languages through Snowball scripting language.

For language \( L \), stemmer defined as: \[ S_L = \Phi(L) \] Where \( \Phi \) is the Snowball compiler generating language-specific stemmers.

French Example

For “mangeaient” (they were eating):

Remove plural suffix: “mangeaient” → “mangeait”
Remove past tense: “mangeait” → “mange”
Final stem: “mang”

Algorithm Comparison

Feature	Porter	Lancaster	Snowball
Rule Type	Context-sensitive	Weighted rules	Language-specific
Aggressiveness	Moderate	High	Configurable
Language Support	English	English	15+ languages
Rule Application	Sequential	Iterative	Script-defined

Detailed Code Implementation

Porter Stemmer with Context Checks

from nltk.stem import PorterStemmer

porter = PorterStemmer()
complex_words = ["consignment", "consigned", "consigning"]

print("Porter Stemming with Context Awareness:")
for word in complex_words:
    stem = porter.stem(word)
    print(f"{word.ljust(12)} → {stem} (Length: {len(stem)})")

Porter Stemming with Context Awareness:
consignment  → consign (Length: 7)
consigned    → consign (Length: 7)
consigning   → consign (Length: 7)

Lancaster Weighted Rules

from nltk.stem.lancaster import LancasterStemmer

lancaster = LancasterStemmer()
words = ["maximum", "presumably", "provision"]

print("Lancaster Aggressive Stemming:")
for word in words:
    stem = lancaster.stem(word)
    print(f"{word.ljust(12)} → {stem}")

Lancaster Aggressive Stemming:
maximum      → maxim
presumably   → presum
provision    → provid

Snowball Stemmer

from nltk.stem import SnowballStemmer

snowball_en = SnowballStemmer('english')
print(snowball_en.stem("running"))

# French
snowball_fr = SnowballStemmer('french')
print(snowball_fr.stem("manger"))   # → mang

run
mang

Quantitative Evaluation

Stemming performance can be measured using:

\[ \text{Accuracy} = \frac{\text{Correct Stems}}{\text{Total Words}} \] \[ \text{Overstemming Rate} = \frac{\text{Overstems}}{\text{Total Stems}} \] \[ \text{Understemming Rate} = \frac{\text{Understems}}{\text{Total Stems}} \]

Limitations and Solutions

Overstemming/Understemming

Formal definition of errors: \[ \text{Overstemming: } \exists w_1 \neq w_2 \quad \text{stem}(w_1) = \text{stem}(w_2) \] \[ \text{Understemming: } \exists w_1 \sim w_2 \quad \text{stem}(w_1) \neq \text{stem}(w_2) \]

Hybrid Approach

Combine stemming with lemmatization: \[ \text{processed}(w) = \begin{cases} \text{stem}(w) & \text{if } len(w) > 7 \\ \text{lemma}(w) & \text{otherwise} \end{cases} \]

Advanced Topics

Finite State Stemmers

For morphologically rich languages like Arabic:

\[ \mathcal{A} = (Q, \Sigma, \delta, q_0, F) \] Where:

\( Q \): Finite set of states
\( \Sigma \): Input alphabet
\( \delta \): Transition function
\( q_0 \): Initial state
\( F \): Final states

Statistical Stemming

Using co-occurrence probabilities: \[ P(s|w) = \frac{\text{count}(w, s)}{\text{count}(w)} \] Where \( s \) is the candidate stem for word \( w \).

Conclusion

Modern stemming combines rule-based and statistical approaches. While Porter remains popular for English, Snowball provides multilingual support, and Lancaster offers aggressive compression. Choice depends on:

Language complexity
Required precision
Processing constraints

Future directions incorporate neural networks for contextual stemming: \[ \text{stem}(w) = \underset{s}{\text{argmax}} \ P_\theta(s|w, C) \] Where \( C \) is the context window.

Contents