Lemmatization in NLP

Karna included in NLP

2025-02-24 637 words 3 minutes

Introduction to Lemmatization

Lemmatization is the process of reducing words to their base dictionary form (lemma) through morphological analysis. Unlike stemming, it considers:

Part-of-speech (POS) tagging
Morphological structure
Semantic context

Formally defined as: \[ \text{lemma}(w, p) = l \quad \text{where} \begin{cases} w \in \text{Words}(L) \\ p \in \text{POS\_Tags} \\ l \in \text{Lexicon}(L) \end{cases} \]

Mathematical Foundation

Lemma Transformation

For word \( w \) with POS tag \( p \):

\[ \text{lemmatize}(w, p) = \underset{l \in \mathcal{L}}{\text{argmax}} \ P(l|w, p) \]

Where \( \mathcal{L} \) is the set of possible lemmas.

Contextual Disambiguation

Neural lemmatizers use:

\[ P(l|w, C) = \prod_{i=1}^n P(l_i|w_i, C_i) \]

Where \( C \) is the context window.

Lemmatization vs Stemming

Feature	Lemmatization	Stemming
Basis	Dictionary lookup	Heuristic rules
Output	Valid lemma	Potential non-word
POS Awareness	Required	Ignored
Accuracy	Higher	Lower
Speed	Slower (10-100x)	Faster
Complexity	O(log n) for lookups	O(1) for rule app

Lemmatization Approaches

1. Rule-Based Lemmatization

Combine morphological rules with exception dictionaries:

\[ l = \begin{cases} \text{exception\_dict}[w] & \text{if } w \in \mathcal{D} \\ \text{apply\_rules}(w, p) & \text{otherwise} \end{cases} \]

2. Statistical Lemmatization

Use Hidden Markov Models (HMMs):

\[ \hat{l} = \underset{l}{\text{argmax}} \ P(p|l)P(l|w) \]

3. Neural Lemmatization

Transformer-based approach:

\[ \text{Lemma} = \text{Decoder}(\text{Encoder}(w, C)) \]

Implementation in Python

NLTK WordNet Lemmatizer

import nltk
nltk.download(['wordnet', 'omw-1.4', 'averaged_perceptron_tagger'])
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "better", "worst", "leaves"]
pos_tags = ['v', 'a', 'a', 'n']

for word, pos in zip(words, pos_tags):
    lemma = lemmatizer.lemmatize(word, pos=pos)
    print(f"{word} ({pos}) → {lemma}")

[nltk_data] Downloading package wordnet to /home/karna/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/karna/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/karna/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
running (v) → run
better (a) → good
worst (a) → bad
leaves (n) → leaf

spaCy’s Linguistic Lemmatization

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The striped bats were hanging on their feet and ate flies"
doc = nlp(text)

print(f"{'Token':<10} {'Lemma':<10} {'POS':<8}")
print("-" * 30)
for token in doc:
    print(f"{token.text:<10} {token.lemma_:<10} {token.pos_:<8}")

Token      Lemma      POS
------------------------------
The        the        DET
striped    striped    ADJ
bats       bat        NOUN
were       be         AUX
hanging    hang       VERB
on         on         ADP
their      their      PRON
feet       foot       NOUN
and        and        CCONJ
ate        eat        VERB
flies      fly        NOUN

Advanced Techniques

Morphological Analysis

Decompose words using paradigm tables:

\[ \text{lemma} = \text{root} + \text{morpheme\_sequence} \]

For German verb “gegangen”:

\[ \text{ge + gang + en} \rightarrow \text{gehen} \]

Hybrid Approaches

Combine rules with machine learning:

\[ \text{Final Lemma} = \begin{cases} \text{ML\_Lemma} & \text{if } P > 0.95 \\ \text{Rule\_Lemma} & \text{otherwise} \end{cases} \]

Challenges in Lemmatization

Language Complexity

For morphologically rich languages (e.g., Finnish):

\[ \text{number\_of\_forms} = \prod_{i=1}^n \text{case\_options}_i \]

Ambiguity Resolution

Requires context modeling:

\[ P(l|w) = \sum_{p \in POS} P(l|w, p)P(p|C) \]

Performance Metrics

\[ \text{Accuracy} = \frac{\text{Correct Lemmas}}{\text{Total Words}} \]

\[ \text{Ambiguity Resolution Rate} = 1 - \frac{\text{Multiple Options}}{\text{Total Words}} \]

Custom Lemmatizer Class

class AdvancedLemmatizer:
    def __init__(self):
	self.lemma_rules = {
	    'NOUN': {'s': '', 'ses': 's', 'ves': 'f'},
	    'VERB': {'ing': '', 'ed': 'e', 's': ''}
	}

    def lemmatize(self, word, pos):
	for suffix in sorted(self.lemma_rules.get(pos, {}), key=len, reverse=True):
	    if word.endswith(suffix):
		return word[:-len(suffix)] + self.lemma_rules[pos][suffix]
	return word


lem = AdvancedLemmatizer()
print(lem.lemmatize("leaves", "NOUN"))

leaf

Modern Approaches

BERT-style Lemmatization

\[ \text{Lemma} = \underset{l}{\text{argmax}} \ \text{Softmax}(W_h h_{[CLS]}) \]

Where \( h_{[CLS]} \) is the BERT contextual embedding.

Sequence-to-Sequence Models

\[ P(l|w) = \prod_{t=1}^T P(l_t|l_{<t}, w) \]

Implemented using LSTM or Transformer architectures.

When to Use Lemmatization

Sentiment analysis
Question answering systems
Machine translation
Text generation
Semantic analysis

Conclusion

Lemmatization provides more linguistically valid results than stemming but requires:

POS tagging infrastructure
Language-specific resources
Computational resources

Modern NLP pipelines (like spaCy) integrate lemmatization with POS tagging for optimal results. While being 10-100x slower than stemming, the accuracy gains justify its use in semantic-focused applications.

Future directions include:

Zero-shot lemmatization using LLMs
Cross-lingual lemma transfer
Morphologically-aware neural models

Contents