Stopwords and NER in NLP
Introduction to Text Normalization
Modern NLP pipelines require careful text processing. Two critical components:
- Stop Word Removal - Eliminating common function words
- Named Entity Recognition - Identifying proper nouns and specialized terms
Part 1: Stop Words Analysis
Mathematical Definition
For document \( D \) with vocabulary \( V \), stop words \( S \) satisfy: \[ S = \{w_i \in V \mid P_{\text{lang}}(w_i) > \theta\} \] Where \( P_{\text{lang}} \) is language-specific frequency threshold.
TF-IDF Relevance
Stop words have high TF but low IDF: \[ \text{TF-IDF}(w, D) = \underbrace{f_{w,D}}_{\text{High}} \times \underbrace{\log\frac{N}{n_w}}_{\text{Low}} \approx 0 \]
Implementation Techniques
from nltk.corpus import stopwords
import spacy
# NLTK Approach
nltk_stop = set(stopwords.words("english"))
text = "The quick brown fox jumps over the lazy dog"
filtered = [w for w in text.split() if w.lower() not in nltk_stop]
print("NLTK:", filtered)
# spaCy Approach
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_stop = [token.text for token in doc if token.is_stop]
print("spaCy Stop Words:", spacy_stop)
NLTK: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
spaCy Stop Words: ['The', 'over', 'the']
Custom Stop Words
Create domain-specific stop words using Z-score:
\[ z(w) = \frac{f_w - \mu_f}{\sigma_f} > 2.5 \] Where \( \mu_f \) is mean word frequency in domain corpus.
When to Keep Stop Words
- Sentiment analysis (“not”, “but”)
- Phrase detection (“New York”)
- Language modeling
Part 2: Named Entity Recognition (NER)
Formal Definition
For token sequence \( T = \{t_1,…,t_n\} \), find entity spans:
\[ E = \{(t_i,…,t_j, c_k) \mid c_k \in C\} \] Where \( C = \{\text{PER}, \text{ORG}, \text{LOC},…\} \)
Common Architectures
- Rule-based: Regular expressions + Gazetteers
- Statistical: CRF with features
\[ P(y|x) = \frac{1}{Z}\exp\left(\sum_i \lambda_i f_i(y,x)\right) \]
- Neural: BiLSTM-CRF
\[ h_t = \text{BiLSTM}(e_t, h_{t-1}) \] \[ P(y|x) = \text{CRF}(\mathbf{H}) \]
Advanced Models
Transformer-based (BERT):
\[ \mathbf{H} = \text{Transformer}(E(w_1),…,E(w_n)) \] \[ y_i = \text{Softmax}(W_c h_i) \]
spaCy Implementation
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. plans to open a new store in Paris by 2024."
doc = nlp(text)
print(f"{'Text':<15} {'Entity':<10} {'Label':<10}")
print("-" * 35)
for ent in doc.ents:
print(f"{ent.text:<15} {ent.label_:<10} {spacy.explain(ent.label_):<30}")
Text Entity Label
-----------------------------------
Apple Inc. ORG Companies, agencies, institutions, etc.
Paris GPE Countries, cities, states
2024 DATE Absolute or relative dates or periods
Custom NER with Transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
inputs = tokenizer("Jeff Bezos works at Amazon in Seattle", return_tensors="pt")
outputs = model(**inputs).logits
predictions = outputs.argmax(dim=-1).squeeze().tolist()
['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC']
Combined Pipeline
Text Processing Workflow
\[ \text{Raw Text} \rightarrow \text{Stop Word Removal} \rightarrow \text{LEMM/NER} \rightarrow \text{Application} \]
Impact on Downstream Tasks
Task | Stop Word Benefit | NER Benefit |
---|---|---|
Search Engine | 85% size reduction | 40% relevance boost |
Chatbot | 30% speed increase | 55% intent accuracy |
Sentiment | 15% accuracy loss | 25% context capture |
Advanced Topics
Cross-Lingual Challenges
- Stop Words: Japanese particles (は, が)
- NER: Chinese person names (毛泽东 → PER)
Context-Aware Stop Words
Dynamic filtering using attention weights: \[ \text{Keep } w_i \text{ if } \sum_j \alpha_{ij} > \tau \]
NER Evaluation Metrics
\[ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \] Where:
- Precision: True Positives / (True Positives + False Positives)
- Recall: True Positives / (True Positives + False Negatives)
Code: Custom Pipeline
import spacy
class AdvancedNLPipeline:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.custom_stop = {'inc', 'ltd', 'corp'}
def process(self, text):
doc = self.nlp(text)
filtered = [token.lemma_ for token in doc
if not token.is_stop and token.text.lower() not in self.custom_stop]
entities = [(ent.text, ent.label_) for ent in doc.ents]
return filtered, entities
pipeline = AdvancedNLPipeline()
text = "Microsoft Corp announced $5B acquisition of GitHub in 2018"
tokens, ents = pipeline.process(text)
print("Filtered:", tokens)
print("Entities:", ents)
Filtered: ['Microsoft', 'announce', '$', '5b', 'acquisition', 'GitHub', '2018']
Entities: [('Microsoft Corp', 'ORG'), ('5B', 'MONEY'), ('GitHub', 'ORG'), ('2018', 'DATE')]
Conclusion
Stop Words Tradeoffs
- Pros: Reduces dimensionality (40-60%), speeds processing
- Cons: May lose contextual signals, language-dependent
NER Challenges
- Ambiguous entities (“Apple” → fruit vs company)
- Emerging entities (new startups, slang)
- Cross-domain generalization
Modern Approaches
- Contextual stop word detection using BERT embeddings
- Few-shot NER with large language models
- Multimodal NER combining text and visual cues