Contents

Stopwords and NER in NLP

Introduction to Text Normalization

Modern NLP pipelines require careful text processing. Two critical components:

  1. Stop Word Removal - Eliminating common function words
  2. Named Entity Recognition - Identifying proper nouns and specialized terms

Part 1: Stop Words Analysis

Mathematical Definition

For document \( D \) with vocabulary \( V \), stop words \( S \) satisfy: \[ S = \{w_i \in V \mid P_{\text{lang}}(w_i) > \theta\} \] Where \( P_{\text{lang}} \) is language-specific frequency threshold.

TF-IDF Relevance

Stop words have high TF but low IDF: \[ \text{TF-IDF}(w, D) = \underbrace{f_{w,D}}_{\text{High}} \times \underbrace{\log\frac{N}{n_w}}_{\text{Low}} \approx 0 \]

Implementation Techniques

from nltk.corpus import stopwords
import spacy

# NLTK Approach
nltk_stop = set(stopwords.words("english"))
text = "The quick brown fox jumps over the lazy dog"
filtered = [w for w in text.split() if w.lower() not in nltk_stop]
print("NLTK:", filtered)

# spaCy Approach
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_stop = [token.text for token in doc if token.is_stop]
print("spaCy Stop Words:", spacy_stop)
NLTK: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
spaCy Stop Words: ['The', 'over', 'the']

Custom Stop Words

Create domain-specific stop words using Z-score:

\[ z(w) = \frac{f_w - \mu_f}{\sigma_f} > 2.5 \] Where \( \mu_f \) is mean word frequency in domain corpus.

When to Keep Stop Words

  • Sentiment analysis (“not”, “but”)
  • Phrase detection (“New York”)
  • Language modeling

Part 2: Named Entity Recognition (NER)

Formal Definition

For token sequence \( T = \{t_1,…,t_n\} \), find entity spans:

\[ E = \{(t_i,…,t_j, c_k) \mid c_k \in C\} \] Where \( C = \{\text{PER}, \text{ORG}, \text{LOC},…\} \)

Common Architectures

  1. Rule-based: Regular expressions + Gazetteers
  2. Statistical: CRF with features

\[ P(y|x) = \frac{1}{Z}\exp\left(\sum_i \lambda_i f_i(y,x)\right) \]

  1. Neural: BiLSTM-CRF

\[ h_t = \text{BiLSTM}(e_t, h_{t-1}) \] \[ P(y|x) = \text{CRF}(\mathbf{H}) \]

Advanced Models

Transformer-based (BERT):

\[ \mathbf{H} = \text{Transformer}(E(w_1),…,E(w_n)) \] \[ y_i = \text{Softmax}(W_c h_i) \]

spaCy Implementation

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. plans to open a new store in Paris by 2024."
doc = nlp(text)

print(f"{'Text':<15} {'Entity':<10} {'Label':<10}")
print("-" * 35)
for ent in doc.ents:
    print(f"{ent.text:<15} {ent.label_:<10} {spacy.explain(ent.label_):<30}")
Text            Entity     Label
-----------------------------------
Apple Inc.      ORG        Companies, agencies, institutions, etc.
Paris           GPE        Countries, cities, states
2024            DATE       Absolute or relative dates or periods

Custom NER with Transformers

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

inputs = tokenizer("Jeff Bezos works at Amazon in Seattle", return_tensors="pt")
outputs = model(**inputs).logits
predictions = outputs.argmax(dim=-1).squeeze().tolist()
['B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC']

Combined Pipeline

Text Processing Workflow

\[ \text{Raw Text} \rightarrow \text{Stop Word Removal} \rightarrow \text{LEMM/NER} \rightarrow \text{Application} \]

Impact on Downstream Tasks

TaskStop Word BenefitNER Benefit
Search Engine85% size reduction40% relevance boost
Chatbot30% speed increase55% intent accuracy
Sentiment15% accuracy loss25% context capture

Advanced Topics

Cross-Lingual Challenges

  1. Stop Words: Japanese particles (は, が)
  2. NER: Chinese person names (毛泽东 → PER)

Context-Aware Stop Words

Dynamic filtering using attention weights: \[ \text{Keep } w_i \text{ if } \sum_j \alpha_{ij} > \tau \]

NER Evaluation Metrics

\[ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \] Where:

  • Precision: True Positives / (True Positives + False Positives)
  • Recall: True Positives / (True Positives + False Negatives)

Code: Custom Pipeline

import spacy

class AdvancedNLPipeline:
    def __init__(self):
	self.nlp = spacy.load("en_core_web_sm")
	self.custom_stop = {'inc', 'ltd', 'corp'}

    def process(self, text):
	doc = self.nlp(text)
	filtered = [token.lemma_ for token in doc
		    if not token.is_stop and token.text.lower() not in self.custom_stop]
	entities = [(ent.text, ent.label_) for ent in doc.ents]
	return filtered, entities

pipeline = AdvancedNLPipeline()
text = "Microsoft Corp announced $5B acquisition of GitHub in 2018"
tokens, ents = pipeline.process(text)
print("Filtered:", tokens)
print("Entities:", ents)
Filtered: ['Microsoft', 'announce', '$', '5b', 'acquisition', 'GitHub', '2018']
Entities: [('Microsoft Corp', 'ORG'), ('5B', 'MONEY'), ('GitHub', 'ORG'), ('2018', 'DATE')]

Conclusion

Stop Words Tradeoffs

  • Pros: Reduces dimensionality (40-60%), speeds processing
  • Cons: May lose contextual signals, language-dependent

NER Challenges

  • Ambiguous entities (“Apple” → fruit vs company)
  • Emerging entities (new startups, slang)
  • Cross-domain generalization

Modern Approaches

  • Contextual stop word detection using BERT embeddings
  • Few-shot NER with large language models
  • Multimodal NER combining text and visual cues