Word Embedding Techniques
Word Embeddings
Alright, let’s start with what’s an embedding? In NLP, word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
There are two different types of Word Embedding techniques are available:
Count/Frequency
ONE HOT ENCODING
One-hot encoding is a technique to represent categorical variables (e.g., words) as binary vectors. Each word in a vocabulary is assigned a unique index, and its vector has 1 at that index and 0 elsewhere. This method converts discrete tokens into a format usable by machine learning algorithms.
For a vocabulary \( V \) of size \( |V| \), the one-hot encoding of a word \( w_i \) is a vector \( \mathbf{o}_i \in \mathbb{R}^{|V|} \) defined as: \[ \mathbf{o}_i[j] = \begin{cases} 1 & \text{if } j = i \quad (\text{index of } w_i \text{ in } V) \\ 0 & \text{otherwise} \end{cases} \]
Consider a small vocabulary: \[ V = \{\text{cat, dog, bird}\} \quad (|V| = 3) \] The one-hot encodings are:
“cat” → \([1, 0, 0]\)
“dog” → \([0, 1, 0]\)
“bird” → \([0, 0, 1]\)
Given a corpus, extract unique words and assign indices. For example:
corpus = ["I love NLP", "NLP is fun"]
vocab = sorted(set(" ".join(corpus).lower().split()))
print(vocab)
['fun', 'i', 'is', 'love', 'nlp']
Each word is mapped to a binary vector of length \( |V| \). For the vocabulary \( V = \{\text{fun, i, is, love, nlp}\} \):
Word | One-Hot Vector |
---|---|
fun | \([1, 0, 0, 0, 0]\)) |
i | \([0, 1, 0, 0, 0]\)) |
is | \([0, 0, 1, 0, 0]\)) |
love | \([0, 0, 0, 1, 0]\)) |
nlp | \( [0, 0, 0, 0, 1] \)) |
Ok! everything seems alright but somethings are missing right??
Although it’s easy to implement using libraries like scikit-learn, What if the vocabulary is larger than we thought.. something like \( 10^{5} - 10^{6} \).
With ever increasing vocab the sparseness of matrices also increases leading to computational complexity and overfitting issues.
In Machine learning algorithms we need
fixed input size
right! but with this case we cannot tell the input size which always depends on the sentence length.What if some new word is in testing which is not in the training? Then the context changes entirely and we can get
OOV(Out of Vocabulary).
One more main limitation is the semantic meaning is getting corrupted. Words with similar meanings (e.g., “king” and “queen”) are treated as completely unrelated, while unrelated words (e.g., “king” and “apple”) are equally dissimilar.
For example let’s consider,
Cosine similarity between two one-hot vectors \( \mathbf{o}_i \) and \( \mathbf{o}_j \): \[ \text{cosine-sim}(\mathbf{o}_i, \mathbf{o}_j) = \frac{\mathbf{o}_i \cdot \mathbf{o}_j}{\|\mathbf{o}_i\| \|\mathbf{o}_j\|} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{otherwise} \end{cases} \]
Consider three words and their one-hot encodings:
- “king” → \([1, 0, 0, 0]\)
- “queen” → \([0, 1, 0, 0]\)
- “apple” → \([0, 0, 1, 0] \)
- “orange” → \([0, 0, 0, 1]\)
import numpy as np
words = ["king", "queen", "apple", "orange"]
vectors = np.eye(4) # One-hot vectors
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print("| Word 1 | Word 2 | Cosine Similarity |")
print("|--------+--------+-------------------|")
for i in range(4):
for j in range(i+1, 4):
sim = cosine_sim(vectors[i], vectors[j])
print(f"| {words[i]} | {words[j]} | {sim:.2f} |")
| Word 1 | Word 2 | Cosine Similarity |
|--------+--------+-------------------|
| king | queen | 0.00 |
| king | apple | 0.00 |
| king | orange | 0.00 |
| queen | apple | 0.00 |
| queen | orange | 0.00 |
| apple | orange | 0.00 |
Difference between two semantic representations
Semantic Space (One-Hot Encoding) Ideal Semantic Space
▲ ▲
| |
o king o king
| | \
| | o queen
o queen | /
| o apple
o apple \
| \
o orange o orange
└────────────────► └───────────────►
No semantic relationships Clustered by meaning
We can see that there is no similarity between related words as well. To reduce these limitations then introduced new method called *BAG OF WORDS*
BAG OF WORDS
The Bag-of-Words model represents text as an unordered collection of words, ignoring grammar and word order. Documents are encoded as vectors of word frequencies or presence indicators.
For a vocabulary \( V = \{w_1, w_2, \dots, w_{|V|}\} \), a document \( D \) is represented as: \[ \text{BoW}(D) = \left[ f(w_1, D), f(w_2, D), \dots, f(w_{|V|}, D) \right] \] where \( f(w_i, D) \) is the frequency (or binary indicator) of word \( w_i \) in \( D \).
Let’s consider a corpus:
- “I love machine learning”
- “AI is fascinating and powerful”
- “Machine learning and AI are powerful”
\[ V = \{\text{I, love, machine, learning, AI, is, fascinating, and, powerful, are}\} \quad(|V|=10)\]
Document | Vector Representation (Frequency) |
---|---|
Doc 1 | \([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]\)) |
Doc 2 | \([0, 0, 0, 0, 1, 1, 1, 1, 1, 0]\)) |
Doc 3 | \([0, 0, 1, 1, 1, 0, 0, 1, 1, 1] \)) |
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love machine learning",
"AI is fascinating and powerful",
"Machine learning and AI are powerful"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
vocab = vectorizer.get_feature_names_out()
print("| Word | Doc 1 | Doc 2 | Doc 3 |")
print("|-------------+-------+-------+-------|")
for i, word in enumerate(vocab):
print(f"| {word.ljust(10)} | {X[0][i]} | {X[1][i]} | {X[2][i]} |")
| Word | Doc 1 | Doc 2 | Doc 3 |
|-------------+-------+-------+-------|
| ai | 0 | 1 | 1 |
| and | 0 | 1 | 1 |
| are | 0 | 0 | 1 |
| fascinating | 0 | 1 | 0 |
| is | 0 | 1 | 0 |
| learning | 1 | 0 | 1 |
| love | 1 | 0 | 0 |
| machine | 1 | 0 | 1 |
| powerful | 0 | 1 | 1 |
The one advantage over One-Hot is that the input size is fixed which can solve the input size problem with machine learning models.
Limitations:
BoW treats documents as independent word counts, ignoring:
Word order and context
Synonyms (e.g., “AI” vs “machine learning”)
Semantic relationships between terms
Let’s see it with cosine sim
- Doc 1: “I love machine learning”
- Doc 3: “Machine learning and AI are powerful”
- Actual similarity (BoW): Partial overlap
- Ideal similarity: High (both discuss ML/AI)
Doc 1 | Doc 2 | Doc 3 | |
---|---|---|---|
Doc 1 | 1.00 | 0.00 | 0.45 |
Doc 2 | 0.00 | 1.00 | 0.52 |
Doc 3 | 0.45 | 0.52 | 1.00 |
BoW Semantic Space Ideal Semantic Space
▲ ▲
| |
o Doc1 (ML) o Doc1
| | \
| | o Doc3
o Doc2 (AI) | /
| o Doc2
o Doc3 (ML+AI) |
└────────────────► └───────────────►
Distance based on word overlap Distance based on meaning
- Still the OOV problem persists with BoW also
TF-IDF
TF-IDF is a statistical measure to evaluate the importance of a word in a document relative to a corpus. It balances:
- Term Frequency (TF): How often a word appears in a document.
For a word \( w \) in document \( D \): \[ \text{TF}(w, D) = \frac{\text{Count of } w \text{ in } D}{\text{Total words in } D} \]
- Inverse Document Frequency (IDF): How rare/important the word is across documents.
For a word \( w \) in corpus \( C \): \[ \text{IDF}(w, C) = \log\left(\frac{\text{Total documents in } C}{\text{Documents containing } w}\right) \]
- TF-IDF Weight
\[ \text{TF-IDF}(w, D, C) = \text{TF}(w, D) \times \text{IDF}(w, C) \]
Let’s see it with an example:
Document 1: “the cat sat on the mat” Document 2: “the dog sat on the log” Document 3: “the cat and the dog”
- Compute TF
Word | Doc 1 | Doc 2 | Doc 3 |
---|---|---|---|
the | 2/5 | 2/5 | 2/5 |
cat | 1/5 | 0 | 1/5 |
sat | 1/5 | 1/5 | 0 |
on | 1/5 | 1/5 | 0 |
mat | 1/5 | 0 | 0 |
dog | 0 | 1/5 | 1/5 |
log | 0 | 1/5 | 0 |
and | 0 | 0 | 1/5 |
- Compute IDF
\[ \text{IDF}(w) = \log\left(\frac{3}{\text{df}(w)}\right) \]
Word | Document Frequency (df) | IDF |
---|---|---|
the | 3 | \(\log(3/3) = 0 \) |
cat | 2 | \(\log(3/2) \approx 0.176 \) |
sat | 2 | \(\log(3/2) \approx 0.176 \) |
on | 2 | \(\log(3/2) \approx 0.176 \) |
mat | 1 | \(\log(3/1) \approx 0.477 \) |
dog | 2 | \(\log(3/2) \approx 0.176 \) |
log | 1 | \(\log(3/1) \approx 0.477 \) |
and | 1 | \(\log(3/1) \approx 0.477 \) |
- Compute TF-IDF
Word | Doc 1 | Doc 2 | Doc 3 |
---|---|---|---|
the | \( 0 \times 2/5 = 0\) | 0 | 0 |
cat | \(0.176 \times 1/5 \approx 0.035\) | 0 | \(0.176 \times 1/5 \approx 0.035\) |
sat | \(0.176 \times 1/5 \approx 0.035\) | \(0.176 \times 1/5 \approx 0.035\) | 0 |
mat | \(0.477 \times 1/5 \approx 0.095\) | 0 | 0 |
dog | 0 | \(0.176 \times 1/5 \approx 0.035\) | \(0.176 \times 1/5 \approx 0.035\) |
log | 0 | \(0.477 \times 1/5 \approx 0.095\) | 0 |
and | 0 | 0 | \(0.477 \times 1/5 \approx 0.095\) |
We can implement it using scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"the cat and the dog"
]
vectorizer = TfidfVectorizer(norm=None) # Disable L2 normalization
X = vectorizer.fit_transform(corpus).toarray()
features = vectorizer.get_feature_names_out()
print("| Word | Doc 1 | Doc 2 | Doc 3 |")
print("|------|-------|-------|-------|")
for i, word in enumerate(features):
print(f"| {word.ljust(4)} | {X[0][i]:.3f} | {X[1][i]:.3f} | {X[2][i]:.3f} |")
Word | Doc 1 | Doc 2 | Doc 3 |
---|---|---|---|
and | 0.000 | 0.000 | 0.959 |
cat | 0.959 | 0.000 | 0.959 |
dog | 0.000 | 0.959 | 0.959 |
log | 0.000 | 0.959 | 0.000 |
mat | 0.959 | 0.000 | 0.000 |
on | 0.959 | 0.959 | 0.000 |
sat | 0.959 | 0.959 | 0.000 |
the | 0.000 | 0.000 | 0.000 |
Key advantages
Easy to implement
Fixed size as of BoW
Word importance is there which is not there in BoW and one hot encoding. But how? If we check the entire TF-IDF weights table we will get some weights instead of zeroes for the words which are super important and change the context in that particular sentence. So that means we are getting the word importance which leads to solve the word ignorance problem unlike the others.
Limitations
Sparse matrix which depends on the corpus size.
OOV problem still persists
Deep Learning Trained
The type of word embeddings are deep learning trained techniques which solves most of the count/frequency techniques limitations. This contains a technique called Word2Vec
.
- Word2Vec: Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.
Key Architectures:
1. Continuous Bag of Words (CBOW)
Predicts a target word given its context words. Ideal for small datasets.
Mathematical Formulation
- Input: Context words \( \{w_{t-k}, …, w_{t-1}, w_{t+1}, …, w_{t+k}\} \)
- Output: Target word \( w_t \)
Let:
- \( \mathbf{V} \in \mathbb{R}^{|V|\times d} \): Input embedding matrix
- \( \mathbf{U} \in \mathbb{R}^{d\times |V|} \): Output embedding matrix
- \( \mathcal{C}_t \): Context window around \( w_t \)
Objective function:
$$ P(w_t | \mathcal{C}t) = \frac{\exp(\mathbf{u}{w_t}^{\top} \bar{\mathbf{v}})}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^{\top} \bar{\mathbf{v}})} $$
where \( \bar{\mathbf{v}} = \frac{1}{2k} \sum_{w \in \mathcal{C}_t} \mathbf{v}_w \)
2. Skip-Gram
Predicts context words given a target word. Better for large datasets and rare words.
Mathematical Formulation
- Input: Target word \( w_t \)
- Output: Context words \( \{w_{t-k}, …, w_{t+k}\} \)
Objective function (assuming independence):
$$
\prod_{w_c \in \mathcal{C}t} P(w_c | w_t) = \prod{w_c \in \mathcal{C}t} \frac{\exp(\mathbf{u}{w_c}^\top \mathbf{v}{w_t})}{\sum{w=1}^{|V|} \exp(\mathbf{u}w^\top \mathbf{v}{w_t})}
$$
Core Mathematical Components
- Softmax Function
Converts dot products into probability distributions:
$$ P(w_j | w_i) = \frac{\exp(\mathbf{u}_j^\top \mathbf{v}_i)}{\sum_{k=1}^{|V|} \exp(\mathbf{u}_k^\top \mathbf{v}_i)} $$
where \( \mathbf{v}_i \) = input embedding of \( w_i \), \( \mathbf{u}_j \) = output embedding of \( w_j \)
- Negative Log-Likelihood Loss
Loss function for training: \[ \mathcal{L} = -\sum_{t=1}^T \sum_{w_c \in \mathcal{C}_t} \log P(w_c | w_t) \]
- Hierarchical Softmax
Efficient approximation using binary Huffman tree:
- Each leaf node represents a word \( w \)
- Path from root to \( w \) defines binary decisions
$$ P(w | w_i) = \prod_{n=1}^{\text{path}(w)} \sigma(\llbracket \text{dir}(n) \rrbracket \cdot \mathbf{u}_n^\top \mathbf{v}_i) $$
where \( \sigma \) = sigmoid function, \( \llbracket \cdot \rrbracket \) = +1/-1 indicator
- Negative Sampling
Approximates softmax by sampling negative examples: $$ \mathcal{L} = -\log \sigma(\mathbf{u}_c^\top \mathbf{v}_i) - \sum_{k=1}^K \mathbb{E}_{w_k \sim P_n(w)} \log \sigma(-\mathbf{u}_k^\top \mathbf{v}_i) $$
where \( K \) = number of negative samples, \( P_n(w) \propto \text{freq}(w)^{3/4} \)
Training Dynamics
- Embedding Learning
- Input embeddings \( \mathbf{V} \) capture word meaning
- Output embeddings \( \mathbf{U} \) capture contextual usage
- Gradient Updates
For parameter \( \theta \in \{\mathbf{V}, \mathbf{U}\} \): \[ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} \] Where gradients are computed through backpropagation.
- Context Window
Sliding window of size \( k \) defines local context:
- Typical \( k = 5 \) (2 words before + 2 after + target)
- Larger windows capture more topical similarity
Vector Space Properties
- Additive Compositionality
Linear analogies emerge in embedding space: \[ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}} \]
- Cosine Similarity
Semantic similarity measured by: \[ \text{sim}(w_i, w_j) = \cos(\theta) = \frac{\mathbf{v}_i \cdot \mathbf{v}_j}{\|\mathbf{v}_i\| \|\mathbf{v}_j\|} \]
- Dimensionality
Typical embedding dimensions:
- \( d = 300 \) (balance between expressiveness and efficiency)
- Higher \( d \): Better performance but risk of overfitting
- Example: Semantic Relationships
Relationship Type Example Vector Operation Gender king - man + woman = queen \( \mathbf{v}_w - \mathbf{v}_m + \mathbf{v}_f \) Plurality apple - apples + cars = car \( \mathbf{v}_s - \mathbf{v}_p + \mathbf{v}_t \) Verb-Noun fly - bird + fish = swim \( \mathbf{v}_a - \mathbf{v}_b + \mathbf{v}_c \)
Limitations
- Static Embeddings
- Single vector per word (fails to handle polysemy):
\[ \text{“bank”} = \begin{cases} \text{financial} \\ \text{river} \end{cases} \rightarrow \text{single vector} \]
- Context Window Limitation
- Fixed window size ignores global document context
- Fails to capture long-range dependencies
- Corpus Dependency
- Embeddings reflect biases in training data
\[ \mathbf{v}_{\text{doctor}} \approx \mathbf{v}_{\text{he}} \quad \text{(gender bias)} \]
- Out-of-Vocabulary Words
- Cannot handle new words not in training corpus
Comparison with Other Methods
Property | Word2Vec | GloVe | BERT |
---|---|---|---|
Context | Local (window) | Global (co-occur) | Full sentence |
Polysemy Handling | ❌ | ❌ | ✅ |
Training | Shallow NN | Matrix factorization | Deep Transformer |
Embedding Type | Static | Static | Contextual |
Theoretical Insights
- Connection to Matrix Factorization
Word2Vec implicitly factorizes a shifted PMI matrix: \[ \mathbf{V}\mathbf{U}^\top = \log P(w_i, w_j) - \log P(w_i)P(w_j) - \log k \] where \( k \) = negative samples
- Energy-Based Models
Word2Vec can be viewed as energy minimization: \[ E(w_i, w_j) = -\mathbf{v}_i^\top \mathbf{u}_j \] Minimize energy for observed word-context pairs.
- Skip-Gram as Neural Network
- Input layer: One-hot encoded word
- Hidden layer: Embedding lookup (no activation)
- Output layer: Softmax over context words