Word Embedding Techniques

Karna included in NLP

2025-03-03 2419 words 12 minutes

Contents

Word Embeddings

Alright, let’s start with what’s an embedding? In NLP, word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.

There are two different types of Word Embedding techniques are available:

Count/Frequency

ONE HOT ENCODING

One-hot encoding is a technique to represent categorical variables (e.g., words) as binary vectors. Each word in a vocabulary is assigned a unique index, and its vector has 1 at that index and 0 elsewhere. This method converts discrete tokens into a format usable by machine learning algorithms.

For a vocabulary $ V $ of size $ |V| $, the one-hot encoding of a word $ w_i $ is a vector $ \mathbf{o}_i \in \mathbb{R}^{|V|} $ defined as: \[ \mathbf{o}_i[j] = \begin{cases} 1 & \text{if } j = i \quad (\text{index of } w_i \text{ in } V) \\ 0 & \text{otherwise} \end{cases} \]

Consider a small vocabulary: \[ V = \{\text{cat, dog, bird}\} \quad (|V| = 3) \] The one-hot encodings are:

“cat” → $[1, 0, 0]$
“dog” → $[0, 1, 0]$
“bird” → $[0, 0, 1]$
Given a corpus, extract unique words and assign indices. For example:

corpus = ["I love NLP", "NLP is fun"]
vocab = sorted(set(" ".join(corpus).lower().split()))
print(vocab)

['fun', 'i', 'is', 'love', 'nlp']

Each word is mapped to a binary vector of length $ |V| $. For the vocabulary $ V = \{\text{fun, i, is, love, nlp}\} $:

Word	One-Hot Vector
fun	$[1, 0, 0, 0, 0]$)
i	$[0, 1, 0, 0, 0]$)
is	$[0, 0, 1, 0, 0]$)
love	$[0, 0, 0, 1, 0]$)
nlp	$ [0, 0, 0, 0, 1] $)

Ok! everything seems alright but somethings are missing right??

Although it’s easy to implement using libraries like scikit-learn, What if the vocabulary is larger than we thought.. something like $ 10^{5} - 10^{6} $.
With ever increasing vocab the sparseness of matrices also increases leading to computational complexity and overfitting issues.
In Machine learning algorithms we need fixed input size right! but with this case we cannot tell the input size which always depends on the sentence length.
What if some new word is in testing which is not in the training? Then the context changes entirely and we can get OOV(Out of Vocabulary).
One more main limitation is the semantic meaning is getting corrupted. Words with similar meanings (e.g., “king” and “queen”) are treated as completely unrelated, while unrelated words (e.g., “king” and “apple”) are equally dissimilar.
For example let’s consider,
Cosine similarity between two one-hot vectors $ \mathbf{o}_i $ and $ \mathbf{o}_j $: \[ \text{cosine-sim}(\mathbf{o}_i, \mathbf{o}_j) = \frac{\mathbf{o}_i \cdot \mathbf{o}_j}{\|\mathbf{o}_i\| \|\mathbf{o}_j\|} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{otherwise} \end{cases} \]

Consider three words and their one-hot encodings:

“king” → $[1, 0, 0, 0]$
“queen” → $[0, 1, 0, 0]$
“apple” → $[0, 0, 1, 0] $
“orange” → $[0, 0, 0, 1]$

import numpy as np

words = ["king", "queen", "apple", "orange"]
vectors = np.eye(4)  # One-hot vectors

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("| Word 1 | Word 2 | Cosine Similarity |")
print("|--------+--------+-------------------|")
for i in range(4):
    for j in range(i+1, 4):
          sim = cosine_sim(vectors[i], vectors[j])
          print(f"| {words[i]} | {words[j]} | {sim:.2f} |")

| Word 1 | Word 2 | Cosine Similarity |
|--------+--------+-------------------|
| king | queen | 0.00 |
| king | apple | 0.00 |
| king | orange | 0.00 |
| queen | apple | 0.00 |
| queen | orange | 0.00 |
| apple | orange | 0.00 |

Difference between two semantic representations

Semantic Space (One-Hot Encoding)      Ideal Semantic Space
      ▲                                     ▲
      |                                     |
      o king                                o king
      |                                     | \
      |                                     |  o queen
      o queen                               | /
      |                                     o apple
      o apple                               \
      |                                       \
      o orange                                 o orange
      └────────────────►                      └───────────────►
       No semantic relationships              Clustered by meaning

We can see that there is no similarity between related words as well.  To reduce these limitations then introduced new method called *BAG OF WORDS*

BAG OF WORDS

The Bag-of-Words model represents text as an unordered collection of words, ignoring grammar and word order. Documents are encoded as vectors of word frequencies or presence indicators.

For a vocabulary $ V = \{w_1, w_2, \dots, w_{|V|}\} $, a document $ D $ is represented as: \[ \text{BoW}(D) = \left[ f(w_1, D), f(w_2, D), \dots, f(w_{|V|}, D) \right] \] where $ f(w_i, D) $ is the frequency (or binary indicator) of word $ w_i $ in $ D $.

Let’s consider a corpus:

“I love machine learning”
“AI is fascinating and powerful”
“Machine learning and AI are powerful”

\[ V = \{\text{I, love, machine, learning, AI, is, fascinating, and, powerful, are}\} \quad(|V|=10)\]

Document	Vector Representation (Frequency)
Doc 1	$[1, 1, 1, 1, 0, 0, 0, 0, 0, 0]$)
Doc 2	$[0, 0, 0, 0, 1, 1, 1, 1, 1, 0]$)
Doc 3	$[0, 0, 1, 1, 1, 0, 0, 1, 1, 1] $)

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love machine learning",
    "AI is fascinating and powerful",
    "Machine learning and AI are powerful"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
vocab = vectorizer.get_feature_names_out()

print("| Word        | Doc 1 | Doc 2 | Doc 3 |")
print("|-------------+-------+-------+-------|")
for i, word in enumerate(vocab):
    print(f"| {word.ljust(10)} | {X[0][i]}     | {X[1][i]}     | {X[2][i]}     |")

| Word        | Doc 1 | Doc 2 | Doc 3 |
|-------------+-------+-------+-------|
| ai         | 0     | 1     | 1     |
| and        | 0     | 1     | 1     |
| are        | 0     | 0     | 1     |
| fascinating | 0     | 1     | 0     |
| is         | 0     | 1     | 0     |
| learning   | 1     | 0     | 1     |
| love       | 1     | 0     | 0     |
| machine    | 1     | 0     | 1     |
| powerful   | 0     | 1     | 1     |

The one advantage over One-Hot is that the input size is fixed which can solve the input size problem with machine learning models.

Limitations:

BoW treats documents as independent word counts, ignoring:

Word order and context
Synonyms (e.g., “AI” vs “machine learning”)
Semantic relationships between terms
Let’s see it with cosine sim
- Doc 1: “I love machine learning”
- Doc 3: “Machine learning and AI are powerful”
- Actual similarity (BoW): Partial overlap
- Ideal similarity: High (both discuss ML/AI)

	Doc 1	Doc 2	Doc 3
Doc 1	1.00	0.00	0.45
Doc 2	0.00	1.00	0.52
Doc 3	0.45	0.52	1.00

BoW Semantic Space                  Ideal Semantic Space
      ▲                                     ▲
      |                                     |
      o Doc1 (ML)                          o Doc1
      |                                     | \
      |                                     |  o Doc3
      o Doc2 (AI)                          | /
      |                                     o Doc2
      o Doc3 (ML+AI)                       |
      └────────────────►                   └───────────────►
      Distance based on word overlap        Distance based on meaning

Still the OOV problem persists with BoW also

TF-IDF

TF-IDF is a statistical measure to evaluate the importance of a word in a document relative to a corpus. It balances:

Term Frequency (TF): How often a word appears in a document.

For a word $ w $ in document $ D $: \[ \text{TF}(w, D) = \frac{\text{Count of } w \text{ in } D}{\text{Total words in } D} \]

Inverse Document Frequency (IDF): How rare/important the word is across documents.

For a word $ w $ in corpus $ C $: \[ \text{IDF}(w, C) = \log\left(\frac{\text{Total documents in } C}{\text{Documents containing } w}\right) \]

TF-IDF Weight

\[ \text{TF-IDF}(w, D, C) = \text{TF}(w, D) \times \text{IDF}(w, C) \]

Let’s see it with an example:

Document 1: “the cat sat on the mat” Document 2: “the dog sat on the log” Document 3: “the cat and the dog”

Compute TF

Word	Doc 1	Doc 2	Doc 3
the	2/5	2/5	2/5
cat	1/5	0	1/5
sat	1/5	1/5	0
on	1/5	1/5	0
mat	1/5	0	0
dog	0	1/5	1/5
log	0	1/5	0
and	0	0	1/5

Compute IDF

\[ \text{IDF}(w) = \log\left(\frac{3}{\text{df}(w)}\right) \]

Word	Document Frequency (df)	IDF
the	3	$\log(3/3) = 0 $
cat	2	$\log(3/2) \approx 0.176 $
sat	2	$\log(3/2) \approx 0.176 $
on	2	$\log(3/2) \approx 0.176 $
mat	1	$\log(3/1) \approx 0.477 $
dog	2	$\log(3/2) \approx 0.176 $
log	1	$\log(3/1) \approx 0.477 $
and	1	$\log(3/1) \approx 0.477 $

Compute TF-IDF

Word	Doc 1	Doc 2	Doc 3
the	$ 0 \times 2/5 = 0$	0	0
cat	$0.176 \times 1/5 \approx 0.035$	0	$0.176 \times 1/5 \approx 0.035$
sat	$0.176 \times 1/5 \approx 0.035$	$0.176 \times 1/5 \approx 0.035$	0
mat	$0.477 \times 1/5 \approx 0.095$	0	0
dog	0	$0.176 \times 1/5 \approx 0.035$	$0.176 \times 1/5 \approx 0.035$
log	0	$0.477 \times 1/5 \approx 0.095$	0
and	0	0	$0.477 \times 1/5 \approx 0.095$

We can implement it using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat and the dog"
]

vectorizer = TfidfVectorizer(norm=None)  # Disable L2 normalization
X = vectorizer.fit_transform(corpus).toarray()
features = vectorizer.get_feature_names_out()

print("| Word | Doc 1 | Doc 2 | Doc 3 |")
print("|------|-------|-------|-------|")
for i, word in enumerate(features):
    print(f"| {word.ljust(4)} | {X[0][i]:.3f} | {X[1][i]:.3f} | {X[2][i]:.3f} |")

Word	Doc 1	Doc 2	Doc 3
and	0.000	0.000	0.959
cat	0.959	0.000	0.959
dog	0.000	0.959	0.959
log	0.000	0.959	0.000
mat	0.959	0.000	0.000
on	0.959	0.959	0.000
sat	0.959	0.959	0.000
the	0.000	0.000	0.000

Key advantages
- Easy to implement
- Fixed size as of BoW
- Word importance is there which is not there in BoW and one hot encoding. But how? If we check the entire TF-IDF weights table we will get some weights instead of zeroes for the words which are super important and change the context in that particular sentence. So that means we are getting the word importance which leads to solve the word ignorance problem unlike the others.
Limitations
- Sparse matrix which depends on the corpus size.
- OOV problem still persists

Deep Learning Trained

The type of word embeddings are deep learning trained techniques which solves most of the count/frequency techniques limitations. This contains a technique called Word2Vec.

Word2Vec: Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.

Key Architectures:

1. Continuous Bag of Words (CBOW)

Predicts a target word given its context words. Ideal for small datasets.

Mathematical Formulation
- Input: Context words $ \{w_{t-k}, …, w_{t-1}, w_{t+1}, …, w_{t+k}\} $
- Output: Target word $ w_t $
Let:
- $ \mathbf{V} \in \mathbb{R}^{|V|\times d} $: Input embedding matrix
- $ \mathbf{U} \in \mathbb{R}^{d\times |V|} $: Output embedding matrix
- $ \mathcal{C}_t $: Context window around $ w_t $
Objective function:
$$ P(w_t | \mathcal{C}t) = \frac{\exp(\mathbf{u}{w_t}^{\top} \bar{\mathbf{v}})}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^{\top} \bar{\mathbf{v}})} $$
where $ \bar{\mathbf{v}} = \frac{1}{2k} \sum_{w \in \mathcal{C}_t} \mathbf{v}_w $

2. Skip-Gram

Predicts context words given a target word. Better for large datasets and rare words.

Mathematical Formulation
- Input: Target word $ w_t $
- Output: Context words $ \{w_{t-k}, …, w_{t+k}\} $
Objective function (assuming independence):
$$
\prod_{w_c \in \mathcal{C}t} P(w_c | w_t) = \prod{w_c \in \mathcal{C}t} \frac{\exp(\mathbf{u}{w_c}^\top \mathbf{v}{w_t})}{\sum{w=1}^{|V|} \exp(\mathbf{u}w^\top \mathbf{v}{w_t})}
$$

Core Mathematical Components

1. Softmax Function
Converts dot products into probability distributions:
$$ P(w_j | w_i) = \frac{\exp(\mathbf{u}_j^\top \mathbf{v}_i)}{\sum_{k=1}^{|V|} \exp(\mathbf{u}_k^\top \mathbf{v}_i)} $$
where $ \mathbf{v}_i $ = input embedding of $ w_i $, $ \mathbf{u}_j $ = output embedding of $ w_j $

1. Negative Log-Likelihood Loss
Loss function for training: \[ \mathcal{L} = -\sum_{t=1}^T \sum_{w_c \in \mathcal{C}_t} \log P(w_c | w_t) \]

1. Hierarchical Softmax
Efficient approximation using binary Huffman tree:
- Each leaf node represents a word $ w $
- Path from root to $ w $ defines binary decisions
$$ P(w | w_i) = \prod_{n=1}^{\text{path}(w)} \sigma(\llbracket \text{dir}(n) \rrbracket \cdot \mathbf{u}_n^\top \mathbf{v}_i) $$
where $ \sigma $ = sigmoid function, $ \llbracket \cdot \rrbracket $ = +1/-1 indicator

1. Negative Sampling
Approximates softmax by sampling negative examples: $$ \mathcal{L} = -\log \sigma(\mathbf{u}_c^\top \mathbf{v}_i) - \sum_{k=1}^K \mathbb{E}_{w_k \sim P_n(w)} \log \sigma(-\mathbf{u}_k^\top \mathbf{v}_i) $$
where $ K $ = number of negative samples, $ P_n(w) \propto \text{freq}(w)^{3/4} $

Training Dynamics

1. Embedding Learning
- Input embeddings $ \mathbf{V} $ capture word meaning
- Output embeddings $ \mathbf{U} $ capture contextual usage

1. Gradient Updates
For parameter $ \theta \in \{\mathbf{V}, \mathbf{U}\} $: \[ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} \] Where gradients are computed through backpropagation.

1. Context Window
Sliding window of size $ k $ defines local context:
- Typical $ k = 5 $ (2 words before + 2 after + target)
- Larger windows capture more topical similarity

Vector Space Properties

1. Additive Compositionality
Linear analogies emerge in embedding space: \[ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}} \]

1. Cosine Similarity
Semantic similarity measured by: \[ \text{sim}(w_i, w_j) = \cos(\theta) = \frac{\mathbf{v}_i \cdot \mathbf{v}_j}{\|\mathbf{v}_i\| \|\mathbf{v}_j\|} \]

1. Dimensionality
Typical embedding dimensions:
- $ d = 300 $ (balance between expressiveness and efficiency)
- Higher $ d $: Better performance but risk of overfitting

Example: Semantic Relationships

Relationship Type	Example	Vector Operation
Gender	king - man + woman = queen	$ \mathbf{v}_w - \mathbf{v}_m + \mathbf{v}_f $
Plurality	apple - apples + cars = car	$ \mathbf{v}_s - \mathbf{v}_p + \mathbf{v}_t $
Verb-Noun	fly - bird + fish = swim	$ \mathbf{v}_a - \mathbf{v}_b + \mathbf{v}_c $

Limitations

1. Static Embeddings
- Single vector per word (fails to handle polysemy):
\[ \text{“bank”} = \begin{cases} \text{financial} \\ \text{river} \end{cases} \rightarrow \text{single vector} \]

1. Context Window Limitation
- Fixed window size ignores global document context
- Fails to capture long-range dependencies

1. Corpus Dependency
- Embeddings reflect biases in training data
\[ \mathbf{v}_{\text{doctor}} \approx \mathbf{v}_{\text{he}} \quad \text{(gender bias)} \]

1. Out-of-Vocabulary Words
- Cannot handle new words not in training corpus

Comparison with Other Methods

Property	Word2Vec	GloVe	BERT
Context	Local (window)	Global (co-occur)	Full sentence
Polysemy Handling	❌	❌	✅
Training	Shallow NN	Matrix factorization	Deep Transformer
Embedding Type	Static	Static	Contextual

Theoretical Insights

1. Connection to Matrix Factorization
Word2Vec implicitly factorizes a shifted PMI matrix: \[ \mathbf{V}\mathbf{U}^\top = \log P(w_i, w_j) - \log P(w_i)P(w_j) - \log k \] where $ k $ = negative samples

1. Energy-Based Models
Word2Vec can be viewed as energy minimization: \[ E(w_i, w_j) = -\mathbf{v}_i^\top \mathbf{u}_j \] Minimize energy for observed word-context pairs.

1. Skip-Gram as Neural Network
- Input layer: One-hot encoded word
- Hidden layer: Embedding lookup (no activation)
- Output layer: Softmax over context words

References

Krish Naik YT Video

Word	One-Hot Vector
fun	\([1, 0, 0, 0, 0]\))
i	\([0, 1, 0, 0, 0]\))
is	\([0, 0, 1, 0, 0]\))
love	\([0, 0, 0, 1, 0]\))
nlp	\( [0, 0, 0, 0, 1] \))

Document	Vector Representation (Frequency)
Doc 1	\([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]\))
Doc 2	\([0, 0, 0, 0, 1, 1, 1, 1, 1, 0]\))
Doc 3	\([0, 0, 1, 1, 1, 0, 0, 1, 1, 1] \))

Word	Doc 1	Doc 2	Doc 3
the	2/5	2/5	2/5
cat	1/5	0	1/5
sat	1/5	1/5	0
on	1/5	1/5	0
mat	1/5	0	0
dog	0	1/5	1/5
log	0	1/5	0
and	0	0	1/5

Word	Document Frequency (df)	IDF
the	3	\(\log(3/3) = 0 \)
cat	2	\(\log(3/2) \approx 0.176 \)
sat	2	\(\log(3/2) \approx 0.176 \)
on	2	\(\log(3/2) \approx 0.176 \)
mat	1	\(\log(3/1) \approx 0.477 \)
dog	2	\(\log(3/2) \approx 0.176 \)
log	1	\(\log(3/1) \approx 0.477 \)
and	1	\(\log(3/1) \approx 0.477 \)

Word	Doc 1	Doc 2	Doc 3
the	\( 0 \times 2/5 = 0\)	0	0
cat	\(0.176 \times 1/5 \approx 0.035\)	0	\(0.176 \times 1/5 \approx 0.035\)
sat	\(0.176 \times 1/5 \approx 0.035\)	\(0.176 \times 1/5 \approx 0.035\)	0
mat	\(0.477 \times 1/5 \approx 0.095\)	0	0
dog	0	\(0.176 \times 1/5 \approx 0.035\)	\(0.176 \times 1/5 \approx 0.035\)
log	0	\(0.477 \times 1/5 \approx 0.095\)	0
and	0	0	\(0.477 \times 1/5 \approx 0.095\)

Relationship Type	Example	Vector Operation
Gender	king - man + woman = queen	\( \mathbf{v}_w - \mathbf{v}_m + \mathbf{v}_f \)
Plurality	apple - apples + cars = car	\( \mathbf{v}_s - \mathbf{v}_p + \mathbf{v}_t \)
Verb-Noun	fly - bird + fish = swim	\( \mathbf{v}_a - \mathbf{v}_b + \mathbf{v}_c \)

Word	Doc 1	Doc 2	Doc 3
the	2/5	2/5	2/5
cat	1/5	0	1/5
sat	1/5	1/5	0
on	1/5	1/5	0
mat	1/5	0	0
dog	0	1/5	1/5
log	0	1/5	0
and	0	0	1/5

Word	Doc 1	Doc 2	Doc 3
the	2/5	2/5	2/5
cat	1/5	0	1/5
sat	1/5	1/5	0
on	1/5	1/5	0
mat	1/5	0	0
dog	0	1/5	1/5
log	0	1/5	0
and	0	0	1/5