Contents

Word Embedding Techniques

Word Embeddings

Alright, let’s start with what’s an embedding? In NLP, word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.

There are two different types of Word Embedding techniques are available:

Count/Frequency

ONE HOT ENCODING

One-hot encoding is a technique to represent categorical variables (e.g., words) as binary vectors. Each word in a vocabulary is assigned a unique index, and its vector has 1 at that index and 0 elsewhere. This method converts discrete tokens into a format usable by machine learning algorithms.

For a vocabulary \( V \) of size \( |V| \), the one-hot encoding of a word \( w_i \) is a vector \( \mathbf{o}_i \in \mathbb{R}^{|V|} \) defined as: \[ \mathbf{o}_i[j] = \begin{cases} 1 & \text{if } j = i \quad (\text{index of } w_i \text{ in } V) \\ 0 & \text{otherwise} \end{cases} \]

Consider a small vocabulary: \[ V = \{\text{cat, dog, bird}\} \quad (|V| = 3) \] The one-hot encodings are:

  • “cat” → \([1, 0, 0]\)

  • “dog” → \([0, 1, 0]\)

  • “bird” → \([0, 0, 1]\)

    Given a corpus, extract unique words and assign indices. For example:

corpus = ["I love NLP", "NLP is fun"]
vocab = sorted(set(" ".join(corpus).lower().split()))
print(vocab)
['fun', 'i', 'is', 'love', 'nlp']

Each word is mapped to a binary vector of length \( |V| \). For the vocabulary \( V = \{\text{fun, i, is, love, nlp}\} \):

WordOne-Hot Vector
fun\([1, 0, 0, 0, 0]\))
i\([0, 1, 0, 0, 0]\))
is\([0, 0, 1, 0, 0]\))
love\([0, 0, 0, 1, 0]\))
nlp\( [0, 0, 0, 0, 1] \))

Ok! everything seems alright but somethings are missing right??

  • Although it’s easy to implement using libraries like scikit-learn, What if the vocabulary is larger than we thought.. something like \( 10^{5} - 10^{6} \).

  • With ever increasing vocab the sparseness of matrices also increases leading to computational complexity and overfitting issues.

  • In Machine learning algorithms we need fixed input size right! but with this case we cannot tell the input size which always depends on the sentence length.

  • What if some new word is in testing which is not in the training? Then the context changes entirely and we can get OOV(Out of Vocabulary).

  • One more main limitation is the semantic meaning is getting corrupted. Words with similar meanings (e.g., “king” and “queen”) are treated as completely unrelated, while unrelated words (e.g., “king” and “apple”) are equally dissimilar.

    For example let’s consider,

    Cosine similarity between two one-hot vectors \( \mathbf{o}_i \) and \( \mathbf{o}_j \): \[ \text{cosine-sim}(\mathbf{o}_i, \mathbf{o}_j) = \frac{\mathbf{o}_i \cdot \mathbf{o}_j}{\|\mathbf{o}_i\| \|\mathbf{o}_j\|} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{otherwise} \end{cases} \]

Consider three words and their one-hot encodings:

  • “king” → \([1, 0, 0, 0]\)
  • “queen” → \([0, 1, 0, 0]\)
  • “apple” → \([0, 0, 1, 0] \)
  • “orange” → \([0, 0, 0, 1]\)
import numpy as np

words = ["king", "queen", "apple", "orange"]
vectors = np.eye(4)  # One-hot vectors

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("| Word 1 | Word 2 | Cosine Similarity |")
print("|--------+--------+-------------------|")
for i in range(4):
    for j in range(i+1, 4):
          sim = cosine_sim(vectors[i], vectors[j])
          print(f"| {words[i]} | {words[j]} | {sim:.2f} |")
| Word 1 | Word 2 | Cosine Similarity |
|--------+--------+-------------------|
| king | queen | 0.00 |
| king | apple | 0.00 |
| king | orange | 0.00 |
| queen | apple | 0.00 |
| queen | orange | 0.00 |
| apple | orange | 0.00 |

Difference between two semantic representations

Semantic Space (One-Hot Encoding)      Ideal Semantic Space
      ▲                                     ▲
      |                                     |
      o king                                o king
      |                                     | \
      |                                     |  o queen
      o queen                               | /
      |                                     o apple
      o apple                               \
      |                                       \
      o orange                                 o orange
      └────────────────►                      └───────────────►
       No semantic relationships              Clustered by meaning

We can see that there is no similarity between related words as well.  To reduce these limitations then introduced new method called *BAG OF WORDS*

BAG OF WORDS

The Bag-of-Words model represents text as an unordered collection of words, ignoring grammar and word order. Documents are encoded as vectors of word frequencies or presence indicators.

For a vocabulary \( V = \{w_1, w_2, \dots, w_{|V|}\} \), a document \( D \) is represented as: \[ \text{BoW}(D) = \left[ f(w_1, D), f(w_2, D), \dots, f(w_{|V|}, D) \right] \] where \( f(w_i, D) \) is the frequency (or binary indicator) of word \( w_i \) in \( D \).

Let’s consider a corpus:

  1. “I love machine learning”
  2. “AI is fascinating and powerful”
  3. “Machine learning and AI are powerful”

\[ V = \{\text{I, love, machine, learning, AI, is, fascinating, and, powerful, are}\} \quad(|V|=10)\]

DocumentVector Representation (Frequency)
Doc 1\([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]\))
Doc 2\([0, 0, 0, 0, 1, 1, 1, 1, 1, 0]\))
Doc 3\([0, 0, 1, 1, 1, 0, 0, 1, 1, 1] \))
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love machine learning",
    "AI is fascinating and powerful",
    "Machine learning and AI are powerful"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
vocab = vectorizer.get_feature_names_out()

print("| Word        | Doc 1 | Doc 2 | Doc 3 |")
print("|-------------+-------+-------+-------|")
for i, word in enumerate(vocab):
    print(f"| {word.ljust(10)} | {X[0][i]}     | {X[1][i]}     | {X[2][i]}     |")
| Word        | Doc 1 | Doc 2 | Doc 3 |
|-------------+-------+-------+-------|
| ai         | 0     | 1     | 1     |
| and        | 0     | 1     | 1     |
| are        | 0     | 0     | 1     |
| fascinating | 0     | 1     | 0     |
| is         | 0     | 1     | 0     |
| learning   | 1     | 0     | 1     |
| love       | 1     | 0     | 0     |
| machine    | 1     | 0     | 1     |
| powerful   | 0     | 1     | 1     |

The one advantage over One-Hot is that the input size is fixed which can solve the input size problem with machine learning models.

Limitations:

BoW treats documents as independent word counts, ignoring:

  • Word order and context

  • Synonyms (e.g., “AI” vs “machine learning”)

  • Semantic relationships between terms

    Let’s see it with cosine sim

    • Doc 1: “I love machine learning”
    • Doc 3: “Machine learning and AI are powerful”
    • Actual similarity (BoW): Partial overlap
    • Ideal similarity: High (both discuss ML/AI)
Doc 1Doc 2Doc 3
Doc 11.000.000.45
Doc 20.001.000.52
Doc 30.450.521.00
BoW Semantic Space                  Ideal Semantic Space
      ▲                                     ▲
      |                                     |
      o Doc1 (ML)                          o Doc1
      |                                     | \
      |                                     |  o Doc3
      o Doc2 (AI)                          | /
      |                                     o Doc2
      o Doc3 (ML+AI)                       |
      └────────────────►                   └───────────────►
      Distance based on word overlap        Distance based on meaning
  • Still the OOV problem persists with BoW also

TF-IDF

TF-IDF is a statistical measure to evaluate the importance of a word in a document relative to a corpus. It balances:

  1. Term Frequency (TF): How often a word appears in a document.

For a word \( w \) in document \( D \): \[ \text{TF}(w, D) = \frac{\text{Count of } w \text{ in } D}{\text{Total words in } D} \]

  1. Inverse Document Frequency (IDF): How rare/important the word is across documents.

For a word \( w \) in corpus \( C \): \[ \text{IDF}(w, C) = \log\left(\frac{\text{Total documents in } C}{\text{Documents containing } w}\right) \]

  • TF-IDF Weight

\[ \text{TF-IDF}(w, D, C) = \text{TF}(w, D) \times \text{IDF}(w, C) \]

Let’s see it with an example:

Document 1: “the cat sat on the mat” Document 2: “the dog sat on the log” Document 3: “the cat and the dog”

  • Compute TF
WordDoc 1Doc 2Doc 3
the2/52/52/5
cat1/501/5
sat1/51/50
on1/51/50
mat1/500
dog01/51/5
log01/50
and001/5
  • Compute IDF

\[ \text{IDF}(w) = \log\left(\frac{3}{\text{df}(w)}\right) \]

WordDocument Frequency (df)IDF
the3\(\log(3/3) = 0 \)
cat2\(\log(3/2) \approx 0.176 \)
sat2\(\log(3/2) \approx 0.176 \)
on2\(\log(3/2) \approx 0.176 \)
mat1\(\log(3/1) \approx 0.477 \)
dog2\(\log(3/2) \approx 0.176 \)
log1\(\log(3/1) \approx 0.477 \)
and1\(\log(3/1) \approx 0.477 \)
  • Compute TF-IDF
WordDoc 1Doc 2Doc 3
the\( 0 \times 2/5 = 0\)00
cat\(0.176 \times 1/5 \approx 0.035\)0\(0.176 \times 1/5 \approx 0.035\)
sat\(0.176 \times 1/5 \approx 0.035\)\(0.176 \times 1/5 \approx 0.035\)0
mat\(0.477 \times 1/5 \approx 0.095\)00
dog0\(0.176 \times 1/5 \approx 0.035\)\(0.176 \times 1/5 \approx 0.035\)
log0\(0.477 \times 1/5 \approx 0.095\)0
and00\(0.477 \times 1/5 \approx 0.095\)

We can implement it using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat and the dog"
]

vectorizer = TfidfVectorizer(norm=None)  # Disable L2 normalization
X = vectorizer.fit_transform(corpus).toarray()
features = vectorizer.get_feature_names_out()

print("| Word | Doc 1 | Doc 2 | Doc 3 |")
print("|------|-------|-------|-------|")
for i, word in enumerate(features):
    print(f"| {word.ljust(4)} | {X[0][i]:.3f} | {X[1][i]:.3f} | {X[2][i]:.3f} |")
WordDoc 1Doc 2Doc 3
and0.0000.0000.959
cat0.9590.0000.959
dog0.0000.9590.959
log0.0000.9590.000
mat0.9590.0000.000
on0.9590.9590.000
sat0.9590.9590.000
the0.0000.0000.000
  • Key advantages

    • Easy to implement

    • Fixed size as of BoW

    • Word importance is there which is not there in BoW and one hot encoding. But how? If we check the entire TF-IDF weights table we will get some weights instead of zeroes for the words which are super important and change the context in that particular sentence. So that means we are getting the word importance which leads to solve the word ignorance problem unlike the others.

  • Limitations

    • Sparse matrix which depends on the corpus size.

    • OOV problem still persists

Deep Learning Trained

The type of word embeddings are deep learning trained techniques which solves most of the count/frequency techniques limitations. This contains a technique called Word2Vec.

  • Word2Vec: Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.

Key Architectures:

1. Continuous Bag of Words (CBOW)

Predicts a target word given its context words. Ideal for small datasets.

  • Mathematical Formulation

    • Input: Context words \( \{w_{t-k}, …, w_{t-1}, w_{t+1}, …, w_{t+k}\} \)
    • Output: Target word \( w_t \)

    Let:

    • \( \mathbf{V} \in \mathbb{R}^{|V|\times d} \): Input embedding matrix
    • \( \mathbf{U} \in \mathbb{R}^{d\times |V|} \): Output embedding matrix
    • \( \mathcal{C}_t \): Context window around \( w_t \)

    Objective function:

    $$ P(w_t | \mathcal{C}t) = \frac{\exp(\mathbf{u}{w_t}^{\top} \bar{\mathbf{v}})}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^{\top} \bar{\mathbf{v}})} $$

    where \( \bar{\mathbf{v}} = \frac{1}{2k} \sum_{w \in \mathcal{C}_t} \mathbf{v}_w \)

2. Skip-Gram

Predicts context words given a target word. Better for large datasets and rare words.

  • Mathematical Formulation

    • Input: Target word \( w_t \)
    • Output: Context words \( \{w_{t-k}, …, w_{t+k}\} \)

    Objective function (assuming independence):

    $$

    \prod_{w_c \in \mathcal{C}t} P(w_c | w_t) = \prod{w_c \in \mathcal{C}t} \frac{\exp(\mathbf{u}{w_c}^\top \mathbf{v}{w_t})}{\sum{w=1}^{|V|} \exp(\mathbf{u}w^\top \mathbf{v}{w_t})}

    $$

Core Mathematical Components

    1. Softmax Function

    Converts dot products into probability distributions:

    $$ P(w_j | w_i) = \frac{\exp(\mathbf{u}_j^\top \mathbf{v}_i)}{\sum_{k=1}^{|V|} \exp(\mathbf{u}_k^\top \mathbf{v}_i)} $$

    where \( \mathbf{v}_i \) = input embedding of \( w_i \), \( \mathbf{u}_j \) = output embedding of \( w_j \)

    1. Negative Log-Likelihood Loss

    Loss function for training: \[ \mathcal{L} = -\sum_{t=1}^T \sum_{w_c \in \mathcal{C}_t} \log P(w_c | w_t) \]

    1. Hierarchical Softmax

    Efficient approximation using binary Huffman tree:

    • Each leaf node represents a word \( w \)
    • Path from root to \( w \) defines binary decisions

    $$ P(w | w_i) = \prod_{n=1}^{\text{path}(w)} \sigma(\llbracket \text{dir}(n) \rrbracket \cdot \mathbf{u}_n^\top \mathbf{v}_i) $$

    where \( \sigma \) = sigmoid function, \( \llbracket \cdot \rrbracket \) = +1/-1 indicator

    1. Negative Sampling

    Approximates softmax by sampling negative examples: $$ \mathcal{L} = -\log \sigma(\mathbf{u}_c^\top \mathbf{v}_i) - \sum_{k=1}^K \mathbb{E}_{w_k \sim P_n(w)} \log \sigma(-\mathbf{u}_k^\top \mathbf{v}_i) $$

    where \( K \) = number of negative samples, \( P_n(w) \propto \text{freq}(w)^{3/4} \)

Training Dynamics

    1. Embedding Learning
    • Input embeddings \( \mathbf{V} \) capture word meaning
    • Output embeddings \( \mathbf{U} \) capture contextual usage
    1. Gradient Updates

    For parameter \( \theta \in \{\mathbf{V}, \mathbf{U}\} \): \[ \theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta} \] Where gradients are computed through backpropagation.

    1. Context Window

    Sliding window of size \( k \) defines local context:

    • Typical \( k = 5 \) (2 words before + 2 after + target)
    • Larger windows capture more topical similarity

Vector Space Properties

    1. Additive Compositionality

    Linear analogies emerge in embedding space: \[ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}} \]

    1. Cosine Similarity

    Semantic similarity measured by: \[ \text{sim}(w_i, w_j) = \cos(\theta) = \frac{\mathbf{v}_i \cdot \mathbf{v}_j}{\|\mathbf{v}_i\| \|\mathbf{v}_j\|} \]

    1. Dimensionality

    Typical embedding dimensions:

    • \( d = 300 \) (balance between expressiveness and efficiency)
    • Higher \( d \): Better performance but risk of overfitting
  • Example: Semantic Relationships
    Relationship TypeExampleVector Operation
    Genderking - man + woman = queen\( \mathbf{v}_w - \mathbf{v}_m + \mathbf{v}_f \)
    Pluralityapple - apples + cars = car\( \mathbf{v}_s - \mathbf{v}_p + \mathbf{v}_t \)
    Verb-Nounfly - bird + fish = swim\( \mathbf{v}_a - \mathbf{v}_b + \mathbf{v}_c \)

Limitations

    1. Static Embeddings
    • Single vector per word (fails to handle polysemy):

    \[ \text{“bank”} = \begin{cases} \text{financial} \\ \text{river} \end{cases} \rightarrow \text{single vector} \]

    1. Context Window Limitation
    • Fixed window size ignores global document context
    • Fails to capture long-range dependencies
    1. Corpus Dependency
    • Embeddings reflect biases in training data

    \[ \mathbf{v}_{\text{doctor}} \approx \mathbf{v}_{\text{he}} \quad \text{(gender bias)} \]

    1. Out-of-Vocabulary Words
    • Cannot handle new words not in training corpus

Comparison with Other Methods

PropertyWord2VecGloVeBERT
ContextLocal (window)Global (co-occur)Full sentence
Polysemy Handling
TrainingShallow NNMatrix factorizationDeep Transformer
Embedding TypeStaticStaticContextual

Theoretical Insights

    1. Connection to Matrix Factorization

    Word2Vec implicitly factorizes a shifted PMI matrix: \[ \mathbf{V}\mathbf{U}^\top = \log P(w_i, w_j) - \log P(w_i)P(w_j) - \log k \] where \( k \) = negative samples

    1. Energy-Based Models

    Word2Vec can be viewed as energy minimization: \[ E(w_i, w_j) = -\mathbf{v}_i^\top \mathbf{u}_j \] Minimize energy for observed word-context pairs.

    1. Skip-Gram as Neural Network
    • Input layer: One-hot encoded word
    • Hidden layer: Embedding lookup (no activation)
    • Output layer: Softmax over context words