Word2Vec

Try Interactive Demo / 试一试交互式演示

让机器理解语言:Word2Vec词向量详解

“国王”减去”男人”加上”女人”等于什么?如果你的答案是”女王”,那么恭喜你,你已经理解了Word2Vec的精髓!

2013年,Google的Mikolov等人提出了Word2Vec,这是自然语言处理领域的一个里程碑式的工作。它能够将词语转换成数学向量,使得语义相近的词在向量空间中也彼此接近,甚至可以进行”语义算术”。

为什么需要词向量?

在传统的自然语言处理中,词语通常用独热编码(One-Hot Encoding)表示:

  • “猫” = [1, 0, 0, 0, …]
  • “狗” = [0, 1, 0, 0, …]
  • “苹果” = [0, 0, 1, 0, …]

这种表示有两个严重问题:

  1. 维度爆炸:如果词汇表有10万个词,每个词就是10万维的向量
  2. 语义缺失:任意两个词的向量都是正交的,无法表示”猫”和”狗”比”猫”和”苹果”更相似

Word2Vec解决了这两个问题:用低维稠密向量(如100-300维)表示词语,并让语义相近的词有相似的向量。

核心思想:分布式假设

Word2Vec基于一个简单而深刻的假设:

“一个词的含义由它的上下文决定”

如果两个词经常出现在相似的上下文中,它们的含义就应该相似。例如:

  • “我养了一只,它很可爱”
  • “我养了一只,它很可爱”

“猫”和”狗”出现在几乎相同的上下文中,所以它们应该有相似的词向量。

两种架构

Word2Vec提供了两种训练架构:

1. CBOW(Continuous Bag of Words)

通过上下文预测中心词。

1
2
输入:[我, 养了, 一只, ___, 它, 很, 可爱]
预测:猫

CBOW将上下文词向量求平均,然后预测中心词。适合高频词的学习。

2. Skip-gram

通过中心词预测上下文。

1
2
输入:猫
预测:[我, 养了, 一只, 它, 很, 可爱]

Skip-gram用一个词预测周围的词。对低频词和小数据集效果更好。

训练过程

以Skip-gram为例:

  1. 构建训练样本:对于每个词,生成(中心词, 上下文词)对
  2. 定义模型:两层神经网络(输入层→隐藏层→输出层)
  3. 优化目标:最大化上下文词出现的概率
maxt=1Tcjc,j0logP(wt+jwt)\max \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)
  1. 计算概率:使用softmax
P(wOwI)=exp(vwOvwI)w=1Wexp(vwvwI)P(w_O | w_I) = \frac{\exp(v'_{w_O} \cdot v_{w_I})}{\sum_{w=1}^{W} \exp(v'_w \cdot v_{w_I})}

优化技巧

直接计算softmax太慢(需要遍历整个词汇表),Word2Vec使用了两个关键优化:

1. 负采样(Negative Sampling)

不计算所有词的概率,只区分正确的上下文词和几个随机采样的”负例”。

logσ(vwOvwI)+i=1kEwiPn(w)[logσ(vwivwI)]\log \sigma(v'_{w_O} \cdot v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-v'_{w_i} \cdot v_{w_I})]

2. 层次Softmax(Hierarchical Softmax)

使用哈夫曼树组织词汇,将softmax的复杂度从O(V)降到O(log V)。

词向量的魔法

训练好的词向量展现出令人惊叹的特性:

语义相似性

  • cos(国王, 女王) ≈ 0.75
  • cos(猫, 狗) ≈ 0.70
  • cos(猫, 汽车) ≈ 0.15

语义算术

  • 国王 - 男人 + 女人 ≈ 女王
  • 巴黎 - 法国 + 日本 ≈ 东京
  • 游泳 - 游泳者 + 运动员 ≈ 跑步

这表明词向量捕获了词语之间的语义关系!

应用场景

1. 下游NLP任务

  • 文本分类
  • 情感分析
  • 命名实体识别

2. 相似度计算

  • 查找同义词
  • 文档相似度
  • 推荐系统

3. 迁移学习

  • 预训练词向量作为神经网络的输入层
  • 提升小数据集的模型性能

4. 知识发现

  • 发现词语之间的隐含关系
  • 构建语义网络

Word2Vec的局限

  1. 一词一向量:无法处理一词多义(如”苹果”是水果还是公司)
  2. 静态表示:向量不随上下文变化
  3. 忽略词序:CBOW用词袋,丢失了顺序信息
  4. OOV问题:无法处理训练时未见过的词

这些问题后来被BERT、GPT等上下文感知模型解决。

从Word2Vec到现代模型

模型 特点
Word2Vec 静态词向量,快速高效
GloVe 结合全局统计信息
ELMo 上下文相关,LSTM
BERT 双向Transformer,预训练
GPT 自回归,生成能力强

尽管现代模型更加强大,Word2Vec的核心思想——用分布式表示捕获语义——仍然是NLP的基石。理解Word2Vec,是理解现代NLP的第一步。

Teaching Machines to Understand Language: A Deep Dive into Word2Vec

What is “King” minus “Man” plus “Woman”? If your answer is “Queen,” congratulations—you’ve grasped the essence of Word2Vec!

In 2013, Mikolov and colleagues at Google proposed Word2Vec, a milestone work in natural language processing. It can convert words into mathematical vectors, making semantically similar words close to each other in vector space, and even enabling “semantic arithmetic.”

Why Do We Need Word Vectors?

In traditional NLP, words are usually represented using One-Hot Encoding:

  • “cat” = [1, 0, 0, 0, …]
  • “dog” = [0, 1, 0, 0, …]
  • “apple” = [0, 0, 1, 0, …]

This representation has two serious problems:

  1. Dimension explosion: If vocabulary has 100,000 words, each word is a 100,000-dimensional vector
  2. No semantics: Any two word vectors are orthogonal, cannot express that “cat” and “dog” are more similar than “cat” and “apple”

Word2Vec solves both problems: using low-dimensional dense vectors (like 100-300 dimensions) to represent words, and making semantically similar words have similar vectors.

Core Idea: Distributional Hypothesis

Word2Vec is based on a simple yet profound hypothesis:

“A word’s meaning is determined by its context”

If two words often appear in similar contexts, their meanings should be similar. For example:

  • “I have a cat, it’s very cute”
  • “I have a dog, it’s very cute”

“Cat” and “dog” appear in almost identical contexts, so they should have similar word vectors.

Two Architectures

Word2Vec provides two training architectures:

1. CBOW (Continuous Bag of Words)

Predict the center word from context.

1
2
Input: [I, have, a, ___, it's, very, cute]
Predict: cat

CBOW averages context word vectors, then predicts the center word. Good for learning high-frequency words.

2. Skip-gram

Predict context from the center word.

1
2
Input: cat
Predict: [I, have, a, it's, very, cute]

Skip-gram uses one word to predict surrounding words. Works better for rare words and small datasets.

Training Process

Using Skip-gram as example:

  1. Build training samples: For each word, generate (center word, context word) pairs
  2. Define model: Two-layer neural network (input→hidden→output)
  3. Optimization objective: Maximize the probability of context words appearing
maxt=1Tcjc,j0logP(wt+jwt)\max \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)
  1. Compute probability: Using softmax
P(wOwI)=exp(vwOvwI)w=1Wexp(vwvwI)P(w_O | w_I) = \frac{\exp(v'_{w_O} \cdot v_{w_I})}{\sum_{w=1}^{W} \exp(v'_w \cdot v_{w_I})}

Optimization Tricks

Computing softmax directly is too slow (needs to iterate entire vocabulary), Word2Vec uses two key optimizations:

1. Negative Sampling

Don’t compute probabilities for all words, just distinguish correct context words from a few randomly sampled “negatives.”

logσ(vwOvwI)+i=1kEwiPn(w)[logσ(vwivwI)]\log \sigma(v'_{w_O} \cdot v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-v'_{w_i} \cdot v_{w_I})]

2. Hierarchical Softmax

Organize vocabulary using a Huffman tree, reducing softmax complexity from O(V) to O(log V).

The Magic of Word Vectors

Trained word vectors exhibit amazing properties:

Semantic Similarity

  • cos(king, queen) ≈ 0.75
  • cos(cat, dog) ≈ 0.70
  • cos(cat, car) ≈ 0.15

Semantic Arithmetic

  • King - Man + Woman ≈ Queen
  • Paris - France + Japan ≈ Tokyo
  • Swimming - Swimmer + Athlete ≈ Running

This shows word vectors capture semantic relationships between words!

Applications

1. Downstream NLP Tasks

  • Text classification
  • Sentiment analysis
  • Named entity recognition

2. Similarity Computation

  • Finding synonyms
  • Document similarity
  • Recommendation systems

3. Transfer Learning

  • Pre-trained word vectors as input layer for neural networks
  • Improve model performance on small datasets

4. Knowledge Discovery

  • Discover implicit relationships between words
  • Build semantic networks

Limitations of Word2Vec

  1. One vector per word: Cannot handle polysemy (is “apple” a fruit or company?)
  2. Static representation: Vector doesn’t change with context
  3. Ignores word order: CBOW uses bag of words, loses sequence information
  4. OOV problem: Cannot handle words not seen during training

These problems were later solved by context-aware models like BERT and GPT.

From Word2Vec to Modern Models

Model Characteristics
Word2Vec Static word vectors, fast and efficient
GloVe Incorporates global statistics
ELMo Context-dependent, LSTM
BERT Bidirectional Transformer, pre-trained
GPT Autoregressive, strong generation capability

Although modern models are more powerful, Word2Vec’s core idea—capturing semantics with distributed representations—remains the foundation of NLP. Understanding Word2Vec is the first step to understanding modern NLP.