Week 6

MACS 30405: Exploring Cultural Space
University of Chicago

Annoucement

Readings for week 7 have been released.

If you haven’t discussed your final project with me, now is a good time.

Saussure recap

Sign: the arbitrary relationship between signifiers and signified.

Language does not have concrete units

Language conditions thought: Thought without language is a “shapeless and indistinct mass.”
Language delimits units.

Signs as a sytem

Words do not have preexisting meaning. Language carves out meaning.
Signs are both arbitrary and differential: a sign is the counterpart of other signs.

Meaning of words

Words acquire value based on co-occurrence with other words that come before and after.
Words that have something in common are associated.

Associative words

Discussion questions

Does the Saussurean structural linguistics fit into dimensional thinking?
- If yes, what are the dimensions?
What are missing in this structuralist way of thinking about language?
- If Saussure is right that there is no meaning before sounds, how does meaning arise? How can a synchronic model account for a process that turns nothing into something?
- Are learning the conjunction of words (and predicting words given contexts) all we need in order to acquire meaning?
- How does an unrecognizable word acquire its meaning when it was first introduced?

Break

Word-embedding models

Word co-occurrences can be thought of as a matrix.
As demonstrated in class, you can try to apply PCA or CA to a word-document or word-author matrix.

World-embedding 1.0: Latent Sememtic Indexing (LSI)

Vector algebra can be then performed in world-embedding spaces

\(\text{Cosine Similarity}(w_1, w_2) = 1- \frac{\vec{w_1}\cdot\vec{w_2}}{||\vec{w_1}||||\vec{w_2}||}\)
\(\overrightarrow{king} + \overrightarrow{woman} - \overrightarrow{man} \approx \overrightarrow{queen}\)

Modern variant: Language models

Neural network models

Word2Vec

Invented by Tomáš Mikolov’s team in 2013
The goal is not prediction but to learn high-quality word embeddings that have good geometric properties.
Fast, efficient and shallow.
Variants: CBOW and skip-gram

Skipgram

Task: find vector representation of words to maximize \[\begin{equation} \frac{1}{N} \sum _{i=1}^{N} \sum_{j=-k}^{k} \log p(w_{i+j}|w_{i}) \end{equation}\] where \(k\) is a window size.

\[\begin{equation} p(w_{i}|w_{j}) = \frac{\exp(\vec{u_{w_{i}}} \cdot \vec{v_{w_{j}}})}{\sum_{l=1}^{\mathcal{V}} \exp(\vec{u_{w_l}} \cdot \vec{v_{w_{j}}})}, \end{equation}\]

where \(v_{w_{j}}, u_{w_{i}} \in \mathbb{R}^d\) are the input (context) and output (word) vector representations of the context word \(w_j\) and focal word \(w_i\), and \(\mathcal{V}\) is the set of all unique vocabularies in the corpus.

The softmax function is essentially the same link function that is used in a multinomial logistic regression.
- You can think of the task as a multi-class prediction problem: given a context word, among all possible words, which one has the highest chance of occurring next to the given word?
However, in practice, Mikolov’s skip-gram model never actually calculates the softmax function because computing it during training is very costly.
- The sum in the denominator contains too many terms.

Approximation

Word2vec implements 2 ways to speed up the approximation: hierarchical softmax and negative sampling.
Hierarchical softmax employs a Huffman Tree to reduce the computation for the softmax function.

Negative sampling (more popular)

Instead of modeling the conditional probability, negative sampling models

\[\begin{equation} p(y = 1|w_{i},w_{j}) = \text{sigmoid}(\vec{u_{w_{i}}}\cdot\vec{v_{w_{j}}}) \end{equation}\]

, where \(y\) is a binary variable indicating whether \(w_i\) and \(w_j\) co-occur in the training corpus. For each positive sample, the model also draws a given number of negative pairs randomly sampled from the full vocabulary.

The sigmoid function is just the link function of a logistic regression. In other words, negative sampling transforms the original multi-class prediction problem into a binary prediction problem.

Estimation

The model has two layers, an input (context) layer and an output (word) layer.
- Each layer is represented as a matrix. The rows are words (either as context words or as focal words). The columns are dimensions.
- The dimensions are not assumed to be interpretable. Because language is inhertently high-dimensional, we usually set the number of dimensions at a moderately large number (100 ~ 400).
At initialization, words are given random locations in the input and output vector spaces. Then through stochastic gradient descent (SGD), their locations are updated and optimized to maximize the objective function. (This is where learning takes place.)
Analysis and interpretation are usually performed on the output (word) layer.

Evaluation

Because prediction is not our goal, (and there is no reason to believe our shallow neural nets can out-compete more sofisticated deep neural-net models (like Transformer models) in prediction tasks), embeddings learned from word2vec models are usually evaluated based on its performance in solving some analogy tests.
These shallow neural nets are best suited for geometric interpretation.