MACS 30405: Exploring Cultural Space
University of Chicago
Sign: the arbitrary relationship between signifiers and signified.
Language conditions thought: Thought without language is a “shapeless and indistinct mass.”
Language delimits units.
\(\text{Cosine Similarity}(w_1, w_2) = 1- \frac{\vec{w_1}\cdot\vec{w_2}}{||\vec{w_1}||||\vec{w_2}||}\)
\(\overrightarrow{king} + \overrightarrow{woman} - \overrightarrow{man} \approx \overrightarrow{queen}\)
Task: find vector representation of words to maximize \[\begin{equation} \frac{1}{N} \sum _{i=1}^{N} \sum_{j=-k}^{k} \log p(w_{i+j}|w_{i}) \end{equation}\] where \(k\) is a window size.
\[\begin{equation} p(w_{i}|w_{j}) = \frac{\exp(\vec{u_{w_{i}}} \cdot \vec{v_{w_{j}}})}{\sum_{l=1}^{\mathcal{V}} \exp(\vec{u_{w_l}} \cdot \vec{v_{w_{j}}})}, \end{equation}\]
where \(v_{w_{j}}, u_{w_{i}} \in \mathbb{R}^d\) are the input (context) and output (word) vector representations of the context word \(w_j\) and focal word \(w_i\), and \(\mathcal{V}\) is the set of all unique vocabularies in the corpus.
\[\begin{equation} p(y = 1|w_{i},w_{j}) = \text{sigmoid}(\vec{u_{w_{i}}}\cdot\vec{v_{w_{j}}}) \end{equation}\]
, where \(y\) is a binary variable indicating whether \(w_i\) and \(w_j\) co-occur in the training corpus. For each positive sample, the model also draws a given number of negative pairs randomly sampled from the full vocabulary.
The model has two layers, an input (context) layer and an output (word) layer.
At initialization, words are given random locations in the input and output vector spaces. Then through stochastic gradient descent (SGD), their locations are updated and optimized to maximize the objective function. (This is where learning takes place.)
Analysis and interpretation are usually performed on the output (word) layer.
Because prediction is not our goal, (and there is no reason to believe our shallow neural nets can out-compete more sofisticated deep neural-net models (like Transformer models) in prediction tasks), embeddings learned from word2vec models are usually evaluated based on its performance in solving some analogy tests.
These shallow neural nets are best suited for geometric interpretation.