Week 2

MACS 30405: Exploring Cultural Space
University of Chicago

From the last class

Given variables \(\vec{x_1}, \vec{x_2} ... \vec{x_p}\), find a linear combination of all the variables such that the new variable has maximal variance subject to the condition that the coefficients of the linear combination form a unit vector. Or to be more precise, we want to find a unit vector \(\vec{a_1} = [a_{11}, ... , a_{1p}]\) such that the new variable \(a_{11}\vec{x_1} + a_{12}\vec{x_2} + ... + a_{1p}\vec{x_p}\) has maximal variance subject to the condition that \(||a_1||_2 = 1\). \(\vec{a_1}\) is called the first principal component.

Then, we want to find a second linear combination vector (the second principal component) that does the same except for the fact that the newly formed variable \(a_{21}\vec{x_1} + a_{22}\vec{x_2} + ... + a_{2p}\vec{x_p}\) should be uncorrelated with the first variable.
Following the same logic, the kth prinipal component is the \(k\)th linear combination vector that maximizes the variance of the newly formed variable subject to the condition that \(a_{k1}\vec{x_1} + a_{k2}\vec{x_2} + ... + a_{kp}\vec{x_p}\) should be uncorrelated with all previously transformed variables.
The procedure can be iteratively performed until \(k\) reaches \(\text{min}(n,p)\).

The variance of a variable is not scale-free. Note that \(\text{var}(a\vec{x}) = a^2\text{var}(x)\). Maximal variance may not be something we want. To address this problem, we oftentimes standardize the variables by transforming \(x\) into \(x' = \frac{x - \bar{x}}{s_x}\).

PCA has closed-form solutions which make use of eigenvalue decomposition (EVD). You don’t really need to understand EVD in order to use PCA. But it is better to know how it works.
(Further complication) However, most statistical software do not solve PCA by EVD. Rather, the preferred solution uses singular value decomposition (SVD).

SVD is preferred for several reasons. First, it produces exactly the same solution that EVD gives and is computationally more efficient.
Second, it comes with the benifit of performing sample PCA and variable PCA at once.

By default, we treat samples (respondents) as rows and variables as columns. But we can also do it the other way around. And sometimes, it is not obvious what the samples are and what the variables are.

	a	the	…	power	…
Document 1	928	824	…	8	…
…	…	…	…	…	…
Document \(n\)	451	552	…	5	…

	Importance of Family	Importance of Friends	…	Acceptance of Divorce	…
Argentina	…	…	…	…	…
…	…	…	…	…	…
Zimbabwe	…	…	…	…	…

In variable PCA, variables are treated as rows, samples are treated as columns. And we just perform PCA on variables rather than samples.
Variable PCA and sample PCA are actually the two sides of the same coin. And they share the same reduced space (explanation on board).

One of the most useful applications of PCA is to visualize the reduced dimensions in scatter plots.
There are even ways to project samples and variables onto the same dimensions (bi-plot). And the projections are interpretable.

Remember that the eigenvalues are explained variances of the reduced dimensions in the descending order.
A screeplot is a plot of the eigenvalues (explained variances) of the dimensions that helps you decide how many dimensions you want to keep.

Apply PCA to a dataset of your choice. Visualize your results using scatterplots and biplots. Try to interpret the results. Possible data sources:
- World Value Survey
- General Social Survey
Turn in your codes with results and annotations by April 18.

What is the so-called “Blau space?”
What is the benefit of using a spatial model to understand the diversity of human society?
Where does the social space come from? What are its dimensions?

What do we lose? What is not captured by the Blau space? Wha remains unanswered?