Week 2

MACS 30405: Exploring Cultural Space
University of Chicago

From the last class

Sample PCA

  1. Given variables \(\vec{x_1}, \vec{x_2} ... \vec{x_p}\), find a linear combination of all the variables such that the new variable has maximal variance subject to the condition that the coefficients of the linear combination form a unit vector. Or to be more precise, we want to find a unit vector \(\vec{a_1} = [a_{11}, ... , a_{1p}]\) such that the new variable \(a_{11}\vec{x_1} + a_{12}\vec{x_2} + ... + a_{1p}\vec{x_p}\) has maximal variance subject to the condition that \(||a_1||_2 = 1\). \(\vec{a_1}\) is called the first principal component.

Sample PCA - geometric interpretation

  1. Then, we want to find a second linear combination vector (the second principal component) that does the same except for the fact that the newly formed variable \(a_{21}\vec{x_1} + a_{22}\vec{x_2} + ... + a_{2p}\vec{x_p}\) should be uncorrelated with the first variable.

  2. Following the same logic, the kth prinipal component is the \(k\)th linear combination vector that maximizes the variance of the newly formed variable subject to the condition that \(a_{k1}\vec{x_1} + a_{k2}\vec{x_2} + ... + a_{kp}\vec{x_p}\) should be uncorrelated with all previously transformed variables.

  3. The procedure can be iteratively performed until \(k\) reaches \(\text{min}(n,p)\).

  1. The variance of a variable is not scale-free. Note that \(\text{var}(a\vec{x}) = a^2\text{var}(x)\). Maximal variance may not be something we want. To address this problem, we oftentimes standardize the variables by transforming \(x\) into \(x' = \frac{x - \bar{x}}{s_x}\).

Coding in class

Solutions

  • PCA has closed-form solutions which make use of eigenvalue decomposition (EVD). You don’t really need to understand EVD in order to use PCA. But it is better to know how it works.

  • (Further complication) However, most statistical software do not solve PCA by EVD. Rather, the preferred solution uses singular value decomposition (SVD).

SVD

  • SVD is preferred for several reasons. First, it produces exactly the same solution that EVD gives and is computationally more efficient.

  • Second, it comes with the benifit of performing sample PCA and variable PCA at once.

Variable PCA

  • By default, we treat samples (respondents) as rows and variables as columns. But we can also do it the other way around. And sometimes, it is not obvious what the samples are and what the variables are.

Example

a the power
Document 1 928 824 8
Document \(n\) 451 552 5

Example

Document 1 Document \(n\)
a 928 451
the 824 552
power 8 5

Another example

Importance of Family Importance of Friends Acceptance of Divorce
Argentina
Zimbabwe

Another example

Argentina Zimbabwe
Importance of Family
Importance of Friends
Acceptance of Divorce

Variable PCA

  • In variable PCA, variables are treated as rows, samples are treated as columns. And we just perform PCA on variables rather than samples.

  • Variable PCA and sample PCA are actually the two sides of the same coin. And they share the same reduced space (explanation on board).

Scatterplots and bi-plots

  • One of the most useful applications of PCA is to visualize the reduced dimensions in scatter plots.
  • There are even ways to project samples and variables onto the same dimensions (bi-plot). And the projections are interpretable.

Coding examples

Screeplot

  • Remember that the eigenvalues are explained variances of the reduced dimensions in the descending order.
  • A screeplot is a plot of the eigenvalues (explained variances) of the dimensions that helps you decide how many dimensions you want to keep.

Coding examples

Heuristics for deciding the number of dimensions

  1. Simplicity (convenience)
  2. Elbow method
  3. Eigenvalue > 1

Exercise

  • Apply PCA to a dataset of your choice. Visualize your results using scatterplots and biplots. Try to interpret the results. Possible data sources:
    • World Value Survey
    • General Social Survey
  • Turn in your codes with results and annotations by April 18.

Break

McPherson (2004)

  • What is the so-called “Blau space?”
  • What is the benefit of using a spatial model to understand the diversity of human society?
  • Where does the social space come from? What are its dimensions?

How would you interpret this figure?

McPherson (2004)

  • What do we lose? What is not captured by the Blau space? Wha remains unanswered?

Bourdieu

  • Why is cultural taste correlated with education/class?
  • How is Bourdieu’s account different from McPherson’s?
  • What is a cultural field? How is it different from the space?
  • What is habitus? What role does it play in the field?