Big Data Research Strategies: Forecasting and Approximating Experiments

MACS 30000: Perspectives on Computational Analysis
University of Chicago

This week’s participation grade

  • Every group will present on Thursday.
  • We will suspend the voting model for one week.
  • You earn a check by speaking up at least once in class on either Tuesday or Thursday.

Let’s conduct a poll first.

  • Do you prefer to have group discussion
    1. in class OR
    2. only before class?
  • How much time do you prefer to have for in-class group discussion?
    1. 5 minutes
    2. 10 minutes
  • Submit your answer here: https://pollev.com/shilinjia752

Major goals of science

  • Description
  • Explanation
  • Prediction

Example - attitudes towards gay marriage

Background

In the United States, gay marriage, or allowing same-sex couples to marry, has been a contentious issue. Throughout the late 1990s and 2000s, the public was heavily divided on whether gays and lesbians should be allowed to marry. This is made more complicated because prior to the U.S. Supreme Court ruling last year that legalized gay marriage across the country, the decision to allow or forbid gay marriage was left to each individual state. Traditional public opinion surveys were conducted using national samples, so were not very useful for this state-level decision making process.

Possible questions

  • Description - what percentage of the public is in favor of allowing gay marriage?

  • Explanation - why do some people support/oppose gay marriage?

  • Prediction - do specific people support or oppose gay marriage?

  • Who is most likely to care about description?

  • Who is most likely to care about explanation?

  • Who is most likely to care about prediction?

Let’s reconsider the Amazon co-purchase study conducted by Shi et al. (2017)

  • Topic: political polarization in science
  • Data: book co-purchase links on Amazon and Barnes & Noble
  • What would be a good description question to ask?
  • What would be a good explanation question to ask?
  • What would be a good prediction question to ask?

Value of Forecasting

Why should we care about prediction?

  • Interesting in its own right
  • Can be used to validate an explanation

See: Schrodt, P. A. (2014). Seven deadly sins of contemporary quantitative political analysis. Journal of peace research, 51(2), 287-300.

When should we care about prediction and when should we care about explanation?

  • In social sciences, prediction and explanation used to be closely tied together.
  • Think about linear regression: it is both a prediction and an explanation model.

Which one is a better model?

  • Which model fits better?
  • Which model explains better?
  • Which model predicts better?
  • To explain is to predict?

But with the machine-learning revolution, prediction and explanation are now more distinct.

  • Do you have a “rain dance-like” question, or an “umbrella-like” question?

How do you measure success?

Always choose an appropriate baseline

Nowcasting

  • Predict current state of something
  • Example: Google Flu Trends: Using Google’s search records to predict influenza prevalence.
    • Use data to predict the present instead of the future
    • Faster measurements of the flu
    • Enables the CDC and other agencies to mobilize more quickly in response to outbreaks
    • Model using the top 45 highest-scoring queries seems to perform well in predicting the flu

Prediction as explanation 2.0

  • From necessary but not sufficient explanations (e.g. Max Weber: Protestant ethics \(\rightarrow\) birth of Capitalism)
  • To sufficient but not necessary explanations

See: Evans, J. (2020). Social computing unhinged. Journal of Social Computing, 1(1), 1-13.

Example

  • Seguin, Zhang & Gorman (2023): Is cultural aesthetics internally coherent?
    • Data and subject of study: U.S. baby names (1880-2016) and their gender
    • Coherence hypothesis: Parents give boys and girls names based on some enduring gendered aesthetics.
    • Alternative (null) hypothesis: Parents give names by just looking at the names other parents give.
    • Imagine how difficult it is to rule out the alternative hypotheses.

Their strategy

  • RQ: Can a machine predict baby names’ gender solely based on the spelling and pronunciation in the absence of social contexts?
    • decompose names into many sonic and orthographic features
    • build a random forest classifier to predict gender given features

Out-of-sample validation

  • Can the machine predict
    • gender of black babies’ names by only learning white babies’ names?
    • gender of baby names give by high-SES mothers by only learning baby names given by low-SES mothers?
    • gender of novel baby names by only learning pre-1950 baby names?

Their answer

  • Yes, it can
    • with an average F-1 accuracy above or around 0.80.

They also found

  • Names that switch gender convention are more likely to conform to gender aesthetics

Now, let’s turn our attention to the problem of causation.

Imagine the following study:

  • You suspected that big cities made their dwellers more productive.
  • You analyzed a nationally representative survey and found that people living in bigger cities were indeed more productive than people living in smaller cities. Would this confirm your hypothesis?

What about

  • You analyzed a nationally representative longitudinal study and found that people became more productive after they moved to bigger cities in comparison with people who stayed in the countryside. Would this finding confirm your hypothesis?

But

  • People who selected to move to bigger cities might be inherently different from people who chose to stay.
  • There might still be a selectivity issue.

Classical experimental setup: Randomized control trials

  • Randomly assign respondents into two groups
  • One group receives the treatment, and the other group does not (or takes a placebo).
  • Assess the average difference in outcomes between the two groups (average treatment effects).
  • The setup ensures that the treatment and control groups are otherwise similar except for the treatment. (We’ll talk about it more in the week we discuss digital experiments.)

But

  • In observational studies, the treatments we are interested in are most likely not assigned randomly.
  • Possible ways out:
    • Matching
    • Regression discontinuity design (RDD)
    • Natural experiments

Hersh 2013 “Long-term effect of September 11 on the political behavior of victims’ families and neighbors”

  • Is this a valid natural experiment? Can we assume assignment to treatment/condition is “as-if” random?
  • How plausible are the assumptions of the matching approach the author uses? Do they seem to allow for causal inference, or not? Why?
  • Evaluate the strengths and limitations of the research design overall.

Regression discontinuity design (RDD), a real-world example

  • Hypothesis: In political elections, incumbent candidates hold an advantage over their challengers.
  • Findings: In the United States, incumbent parties win at a consistently high rate in elections to the U.S. House (>90% win rate).
  • Problem: Incumbent candidates might differ from non-incumbent challengers in many ways other than the fact that they are incumbent.

An RDD design

Source: Randomized experiments from non-random selection in U.S. House elections. Lee (2008).

On Thursday

Legewie 2016 “Racial Profiling and Use of Force in Police Stops”

  • Is this a valid approximation of an experiment? Can we assume assignment to treatment/condition is “as-if” random?
  • How plausible are the assumptions of the matching approach the author uses? Do they seem to allow for causal inference, or not? Why?
  • Evaluate the use of a Regression Discontinuity Design. What are some strengths and weakness of this alternative estimate?
  • Evaluate the strengths and limitations of the research design overall.

Sourati & Evans 2023 “Accelerating science with human-aware artificial intelligence”

  • We will try to digest this paper as much as possible without digging into the math.
  • I will assign each three of you a figure from the paper. Please try your best to make sense of it and present the figure in front of the class.
  • How do the authors make use of prediction to answer their research questions?

Discussion groups

Week 3 Discussion Groups Figure
0 Lorena / Anny / Max 2
1 Abbey / Tianle / Tian 3
2 Huanrui / Jiazheng / Ertong 4
3 Daniela / Pritam / Erica 5
4 Kuang / Cosmo / Thomas 6
5 Andy / Zhuojun / Yue 7
6 Emma / Agnes / Kexin 8

I will do Figure 1.

Can you make sense of it?