Big Data Research Strategies: Forecasting and Approximating Experiments

MACS 30000: Perspectives on Computational Analysis
University of Chicago

This week’s participation grade

Every group will present on Thursday.
We will suspend the voting model for one week.
You earn a check by speaking up at least once in class on either Tuesday or Thursday.

Let’s conduct a poll first.

Do you prefer to have group discussion
1. in class OR
2. only before class?
How much time do you prefer to have for in-class group discussion?
1. 5 minutes
2. 10 minutes
Submit your answer here: https://pollev.com/shilinjia752

Major goals of science

Description
Explanation
Prediction

Example - attitudes towards gay marriage

Background

In the United States, gay marriage, or allowing same-sex couples to marry, has been a contentious issue. Throughout the late 1990s and 2000s, the public was heavily divided on whether gays and lesbians should be allowed to marry. This is made more complicated because prior to the U.S. Supreme Court ruling last year that legalized gay marriage across the country, the decision to allow or forbid gay marriage was left to each individual state. Traditional public opinion surveys were conducted using national samples, so were not very useful for this state-level decision making process.

Possible questions

Description - what percentage of the public is in favor of allowing gay marriage?
Explanation - why do some people support/oppose gay marriage?
Prediction - do specific people support or oppose gay marriage?
Who is most likely to care about description?
Who is most likely to care about explanation?
Who is most likely to care about prediction?

Let’s reconsider the Amazon co-purchase study conducted by Shi et al. (2017)

Topic: political polarization in science
Data: book co-purchase links on Amazon and Barnes & Noble

What would be a good description question to ask?
What would be a good explanation question to ask?
What would be a good prediction question to ask?

Value of Forecasting

Why should we care about prediction?

Interesting in its own right
Can be used to validate an explanation

See: Schrodt, P. A. (2014). Seven deadly sins of contemporary quantitative political analysis. Journal of peace research, 51(2), 287-300.

When should we care about prediction and when should we care about explanation?

In social sciences, prediction and explanation used to be closely tied together.
Think about linear regression: it is both a prediction and an explanation model.

Which one is a better model?

Source: https://en.wikipedia.org/wiki/Overfitting

Which model fits better?
Which model explains better?
Which model predicts better?
To explain is to predict?

But with the machine-learning revolution, prediction and explanation are now more distinct.

Do you have a “rain dance-like” question, or an “umbrella-like” question?

When developing predictions – also known as forecasting – you should think carefully about a couple of things. First, is it important that your predictive model be explainable (e.g. do you need to know why it works?)? Or does interpretability not matter for your application – can you use a black-box model?

[reveal fragment]

Salganik identifies this as the rain dance or umbrella question.

Consider Anna. She faces a drought and wants to know whether or not she should hire a shaman to perform a rain dance. For her, the key question is “does a rain dance cause rain?” This is a question about explanation. If rain dances do not cause rain, then hiring a shaman is a waste of money.

Compare her to Bob. Bob needs to decide whether to bring an umbrella to work to avoid getting wet. For him the important question has to be “is it going to rain today?” This is a question about prediction. He doesn’t care what factors cause rain to occur, he just wants to know with a reasonable degree of confidence if rain will occur or not.

So if you are thinking about developing a predictive model, consider first if you have a rain dance-like question or an umbrella-like question. In CSS, this might be the difference between fitting a deep learning model that has high predictive accuracy but low interpretability (e.g. to label text with human-level accuracy in a content analysis context) and a standard regression model that might have lower predictive accuracy but higher interpretability (to see the effect size of particular variables on an outcome variable). Not better or worse – just different considerations for different problems!

How do you measure success?

Always choose an appropriate baseline

Accepting that you have a predictive question, how do you measure success? In a vacuum, big data can be impressive. Using search trends to predict sales or real-world performance is pretty cool. But what do we measure that against?
A naive strategy considers each outcome equally likely. To exceed expectations, a model should be correct more often than the a random guess. For binary outcomes, this would mean if I can predict correctly more than 50% of the time, my model is a success.
But sometimes even basic knowledge will outperform a coin flip. For example, what if you wanted to predict weather in Chicago during the winter? Will it be hot or cold that day? A prediction of cold every day would be correct 90% of the time. But that model would not necessarily be useful.
- For a social scientific example, think about predicting decisions from the Supreme Court of the United States. Due to how the Court selects its cases, historically all published decisions result in a reversal of the lower court ruling approximately sixty-six percent of the time. So if my model has a seventy-two percent accuracy rate, is that noteworthy? If we compared it to a random guess with an accuracy of fifty percent, then it does seem better than that. But if we compare it to a the more reasonable baseline of predicting the modal outcome for every observation, we only see a four percent increase in accuracy.

[Reveal fragment]

Bottom line: Always choose an appropriate baseline to decide if a predictive model is useful.

Nowcasting

Predict current state of something
Example: Google Flu Trends: Using Google’s search records to predict influenza prevalence.
- Use data to predict the present instead of the future
- Faster measurements of the flu
- Enables the CDC and other agencies to mobilize more quickly in response to outbreaks
- Model using the top 45 highest-scoring queries seems to perform well in predicting the flu

Not just limited to forecasting things far in the future – another application is to predict what’s going on right now – “nowcasting”

Nowcasting has the potential to be especially useful to governments and companies that require timely and accurate measures of the world. * useful especially in cases where official estimates of something take a long time to come out

For instance, Google Flu Trends used influenza-related search queries to predict current measurements of the Flu – more recently many researchers have done the same with COVID-19
- Faster measurements of the flu/COVID than CDC reports, which are more delayed (potentially allows policy makers to make decisions faster and more effectively)
- Later versions better, but early versions of this kind of nowcasting didn’t adequately assess themselves in comparison to a lagged baseline (e.g. lagged CDC indicator produced basically the same result), nor full take into account shifting nature of Google’s changing algorithms:
  - Google changed how searches are conducted
  - Auto-complete feature
  - Increases the volume of influenza-related queries, even if influenza is not more prevalent
  - Nothing in place to correct for this system drift
  - Also, potentially Overfitting the data
    - 50 million search queries, yet only 1152 observations

Critiques of the Google Flu Trends study

The model does not perform much better than using lagged CDC data
Concerns that the queries don’t actually predict influenza, but instead predict seasonality (i.e. “high school football”)
Also, beware of algorithmic confounding and drift!
Is this a rain dance question or umbrella question?

Prediction as explanation 2.0

From necessary but not sufficient explanations (e.g. Max Weber: Protestant ethics \(\rightarrow\) birth of Capitalism)
To sufficient but not necessary explanations

See: Evans, J. (2020). Social computing unhinged. Journal of Social Computing, 1(1), 1-13.

Example

Seguin, Zhang & Gorman (2023): Is cultural aesthetics internally coherent?
- Data and subject of study: U.S. baby names (1880-2016) and their gender
- Coherence hypothesis: Parents give boys and girls names based on some enduring gendered aesthetics.
- Alternative (null) hypothesis: Parents give names by just looking at the names other parents give.
- Imagine how difficult it is to rule out the alternative hypotheses.

Their strategy

RQ: Can a machine predict baby names’ gender solely based on the spelling and pronunciation in the absence of social contexts?
- decompose names into many sonic and orthographic features
- build a random forest classifier to predict gender given features

Out-of-sample validation

Can the machine predict
- gender of black babies’ names by only learning white babies’ names?
- gender of baby names give by high-SES mothers by only learning baby names given by low-SES mothers?
- gender of novel baby names by only learning pre-1950 baby names?

Their answer

Yes, it can
- with an average F-1 accuracy above or around 0.80.

They also found

Names that switch gender convention are more likely to conform to gender aesthetics

Now, let’s turn our attention to the problem of causation.

Imagine the following study:

You suspected that big cities made their dwellers more productive.
You analyzed a nationally representative survey and found that people living in bigger cities were indeed more productive than people living in smaller cities. Would this confirm your hypothesis?

What about

You analyzed a nationally representative longitudinal study and found that people became more productive after they moved to bigger cities in comparison with people who stayed in the countryside. Would this finding confirm your hypothesis?

But

People who selected to move to bigger cities might be inherently different from people who chose to stay.
There might still be a selectivity issue.

Classical experimental setup: Randomized control trials

Randomly assign respondents into two groups
One group receives the treatment, and the other group does not (or takes a placebo).
Assess the average difference in outcomes between the two groups (average treatment effects).
The setup ensures that the treatment and control groups are otherwise similar except for the treatment. (We’ll talk about it more in the week we discuss digital experiments.)

But

In observational studies, the treatments we are interested in are most likely not assigned randomly.
Possible ways out:
- Matching
- Regression discontinuity design (RDD)
- Natural experiments

Hersh 2013 “Long-term effect of September 11 on the political behavior of victims’ families and neighbors”

Is this a valid natural experiment? Can we assume assignment to treatment/condition is “as-if” random?
How plausible are the assumptions of the matching approach the author uses? Do they seem to allow for causal inference, or not? Why?
Evaluate the strengths and limitations of the research design overall.

Uses 9/11 as a natural experiment to assess how punctuating events influence political activity
- Interesting in the historical context
- Also interesting in the policymaking context as 9/11 families become more politically involved and influential over government policy
- Could apply to action after other tragedies
Compares families and neighbors of victims to similar New Yorkers not directly affected by the tragedy and assesses their levels of political activity and intensity after the attacks over the short and long-term (i.e. “matching”)
Major findings
- Post-9/11, victims’ families and neighbors are more likely to vote in general elections
- Families’ campaign donations increase after the attack relative to the control, but not the neighbors
- Participation in primary elections also increases
- Shift towards Republican party registration - fits the expectations that after a violent attack, individuals are more likely to become politically conservative
Uses matching to find comparable victims
Is this a valid natural experiment? Can we assume assignment to treatment/condition is “as-if” random?
- Comparing families of victims to regular New Yorkers
- Comparing neighbors of victims to regular New Yorkers
- Recognizes that all New Yorkers were affected
  - If anything, biases towards null results because effect size is forced to be smaller
Uses always-on data to observe behavior pre-9/11 (good use of big data)
Pools data sources
- Voter registration files
- Campaign contribution records
- Potentially identifiable data, but all part of the public record
- IRB exempted it from review because the data is public
- Minimal ethical concerns
Use of matching
- Want to draw causal inferences from differences in behavior
- Cannot just observe all New Yorkers and run a regression model
  - Lots of other unobservable differences among the population
  - Change in political behavior could be the result of some third, unknown factor
- Compare victims’ families to similar individuals not directly effected by the attack
- Compare neighbors of victims to similar individuals not directly effected by the attack
- Matching characteristics
  - Demographics
  - Prior political activities
  - Family/neighborhood characteristics
  - Attempts to find controls who are the same as the victims’ families and neighbors, EXCEPT they did suffer a direct fatality in 9/11

Regression discontinuity design (RDD), a real-world example

Hypothesis: In political elections, incumbent candidates hold an advantage over their challengers.
Findings: In the United States, incumbent parties win at a consistently high rate in elections to the U.S. House (>90% win rate).
Problem: Incumbent candidates might differ from non-incumbent challengers in many ways other than the fact that they are incumbent.

An RDD design

Source: Randomized experiments from non-random selection in U.S. House elections. Lee (2008).

On Thursday

Legewie 2016 “Racial Profiling and Use of Force in Police Stops”

Is this a valid approximation of an experiment? Can we assume assignment to treatment/condition is “as-if” random?
How plausible are the assumptions of the matching approach the author uses? Do they seem to allow for causal inference, or not? Why?
Evaluate the use of a Regression Discontinuity Design. What are some strengths and weakness of this alternative estimate?
Evaluate the strengths and limitations of the research design overall.

Sourati & Evans 2023 “Accelerating science with human-aware artificial intelligence”

We will try to digest this paper as much as possible without digging into the math.
I will assign each three of you a figure from the paper. Please try your best to make sense of it and present the figure in front of the class.
How do the authors make use of prediction to answer their research questions?

Discussion groups

	Week 3 Discussion Groups	Figure
0	Lorena / Anny / Max	2
1	Abbey / Tianle / Tian	3
2	Huanrui / Jiazheng / Ertong	4
3	Daniela / Pritam / Erica	5
4	Kuang / Cosmo / Thomas	6
5	Andy / Zhuojun / Yue	7
6	Emma / Agnes / Kexin	8