Digital Experimental Design

MACS 30000: Perspectives on Computational Analysis
University of Chicago

Assignment 1

Format

Submit your assignment via Gradescope
- (Optional) Complete the test assignment by Friday
There is an exact word count range
Cite properly
with a bibliography

Advice

Make sure to answer each question with sufficient explanations
Try to employ concepts from Bit by Bit
Do not easily ignore strengths
Give some second thoughts (especially on weaknesses) before writing
Start early and bring your questions to next class

Experiments

Topics this week

Digital Experimental Design
Running Digital Experiments

When to use experiments

Useful for causal inference and determining causal effects
Rules out confounders or alternative explanations…
- … when designed intelligently:
  randomized controlled trial \(>\) “perturb and observe”

Importance of control group

Random selection of treatment ensures that the only thing changing, on average, among two groups is the treatment

Basic survey designs

Source: Bit by Bit Figure 4.5

When experiments are not appropriate

Smith and Pell (2003) presented a satirical meta-analysis of studies demonstrating the effectiveness of parachutes. They concluded:

As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.

[Read through satirical meta-analysis]

Sometimes experiments are not necessary or are unethical… [TPS: any others that come to mind that might not be unnecssary/unethical?]

Other examples might include:

Establishing a link between smoking and cancer (cannot force individuals to engage in deadly behavior)
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of economic literature, 48(2), 424-55. — RCTs can focus so strongly on specific causal mechanism that they become blinded to generalizable findings that can be used by policymakers
Onishchenko, D., Huang, Y., van Horne, J., Smith, P. J., Msall, M. E., & Chattopadhyay, I. (2021). Reduced false positives in autism screening via digital biomarkers inferred from deep comorbidity patterns. Science advances, 7(41), eabf0354. — recent study from UChicago researchers predicting autism diagnoses based on past medical records. You cannot “expose” children to autism spectrum disorder (ASD) to see what causes it. Here researchers leverage observational digital trace data to generate predictions.
Efficacy of COVID-19 vaccines
- Measuring vaccine efficacy vs. effectiveness
- Cannot intentionally expose subjects to COVID-19
- Instead, randomize assignment to receive the vaccine and placebo group
- Then wait as people naturally are exposed to the virus in the real world
- What proportion of subjects who develop the virus come from the vaccine or placebo group?

Even in cases where people will not die when we expose them to a particular treatment, need to carefully consider ethics of experiments – and, of course,subject them to IRB, which we will talk about in a future week

Major features of experiments

Two dimensions of experiments

Source: Bit by Bit Figure 4.1

Don’t worry about specific study names, but Salganik has a useful diagram for thinking about general experimental design (analog vs. digital, lab vs. field)…

Lab vs. field experiments
- Lab experiments - participants enter lab setting and perform behavioral activities. Often undergrads paid small amounts
  - Lab experiments offer near total control of environment
- Field experiments - take place in more realistic, native setting, more representative groups
  - Field experiments have less control, though
Digital vs. analog – exists on a continuum
- Digital - make use of digital infrastructure to recruit participants, randomize/deliver treatments, and measure outcomes
- Partially digital - use sensors in physical world to deliver treatments, measure outcomes (lots of potential here with growing IoT + partnership with organizations and governments)
  - common misconception: talking about 100% “web” experiments that take place in a browser – lots of potential use of sensors in the built environment
- Differences in scale:
  - Analog experiments typically have fewer participants
  - Digital experiments can have millions of participants
  - Digital experiments can happen over longer time scale

Validity

The extent to which the results of a particular experiment support some more general conclusion
- Statistical conclusion validity
- Construct validity
- Internal validity
- External validity

Importance of validity in experiments

Traditional experiments were most concerned with internal validity
- Not generally concerned with external validity
- WEIRD data
… but digital experiments typically have larger, more diverse pools of participants and increased capacity for assessing external validity

Traditional experiments were most concerned with internal validity
- Were protocols correctly followed?
- Were subjects truly assigned to treatment and control at random?
- Are measurements accurately recorded?
Not generally concerned with external validity and the generalizability of results to alternative situations
This is why WEIRD (Western, Educated, Industrialized, Rich, and Democratic) countries could be used for so much research
- It didn’t matter how atypical the sample was relative to the population of interest - the only goal was to identify and recover the mechanism, or causal effect
- This is one of the main causes of the replication crises in psychology and many other social scientific disciplines - the results may hold true in a single experiment, but not replicated on a different sample drawn from a different population
How do digital experiments overcome this problem?
- Much larger pool of participants
- Not exclusively WEIRD (though still dominant)
- Easier to run experiments on different subject pools and assess the reliability and persistence of results
- Tendency to pair digital experiment with observational study or field experiment - if results are similar, lends credence to the causal mechanism
  - See article pairs for midterm paper 1 for a similar logic

Heterogeneity of treatment effects

Assessing differential effects of the treatment on sub-groups in the study
Made possible in digital experiments because of the larger sample size and low variable cost to add more participants

Power consumption study

Source: Bit by Bit Figure 4.6 (adapted from Allcott 2011)

Traditional field experiments had demonstrated that in a housing context, if you provide people with information about their energy usage in relationship to their peers and some indication of the comparison as being “good” or “bad” (e.g. a signal of injunctive norms), people tended to decrease their energy usage. [see top of form]

To measure changes in energy usage, research assistants needed to walk around to each house and look at power meters – really expensive and not very scalable!

In a partially digital experiment, researchers partnered with power companies to see whether this continued to work at scale

Specifically: delivered these fliers through paper mail to 600k households from 10 different sites
Outcome (energy usage) measured using digital devices in the physical world – power meters
- allows for massive scale of data collection (across geographic regions), low variable cost
- can then estimate average treatment effect after receiving fliers across a much bigger population and bin participants into smaller subgroups for analysis…

Power consumption study

Source: Bit by Bit Figure 4.8 (adapted from Allcott 2011)

Mechanisms

Source: Bit by Bit Figure 4.10

Mechanisms

Mechanisms tell us why or how a treatment caused an effect
Like with prediction, it isn’t necessarily enough to know \(X\) causes \(Y\) - we want to know why
Digital experiments allow us to test the process or mediating variables

Detecting Mechanisms: Collecting Process Data

How did Home Energy Reports cause people to lower their electricity usage?

Linking Data from power company rebate program: data about consumer appliance upgrades (more energy-efficient or not?)
- Allcott and Rogers 2014: more people who received Home Energy Reports upgraded their appliances
- Accounts for a 2% decrease in energy use in treated households

Revisiting the Home Energy Report

Source: Bit by Bit Figure 4.6 (adapted from Allcott 2011)

Detecting Mechanisms: Running Additional Experiments

Test four conditions to assess the role of tips alone in lowering energy usage (Ferraro, Miranda, and Price 2011):

a group that received tips on saving water
a group that received tips on saving water plus a moral appeal to save water
a group that received tips on saving water plus a moral appeal to save water plus information about their water use relative to their peers
a control group

Detecting Mechanisms: Running Additional Experiments

Source: Bit by Bit Figure 4.11 (Adapted from Ferraro, Miranda, and Price 2011)

Better: Full Factorial Design

Treatment	Characteristics
1	Control
2	Tips
3	Appeal
4	Peer Information
5	Tips + Appeal
6	Tips + Peer Information
7	Appeal + Peer Information
8	Tips + Appeal + Peer Information

Voter mobilization on Facebook (Bond et al. 2012)

In groups, evaluate the following:

Where does this experiment fall on the Digital-Analog continuum?
Validity: Do you have any concerns about the four types of validity?
Heterogeneity of effects:
- How does the digital experiment allow researchers to assess different levels of effect?
- What do we learn from this aspect that we would not know if we just learned the average treatment effect?
Causal mechanism:
- What causes the increase in reported and validated voting?
- How does the study identify the causal mechanism? Or does it?
- How can this knowledge be useful to researchers?
Ethics of running such a study

Let’s think about some of the characteristics of experiments that will be useful for you to analyze in your first midterm paper and apply these to a specific example…

[15 min. in groups, 15-20 min. as a class]

Summarize the experimental design
- Participants include all U.S. Facebook users over 18 who login on election day 2010 (\(N = 60,000,000\))
- Randomly display information about the election
  - 98% - social message group (“I voted” message, find your polling place, and show friends who also reported voting)
  - 1% - information group (find your polling place)
  - 1% - control (no message)
  - Testing multiple mechanisms - informational, expressive, and social
- Assess treatment/control impact on three outcomes:
  - Expressed voting
  - Validated voting
  - Searching for polling place
Results
- Social message more likely to self-report voting than informational message, but validated voting effects are much smaller (likely with some bias and errors matching to voter files)
- No difference in validated turnout between informational message and no message
- Effect sizes around .3%-2%
- Seeing friends report voting makes social message group more likely to report voting
- Why have a control group? Why make it so small?
Where does this experiment fall on the Digital-Analog continuum?
- Uses internet to recruit participants (unwittingly)
- Randomize participants using the internet
- Treatment delivered online
- Outcomes measured online
- Fully digital experiment
Validity
- Construct validity: Voters ~ US Facebook users over 18, messages map onto stated theories?
- External validity:
  - Facebook platform, specific election, user base
- Internal validity:
  - Intent-to-treat effects vs. treatment-on-treated effects
    - With digital experiments we don’t know if the subjects actually received the treatment - we have to assume it
    - Also means we assume the treatment can actually change their behavior. Does nothing if the subject
      - Already voted prior to Election Day
      - Logged on too late to go out and vote
Heterogeneity of effects
- How does the digital experiment allow researchers to assess different levels of effect?
- What do we learn from this aspect that we would not know if we just learned the average treatment effect?
Causal mechanism
- What causes the increase in reported and validated voting?
- How does the study identify the causal mechanism? Or does it?
- How can this knowledge be useful to researchers?
Ethics of running a study which could potentially influence the outcome of the election
- Applies to field experiments, but those are usually much smaller in scale
- Is this something we can/should study with such a large sample size?
- Why have such a large sample size anyways?
- follow up on R’s from Salganik as a next step – will talk about this more in next class, but best to reduce, replace, etc. whenever possible

How do I run a digital experiment?

Partner with the powerful

Partnerships with big organizations

Advantages: Reduce costs, increase scale
Disadvantages: can’t alter kinds of participants, treatments, and outcomes that you use
Examples: Energy Report Studies from last class (partner with utility companies), Bond et al. 2012 (partner with big tech companies)

Build your own experiment

Creating your own digital “lab”

Costly, but can run exactly the experiment you want to run.

How do you recruit participants, though?

Digital ads

Salganik 2007, Figure 2.12

Amazon Mechanical Turk (MTurk)

Source: mturk.com

Benefits of MTurk for experiments

Larger potential subject pool
Can integrate with external survey platforms to easily run survey experiments
Cheaper than traditional subject recruitment
4000+ MTurk academic studies and counting!

Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Political analysis, 20(3), 351-368.

Threats to Validity?

External Validity
Internal Validity

Threats to external validity
- Representative of population of interest?
  - No worse than convenience samples (though these are also skewed); ways of handling this via third-party resources that will correct for a lot of these biases
  - Comparable to internet-based surveys
  - Not as good as gold-standard surveys
- Replicates basic political science experiments well
  - Just the undergraduate workers of digital experimental research?
- Most MTurkers are one-off - only participate in one experiment
  - No statistical differences between one-offers and habitual participants
- May be difficult to generalize to the larger population of interest, but experiments run on this platform can be internally valid
Threats to internal validity
- One worker repeatedly working the HIT - not generally a concern
- Sometimes have workers who just rush to complete task (if no attention task)
- More often: Workers pay too much attention to instructions, at least more than normal subjects
  - Desire to please the researcher may bias the results
  - Not clear how serious an effect this would have
  - One solution - hide the purpose of the experiment and debrief afterwards (generally accepted practice by IRB)

Alternatives beyond just MTurk

CloudResearch
Prolific
Qualtrics
Dynata
and more…

Build a product

MovieLens

Use existing environments

On Thursday

van de Rijt et al. 2014: Success-breeds-success experiments
Bail et al. 2018 - political polarization on social media

Best practices in running your own experiment

Run multiple experiments that reinforce one another
Aim for zero variable cost
Build ethics into your design: replace, refine, reduce
See Bit by Bit 4.6.2 for more detail

Regardless of how you do your digital field experiment, some best practices (see Bit by Bit for more):

multi-experiments – should reinforce one another (as no single experiment will be perfect)
Create zero variable cost
- automate as much as possible (e.g. sampling, data collection, treatment)
- emphasize fixed costs (e.g. hosting the website and assets on site), and make users want to do your experiment (e.g. volunteer, as in Salganik’s MusicLab example)
Build ethics into your design – replace, refine, reduce
- replace experiments with less invasive methods if possible
- refine the treatment to make it as harmless as possible
- reduce the number of participants in your experiment as much as possible
  - can conduct a power analysis (Cohen 1988) to reduce size only to what is necessary – don’t need to run at crazy scales if already powerful enough!
- will revisit these considerations in a couple of weeks when we discuss Ethics of digital research more generally

Also on Thursday

Compare and contrast Salganik et al (2006) and van de Rijt et al (2014)
How do these studies fall on the lab/field continuum?
What are the relative strengths/weaknesses of the lab design vs. field design in terms of
- statistical validity
- construct validity
- internal validity
- external validity

Discussion Groups for Week 5

0	Daniela / Tian / Max
1	Pritam / Tianle / Yue
2	Emma / Anny / Zhuojun
3	Jiazheng / Cosmo / Yuhan
4	Andy / Abbey / Ertong
5	Huanrui / Kuang / Kexin
6	Lorena / Agnes / Thomas