Observing Behavior: Digital Trace (Big) Data

MACS 30000: Perspectives on Computational Analysis
University of Chicago

What is digital trace data?

In aggregate: often called “Big Data”

We have finally found our telescope!

[J]ust as the invention of the telescope revolutionized the study of the heavens, so too by rendering the unmeasurable measurable, the technological revolution in mobile, Web, and Internet communications has the potential to revolutionize our understanding of ourselves and how we interact … . [T]hree hundred years after Alexander Pope argued that the proper study of mankind should lie not in the heavens but in ourselves, we have finally found our telescope. Let the revolution begin.

Watts, D.J. (2011:266). Everything is Obvious: Why Common Sense is Nonsense. United Kingdom: Atlantic Books.]

Observational data

Any data that results from observing a social system without intervening in some way

Digital trace

Digital data that captures “traces” of human behavior

  • Social media sites
  • Internet search data
  • Blogs and Websites
  • Administrative records
  • Historical archives
  • Geospatial data
  • and much more…

Mapping social distancing

April 28, 2020

September 13, 2020

Source: https://www.safegraph.com/dashboard/covid19-shelter-in-place

Geography of intergenerational mobility

Chetty, R., Hendren, N., Kline, P., & Saez, E. (2014)

Traits of “Big Data”

The 3 Vs of Big Data

  • Volume
  • Variety
  • Velocity

Strengths

Big

Source: Bit by Bit

Always on

Non-reactive

See also: Stephens-Davidowitz, Seth. 2014. “The Cost of Racial Animus on a Black Candidate: Evidence Using Google Search Data.” Journal of Public Economics 118:26–40. https://doi.org/10.1016/j.jpubeco.2014.04.010.

Weaknesses

Inaccessible

Non-representative

Positivity bias

Drifting

Algorithmically confounded

See also: Ugander, Johan, Brian Karrer, Lars Backstrom, and Cameron Marlow. 2011. The Anatomy of the Facebook Social Graph. arXiv:1111.4503 [Physics], November. http://arxiv.org/abs/1111.4503.

Unstructured

Sensitive

https://aboutmyinfo.org/identity

Incomplete

Characteristics of Big Data that are…

  • …generally helpful for research: big, always-on, and non-reactive
  • …generally problematic for research: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive

The future of digital trace data

Source: https://commons.wikimedia.org/wiki/File:Gartner_Hype_Cycle.svg

Discussion: repurposing social media data

While some public health advocates consider e-cigarettes an effective aid for smoking cessation, others warn about the potential risks, such as the high levels of nicotine. Imagine that a researcher decides to study public opinion toward e-cigarettes by collecting e-cigarettes-related Twitter posts and conducting sentiment analysis.

Thinking about the data in particular, what are three possible biases that you are most worried about in this study? Three possible strengths?

Clark et al. 2016 “Vaporous Marketing: Uncovering Pervasive Electronic Cigarette Advertisements on Twitter.”

  • Collected 850,000 tweets that used e-cigarette-related keywords from January 2012 through December 2014
  • 80% of tweets were from bots (i.e. non-humans)
  • Tweets from bots were more positive than organic tweets (i.e. higher average sentiment score)

Review

Review: To test a social scientific theory…

  • …we must operationalize theoretical constructs in concrete, measurable terms
  • But construct validity is hard to ensure when working with Big Data!

A Theory: People who are more intelligent earn more money.

Two studies that make this claim based on their findings …

  • Study 1: People on Twitter who use longer words are more likely to mention luxury brands
    • Longer words \(\stackrel{?}{\sim}\) more intelligent
    • Luxury brand \(\stackrel{?}{\sim}\) earn more money
    • Weak construct validity
  • Study 2: People who score well on the Raven Progressive Matrices Test have higher reported incomes on their tax returns
    • Raven Progressive Matrices Test \(\sim\) Intelligence (Carpenter, Just, and Shell 1990)
    • Reported Income on Tax Returns \(\sim\) Earnings
    • Much stronger construct validity!

Example digital traces used to operationalize theoretical constructs

Table 2.2, Bit by Bit
Data source Theoretical construct References
Email logs from a university (meta-data only) Social relationships Kossinets and Watts (2006), Kossinets and Watts (2009), De Choudhury et al. (2010)
Social media posts on Weibo Civic engagement Zhang (2016)
Email logs from a firm (meta-data and complete text) Cultural fit in an organization Srivastava et al. (2017)

Counting things

  • How frequently does \(x\) occur?
  • How does that frequency compare to that of \(y\)

Farber 2015: Comparing competing theories in labor economics

  • Do NYC taxi drivers work more hours on days with higher hourly wages, or on days with lower hourly wages?
  • Overall: drivers work more hours on days with higher wages
  • But: this is learned behavior

Counting things 2.0: deriving measurements

  • Measurements: Operational definition of a theoretical construct
  • Example: Measuring “Happiness” of members of the U.S. Congress as a function of multiple variables (Wojcik et al. 2015):
    • Sentiment of tweets (number of positive vs. negative words)
    • Intensity of smile in professional photographs

Reading Exercise

Chan et al. 2021 “Can psychological traits explain mobility behavior during the COVID-19 pandemic?”

  • What is the primary research question?
  • What data do the authors use to answer this question?
  • How do they attempt to bridge the gap between the theoretical constructs in their research question and the data they use in the paper? Assess the construct validity of this study.
  • How do the authors answer their research question? What methods do they use?
  • What are the authors’ main findings?
  • What are some strengths and limitations of this research (design)? Consider both the data and the methods used. (In other words, why did this paper get published?/What is its main contribution (special ingridient)? What is left unanswered?)

In-Class On Thursday

Shi et al. 2017 “Millions of online book co-purchases reveal partisan differences in the consumption of science”

  • What is the primary research question?
  • What data do the authors use to answer this question?
  • How do they attempt to bridge the gap between the theoretical constructs in their research question and the data they use in the paper? Assess the construct validity of this study.
  • How do the authors answer their research question? What methods do they use?
  • What are the authors’ main findings?
  • What are some strengths and limitations of this research (design)? Consider both the data and the methods used.

Mental Exercise

Group Assignment

Week 2 Discussion Groups
0 Andy / Abbey / Kuang
1 Jiazheng / Cosmo / Kexin
2 Pritam Rashtrapal / Agnes / Tianle
3 Huanrui / Tian / Max
4 Daniela / Ertong / Zhuojun
5 Emma / Lorena / Yue
6 Anny / Yuhan / Thomas

  • Read Stephens-Davidowitz (2014) and understand how the author used Google’s search records to measure racial animus in the United States. Critically evaluate the strengths and weaknesses of the measure.

  • Play around with Google Trends to see what functionalities it offers and try to figure out Google’s methodology by checking out information that Google publicly releases (such as its FAQ page).

  • Then, consider the following scenario:

You were asked to conduct a study on how the U.S. public’s attitude toward mask-wearing varies across regions during the COVID-19 pandemic. You thought maybe you could obtain some good measurements by making use of Google’s search records. How would you construct your measure(s)? What search queries (or combination of search queries) would you use? What research question(s) does your measure best answer? (i.e. what public attitude/interest does your query exactly measure?) What does it NOT measure? What are possible measurement errors and biases?

  • For future weeks, I will announce group assignments ahead of time through emails.
  • On each Thursday’s class, I will give you 10 minutes to decide on your group response. Hopefully, we will be able to have more time for presentations and Q&A.
  • I encourage you to start your group discussions before class.