Research Proposal

MACS 30000: Perspectives on Computational Analysis
University of Chicago

Annoucement

  • Final assignment is released
  • (Optional) re-submission of Assignment 1 due this Wednesday
    • Please provide a cover letter to highlight any difference
    • Please stay within the word count limit

Structure of the remaining weeks

  • Tuesday, week 8
    • Mass collaboration: human computation
    • Identifying a research question
  • Thursday, week 8
    • Research design (Thursday)
    • Workshoping your research questions (Thursday)

  • Thanksgiving week
    • Brainstorm your ideas and draft your research proposal
    • Submit whatever you have by Sun Nov 26
    • Participation graded based on complete/incomplete
  • Tuesday, week 9
    • Open calls and distributed data collection
    • Research design, continued
  • Thursday, week 9
    • Proposal workshop
  • Final proposal due on Friday of final week

Mass collaboration

  • Drawing on large-scale human collaboration to conduct research
  • Useful when computers are not smart enough to complete the tasks you need to do, or if you are looking for outside-the-box solutions

Major types

  • Human computation (this week)
  • Open call (week 9)
  • Distributed data collection (week 9)

Human computation

Split-apply-combine problems

  • Split data into pieces
  • Apply some function to each piece
  • Combine the results back together again

Data quality

  • Relies on the quality of volunteers/workers
  • Implement redundancy
  • “Good enough” standard

How to recruit workers

  • Free!
  • Micro-task labor markets
    • Example: Amazon Mechanic Turk

Mechanic Turk

Streetscore

  • Streetscore
  • Measures the perceived safety of a street based on how it looks to a human
  • Computer-assisted human computation system
    • 3,000 images used for scoring/training
    • 300,000 images used for just New York City
  • Field of urban science and urban studies

Potential limitations of this computation method?

  • Only used Boston and NYC images.
  • Street View images are a single snapshot in time.
    • Best time of day for Google to drive their car around city streets capturing data?
  • What does it mean to be “safe”?
  • Always remember the source of the data

Manifesto Project

  • Manifesto Project
  • Content analysis of political party policy positions
    • 1000 parties from 1945 to present, over 50 countries
  • Large-scale comparative analysis
  • Relies on MTurk workers
  • Extremely challenging to perform in a traditional research design

Manifesto Project

Figure 5.5 from Bit by Bit

Figure 5.5

Manifesto Project

Figure 5.6 from Bit by Bit

Figure 5.6

Manifesto Project

  • Wisdom of the crowds
    • Individual raters not as good as expert coders
    • Average of crowd ratings is pretty close to experts
    • Far cheaper to hire MTurkers, and much easier to scale
  • Application of natural language processing?
    • Lots of languages
    • Coding on multiple dimensions
    • Could use crowd-sourced data as training set
  • manifestor

Zooniverse

  • Zooniverse
  • People-powered research platform
  • Allows anyone to host a project and anyone to volunteer on a project
  • Benefits/drawbacks to microtasks vs. volunteers?

What is human computation best for?

Best for

  • Easy task, big scale
  • Tasks not easily solved by computers (yet)
  • Tasks can be done by non-experts
  • Classification is not subjective
    • Is this news story biased?
    • Is this bad policy?
  • Augmented and scaled by computer assisted human computation

Open Calls

  • Pose a problem asking for specific, measurable solutions from other people
  • Offer a reward/incentive for participation
  • Compare and evaluate the solutions using a consistent and measurable metric
  • Generate broad participation from a wide range of researchers

Netflix prize

  • Need to predict what movies customers would enjoy
  • Internal research plateaus
  • Release an anonymized dataset of 100 million movie ratings to predict 3 million held-out ratings
  • Anyone who could create an algorithm that improved the existing model by 10% or better would win 1 million dollars
  • Clear and unbiased evaluation criteria
  • Solicited over 40,000 solutions

Discussion

  • The best predictive models in the Netflix Prize open call were hybrids of multiple models (ensemble methods). What characteristic of one model relative to other models made it improve the overall prediction when blended with the other models?
  • In your opinion, what kind of tasks are better suited for open call contests? What kind of tasks are not?
  • What are the benefits to the researchers proposing the problem?
  • What are the benefits to the participants proposing the solutions?
  • Are open calls better tailored to questions of prediction or questions of explanation? How might we utilize open calls to tackle explanations?

Distributed Data Collection

  • Enlist volunteers as data collectors
  • Enlist work on a scale otherwise impossible

OpenStreetMap

Source: OpenStreetMap

Designing your own mass collaboration

  • Motivate participants
  • Leverage heterogeneity
  • Focus attention
  • Be ethical
  • What happens if nobody comes?

Identifying research questions

Identifying research topics

  • Different strategies:
    • Event-driven
      • Avoid me-search
    • Theory-driven
    • Data-driven
    • Don’t start with a method
  • Start broad, then narrow down

Sources for a research topic

  • Social science seminars
  • Workshops
  • Relevant academic journals
  • Conference presentations
  • Research labs
  • Google (and library searching)
  • Ongoing world events

Turn into a research question

  • Should be focused
    • You’re writing a paper, not a book
  • Needs to be answerable
  • Types of questions:
    • What? (descriptive)
    • How? (descriptive/causal)
    • Why? (causal)
      • Requires a puzzle

Tips

  • Sometimes, what questions are interesting enough. But oftentimes, they are not.
  • Event/data-driven project: What questions are answerable? What variables are measurable? What do they measure?
  • Not a bad idea to start with a descriptive/how question; but better to be turned into a causal/why question at some later point
    • Literature review helps.

Tips

  • Can derive dozens of questions from even a specific topic
  • Pick and choose between them - what is interesting/relevant?
  • Event/data-driven project:
    • Okay to start with a single case/event/platform/site
    • Focus on internal variation
    • Worry about generalizability later

Evaluate your questions

  • There’s no such thing as a dumb question, but not all questions are equally good
  • Avoid settled facts
  • Event/theory-driven project: What data exists?
  • Theory-driven project: Do you have a “fair” sample?

Turn into a research problem

  • “Nobody has studied it before”
    • could be your starting point
    • is almost never a good ending point
  • Your research must be valuable: Determine your potential audience - what will concern them?
  • Pure vs. applied research
    • Science is okay with pure research problems
    • We want understanding
    • In other contexts, application is far more important

Mental exercise for Thursday

  1. Identify ONE research topic (as defined in Booth et. al. [2016]) within the social sciences which you find personally interesting. Briefly explain the research topic and your interest. If it helps, feel free to use the framing device the authors suggest at the end of chapter 3 (e.g. what you are writing about -> what you don’t know about it -> why you want your reader to know and care about it).

  1. Based on your proposed research topic, identify up to THREE specific research questions based on that topic. You should be capable of answering these questions within a single research paper. For each question, identify the problem and target audience of the question. Finally, explain how a computationally-enhanced research design could assist you in answering this question.

In class on Thursday

  • Bring your research topic and questions to class
  • Group discussion (up to 30 mins)
  • Present your topic and questions to class

Discussion groups

0 Abbey / Yuhan / Ertong
1 Emma / Jiazheng / Thomas
2 Andy / Kuang / Agnes
3 Anny / Zhuojun / Max
4 Daniela / Pritam / Cosmo
5 Huanrui / Lorena / Yue
6 Tianle / Kexin / Tian

Example

  • Topic: Polarization in consumption of science
  • Bigger picture
    • Theory-driven: Is science beyond politics?
    • Data-driven: What can we do with millions of online book co-purchases?

Research questions from general to specific

  • What/which science books do liberals and conservatives read?
  • Are liberals or conservatives more interested in reading science books?
  • Do liberals and conservatives read the same (or similar) science books?
  • How are liberals and conservatives different in their selection of science books?
    • Which disciplines are read more by liberals/conservatives?
    • What are the characteristics of the disciplines favored by liberals and conservatives?
    • When a scientific discipline attracts equal attention from both sides, is there any internal division within the discipline?
    • Within each discipline, what are the breadth of books read by liberals and conservatives?
  • What explains the difference? Applied vs. pure?

Answerable questions given co-purchase data

  • What/which science books are co-purchased with liberal/conservative political books?
  • Are science books co-purchased more with liberal or conservative books?
  • Which disciplines of books are more co-purchased with liberals/conservative books?
  • What percentages of books within each discipline have shared links with liberal/conservative books?
  • Within each discipline, what are the breadth of books co-purchased with liberal/conservatives books?
  • Are political alignment/shared interest/difference in breadth correlated with any characteristics of disciplines?

Operationalization

  • What constitute a co-purchase link?
    • Should links be weighted?
  • Given a set of linked books, how to define their breadth?
  • How to measure whether a discipline is more pure or applied?

Class exercise

  • As a class, identify a focused topic within #MeToo
  • In small groups and then as a class, identify
    • Up to three research questions
    • Identify the significance of the question - that is, what is the problem trying to be solved?

Research design

Literature review: Identifying sources

  • What constitutes an academic source?
    • Published journal article
    • Scholarly book/chapter
    • Working paper
  • What is not an appropriate source?
    • Policy report (maybe)
    • Blog post
    • Wikipedia

How to find sources

  • Google Scholar

  • Library research guides

  • Articles in annual review journals, e.g. Annual Review of Sociology/Political Science/Psychology

  • Handbooks & annotated bibliography

  • Skim the works cited of a relevant paper

  • Use citation index to find other papers that cite this work

Evaluating source quality

  • Is the publisher reputable?
  • Is it peer-reviewed?
  • Is it current?
  • What is the citation count?

Manage your bibliography

  • Record the bibliographical information for sources you think are noteworthy
  • Far better to do this up front than waiting until you write the paper - you will forget about some of your sources
  • Citation management

BibTeX

  • Store bibliographic information in plain-text .bib files
  • Easily incorporate citations into \(\LaTeX\) and R Markdown documents
  • I think it works for Microsoft Word as well?
  • Automatically generates your works cited page

A note on citation style

  • I don’t care which style you use, just be consistent
  • Using BibTeX or other citation managers ensures consistency in formatting
  • If you still have trouble understanding how to integrate your sources into your writing (e.g. when to cite, how to paraphrase) read chapter 14 in Booth or ask me.

Developing a research design

  • Is your question a what/how/why question?
  • If it is purely a descriptive/what question
    • choose an observational design
    • design a survey
  • If you want heterogeneity/mechanisms
    • go big
  • If you want to test a causal claim
    • come up with a quasi-experimental design
    • or conduct a real experiment

Where do you get data?

  • Does the data in principle already exist?
  • If yes, is it analog or digital?
    • If analog, how do you turn it into digital?
    • Machine-coding
    • Human computation
    • If digital, how hard is it to get it?

Where do you get data?

  • Does the data in principle already exist?
  • If no, how would you create it?
    • Survey
    • Mass collaboration
    • Simulation
  • Does the data allow you to answer your question?
    • If no, could you change your question?