Research Proposal

MACS 30000: Perspectives on Computational Analysis
University of Chicago

Annoucement

Final assignment is released
(Optional) re-submission of Assignment 1 due this Wednesday
- Please provide a cover letter to highlight any difference
- Please stay within the word count limit

Structure of the remaining weeks

Tuesday, week 8
- Mass collaboration: human computation
- Identifying a research question
Thursday, week 8
- Research design (Thursday)
- Workshoping your research questions (Thursday)

Thanksgiving week
- Brainstorm your ideas and draft your research proposal
- Submit whatever you have by Sun Nov 26
- Participation graded based on complete/incomplete
Tuesday, week 9
- Open calls and distributed data collection
- Research design, continued
Thursday, week 9
- Proposal workshop
Final proposal due on Friday of final week

Mass collaboration

Drawing on large-scale human collaboration to conduct research
Useful when computers are not smart enough to complete the tasks you need to do, or if you are looking for outside-the-box solutions

Major types

Human computation (this week)
Open call (week 9)
Distributed data collection (week 9)

Human computation

Split-apply-combine problems

Split data into pieces
Apply some function to each piece
Combine the results back together again

Data quality

Relies on the quality of volunteers/workers
Implement redundancy
“Good enough” standard

How to recruit workers

Free!
Micro-task labor markets
- Example: Amazon Mechanic Turk

Mechanic Turk

Streetscore

Streetscore
Measures the perceived safety of a street based on how it looks to a human
Computer-assisted human computation system
- 3,000 images used for scoring/training
- 300,000 images used for just New York City
Field of urban science and urban studies

Potential limitations of this computation method?

Only used Boston and NYC images.
Street View images are a single snapshot in time.
- Best time of day for Google to drive their car around city streets capturing data?
What does it mean to be “safe”?
Always remember the source of the data

Only used Boston and NYC images. Other cities may have entirely different types of architectures/features which the model would be unable to process adequately
Street View images are a single snapshot in time. Not representative of a street throughout the day
- Example - best time of day for Google to drive their car around city streets capturing data? Early morning when there is daylight but many people are not on the road yet. However that is also a time of day when there are not very many pedestrians on the street. A street with lots of people probably would be perceived as safe, so even a more run-down street could be perceived as safer if we saw it during rush hour or mid-afternoon
What does it mean to be “safe”? Is this the same as crime? Are we conflating concepts? Is that problematic?
Always remember the source of the data

Manifesto Project

Manifesto Project
Content analysis of political party policy positions
- 1000 parties from 1945 to present, over 50 countries
Large-scale comparative analysis
Relies on MTurk workers
Extremely challenging to perform in a traditional research design

Manifesto Project

Figure 5.5 from Bit by Bit

Figure 5.5

Manifesto Project

Figure 5.6 from Bit by Bit

Figure 5.6

Manifesto Project

Wisdom of the crowds
- Individual raters not as good as expert coders
- Average of crowd ratings is pretty close to experts
- Far cheaper to hire MTurkers, and much easier to scale
Application of natural language processing?
- Lots of languages
- Coding on multiple dimensions
- Could use crowd-sourced data as training set
manifestor

Zooniverse

Zooniverse
People-powered research platform
Allows anyone to host a project and anyone to volunteer on a project
Benefits/drawbacks to microtasks vs. volunteers?

What is human computation best for?

Best for

Easy task, big scale
Tasks not easily solved by computers (yet)
Tasks can be done by non-experts
Classification is not subjective
- Is this news story biased?
- Is this bad policy?
Augmented and scaled by computer assisted human computation

Open Calls

Pose a problem asking for specific, measurable solutions from other people
Offer a reward/incentive for participation
Compare and evaluate the solutions using a consistent and measurable metric
Generate broad participation from a wide range of researchers

Netflix prize

Need to predict what movies customers would enjoy
Internal research plateaus
Release an anonymized dataset of 100 million movie ratings to predict 3 million held-out ratings
Anyone who could create an algorithm that improved the existing model by 10% or better would win 1 million dollars
Clear and unbiased evaluation criteria
Solicited over 40,000 solutions

Discussion

The best predictive models in the Netflix Prize open call were hybrids of multiple models (ensemble methods). What characteristic of one model relative to other models made it improve the overall prediction when blended with the other models?
In your opinion, what kind of tasks are better suited for open call contests? What kind of tasks are not?
What are the benefits to the researchers proposing the problem?
What are the benefits to the participants proposing the solutions?
Are open calls better tailored to questions of prediction or questions of explanation? How might we utilize open calls to tackle explanations?

Distributed Data Collection

Enlist volunteers as data collectors
Enlist work on a scale otherwise impossible

OpenStreetMap

Source: OpenStreetMap

Designing your own mass collaboration

Motivate participants
Leverage heterogeneity
Focus attention
Be ethical
What happens if nobody comes?

Motivate participants - what are potential incentives?
- Money
- Fun
- Community
- Helping science
- Competition
Leverage heterogeneity
- Don’t just automatically throw out low-skill contributors
- Some people will participate much more than others - there is no need to artificially limit participation
- Implement redundancy and quality checks to ensure the results are valid
Focus attention
- Limit the scope of what the contributor is doing
- Streetscore - which of these two images looks safer?
- Netflix prize - optimize our predictive accuracy for this set of movie ratings
- Citizen scientists may not be highly trained in conducting science, but give them clear instructions and a discrete task to complete and they can be impactful
Be ethical
- What is fair compensation?
- How do we avoid abusing or mistreating workers?
- Open calls - is it fair for thousands of worker-hours to be spent on this challenge and only a very small number of workers to receive compensation?
What happens if nobody comes?
- No guarantees people will want to participate in your project
- Some incentives are just not enough to attract enough people
- Pilot tests are good at implementing projects on small-scale to assess the viability of the project
- Be prepared for failure

Identifying research questions

Identifying research topics

Different strategies:
- Event-driven
  - Avoid me-search
- Theory-driven
- Data-driven
- Don’t start with a method
Start broad, then narrow down

Sources for a research topic

Social science seminars
Workshops
Relevant academic journals
Conference presentations
Research labs
Google (and library searching)
Ongoing world events

Turn into a research question

Should be focused
- You’re writing a paper, not a book
Needs to be answerable
Types of questions:
- What? (descriptive)
- How? (descriptive/causal)
- Why? (causal)
  - Requires a puzzle

Tips

Sometimes, what questions are interesting enough. But oftentimes, they are not.
Event/data-driven project: What questions are answerable? What variables are measurable? What do they measure?
Not a bad idea to start with a descriptive/how question; but better to be turned into a causal/why question at some later point
- Literature review helps.

Tips

Can derive dozens of questions from even a specific topic
Pick and choose between them - what is interesting/relevant?
Event/data-driven project:
- Okay to start with a single case/event/platform/site
- Focus on internal variation
- Worry about generalizability later

Evaluate your questions

There’s no such thing as a dumb question, but not all questions are equally good
Avoid settled facts
Event/theory-driven project: What data exists?
Theory-driven project: Do you have a “fair” sample?

Turn into a research problem

“Nobody has studied it before”
- could be your starting point
- is almost never a good ending point
Your research must be valuable: Determine your potential audience - what will concern them?
Pure vs. applied research
- Science is okay with pure research problems
- We want understanding
- In other contexts, application is far more important

Mental exercise for Thursday

Identify ONE research topic (as defined in Booth et. al. [2016]) within the social sciences which you find personally interesting. Briefly explain the research topic and your interest. If it helps, feel free to use the framing device the authors suggest at the end of chapter 3 (e.g. what you are writing about -> what you don’t know about it -> why you want your reader to know and care about it).

Based on your proposed research topic, identify up to THREE specific research questions based on that topic. You should be capable of answering these questions within a single research paper. For each question, identify the problem and target audience of the question. Finally, explain how a computationally-enhanced research design could assist you in answering this question.

In class on Thursday

Bring your research topic and questions to class
Group discussion (up to 30 mins)
Present your topic and questions to class

Discussion groups

0	Abbey / Yuhan / Ertong
1	Emma / Jiazheng / Thomas
2	Andy / Kuang / Agnes
3	Anny / Zhuojun / Max
4	Daniela / Pritam / Cosmo
5	Huanrui / Lorena / Yue
6	Tianle / Kexin / Tian

Example

Topic: Polarization in consumption of science
Bigger picture
- Theory-driven: Is science beyond politics?
- Data-driven: What can we do with millions of online book co-purchases?

Research questions from general to specific

What/which science books do liberals and conservatives read?
Are liberals or conservatives more interested in reading science books?
Do liberals and conservatives read the same (or similar) science books?
How are liberals and conservatives different in their selection of science books?
- Which disciplines are read more by liberals/conservatives?
- What are the characteristics of the disciplines favored by liberals and conservatives?
- When a scientific discipline attracts equal attention from both sides, is there any internal division within the discipline?
- Within each discipline, what are the breadth of books read by liberals and conservatives?
What explains the difference? Applied vs. pure?

Answerable questions given co-purchase data

What/which science books are co-purchased with liberal/conservative political books?
Are science books co-purchased more with liberal or conservative books?
Which disciplines of books are more co-purchased with liberals/conservative books?
What percentages of books within each discipline have shared links with liberal/conservative books?
Within each discipline, what are the breadth of books co-purchased with liberal/conservatives books?
Are political alignment/shared interest/difference in breadth correlated with any characteristics of disciplines?

Operationalization

What constitute a co-purchase link?
- Should links be weighted?
Given a set of linked books, how to define their breadth?
How to measure whether a discipline is more pure or applied?

Class exercise

As a class, identify a focused topic within #MeToo
In small groups and then as a class, identify
- Up to three research questions
- Identify the significance of the question - that is, what is the problem trying to be solved?

Research design

Literature review: Identifying sources

What constitutes an academic source?
- Published journal article
- Scholarly book/chapter
- Working paper
What is not an appropriate source?
- Policy report (maybe)
- Blog post
- Wikipedia

How to find sources

Google Scholar
Library research guides
Articles in annual review journals, e.g. Annual Review of Sociology/Political Science/Psychology
Handbooks & annotated bibliography
Skim the works cited of a relevant paper
Use citation index to find other papers that cite this work

Evaluating source quality

Is the publisher reputable?
Is it peer-reviewed?
Is it current?
What is the citation count?

Manage your bibliography

Record the bibliographical information for sources you think are noteworthy
Far better to do this up front than waiting until you write the paper - you will forget about some of your sources
Citation management

BibTeX

Store bibliographic information in plain-text .bib files
Easily incorporate citations into \(\LaTeX\) and R Markdown documents
I think it works for Microsoft Word as well?
Automatically generates your works cited page

A note on citation style

I don’t care which style you use, just be consistent
Using BibTeX or other citation managers ensures consistency in formatting
If you still have trouble understanding how to integrate your sources into your writing (e.g. when to cite, how to paraphrase) read chapter 14 in Booth or ask me.

Developing a research design

Is your question a what/how/why question?
If it is purely a descriptive/what question
- choose an observational design
- design a survey
If you want heterogeneity/mechanisms
- go big
If you want to test a causal claim
- come up with a quasi-experimental design
- or conduct a real experiment

Where do you get data?

Does the data in principle already exist?
If yes, is it analog or digital?
- If analog, how do you turn it into digital?
- Machine-coding
- Human computation
- If digital, how hard is it to get it?

Where do you get data?

Does the data in principle already exist?
If no, how would you create it?
- Survey
- Mass collaboration
- Simulation
Does the data allow you to answer your question?
- If no, could you change your question?