Using Open Data to Find Social Inequality and Aid the Disadvantaged

SBP 2015 Grand Data Challenge
March 31 – April 3, 2015, UCDC Center, Washington DC, USA


This year's SBP Grand Challenge problem asks participants to consider the following question, "how can we use publicly available data on the web and elsewhere to find social inequality and to aid the disadvantaged?"

From the Arab Spring to the recent Gamergate scandal, the use of social media in understanding and mitigating social inequalities and prejudice has increased at a rapid pace.  At the same time, data used for decades to study the ways in which social inequalities permeate every facet of social structure have become increasingly accessible. While many have taken advantage of these resources to produce new and interesting approaches to understanding social inequalities and ways to prevent them, there is much interesting and useful work still to be done. For example, the following questions may be of interest:

  • How are stereotypes of disadvantaged individuals perpetuated in social media?
  • How do differing levels of Internet access affect the presence and attitude of individuals online?
  • How has the distribution of poverty changed over time as American cities have grown, and how has this affected the impoverished population in a negative or positive way?

These are by no means the only questions of interest, and are only intended to give a rough idea of what might be an interesting topic to explore for this challenge problem.

Winning Entry

  • Title: The Origins of Regional Discrimination in China
  • Authors: Yang Zhang and Xi Wang, University of Iowa
  • Abstract: In this research, we intend to explore the origins of regional discrimination in China. We will analyze two sources of big data, examining more than 2000 Chinese counties and districts. One data source is the most recent Chinese census, which describes the demographic composition of each county/district. And the other data source is the social media data crawled from Sina Weibo that reflect the public sentiment toward non-local population. Through text analysis, we expect that regional discrimination in China results from conflicts of realistic interests and threats to local identities, and local governments are able to shape the public opinion toward the migrant population.

Runner-Up Entry

  • Title: Rich People Don't Have More Followers! Overcoming Social Inequality With Social Media
  • Authors: Hemank Lamba, Momin M. Malik, Constantine Nakos and Juergen Pfeffer, Carnegie Mellon University
  • Abstract: Previous work on personal networks has shown that higher socioeconomic status results in larger and more powerful networks. With the Internet, in particular with social media, it has become easier to establish and maintain relationships, suggesting an equalizing effect. However, people of different socioeconomic status use these new resources in different ways creating a digital divide. In this article we study popularity on Twitter based on estimated socioeconomic status in real life. We collect 1 billion geo-coded Tweets from the United States and connect the geographic position of the sender with socioeconomic data at the level of Census block groups. We show that people tweeting from higher income areas do not have more followers. Rather, there is a small negative correlation between income and number of followers.


  • Don Adjeroh, West Virginia University
  • Ross Maciejewski, Arizona State University
  • Terrill Frantz, Peking University
  • Xun Zhou, University of Iowa


This year's winners received a $1,000 in travel funding to present their work at the conference, a $400 cash prize, and plaques declaring their victory in the competition. The runners-up received a travel stipend and a $200 cash prize.

Example Datasets

We have provided some sample datasets to get contestants started on their submissions. These datasets are merely intended to provide a starting point, and are not required for the submission. Contestants are encouraged to provide their own datasets for the community. All of the datasets that follow are available on the SBP Grand Challenge website (

  • Ferguson Protests - Tweets pertaining to 2014 protest activity in Ferguson, Missouri. Contains 1.1M Tweets 7-14 days after the first protests. The full list of Tweet IDs, as well as the Tweet crawler can be found here:
  • Census Data - The US Census department provides an API ( to quickly access large volumes of census data.
  • Gamergate - Tweets pertaining to the Gamergate scandal collected by Andy Baio for his article about the incident (
  • Social Computing Repository - Contains data from a collection of social media sites including Digg, Foursquare, and Twitter. Data can be obtained from (


Please direct all questions to the SBP Grand Challenge Committee at




Media Partners