Computing Human Society through the Global News

SBP 2014 Grand Data Challenge
April 2 – April 4, 2014, UCDC Center, Washington DC, USA

We invite participants to explore the Global Database of Events, Language and Tone (GDELT), an expansive set containing a quarter-billion geo-referenced political news events covering the whole world since 1979, describing who did what to whom when and where, and use GDELT to

  • Demonstrate the applications of spatial, temporal and network methodologies and their interactions,
  • Identify the latent “influencers” of social movement and media-based political competition,
  • Experimentally validate and improve models for social phenomena,
  • Visualize social movements of different types on all levels at a glance,
  • Propose solutions to societal problems including health care and public safety using data- and model- based reasoning, and
  • To suggest new creative applications using the GDELT data.

 

Using News Media to Understand our Media-Mediated Society

The global news media has long been used as a key archive of human society, both as a catalog of societal activity and of the beliefs and narratives surrounding, causing, and contextualizing it.  The transition of the news industry over the past quarter century to digital publication and the corresponding rise of internet publishing has brought with it an exponential rise in the accessibility of news media from around the world.  The constant stream of daily life that flows across the world's news media provides a vast archive of societal-scale activities from riots and military strikes to peace appeals and diplomatic exchanges, together with rich contextual background information that provide detailed narratives of each region and culture.  Few of us have visited Iraq, Afghanistan, or Syria recently, yet through the vast interconnected web of the world's news media we can instantly access the latest developments in those countries or almost any other corner of the globe and understand how the world is reacting in real time with just a mouse click.

This vast volume of information captured in the world's news poses at once great opportunities, giving ever-increasing access to the real time behavior and beliefs of the world's citizenry, but also requires a fundamental shift in how this information is understood and employed.  The concept of quantifying human behavior into discrete "occurrences" known as events has not been widely used outside of the political science and economics disciplines, while latent co-occurrence graph structures over people, organizations, locations, themes, and events have not previously been available at scale for academic research.  New methodologies are needed to translate the approaches made possible by the quantification of the world's news media into the theoretical constructs and methodologies most common in these fields.  For example, leadership "influencer" networks constructed externally of an organization from its media coverage differ in fundamental ways from those constructed internally – how does one adapt the literature of SNA metrics, theories, and analytic processes to take these differences into account?  How does one translate the way in which the geographic affiliation of a particular topic changes over time in the news into a map of the spread of that idea or belief across communities?  The media itself presents an imperfect record of society, with key media effects from selection bias to media fatigue, agenda setting to helicopter journalism having profound impacts on what is recorded and how it is represented.  How can the function of these processes be modeled at scale?  Can latent dimensions such as emotion, thematic context, and structural contextualization be used as surrogates for national stability or as forecasting signals of future behavior?

The greatest challenge today no longer lies in the acquisition of data, but rather in how best to synthesize and integrate in a timely and actionable way the rapidly increasing pace and volume of material available using analysis and interaction methodologies developed in the by-gone era of information scarcity.  For example, few mapping applications today support real time exploration of even millions of points, while most spatial analytic methods fail to scale beyond hundreds of thousands of points in tractable time.  Few spatial methods support non-spatial connectivity (such as through parallel semantic connectedness of locations), while few network methods incorporate spatial knowledge and both network and spatial methodologies largely lack consistent support for change over time other than through fixed snapshots.  No major global network dataset has existed thus far that attempts to quantify the picture of society presented by the global news media over a longitudinal period of time.  In all, our present analytic methodologies were built to make sense of scarcity, yet now must make sense of a deluge of actors, attributes, and the relationships among them – how do we take the first steps towards computing at a "societal scale"?

The Global Database of Events, Language, and Tone (GDELT) is an initiative to translate the textual narrative of human society as captured by the global news media to construct a quantitative catalog of human societal-scale behavior and beliefs across all countries of the world over the last quarter-century down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first "real time social sciences earth observatory."  Today nearly a quarter-billion geo-referenced dyadic "event records" capture global behavior in more than 300 categories covering 1979 to present with daily updates (this will be expanded back to 1800 in 2014).  Each event record is a dyadic entry that captures who did what to whom, where, and when.  This is combined with a massive array of thematic, spatial, entity, and network indicators capturing both what is happening around the world and the underlying actors, beliefs, and contextualization.  GDELT's historical back file and daily updates expand current understanding of social systems beyond small static snapshots of particular locations or time periods towards global longitudinal time series that incorporate the nonlinear behavior and feedback effects that define human interaction and greatly enrich fragility indexes, early warning systems, and forecasting efforts.

 

The Dataset

The GDELT database (http://gdelt.utdallas.edu/) is divided into three core data streams, capturing physical activity, counts of key incidents like death, and a graph structure capturing the latent and physical aspects of the global news into a single computable network:

Event Data: With coverage beginning January 1, 1979, this dataset consists of a quarter-billion geo-referenced dyadic “event records” covering all countries in the world 1979 to present, capturing who did what to whom, when, and where. All events are captured in the CAMEO taxonomy of roughly 300 categories of global political behavior from riots and protests to diplomatic exchanges and peace appeals, covering all countries 1979-present, with daily updates of around 100,000 new events each day. Special emphasis has been placed on enhanced coverage of Africa and Latin America, producing one of the first cross-national datasets for South America and the most extensive database for Africa. The base CAMEO taxonomy has been enriched with the new CAMEO Religious and Ethnic taxonomies and all events are geo-referenced to the city level. Each record contains 58 fields capturing a range of information about the event and actors involved.

Count Data: Beginning October 1, 2013, this data records mentions of counts of things with respect to a set of predefined categories such as a number of protesters, a number killed, or a number displaced or sickened. There are eleven categories currently supported: Affected, Arrested, Displaced, Evacuation, Kidnapped, Killed, Protested, Refugees, Seized, Sickened, Wounded. Such counts may occur independently of events in the Event Data stream, such as mentions of those killed in industrial accidents (which are not recorded in the CAMEO event taxonomy) or those displaced by a natural disaster or sickened by a disease epidemic. In this way, the Count Data can be used for example to produce a daily “Death Tracker” to map all mentions of death across the world each day, or an “Affected Tracker” to indicate how many persons were sickened/displaced/stranded each day (at least as recorded in the global news media). Similar to the Event Data, each Count is geo-referenced to the city level globally.

Global Knowledge Graph: Beginning October 1, 2013, this data expands GDELT’s ability to quantify global human society beyond cataloging physical occurrences towards actually representing the latent dimensions, geography, and network structure of the global news. To sum up the Global Knowledge Graph in a single sentence, it is an attempt to connect the people, organizations, locations, counts, themes, news sources, and events appearing in the news media across the globe each day into a single massive network that captures what’s happening around the world, what its context is and who’s involved, and how the world is feeling about it, each day. The Global Knowledge Graph is a network structure connecting all persons, organizations, locations, emotions, themes, counts, events, and sources together each day in the news media into a single network structure and captures the cultural narratives that envelope the global information stream.

Each morning the Global Knowledge Graph system processes each news article from the previous day (regardless of whether it contained any events in the Event Data file) and compiles a list of all person names, organization names, locations, overall emotion, any mention of a count of something (the Count Data file above), any mention of an event (the Event Data file above), and any mention of a predefined catalog of themes (over 150 themes as of this writing) within that article. It then constructs a network structure over all of these features based on co-occurrence, allowing a vast array of spatial, temporal, and network analyses, from mapping themes, people, and groups over space, to looking at the connections among people and analyzing media-derived influencer networks around themes and space. The Global Knowledge Graph records any mention of a major terror group, major political party around the world, major infectious disease, etc., irrespective of any connection with an event. This can be used, for example, to create a media-based proxy of political competition through measuring how often each political party receives media coverage and how the parties are contextualized with one another, including across countries. Other applications include mapping a theme over space to capture change over time and how locations are connected, plotting movements of political candidates during an election (along with the themes they are associated with at each location), and constructing influencer networks around the people and organizations associated with specific topics and locations.

Suggested Topics for Submissions

The SBP 2014 challenge is open-ended; submissions will be evaluated based on theoretical grounding, use of evidence, creativity, and impact. We invite researchers to submit entries of all types. Some suggested topics include

  • New spatial, temporal, and network analytic methodologies and algorithms that can cope with the vast scale of the GDELT catalog. Upwards of a billion location mentions exist in the event database, while just a small slice of the knowledge graph can contain tens or even hundreds of millions of connections over space, time, and context.
  • New spatial analytic methodologies that can better take into account change over time and non-spatial distances (such as co-occurrences and semantic similarity between locations).
  • New network methodologies that better incorporate spatio-temporal information and can reason in spatial, temporal, and semantic dimensions.
  • New network methodologies that better incorporate the diversity of actor and relationship types in the data.
  • New network methodologies for constructing edges from the data and for distributing actor and edge attributes onto the graph in ways that support novel analytic approaches.
  • New methods of identifying “influencers” and deriving physical influencer networks from media influencer networks. Calibration of the physical and media-derived pictures of society.
  • Expanding our understanding of critical media effects like Agenda Setting, Selection Bias, Media Fatigue, and Helicopter Journalism through cross-comparison of global discussion.
  • Multi-signal forecasting incorporating structural and latent dimensions.
  • Translating the quantitative methodologies needed to work with GDELT into the often qualitative research questions of many social sciences and humanities disciplines and connectivity of these methodologies to the core theoretical foundations of those fields.
  • Testing hypotheses and models for social phenomena, including those developed during the age of information scarcity, on a larger scale using the GDELT data and proposing improvements to such models.
  • New visualization methodologies that support real time and actionable interaction with the massive size and denseness of the GDELT data, especially approaches that allow both macro-level pattern identification and micro-level exploration within the same interface, and system that incorporate spatial, temporal, thematic, and role-based filtering.
  • Forecasting and predictive modeling.

 

List of Submissions

Evaluation of Submissions

Submissions were evaluated by an interdisciplinary panel of judges who perform research in SBP or closely related areas based on novelty, scientific rigor, quality of presentation and expected impact to the SBP community.

  • Don Adjeroh Professor, Lane Department of Computer Science and Electrical Engineering, West Virginia University
  • Nathan Bos Senior Research Scientist, Applied Physics Laboratory, Johns Hopkins University
  • Jürgen Pfeffer Assistant Research Professor, School of Computer Science, Carnegie Mellon University
  • Winter Mason Data Scientist, Facebook

Winners of SBP 2014 grand data challenge

Friends in Joy and Sorrow: an Analysis of 2007-2012 Global Financial Crisis via Bayesian Nonparametric Dynamic Networks
Entry, Presentation slides

Daniele Durante
University of Padua

David B. Dunson
Duke University

Word Cloud Durante & Dunson

 

 

Discovering Bilateral and Multilateral Causal Events in GDELT
Entry, Presentation Slides

Lei Jiang
Tapjoy, Inc.

Fan Mai
University of Virginia

Word Cloud Jiang & Mai