3 Challenges for behavioral research in the age of multimedia big data
By Soubhik Barari, Ph.D. student in Political Science and an affiliate of the Institute for Quantitative Social Science at Harvard University.
Today, each of us is capable of producing terabytes (the equivalent of “tons”) of highly personal data on very distinct mediums. These include: text documents, video streams, geographic coordinates, and even waveforms capturing our every bodily pulse. Examining each of these mediums over a cross-section of millions (even billions) of people can unlock new discoveries about human behavior in microscopic high resolution. We’re living in an age of multimedia big data and social scientists are taking to the helm of discovery.
To be exact, I define multimedia big data as “the study of societal and behavioral trends by combining large-scale insights across distinct media (e.g. text, audio, video, GPS, body sensors), each involving its own particular analytical methodology.” This definition is an offshoot of the National Science Foundation’s quasi-official 2012 definition of big data. I put deliberate importance on the medium, which, in practice, demands its own set of methodological tools separately from the analysis of the phenomenon it captures. Furthermore, in this definition I exclude instruments of up-front data collection such as surveys or experiments. No need to ask us questions about our activities; we’re automatically generating the answers ourselves!
Perhaps nowhere is the spirit of this multimedia big data revolution more apparent than in political science. For example, political scientists are currently designing a method to computationally classify “affect” from political speech. Others are developing computer vision techniques to extract visual cues communicated in political imagery. Some of my own work aims to evaluate governments’ information transparency by extracting textual evidence on their websites. Still others have introduced models of radio signal propagation that can be used to measure media broadcasting behavior. These lines of research all offer perspectives on how humans navigate diverse forms of political media in the 21st century; more than that, they offer other researchers the tools to coax, simplify, and parse the media itself into more interpretable and useful formats.
What broad challenges might researchers encounter in multimedia big data? I outline three especially important ones: ethical data collection, careful inference of causal relationships, and effective operationalization of research findings.
Ethically collecting high-resolution behavioral data. Many high-dimensional mediums of data can capture complex, nuanced patterns not usually within the capabilities of up-front instruments such as surveys. Some media can often reveal more about individuals than they themselves know. For example, your Instagram photos may be better at diagnosing depression than your own doctor. In other instances, such media can be reverse-engineered to identify private information. It is therefore not surprising that flagrant dissemination of personal data can cause public uproar.
In light of such breaches, securing public trust in big data research should be a high priority in both academia and industry. Firstly, institutions must be supported that regulate the harvesting of personal data. Where the Institutional Review Board (IRB) acts as this screening device for studies in academia, regulations like the GDPR in the E.U. will act as a similar institution to protect citizen “data rights” in the private sector. Secondly, practices to preserve individual-level privacy should be baked into data analysis. This includes de-identifying records and, when appropriate, removing variables like race or gender that might produce undesirable downstream bias. Lastly, teams of data scientists should hold themselves accountable to a set of guiding ethical principles. The Manifesto for Data Practices signed online by thousands of data scientists is one such synthesis of principles. O’Reilly Media’s recently launched “Doing good data science” series is another. Altogether, efforts at three levels can help to make big data more equitable for the most important stakeholder: those whose behavior is being studied.
Developing methods to carefully analyze causality. It is not enough to know what predicts viral content sharing on social networks or democratization in authoritarian regimes. Finding causation, not just correlation, has become the gold standard in social science research. However, by using off-the-shelf methods to measure cause-and-effect in new mediums of data, one risks overstating effects.
To avoid this, when coming up with models to interpret and represent different media, one should carefully identify moving pieces that can be exploited for causal inference. What are sources of true randomness in data generated from this particular medium? How could someone — or something — randomly transform targeted components of the data? Most importantly, for such interventions, how could effects on the treated parts of the data spill over into the un-treated parts? These questions call for due diligence on the media of interest. For careful causal inference, social science researchers should qualitatively understand the technicalities of the medium, be it broadcast radio or video transmission. We need to structurally understand mediums just as rigorously as the phenomena they capture.
Effectively operationalizing research findings. The end goal of big data research is often obvious. In industry, it is usually to improve products in service of their consumers; in academia, it is usually to inform society at-large. What is not trivial is how to translate research insights into these end goals. For example, research on photo and video streams on a social media site might prompt platform managers to automate the flagging or removal of pernicious content. Studies of societal crisis that use multimedia data feeds could be operationalized by deploying public alert systems that ingest real-time information and respond to probabilistically “tail-end” events. Insights about social bias could be disseminated via visually engaging web pamphlets. In practice, none of these things are easy to do (well) and the complexity of the media involved only makes them harder.
Engineering provides exactly the tools that can make this task easier for social scientists. When it comes to deploying production-grade databases, the Apache suite offers powerful options; there is Kafka, the real-time data streaming platform, and Presto, the multi-format database engine. Developing accessible and visually pleasing front-end websites is much easier than ever thanks to lightweight Python libraries like Flask and web design frameworks like Bootstrap. Finally, to build interactive applications for external stakeholders to explore social science data, look no further than R’s Shiny package or Javascript’s D3 library. In my own recent experience, I’ve built visualization dashboards for LobbyView, an open-source database of firm-level political lobbying (see associated paper, Kim and Kunisky 2018). From the parsed texts of nearly 1,000 lobbying reports, one can visualize the temporal and spatial dynamics of how Google lobbies the United States Congress. Many of these tools have a high technical barrier of entry, so realistically social scientists should not become experts. However, they should become familiar enough with such modern engineering tools in order to interface with and even lead teams of experts.
With so much diverse data to dig into, the future of quantitative social science is exciting, particularly for those studying the granularities of individual-level behavior. In doing so, we must make sure that this research is ethical, robust and ultimately useful. This demands that we not only look inward to our own social science subfields, but outward to tools in engineering disciplines. Although it will take effort to tackle these challenges, it is worth the result: bringing social science to a new frontier by learning from multimedia big data.
Soubhik Barari is a computational social scientist and an affiliate of the Institute for Quantitative Social Science at Harvard University. The views expressed here are
his own.