What do researchers need to know about using datasets?
By Janet Salmons, PhD Research Community Manager, Sage Methodspace
Want to learn more about research with datasets? This curated collection of open-access articles can help you understand defining characteristics, and develop data literacy skills needed to work with large datasets and machine learning tools for managing Big Data sources.
Characteristics of Big Data
Kitchin, R., & McArdle, G. (2016). What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society. https://doi.org/10.1177/2053951716631130
Big Data has been variously defined in the literature. In the main, definitions suggest that Big Data possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes Big Data, Big Data?’, applying Kitchin’s taxonomy of seven Big Data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute Big Data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of Big Data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that Big Data as an analytical category needs to be unpacked, with the genus of Big Data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes Big Data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world.
Lupton, D. (2018). How do data come to matter? Living and becoming with personal data. Big Data & Society. https://doi.org/10.1177/2053951718786314
Humans have become increasingly datafied with the use of digital technologies that generate information with and about their bodies and everyday lives. The onto-epistemological dimensions of human–data assemblages and their relationship to bodies and selves have yet to be thoroughly theorised. In this essay, I draw on key perspectives espoused in feminist materialism, vital materialism and the anthropology of material culture to examine the ways in which these assemblages operate as part of knowing, perceiving and sensing human bodies. I draw particularly on scholarship that employs organic metaphors and concepts of vitality, growth, making, articulation, composition and decomposition. I show how these metaphors and concepts relate to and build on each other, and how they can be applied to think through humans’ encounters with their digital data. I argue that these theoretical perspectives work to highlight the material and embodied dimensions of human–data assemblages as they grow and are enacted, articulated and incorporated into everyday lives.
Resnyansky, L. (2019). Conceptual frameworks for social and cultural Big Data analytics: Answering the epistemological challenge. Big Data & Society. https://doi.org/10.1177/2053951718823815
This paper aims to contribute to the development of tools to support an analysis of Big Data as manifestations of social processes and human behaviour. Such a task demands both an understanding of the epistemological challenge posed by the Big Data phenomenon and a critical assessment of the offers and promises coming from the area of Big Data analytics. This paper draws upon the critical social and data scientists’ view on Big Data as an epistemological challenge that stems not only from the sheer volume of digital data but, predominantly, from the proliferation of the narrow-technological and the positivist views on data. Adoption of the social-scientific epistemological stance presupposes that digital data was conceptualised as manifestations of the social. In order to answer the epistemological challenge, social scientists need to extend the repertoire of social scientific theories and conceptual frameworks that may inform the analysis of the social in the age of Big Data. However, an ‘epistemological revolution’ discourse on Big Data may hinder the integration of the social scientific knowledge into the Big Data analytics.
Stewart, R. (2021). Big data and Belmont: On the ethics and research implications of consumer-based datasets. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211048183
Consumer-based datasets are the products of data brokerage firms that agglomerate millions of personal records on the adult US population. This big data commodity is purchased by both companies and individual clients for purposes such as marketing, risk prevention, and identity searches. The sheer magnitude and population coverage of available consumer-based datasets and the opacity of the business practices that create these datasets pose emergent ethical challenges within the computational social sciences that have begun to incorporate consumer-based datasets into empirical research. To directly engage with the core ethical debates around the use of consumer-based datasets within social science research, I first consider two case study applications of consumer-based dataset-based scholarship. I then focus on three primary ethical dilemmas within consumer-based datasets regarding human subject research, participant privacy, and informed consent in conversation with the principles of the seminal Belmont Report.
Data Literacy
Corrall, Sheila (2019) Repositioning Data Literacy as a Mission-Critical Competence. In: ACRL 2019: Recasting the Narrative, April 10-13, 2019, Cleveland, OH.
With data rapidly replacing information as the currency of research, business, government, and healthcare, is it time for librarians to make data literacy central to their professional mission, take on roles as interdisciplinary mediators, and lead the data literacy movement on campus? Join the data literacy debate and discuss what librarians can do to cut across the disciplinary and professional silos now threatening the development of lifewide data literacy. Investigate and critique diverse conceptions and pedagogies for data literacy, and experiment with the MAW theory of stakeholder saliency to identify individuals and groups to target in your data literacy initiatives.
Gray, J., Gerlitz, C., & Bounegru, L. (2018). Data infrastructure literacy. Big Data & Society. https://doi.org/10.1177/2053951718786316
A recent report from the UN makes the case for “global data literacy” in order to realise the opportunities afforded by the “data revolution”. Here and in many other contexts, data literacy is characterised in terms of a combination of numerical, statistical and technical capacities. In this article, we argue for an expansion of the concept to include not just competencies in reading and working with datasets but also the ability to account for, intervene around and participate in the wider socio-technical infrastructures through which data is created, stored and analysed – which we call “data infrastructure literacy”. We illustrate this notion with examples of “inventive data practice” from previous and ongoing research on open data, online platforms, data journalism and data activism. Drawing on these perspectives, we argue that data literacy initiatives might cultivate sensibilities not only for data science but also for data sociology, data politics as well as wider public engagement with digital data infrastructures. The proposed notion of data infrastructure literacy is intended to make space for collective inquiry, experimentation, imagination and intervention around data in educational programmes and beyond, including how data infrastructures can be challenged, contested, reshaped and repurposed to align with interests and publics other than those originally intended.
Koltay, T. (2017). Data literacy for researchers and data librarians. Journal of Librarianship and Information Science, 49(1), 3–14. https://doi.org/10.1177/0961000615616450
This paper describes data literacy and emphasizes its importance. Data literacy is vital for researchers who need to become data literate science workers and also for (potential) data management professionals. Its important characteristic is a close connection and similarity to information literacy. To support this argument, a review of literature was undertaken on the importance of data, and the data-intensive paradigm of scientific research, researchers’ expected and real behaviour, the nature of research data management, the possible roles of the academic library, data quality and data citation, Besides describing the nature of data literacy and enumerating the related skills, the application of phenomenographic approaches to data literacy and its relationship to the digital humanities have been identified as subjects for further investigation.
Nguyen, D. (2021). Mediatisation and datafication in the global COVID-19 pandemic: on the urgency of data literacy. Media International Australia, 178(1), 210–214. https://doi.org/10.1177/1329878X20947563
In the COVID-19 pandemic, societal discourses and social interaction are subject to rapid mediatisation and digitalisation, which accelerate datafication. This indicates urgency for increasing data literacy: individual abilities in understanding and critically assessing datafication and its social implications. Immediate challenges concern misconceptions about the crisis, data misuses, widening (social) divides and (new) data biases. Citizens need to be on guard in respect to the crisis’ impact on the next stages of the digital transformation.
Poirier, L. (2021). Reading datasets: Strategies for interpreting the politics of data signification. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211029322
All datasets emerge from and are enmeshed in power-laden semiotic systems. While emerging data ethics curriculum is supporting data science students in identifying data biases and their consequences, critical attention to the cultural histories and vested interests animating data semantics is needed to elucidate the assumptions and political commitments on which data rest, along with the externalities they produce. In this article, I introduce three modes of reading that can be engaged when studying datasets—a denotative reading (extrapolating the literal meaning of values in a dataset), a connotative reading (tracing the socio-political provenance of data semantics), and a deconstructive reading (seeking what gets Othered through data semantics and structure). I then outline how I have taught students to engage these methods when analyzing three datasets in Data and Society—a course designed to cultivate student competency in politically aware data analysis and interpretation. I show how combined, the reading strategies prompt students to grapple with the double binds of perceiving contemporary problems through systems of representation that are always situated, incomplete, and inflected with diverse politics. While I introduce these methods in the context of teaching, I argue that the methods are integral to any data practice in the conclusion.
Machine Learning Tools
Denton, E., Hanna, A., Amironesei, R., Smart, A., & Nicole, H. (2021). On the genealogy of machine learning datasets: A critical history of ImageNet. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211035955
In response to growing concerns of bias, discrimination, and unfairness perpetuated by algorithmic systems, the datasets used to train and evaluate machine learning models have come under increased scrutiny. Many of these examinations have focused on the contents of machine learning datasets, finding glaring underrepresentation of minoritized groups. In contrast, relatively little work has been done to examine the norms, values, and assumptions embedded in these datasets. In this work, we conceptualize machine learning datasets as a type of informational infrastructure, and motivate a genealogy as method in examining the histories and modes of constitution at play in their creation. We present a critical history of ImageNet as an exemplar, utilizing critical discourse analysis of major texts around ImageNet’s creation and impact. We find that assumptions around ImageNet and other large computer vision datasets more generally rely on three themes: the aggregation and accumulation of more data, the computational construction of meaning, and making certain types of data labor invisible. By tracing the discourses that surround this influential benchmark, we contribute to the ongoing development of the standards and norms around data development in machine learning and artificial intelligence research.
Fournier-Tombs, E., & MacKenzie, M. K. (2021). Big data and democratic speech: Predicting deliberative quality using machine learning techniques. Methodological Innovations. https://doi.org/10.1177/20597991211010416
This article explores techniques for using supervised machine learning to study discourse quality in large datasets. We explain and illustrate the computational techniques that we have developed to facilitate a large-scale study of deliberative quality in Canada’s three northern territories: Yukon, Northwest Territories, and Nunavut. This larger study involves conducting comparative analyses of hundreds of thousands of parliamentary speech acts since the creation of Nunavut 20 years ago. Without computational techniques, we would be unable to conduct such an ambitious and comprehensive analysis of deliberative quality. The purpose of this article is to demonstrate the machine learning techniques that we have developed with the hope that they might be used and improved by other communications scholars who are interested in conducting textual analyses using large datasets. Other possible applications of these techniques might include analyses of campaign speeches, party platforms, legislation, judicial rulings, online comments, newspaper articles, and television or radio commentaries.
Gray, J. E., & Suzor, N. P. (2020). Playing with machines: Using machine learning to understand automated copyright enforcement at scale. Big Data & Society. https://doi.org/10.1177/2053951720919963
This article presents the results of methodological experimentation that utilises machine learning to investigate automated copyright enforcement on YouTube. Using a dataset of 76.7 million YouTube videos, we explore how digital and computational methods can be leveraged to better understand content moderation and copyright enforcement at a large scale.We used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. We use this to explore, in a granular way, how copyright is enforced on YouTube, using both statistical methods and qualitative analysis of our categorised dataset. We provide a large-scale systematic analysis of removals rates from Content ID’s automated detection system and the largely automated, text search based, Digital Millennium Copyright Act notice and takedown system. These are complex systems that are often difficult to analyse, and YouTube only makes available data at high levels of abstraction. Our analysis provides a comparison of different types of automation in content moderation, and we show how these different systems play out across different categories of content. We hope that this work provides a methodological base for continued experimentation with the use of digital and computational methods to enable large-scale analysis of the operation of automated systems.
Hansen, K. B. (2020). The virtue of simplicity: On machine learning models in algorithmic trading. Big Data & Society. https://doi.org/10.1177/2053951720926558
Machine learning models are becoming increasingly prevalent in algorithmic trading and investment management. The spread of machine learning in finance challenges existing practices of modelling and model use and creates a demand for practical solutions for how to manage the complexity pertaining to these techniques. Drawing on interviews with quants applying machine learning techniques to financial problems, the article examines how these people manage model complexity in the process of devising machine learning-powered trading algorithms. The analysis shows that machine learning quants use Ockham’s razor – things should not be multiplied without necessity – as a heuristic tool to prevent excess model complexity and secure a certain level of human control and interpretability in the modelling process. I argue that understanding the way quants handle the complexity of learning models is a key to grasping the transformation of the human’s role in contemporary data and model-driven finance. The study contributes to social studies of finance research on the human–model interplay by exploring it in the context of machine learning model use.
Jacobsen, B. N. (2023). Machine learning and the politics of synthetic data. Big Data & Society, 10(1). https://doi.org/10.1177/20539517221145372
Machine-learning algorithms have become deeply embedded in contemporary society. As such, ample attention has been paid to the contents, biases, and underlying assumptions of the training datasets that many algorithmic models are trained on. Yet, what happens when algorithms are trained on data that are not real, but instead data that are ‘synthetic’, not referring to real persons, objects, or events? Increasingly, synthetic data are being incorporated into the training of machine-learning algorithms for use in various societal domains. There is currently little understanding, however, of the role played by and the ethicopolitical implications of synthetic training data for machine-learning algorithms. In this article, I explore the politics of synthetic data through two central aspects: first, synthetic data promise to emerge as a rich source of exposure to variability for the algorithm. Second, the paper explores how synthetic data promise to place algorithms beyond the realm of risk. I propose that an analysis of these two areas will help us better understand the ways in which machine-learning algorithms are envisioned in the light of synthetic data, but also how synthetic training data actively reconfigure the conditions of possibility for machine learning in contemporary society.
Jaton, F. (2021). Assessing biases, relaxing moralism: On ground-truthing practices in machine learning design and application. Big Data & Society. https://doi.org/10.1177/20539517211013569
This theoretical paper considers the morality of machine learning algorithms and systems in the light of the biases that ground their correctness. It begins by presenting biases not as a priori negative entities but as contingent external referents—often gathered in benchmarked repositories called ground-truth datasets—that define what needs to be learned and allow for performance measures. I then argue that ground-truth datasets and their concomitant practices—that fundamentally involve establishing biases to enable learning procedures—can be described by their respective morality, here defined as the more or less accounted experience of hesitation when faced with what pragmatist philosopher William James called “genuine options”—that is, choices to be made in the heat of the moment that engage different possible futures. I then stress three constitutive dimensions of this pragmatist morality, as far as ground-truthing practices are concerned: (I) the definition of the problem to be solved (problematization), (II) the identification of the data to be collected and set up (databasing), and (III) the qualification of the targets to be learned (labeling). I finally suggest that this three-dimensional conceptual space can be used to map machine learning algorithmic projects in terms of the morality of their respective and constitutive ground-truthing practices. Such techno-moral graphs may, in turn, serve as equipment for greater governance of machine learning algorithms and systems.
Thylstrup, N. B., Hansen, K. B., Flyverbom, M., & Amoore, L. (2022). Politics of data reuse in machine learning systems: Theorizing reuse entanglements. Big Data & Society, 9(2). https://doi.org/10.1177/20539517221139785
Policy discussions and corporate strategies on machine learning are increasingly championing data reuse as a key element in digital transformations. These aspirations are often coupled with a focus on responsibility, ethics and transparency, as well as emergent forms of regulation that seek to set demands for corporate conduct and the protection of civic rights. And the Protective measures include methods of traceability and assessments of ‘good’ and ‘bad’ datasets and algorithms that are considered to be traceable, stable and contained. However, these ways of thinking about both technology and ethics obscure a fundamental issue, namely that machine learning systems entangle data, algorithms and more-than-human environments in ways that challenge a well-defined separation. This article investigates the fundamental fallacy of most data reuse strategies as well as their regulation and mitigation strategies that data can somehow be followed, contained and controlled in machine learning processes. Instead, the article argues that we need to understand the reuse of data as an inherently entangled phenomenon. To examine this tension between the discursive regimes and the realities of data reuse, we advance the notion of reuse entanglements as an analytical lens. The main contribution of the article is the conceptualization of reuse that places entanglements at its core and the articulation of its relevance using empirical illustrations. This is important, we argue, for our understanding of the nature of data and algorithms, for the practical uses of data and algorithms and our attitudes regarding ethics, responsibility and regulation.
Want to learn more about research with datasets? This curated collection of open-access articles can help you understand defining characteristics, and develop data literacy skills needed to work with large datasets and machine learning tools for managing Big Data sources.