Collecting social media data for research

By Jason Radford, CEO Volunteer Science and visiting researcher at Northeastern University.

Human social behavior has rapidly shifted to digital media services, whether Gmail for email, Skype for phone calls, Twitter and Facebook for micro-blogging, or WhatsApp and SMS for private messaging.  This digitalization of social life offers researchers an unprecedented world of data with which to study human life and social systems. However, accessing this data has become increasingly difficult.

Concerns about privacy have led to strong government regulations around how data can be accessed and stored. Digital media companies are erecting barriers to data in response to these new regulations as well as scandals involving data sharing and data breaches. Today, access to application programming interfaces (APIs) is becoming increasingly curated and the data available via APIs is shrinking. Gone are the days when these companies offered their data for free to the public through an open API.

How can researchers use social media data in their research? Full disclosure, I run the online study and subject recruitment platform Volunteer Science. I’ve been collecting digital data online for over a decade and seen the ecosystem of digital platforms, APIs, and regulations evolve first hand. We’ve developed a solution for collecting digital data but it’s not right for everyone. My purpose here is to help you figure out what approach is right for you.

The manual approach

The new regulations being erected by companies, governments, and universities’ institutional review boards largely relate to automated access to data. Companies’ terms of service specifically regulate automated access to data and their restrictions on data access are targeted at what automated computer scripts have access to. For example, Facebook’s API no longer provides you with comments on a user’s post. This is a matter of technocratic policy. Facebook does not want to provide automated access to this data. However, you are still largely free to record data manually, observing and recording interactions on Facebook is no different from taking notes on interactions in a school, bar, or office.

As a researcher, you are free to observe and record the comments on posts assuming you have permission from users. Observing and recording interactions on Facebook is no different from taking notes on interactions in a school, bar, or office. For private data like personal messages, emails, or private social media posts, you need the consent of the users involved to make observations of their information. For public data, like most Tweets, you’re free to make observations as if you were standing on the street or any other public space. In both cases, you are still bound to the privacy policies of the digital platform as well as the policies of your institutional review board.

If you want to collect digital information, you should be clear about whether or not you need to take an automated approach to data collection. It is certainly faster than doing it by hand if you’re collecting data in bulk. APIs can also provide metadata that is not easy to observe like user ID numbers. However, it may not be worth the administrative and technical headaches. You have to get permissions from these companies, write your own automated tool, and clean and analyze the data. Moreover, there is substantial value in observing human social interaction directly rather than through the lens of an algorithm.

The automated approach

If you want to collect digital data using an automated system, there are three ways to go about it. The first involves using web services like Volunteer Science or ExportTweet, which provide the technical and legal tools to access data legitimately. Second, you can build your own tool to access the APIs. Finally, and most controversially, you can scrape raw content from websites using programming tools like Selenium.

Web services

Web services provide user-friendly access to APIs. At Volunteer Science, you can request access to and download users’ Facebook photos or Twitter posts with a single button. ExportTweet allows you to put in a Twitter handle and will output the texts, videos, and images from all of that user’s Tweets.

With web services, you don’t need to worry about applying for access to an API. For example, Facebook requires you to create videos demonstrating how your app will use each piece of data you request access to. You also don’t have to worry about keeping up with the latest standards. Many services began to require apps to use OAuth authentication and accept cookies before there were good, well-documented libraries for implementing them. Finally, these services build in consent and transparency. When you request access to someone’s data, they see the request and can sign off on it, providing you with proof of consent.

However, there are several shortcomings with using web services. They may not connect to the website you need and they may not provide access to the data you want. Sometimes, they are not scalable, meaning they might be difficult to use if you are looking to scrape public data from thousands or hundreds of thousands of users. They also cost money. ExportTweet charges $29.99* to download Tweets from a single account. Volunteer Science includes the social media service as part of its base package which costs $29/month. If you do not have funding, want data that is not being served by a web service, or want to collect a lot of data for which you do not need the users’ permission, the next best option is to build it yourself.

Do it yourself

Programming languages, like Python and R, and scientific software, like SAS, offer packages for interacting with APIs and have libraries for interacting with most major digital platforms. For example, Tweepy, for Python, and twitteR, for R, have become standard for downloading Twitter data.

The strength of using programmable software like Python and SAS to build your own data collector is that it enables you to collect lots of data quickly. For example, I maintain a list of Twitter handles for members of the United States Congress and use Python to regularly fetch their latest Tweets.

The downside to these is, as I’ve mentioned, you have to generally apply for access to APIs and maintain your code, as APIs and standards change. You also have to know how to program in one of these languages, a skill that not many social scientists yet have.

Web-scraping

Most websites do not have APIs for accessing their data. In this case, you have to use web-scraping, which involves writing a computer program to fetch data from a web address and parse its content into structured data. Many websites serve information statically, meaning you get the whole webpage when you go to a URL. In this case, most programming languages enable you to fetch the raw HTML webpages. However, most social media websites serve information dynamically, showing you more content as you scroll, swipe, or tap. For these, you must use special programs like Selenium that simulate these actions. In either case, once you have the raw HTML data, you then need to parse it using libraries that can interpret HTML like Beautiful Soup.

Web-scraping offers you the most power and flexibility and can allow you to put together datasets that are otherwise impossible to create. For example, it is impossible to gather all of the replies to a Tweet on Twitter because the API does not provide a way of searching for who replied to a Tweet. You either must have access to the entirety of Twitter’s data or build a web-scraper capable of simulating the process of clicking through all the replies to a Tweet.

Web-scraping comes with the highest degree of complexity and risk. It is often explicitly prohibited in websites’ terms of service. You as the researcher are responsible for ensuring you have access to the data you’re collecting. And, you must ensure that the code you build collects the data you think it does.

Wrapping up

There are many ways to collect the social data people generate on a daily basis. Once you have your empirical question in mind, you still must consider the technical, ethical, legal, and financial dimensions of answering that question. Hopefully you can use this guide to navigate these issues and find the approach that’s right for you.

*Prices are correct as of April 2019

About

Jason Radford is the CEO of Volunteer Science, visiting researcher at Northeastern University, and a PhD candidate in sociology at the University of Chicago. He develops computational methods for social research with a focus networks, teams, and organizations.

Previous
Previous

Qualitative Data Analysis with NVivo: Author Interview

Next
Next

An interview with the Allen Institute for Artificial Intelligence, winners of the NYU Coleridge Initiative's Rich Context Competition