The Tedium of Transcription: Who's Codifying the Process?
There is no right way of transcribing interviews or other types of recordings for research in the social sciences. If I were to get interviewed, I would definitely ask the researcher to remove all the ‘ers’ and ‘uhms’ from the transcript. This is exactly what a few researchers asked British sociologist Harry Collins to do, after he transcribed entire conversations on gravitational wave detection verbatim. Collins subsequently decided to correct the broken English of certain Italian physicists in a debate they were having with American physicists, because in written form, those expressions would greatly reduce the perceived validity of their arguments or their credibility as experts. Such decisions, along with removing or leaving in non-linguistic sounds like laughs and sighs, could influence both the direction of the study, as well as the results.
Arguably, the easiest way to transcribe would be to leave in all the sounds, delaying any decision on ‘cleaning’ the data, until the transcription is complete. Thus, transforming transcription into a relatively mechanical process – you write down everything you hear. Historically, this was the responsibility of the graduate student or junior researcher, almost like an induction process into the field. Although extremely tedious, the school of thought, for any in-depth qualitative research in the social sciences, was and still remains that, transcribing an interview yourself would help you understand the data better, picking up on insights and patterns.
AI startups disrupting transcription
Tight for time and not very motivated by the meticulousness and length of this type of work, a few students at the Dublin City University, decided to automate the process. They quickly studied the latest developments in speech recognition and came up with HappyScribe. Two years later, HappyScribe supports more than 100 languages, with export and import features of multiple file formats. HappyScribe is used by more than seventy thousand researchers and journalists. Trint, Verbit, Otter and Voicea have similar technology for transcription and are popular with both industry and academia, allowing for live transcriptions, notes and sometimes even producing a summary of the content. All of them have recently raised at least a few million dollars in funding—an indication that the size of this market and demand for transcription is growing. If you were looking for an automated transcription tool, I’d start with Otter, because you have 600 minutes per month with the free plan. For something a bit more sophisticated, use Verbit, as it has a multi-proofing layer and does something quite extraordinary: you can upload your own dictionary, where for example, your interviews are filled with jargon.
Transcribe by Wreally Studios (which has around 200 citations already) is another tool in this space developing its algorithms for speech recognition. On its website it claims to convert audio to text in minutes and you’ll have the option to do your own transcribing with its integrated media player. Similarly, established qualitative analysis software solutions like NVIVO from QSR International are also adding features to automatically transcribe, search and annotate the transcript. NVIVO transcription supports 26 languages and a variety of English accents, while also being GDPR and HIPAA compliant. Silvana di Gregorio described the features of NVIVO transcription and what it can do for researchers in a recent blog post.
If you want to crowdsource your transcribing project, TranscribeMe (about 300 citations) would be your best bet. Alternatively, Dictate2us (about 15 articles currently cite it) works with 300k transcribers, similar to the popular Rev Voice recorder, who also employ professional transcribers and is available as an app that you can use to record your interviews on. Or if you want an app to transcribe yourself more seamlessly, then oTranscribe would do, and it’s also free.
Challenges
The market for academic researchers transcribing interviews is small or at least not big enough to drive the explosion of well-funded startups focusing on transcription. Rather, the development of these tools is determined by two big trends: a desire to improve the accessibility of video and audio material, for example captioning lectures in academia; and a steady move towards voice with a doubling in the number of Amazon Echos deployed in under four years and video with 300 hours uploaded every minute on YouTube alone.
Thus, speech recognition developed by the big tech companies that host and record these large volumes of audio and video has gotten really good at long blocks of monologues, or conversations between two people. They’ve dealt with background noise and even improved on accents and dialects. As such, even bootstrapping your transcription in Google Docs (for free), or with Microsoft’s speech recognition services (included with an Azure subscription) is almost on par with a human transcriber and saves time and effort when transcribing a handful of interviews.
There is a catch though: algorithms need to be improved, and they will do so with more data. Recent investigations in the practices of Amazon, Apple, Google and Microsoft only confirm something we already knew – human transcribers are still employed. The big challenge, thus, when using automated transcription services has to do with data protection, especially when your interviews have sensitive and person-identifiable data and there is little if any indication of what happens to your files or the transcribed output once you’re done. Whether you end up employing your graduate students, crowd-sourcing your transcription, or fully automating it, your audio files could be listened to by third parties possibly not involved in the interview, and that is a point to include on your consent forms.
More recently, a growing number of speech recognition and transcription startups like Voicea are serving commercial uses, such as recording meetings within workplaces. This drives improvements in transcriptions for focus-group-type research or where multiple people are involved and it is harder to reliably track individuals in conversations. This also remains a challenge, though a smaller one, for professional transcribers (or even graduate students), who would normally slow down the speech or listen to the recording multiple times.
Although many of the current automated transcription tools cover more than 50 languages and a variety of English dialects, accents and even speech impediments, they are useless when your interviewees mix English with another language. Moreover, most tools are still struggling with reliably predicting the words they cannot clearly detect, to make it easier for you to find and edit it. And of course the quality of the sound is a challenge for both human and robot-transcribers, but more so for the latter.
Looking ahead at research with video and audio recordings
In 1976 computer scientist Raj Reddy predicted that a speech recognition algorithm will be built within 10 years. It took longer than that, but the algorithms available today are on par with human transcribers, continuously improving the error rate but also decreasing the cost to performance. With these tools, researchers can transcribe large audio corpora, while also turning the dial up on the detail – anything from a verbatim transcription, with a focus on both words and non-textual elements, to the edited and polished text outputs that focus on content, converting speech to well-written and intelligible text, some offering bonus options to create summaries and highlights.
Silvana di Gregorio of QSR International remarks that “automatic transcription is taking the ‘heavy lifting’ out of transcribing – allowing the researcher more time to reflect and start the analysis process as they review the transcript (while hearing the audio)”. And indeed, all tools make editing and annotation available directly on their platform, a feature that is commonly used by social science researchers for further and more in-depth analysis. However, researchers always return to the audio file, listening again and again, while also annotating the transcribed output to an incredible detail.
Yet, the question remains: how much do we need to focus on transcription? Puzzled by the standard of required detail that social scientists are currently using when transcribing interviews, Collins, the same British sociologist who edited the broken English of Italian physicists, ran an empirical study and demonstrated that in written form, the presence of non-verbal elements “indicate far more uncertainty in the speaker when transmitted in text than when transmitted in recorded sound.”
In a recent book (2019) that further inquires into the practices and methods in sociology, Collins argues that “knowing what counts as a good illustration or convincing exhibit requires interpretative skill—it cannot be done by counting or classifying words or utterances.” Nicholas Loubere, an interdisciplinary social scientist, would agree. In fact, he developed a new method which he calls Systematic and Reflexive Interviewing and Reporting, or SRIR for short. Using this method, the researcher would skip the transcription process altogether and would instead conduct the interview alongside a colleague. Together, they would take the time right after the interview to engage in a reflexive conversation and right there pick out the key content, writing reports independently afterwards. Perhaps, the real disruptors of the transcription process are not the automated transcription tools, but rather new methods and tools that enable researchers to annotate video and audio files directly and engage in further in-depth analysis of the audio corpora.