GailBot: The Challenges of Internationalizing, Scaling Up, and Working With Other People’s Data
GailBot is a new transcription software system that combines automatic speech recognition (ASR), machine learning (ML) models, and research-informed heuristics to capture often-overlooked paralinguistic features such as overlaps, sound stretches, speed shifts, and laughter.
Over the last six months, the GailBot development team (based in Tufts University, Boston, and Loughborough University, UK—with the support of the Sage Concept Grant—have been rebuilding GailBot into something that can be rolled out to researchers more widely. GailBot has a new graphical user interface (GUI), a range of new features based on the needs of our beta users, and (crucially), a new plugin architecture that should allow GailBot to scale up and grow sustainably as machine learning (ML) modes,, annotated training data, and ML tools continue to become ever more accessible for researchers.
In this blog post, we will report on the team’s recent week-long hack session at UCLA, outline how we are adapting GailBot in response to what users have told us, and our longer aims to scale up GailBot.
A Hack-session at the Birthplace of Conversation Analysis
In April 2024, four members of the GailBot team (Muhammad Umair, Vivian Li, Saul Albert, Julia Mertens, and JP de Ruiter) gathered in the basement of Haines Hall, UCLA for an intense week of work on GailBot, collaborating with researchers at UCLA’s Conversation Analysis Working Group. GailBot is named after Gail Jefferson, who invented the very detailed system of transcription used in the field of Conversation Analysis (CA) while she was an undergraduate at UCLA, so this was a very fitting venue! The hack-session ran through the weekend, involving a highly committed group of staff, graduate students, and visiting scholars from around the world.
After some initial presentations about GailBot, outlining the process, challenges, and potentials of automating the detailed transcription of dialogue for research, we began the task of installing and configuring GailBot on everyone’s computers. Predictably, given the range of operating systems, processors, and configurations present, this was a significant challenge. Nonetheless, we did eventually get GailBot running for everyone.
Working With Other People’s Data
As hard as it was working with other people’s computers, it was so much harder ( and so interesting), to work with other people’s data. GailBot was initially developed to solve a research problem in the Human Interaction Lab at Tufts: how to transcribe hundreds of hours of high-quality, neatly diarized video recordings of dialogue in US English without spending a fortune or going mad? But most people’s data was far more challenging to work with. Some people had data recorded on head-mounted microphones, so we had to deal with cross-talk using audio preprocessing steps before GailBot would do a decent transcription. Others brought data recorded on a single channel using an omni-directional mic and up to 15 different speakers. We quickly learned the limitations of GailBot, but could also see the need to create instructions for users to deal with different data by creating different workflows. Thankfully, even without speaker diarization, users could still see how to generate a useful, fast transcript by doing some processing tasks manually.
Internationalization
Also, now we were working with experts on Japanese, Cantonese, and Mandarin conversational data. ASR for these languages often does not work quite as well, and in the case of minority languages such as Cantonese, ASR models are not available in the four ASR engines GailBot currently uses (IBM Watson, OpenAI Whisper/WhisperX, Google). We had to try many combinations, and do quite a bit of hacking around to get results.
Some people had data recorded on head-mounted microphones, so we had to deal with cross-talk using audio preprocessing steps before GailBot would do a decent transcription. Others brought data recorded on a single channel using an omni-directional mic and up to 15 different speakers. We quickly learned the limitations of GailBot, but could also see the need to create instructions for users to deal with different data by creating different workflows. Thankfully, even without speaker diarization, users could still see how to generate a useful, fast transcript by doing some processing tasks manually.
GailBot’s Next Steps
Since returning from UCLA, We have set up a Slack Community to stay in touch with our growing community of beta testers, which you can join on the GailBot site, where you can also download beta versions of GailBot for Mac OS, read documentation, and watch how-to videos.
Firstly, to create multiple output representations of the transcripts including orthographic-only (for searching), and in multiple file formats for use both in specialist transcription software such as CLAN, ELAN, EXMARaLDA, and D:OTE, but also for more ‘basic’ presentation file formats like RTF/MS Word etc. This feature was so requested, we hacked together a provisional CLAN/ELAN exporter that very week.
Secondly, we realized that some ASR models such as Google, Whisper, and WhisperX automatically delete disfluencies (um/er/uuh), word repeats, and other features that we really need to include in our transcripts - this is a major issue that we will need to find workarounds for in future GailBot versions.
Thirdly, we really saw the potential of enabling uses to further customize how GailBot interacts with the ASR systems it uses to do its orthographic transcript and timing, enabling users to access all languages, features, and fine-tuning capabilities of ASR (and other) systems that are constantly improving. Eventually we will generate interfaces dynamically to let users fully harness ASR providers’ APIs from within GailBot.