Interview with Daniela Duca on creating SAGE Texti: A free tool for cleaning and pre-processing textual data
As part of a three-month focus on data analysis Methodspace is featuring original writings, podcasts, and videos on all phases of the process for both qualitative and quantitative studies. In this interview, learn about a free tool, SAGE Texti.
Daniela Duca, Senior Product Manager, SAGE Publishing
Q: Hi Daniela, can you tell us a little about your background?
I have a multi-disciplinary background, starting off in biochemistry, then moving into economics. I wrote my PhD in corporate innovation, while working in fintech to support the development of data crawlers and algorithms. This experience stimulated me to think about research automation and how technology can improve the academic lifecycle, and I really wanted to be part of it. I joined SAGE several years ago because I am deeply passionate about research and learning.
Q: Tell us a little about your role at SAGE and your work in the research software space?
The Ecosystem of Technologies for Social Science Research
At SAGE, I research, review, and track software tools that social science researchers develop or use to support their data collection and analysis. I openly share this knowledge with the research community in our white paper, The Ecosystem of Technologies for Social Science Research and share the open comprehensive directory list of research tools. In my recent article, I describe how SAGE developed the list and how researchers could use it when teaching the next generation of students.
We look at tools from the perspective of different research methods, such as tools for social media research, surveys, or experiments. In doing this we identified text mining as a huge area for software tools, with an incredible variety. There are tools that require little to no coding experience for statistical analysis, right through to very cutting-edge transformers for those with programming skills and computing resources. While there is a lot of choice, researchers told us they are still having to spend more time than they would want on cleaning and pre-processing their text corpora. This is why we came up with the idea behind our new text cleaning and pre-processing tool,: SAGE Texti.
Q: What is the idea behind SAGE’s new software tool, SAGE Texti? What problem does it solve for researchers?
We started with the problem. Many researchers report that cleaning and even pre-processing corpora is a huge challenge. And the challenge isn’t that they can’t clean text themselves (or together with their graduate students), the challenge is the time it takes to get to a good output that they can use in whatever environment they want to do their analysis. Cleaning text is not the part of the process that is publishable or produces any outputs for researchers, but remains critical for the analysis and for ensuring reproducible results – and is incredible time consuming.
For example, if a researcher wants to do a simple word cloud but their corpus is several thousand pdf documents, they need to find a converter to extract the text, hoping for no OCR, and then figure out how to remove all the junk that they don’t need, like the word ‘Page’ at the bottom of the document.
SAGE Texti aims to speed up this cleaning and pre-processing of text corpora. The idea has been brewing for a while, but we kicked off the development work and testing in the summer of 2020, and we’ve already had so many brilliant people involved in bootstrapping and continuously reinventing it
Q: A free beta version of SAGE Texti is now available for researchers – what does it do?
SAGE Texti is an interface to help social science researchers quickly test and build workflows to clean and pre-process their corpora.
It’s a web application where researchers can:
Upload their documents and extract the text.
Pick and choose from a growing list of cleaners and transformers to build a workflow for cleaning and pre-processing the extracted text.
Preview pages from their document live so you can see the workflow steps in action.
Download the extracted, cleaned and pre-processed text.
The cleaners and transformers are written in the Python programming language, developed by three very bright PhD students on a SAGE Ocean Fellowship (read about their experiences here and here). The cleaners are all available on GitHub, and you can either open an issue to request a new cleaner or submit one you’ve developed one. These will automatically go onto SAGE Texti once tested by us.
SAGE Texti is currently free and in beta, which means it works at a basic level but there will be some bugs. We are continuously building it and are very interested in what works and what doesn’t.
What’re the future plans for SAGE Texti?
Right now, we support pdf and txt documents and enabled basic scraping of html pages and sharing the workflow the user creates. In the future we hope to enable different file formats and more advanced scraping capabilities, as well as saving and sharing entire corpora and projects. We have many other ideas for the app thanks to the fabulous users that have been excited about its development and have tested the application so far, and we would love to hear from others that will find the tool useful for their own research projects.
Who do you think SAGE Texti will most benefit? What type of research in particular?
SAGE Texti will benefit any researcher, and in particular we hope it will be useful for social scientists at the beginning of their text mining project, who have some basic coding skills. We are also hearing from our users that they are interested in adopting SAGE Texti for their students, to help them pick up on the nuances of cleaning and pre-processing.
Where can researchers find out more or try it out?
You can learn about SAGE Texti on our webpage. The application is free and you can go ahead and try it out now here. If you want to check out the python code behind the cleaners, you can always access it on GitHub.
I am always happy to speak with anyone who has ideas and feedback, and you can connect with me directly at the below:
Twitter: @danielagduca
LinkedIn: linkedin.com/danielagduca
Email: Daniela.Duca@sagepub.co.uk