Swahili Lexicon for Sentiment Analysis
By Dr. Aloyce R Kaliba, Professor of Economics and Statistics, College of Business, Southern University, and A&M College, Baton Rouge, Louisiana. This study was funded by SAGE Publishing under the Concept Grant Program.
Spoken by more than 220 million people and growing, Swahili is becoming the language of the future. Currently, Kiswahili is a leading language in the East African Community, including Burundi, the Democratic Republic of Congo (DRC), Kenya, Rwanda, Tanzania, Southern Sudan, and Uganda. Other countries with Swahili-speaking people are Oman and Yemen in the Middle East, Malawi, Mozambique, Zambia, and the southern tip of Somalia. Universities in China, U.S., and Europe offer Swahili curriculums for students interested in foreign languages, and these programs are growing and expanding exponentially. The "Swahili Lexicon for Sentiment Analysis" project aims to develop and test a Swahili Lexicon annotated by native Swahili speakers for text mining, particularly sentiment analysis.
Over the last year, in collaboration with Swahili teachers in Zanzibar and Tanzania Mainland, we created a Standard Swahili (Swahili Sanifu) Lexicon (SWAHILILex.01) with about 20,000 words tagged as negative, neutral, and positive. Due to time differences and schedules, the challenges were organizing virtue meetings. We then decided to meet in Dar es Salaam for one week when the schools were in recess. After compiling the SWAHILILex.01, we compared its performance in polarity analysis against other Swahili lexicons (i.e., NRC Word-Emotion Association Lexicon (EmoLex) and the CSLex created Chen, Skiena as an exercise to building sentiment lexicons for all major languages. The comparison is also against the classification results of supervised machine learning models and Devlan/bert-base-multilingual-cased-finetuned-Swahili pre-trained transformer. The reference was two annotated datasets: standard Swahili and tweet datasets. The performance of the new SWAHILILex.01 lexicon was similar to the results of supervised machine learning and outperformed other methods when classifying the regular Swahili dataset but underperformed when classifying the tweets dataset. Results emphasize the need for domain-based Lexicons or new modeling techniques that account for the multidomain archetypes as topics from social media are across many domains. The paper has been submitted for publication consideration and will soon appear at https://www.techrxiv.org/ website.
I came across the SAGE Concept Grant while searching for a potential sponsor for the Swahili Lexicon project. The application was simple and smooth, and the funding helped to achieve the first regarding developing a Lexicon for Standard Swahili. The tools will be available by the end of October 2023 as we are currently conducting the final reviews of the Lexicon using Swahili experts and creating GitHub to deposit the Lexicon and tools used in the above mention paper.
Have an idea? Apply for a Sage Concept Grant!
Sage’s Concept Grant program has been running since 2018 and aims to fund innovative software solutions that support research in the social sciences. We are seeking proposals for new technological solutions that support the adoption, development and application of established and emerging research methods, including quantitative, qualitative, mixed, and computational methods. We offer seed £2k grants to develop ideas and £15k grants for scaling prototypes. Explore previous winners. Apply for this year’s round and find out more here. Applications are open to anyone regardless of location or affiliation until 20th of June.
More Methodspace Posts from Sage Concept Grant Winners
Annotiva, a 2023 Sage Concept Grant recipient, gives an update on the development of their AI-assisted tool to enhance qualitative coding by predicting ambiguities and improving collaboration among novice researchers.
Catch up with Gorilla’s Shop Builder 2. Developed after receiving the 2023 Sage Concept Grant, it features a modular design, mobile compatibility, skinnability, and improved customization, enhancing researchers' flexibility and user experience.
BigKnowledge's BoKMap platform transforms mapping principles into a "GPS for knowledge," visualizing unstructured text data in a map-like form, aiding in document exploration and discovery across various domains. Read more for an update on ths 2023 Concept Grant winning project.
We check in with GailBot, the £15k Concept Grant winner. Combining ASR and ML to capture paralinguistic features, GailBot has been revamped for wider researcher use. After a week-long UCLA hack session, we’re addressing internationalization and user-driven feature requests, ensuring scalable and versatile transcription. Join our Slack Community for updates!
Qualitative researchers face ongoing challenges in consistently identifying and addressing potential ambiguity in collaborative data annotation processes. Annotiva can help.
Swara is a web-based/mobile self-completion survey software/application that utilizes voice user interface technology.
GailBot is an automated dialogue transcription tool that uses machine learning and heuristics to annotate essential paralinguistic details of talk such as overlaps, sound stretches, speed changes, and timed silences.
The "Swahili Lexicon for Sentiment Analysis" project, funded by a Sage Concept Grant, aims to develop and test a Swahili Lexicon annotated by native Swahili speakers for text mining, particularly sentiment analysis.
Learn about a Sage Concept Grant winner: Automated Video Analysis software (or AUVANA for short) is an open-source annotation tool for social scientists whose research involves analyzing and annotating videos.
Causal Map is a web app dedicated to causal qualitative data analysis. Learn more about this open-access tool, winner of a Sage Concept Grant.
SAGE Publishing is inviting applications for the 2022 SAGE Concept Grant, which provides funding for new software tools for social science research.
Learn about new technologies for researchers!
Interested in applying for a SAGE Concept Grant? We’ve put together the following advice for applicants based on our feedback for previous years’ applicants, and the criteria we’ll be using to judge this year’s applications.
Watch the webinar recording and read the follow-up blog from our webinar with SAGE 2020 Concept Grant winner, Andrew Lovett-Barron, on how to design trust relationships with participants in research using Knowsi.
There’s a validity problem with automated content analysis. In this post, Dr. Chung-hong Chan introduces a new tool that provides a set of simple and standardized tests for frequently used text analytic tools and gives examples of validity tests you can apply to your research right away.
SAGE Concept Grant winner Dr Nechama Brodie introduces the Homicide Media Tracker, a tool currently under development that will enable the collection, curation and analysis of crime data in media. Why is it needed? What kind of data will the tool be able to collect? And what insights can this data afford researchers?
Media reports are a valuable source of crime data that can be used to supplement police and judicial records. Nechama Brodie explains the challenges and opportunities of working with this type of data, and introduces a new tool concept for the collection and analysis of homicide data.
Virtual reality offers a realistic and controlled research environment, presenting an opportunity for the future of valid and reliable research in the social sciences. This blog introduces a new tool that enables real-time experience measurement in VR, under development with support from the SAGE Concept Grant.
Last month we announced the winners of the 2020 SAGE Concept Grant, which supports the development of new software tools for social science research. In the latest edition of our newsletter we introduce our six winners: Read it here and sign up for future updates.
The SAGE Concept Grants form an integral part of our mission at SAGE Ocean to support the use of computational methods and big data in social science research. Read interviews with the winners to find out how their tools will benefit social research, and how the funding from SAGE will help them.
SAGE has announced the 2020 winners of its Concept Grant program, which provides funding for innovative software solutions that support social science research. In this blog we interview Andrew Lovett-Barron, the creator of the winning tool, Knowsi; a portal for researchers and participants to manage their consent relationship.
The 2020 SAGE Ocean Concept Grant program drew over 140 applications from all over the world. In this blog post, we’re giving you an insight into our judging criteria and sharing the most common reasons why applications did not progress further, to serve as feedback for this year’s applicants and guidance for future applicants.
Text Wash uses machine learning and natural language processing to unlock previously untapped crime data, that so far has been inaccessible to research due to the need to anonymize the personally identifiable information it contains.
Text is everywhere, and everything is text. More textual data than ever before are available to computational social scientists—be it in the form of digitized books, communication traces on social media platforms, or digital scientific articles. Researchers in academia and industry increasingly use text data to understand human behavior and to measure patterns in language. Techniques from natural language processing have created a fertile soil to perform these tasks and to make inferences based on text data on a large scale.
Following the launch of the SAGE Ocean initiative in February 2018, the inaugural winners of the SAGE Concept Grant program were announced in March of the same year. As we build up to this year’s winner announcement we’ve caught up with the three winners from 2018 to see what they’ve been up to and how the seed funding has helped in the development of their tools.
In this post we chatted to MiniVan, a project of the Public Data Lab.
The Shop Builder is a unique product that allows researchers to easily create a simulated and interactive online shop to study consumer behaviour.