Swahili Lexicon for Sentiment Analysis

By Dr. Aloyce R Kaliba, Professor of Economics and Statistics, College of Business, Southern University, and A&M College, Baton Rouge, Louisiana. This study was funded by SAGE Publishing under the Concept Grant Program.    


Spoken by more than 220 million people and growing, Swahili is becoming the language of the future. Currently, Kiswahili is a leading language in the East African Community, including Burundi, the Democratic Republic of Congo (DRC), Kenya, Rwanda, Tanzania, Southern Sudan, and Uganda. Other countries with Swahili-speaking people are Oman and Yemen in the Middle East, Malawi, Mozambique, Zambia, and the southern tip of Somalia. Universities in China, U.S., and Europe offer Swahili curriculums for students interested in foreign languages, and these programs are growing and expanding exponentially. The "Swahili Lexicon for Sentiment Analysis" project aims to develop and test a Swahili Lexicon annotated by native Swahili speakers for text mining, particularly sentiment analysis.  

Over the last year, in collaboration with Swahili teachers in Zanzibar and Tanzania Mainland, we created a Standard Swahili (Swahili Sanifu) Lexicon (SWAHILILex.01) with about 20,000 words tagged as negative, neutral, and positive. Due to time differences and schedules, the challenges were organizing virtue meetings. We then decided to meet in Dar es Salaam for one week when the schools were in recess. After compiling the SWAHILILex.01, we compared its performance in polarity analysis against other Swahili lexicons (i.e., NRC Word-Emotion Association Lexicon (EmoLex) and the CSLex created  Chen, Skiena as an exercise to building sentiment lexicons for all major languages. The comparison is also against the classification results of supervised machine learning models and Devlan/bert-base-multilingual-cased-finetuned-Swahili pre-trained transformer. The reference was two annotated datasets: standard Swahili and tweet datasets. The performance of the new SWAHILILex.01  lexicon was similar to the results of supervised machine learning and outperformed other methods when classifying the regular Swahili dataset but underperformed when classifying the tweets dataset. Results emphasize the need for domain-based Lexicons or new modeling techniques that account for the multidomain archetypes as topics from social media are across many domains. The paper has been submitted for publication consideration and will soon appear at https://www.techrxiv.org/ website.

I came across the SAGE Concept Grant while searching for a potential sponsor for the Swahili Lexicon project. The application was simple and smooth, and the funding helped to achieve the first regarding developing a Lexicon for Standard Swahili. The tools will be available by the end of October 2023 as we are currently conducting the final reviews of the Lexicon using Swahili experts and creating GitHub to deposit the Lexicon and tools used in the above mention paper.   

Have an idea? Apply for a Sage Concept Grant!

Sage’s Concept Grant program has been running since 2018 and aims to fund innovative software solutions that support research in the social sciences. We are seeking proposals for new technological solutions that support the adoption, development and application of established and emerging research methods, including quantitative, qualitative, mixed, and computational methods. We offer seed £2k grants to develop ideas and £15k grants for scaling prototypes. Explore previous winners. Apply for this year’s round and find out more here. Applications are open to anyone regardless of location or affiliation until 20th of June.


More Methodspace Posts from Sage Concept Grant Winners

Previous
Previous

What does it mean to “decolonize” research?

Next
Next

Engaging with stakeholders: What can we learn from action researchers?