Research Article Open Access

EthioLSocMDMTLM: Exploring Application of Topic Modeling for Building Ethiopian Language Social Media Data-Based Multilingual Transformer Language Models for Multilingual Hateful Content Detection

Naol Bakala Defersha1, Kula Kekeba Tune1 and Solomon Teferra Abate2
  • 1 Software Engineering, College of Engineering, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
  • 2 School of Information Science, College of Natural and Computational Science, Addis Ababa University, Addis Ababa, Ethiopia

Abstract

This study proposes topic modeling techniques to develop Ethiopian Language Social Media Data Based Multilingual Transformer Language Models for multilingual hateful content detection. We modified various multilingual pretrained models, investigated the challenges of using pre-trained transformer language models, and built multilingual hateful content detection models. Topic words with rows of 1561, 70, and 1044 extracted from Afaan Oromo, Tigrigna, and Amharic Afaan Oromo, Amharic, and Tigrigna respectively used to train transformers. The proposed models were also tested by developing a multilingual hateful content detection model for low-resource Ethiopian languages using deep learning techniques. A total of 45522, 59529, and 48882, Tex documents of Amharic, Afaan Oromo, and Tigrigna were collected and three annotators annotated the data into binary classes where the agreement among annotators result scored 87% for Amharic, 82% for Tigrigna and 84% for Afaan Oromo. LSTM, CNN, and BiLSM deep learning algorithms applied algorithms, that includes integration of EthioLan_mBERT, EthioLan_BERT, and EthioLan_XLM-Roberta contextual embeddings. Among applied the techniques; LSTM+ EthioLan_mBERT outperforms the score performance of F1score 81%. We publicly release the modified pre-trained models, dataset, and related codes.

Journal of Computer Science
Volume 21 No. 2, 2025, 250-262

DOI: https://doi.org/10.3844/jcssp.2025.250.262

Submitted On: 29 October 2024 Published On: 31 December 2024

How to Cite: Defersha, N. B., Tune, K. K. & Abate, S. T. (2025). EthioLSocMDMTLM: Exploring Application of Topic Modeling for Building Ethiopian Language Social Media Data-Based Multilingual Transformer Language Models for Multilingual Hateful Content Detection. Journal of Computer Science, 21(2), 250-262. https://doi.org/10.3844/jcssp.2025.250.262

  • 223 Views
  • 94 Downloads
  • 0 Citations

Download

Keywords

  • Afaan Oromo
  • Low Resource Languages
  • Amharic
  • Hateful Content
  • EthioLSocMDMTLM
  • Transformer Language Model
  • Tigrigna