Dr. Idris Abdulmumin

Our 2026 publications so far, and where African NLP is heading

2026-05-29T00:00:00+00:00

            Research Associate, University of Pretoria | Co-Founder, HausaNLP & ArewaDS | Member, MasaKhaneNLP
        Our 2026 publications so far and where African NLP is heading

Publicly available work I’ve been part of this year in translation, sentiment, speech, language ID, and benchmarking for African languages:

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation Idris Abdulmumin et al. (2026) — Parallel corpus across six African languages, confronting the terminology gap that blocks native-language access to science. https://lnkd.in/dUPPbHck

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora Idris Abdulmumin et al. (2026) — Setswana sentiment dataset analyzing how inter-annotator agreement decays over long annotation campaigns. https://lnkd.in/dnADuScY

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks Tadesse Destaw Belay et al. (2026) — Scalable alternative to majority voting: cluster annotators by agreement to preserve diverse perspectives. https://lnkd.in/dFPQQnaR

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages Marie Maltais et al. (2026) — Parallel speech translation dataset for Igbo, Hausa, Yoruba, and Nigerian Pidgin paired with English. https://lnkd.in/dYeWemMT

SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA) Liang-Chih Yu et al. (2026) — Shared task moving ABSA from categorical polarity to valence-arousal modeling, extended to public-issue discourse. https://lnkd.in/dk7KwCAq

SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization Usman Naseem et al. (2026) — 22 languages and 110K instances for detecting online polarization, its type, and its manifestation. https://lnkd.in/dzM7KvSM

DimStance: Multilingual Datasets for Dimensional Stance Analysis Jonas Becker et al. (2026) — Stance datasets modeling attitudes along continuous valence-arousal dimensions instead of Favor/Neutral/Against bins. https://lnkd.in/d8MVBBm2

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Pedro Ortiz Suarez et al. (2026) — Community-built LID benchmark across 109 languages for the noisy web domain, where current LID still fails. https://lnkd.in/dBgW6b6P

Afri-MCQA: Multimodal Cultural Question Answering for African Languages Atnafu Lambebo Tonja et al. (2026) — First multilingual cultural QA benchmark for 15 African languages: 7.5K parallel pairs across text and speech, built by native speakers. https://lnkd.in/dd-SsvKn

Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivate et al. (2026) — 3,000-hour ASR dataset covering seven South African languages across agriculture, healthcare, and general domains. https://lnkd.in/dMcxXDXR

Gratitude to every co-author, annotator, and native-speaker reviewer who makes this work possible. More in the pipeline.

#AfricanNLP #LowResourceNLP #ResearchHighlights

African languages reach record web presence in Common Crawl data

2026-02-05T00:00:00+00:00

            Research Associate, University of Pretoria | Co-Founder, HausaNLP & ArewaDS | Member, MasaKhaneNLP
        African languages just hit their highest-ever representation on the web!!!

The latest Common Crawl (Jan 2026) shows African languages at 0.057% of all crawled pages, an all-time record. That’s 18.5% higher than the previous peak and 343,000+ more pages than the last crawl.

Some standout growth in a single month: → Igbo: +124% → Sango: +259% → Tswana: +279% → Swahili: +45% (now at 294K pages)

For context, English sits at 42% — roughly 728x more pages than all 29 detected African languages combined. There’s still a massive gap, but the direction is right. We believe projects like AfriCC are contributing to this shift by actively increasing the volume and diversity of African language content available for web crawlers.

The full data is open — Common Crawl publishes language stats for every monthly crawl: https://lnkd.in/dE-WAUeZ

What’s your take — what else can we do to close this gap?

#AfriCC #AfricanLanguages #NLP #CommonCrawl #DigitalInclusion #OpenData

CommonLID: Re-evaluating State-of-the-Art Language Identification…

2026-01-29T00:00:00+00:00

            Research Associate, University of Pretoria | Co-Founder, HausaNLP & ArewaDS | Member, MasaKhaneNLP
        Introducing our work

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data ArXiv: https://lnkd.in/dBgW6b6P

Some months ago, I invited the community to participate in a hackathon to annotate a language identification dataset, with authorship on the resulting dataset description paper as an incentive. We are deeply grateful to everyone who participated, as well as to the Common Crawl team led by Pedro Ortiz Suarez, my boss Vukosi Marivate, and our collaborators Shamsuddeen H. Muhammad, PhD, and Atnafu Lambebo Tonja.

We hope this resource will inspire a wide range of NLP research and applications, and contribute meaningfully to advancing African NLP.