<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://abumafrim.com/feed.xml" rel="self" type="application/atom+xml"/><link href="https://abumafrim.com/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-06-16T09:04:59+00:00</updated><id>https://abumafrim.com/feed.xml</id><title type="html">Dr. Idris Abdulmumin</title><subtitle>A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design. </subtitle><entry><title type="html">Our 2026 publications so far, and where African NLP is heading</title><link href="https://abumafrim.com/blog/2026/our-2026-publications-so-far-and-where-african-nlp-is-heading/" rel="alternate" type="text/html" title="Our 2026 publications so far, and where African NLP is heading"/><published>2026-05-29T00:00:00+00:00</published><updated>2026-05-29T00:00:00+00:00</updated><id>https://abumafrim.com/blog/2026/our-2026-publications-so-far-and-where-african-nlp-is-heading</id><content type="html" xml:base="https://abumafrim.com/blog/2026/our-2026-publications-so-far-and-where-african-nlp-is-heading/"><![CDATA[<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            Research Associate, University of Pretoria | Co-Founder, HausaNLP &amp; ArewaDS | Member, MasaKhaneNLP
        Our 2026 publications so far and where African NLP is heading
</code></pre></div></div> <p>Publicly available work I’ve been part of this year in translation, sentiment, speech, language ID, and benchmarking for African languages:</p> <p>AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation Idris Abdulmumin et al. (2026) — Parallel corpus across six African languages, confronting the terminology gap that blocks native-language access to science. https://lnkd.in/dUPPbHck</p> <p>Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora Idris Abdulmumin et al. (2026) — Setswana sentiment dataset analyzing how inter-annotator agreement decays over long annotation campaigns. https://lnkd.in/dnADuScY</p> <p>Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks Tadesse Destaw Belay et al. (2026) — Scalable alternative to majority voting: cluster annotators by agreement to preserve diverse perspectives. https://lnkd.in/dFPQQnaR</p> <p>NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages Marie Maltais et al. (2026) — Parallel speech translation dataset for Igbo, Hausa, Yoruba, and Nigerian Pidgin paired with English. https://lnkd.in/dYeWemMT</p> <p>SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA) Liang-Chih Yu et al. (2026) — Shared task moving ABSA from categorical polarity to valence-arousal modeling, extended to public-issue discourse. https://lnkd.in/dk7KwCAq</p> <p>SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization Usman Naseem et al. (2026) — 22 languages and 110K instances for detecting online polarization, its type, and its manifestation. https://lnkd.in/dzM7KvSM</p> <p>DimStance: Multilingual Datasets for Dimensional Stance Analysis Jonas Becker et al. (2026) — Stance datasets modeling attitudes along continuous valence-arousal dimensions instead of Favor/Neutral/Against bins. https://lnkd.in/d8MVBBm2</p> <p>CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Pedro Ortiz Suarez et al. (2026) — Community-built LID benchmark across 109 languages for the noisy web domain, where current LID still fails. https://lnkd.in/dBgW6b6P</p> <p>Afri-MCQA: Multimodal Cultural Question Answering for African Languages Atnafu Lambebo Tonja et al. (2026) — First multilingual cultural QA benchmark for 15 African languages: 7.5K parallel pairs across text and speech, built by native speakers. https://lnkd.in/dd-SsvKn</p> <p>Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivate et al. (2026) — 3,000-hour ASR dataset covering seven South African languages across agriculture, healthcare, and general domains. https://lnkd.in/dMcxXDXR</p> <p>Gratitude to every co-author, annotator, and native-speaker reviewer who makes this work possible. More in the pipeline.</p> <p>#AfricanNLP #LowResourceNLP #ResearchHighlights</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Publicly available work I've been part of this year in translation, sentiment, speech, language ID, and benchmarking for African languages.]]></summary></entry><entry><title type="html">African languages reach record web presence in Common Crawl data</title><link href="https://abumafrim.com/blog/2026/african-languages-reach-record-web-presence-in-common-crawl-data/" rel="alternate" type="text/html" title="African languages reach record web presence in Common Crawl data"/><published>2026-02-05T00:00:00+00:00</published><updated>2026-02-05T00:00:00+00:00</updated><id>https://abumafrim.com/blog/2026/african-languages-reach-record-web-presence-in-common-crawl-data</id><content type="html" xml:base="https://abumafrim.com/blog/2026/african-languages-reach-record-web-presence-in-common-crawl-data/"><![CDATA[<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            Research Associate, University of Pretoria | Co-Founder, HausaNLP &amp; ArewaDS | Member, MasaKhaneNLP
        African languages just hit their highest-ever representation on the web!!!
</code></pre></div></div> <p>The latest Common Crawl (Jan 2026) shows African languages at 0.057% of all crawled pages, an all-time record. That’s 18.5% higher than the previous peak and 343,000+ more pages than the last crawl.</p> <p>Some standout growth in a single month: → Igbo: +124% → Sango: +259% → Tswana: +279% → Swahili: +45% (now at 294K pages)</p> <p>For context, English sits at 42% — roughly 728x more pages than all 29 detected African languages combined. There’s still a massive gap, but the direction is right. We believe projects like AfriCC are contributing to this shift by actively increasing the volume and diversity of African language content available for web crawlers.</p> <p>The full data is open — Common Crawl publishes language stats for every monthly crawl: https://lnkd.in/dE-WAUeZ</p> <p>What’s your take — what else can we do to close this gap?</p> <p>#AfriCC #AfricanLanguages #NLP #CommonCrawl #DigitalInclusion #OpenData</p>]]></content><author><name></name></author><summary type="html"><![CDATA[African languages just hit their highest-ever representation on the web!!! The latest Common Crawl (Jan 2026) shows African languages at 0.057% of all crawled pages, an all-time record. That's 18.5% higher than the previous peak and 343,000+ more pages than the last crawl. Some standout growth in a single month: → Igbo: +124% → Sango: +259% → Tswana: +279% → Swahili: +45% (now at 294K pages) For context, English sits at 42% — roughly 728x more pages than all 29 detected African languages combined. There's still a massive gap, but the direction is right. We believe projects like AfriCC are contributing to this shift by actively increasing the volume and diversity of African language content available for web crawlers. The full data is open — Common Crawl publishes language stats for every monthly crawl: https://lnkd.in/dE-WAUeZ What's your take — what else can we do to close this gap? #AfriCC #AfricanLanguages #NLP #CommonCrawl #DigitalInclusion #OpenData]]></summary></entry><entry><title type="html">CommonLID: Re-evaluating State-of-the-Art Language Identification…</title><link href="https://abumafrim.com/blog/2026/commonlid-re-evaluating-state-of-the-art-language-identification/" rel="alternate" type="text/html" title="CommonLID: Re-evaluating State-of-the-Art Language Identification…"/><published>2026-01-29T00:00:00+00:00</published><updated>2026-01-29T00:00:00+00:00</updated><id>https://abumafrim.com/blog/2026/commonlid-re-evaluating-state-of-the-art-language-identification</id><content type="html" xml:base="https://abumafrim.com/blog/2026/commonlid-re-evaluating-state-of-the-art-language-identification/"><![CDATA[<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>            Research Associate, University of Pretoria | Co-Founder, HausaNLP &amp; ArewaDS | Member, MasaKhaneNLP
        Introducing our work
</code></pre></div></div> <p>CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data ArXiv: https://lnkd.in/dBgW6b6P</p> <p>Some months ago, I invited the community to participate in a hackathon to annotate a language identification dataset, with authorship on the resulting dataset description paper as an incentive. We are deeply grateful to everyone who participated, as well as to the Common Crawl team led by Pedro Ortiz Suarez, my boss Vukosi Marivate, and our collaborators Shamsuddeen H. Muhammad, PhD, and Atnafu Lambebo Tonja.</p> <p>We hope this resource will inspire a wide range of NLP research and applications, and contribute meaningfully to advancing African NLP.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introducing our work CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data ArXiv: https://lnkd.in/dBgW6b6P Some months ago, I invited the community to participate in a hackathon to annotate a language identification dataset, with authorship on the resulting dataset description paper as an incentive. We are deeply grateful to everyone who participated, as well as to the Common Crawl team led by Pedro Ortiz Suarez, my boss Vukosi Marivate, and our collaborators Shamsuddeen H. Muhammad, PhD, and Atnafu Lambebo Tonja. We hope this resource will inspire a wide range of NLP research and applications, and contribute meaningfully to advancing African NLP.]]></summary></entry></feed>