Blog
Posts and updates, including highlights shared on LinkedIn.
Our 2026 publications so far, and where African NLP is heading
Publicly available work I've been part of this year in translation, sentiment, speech, language ID, and benchmarking for African languages.
African languages reach record web presence in Common Crawl data
African languages just hit their highest-ever representation on the web!!! The latest Common Crawl (Jan 2026) shows African languages at 0.057% of all crawled pages, an all-time record. That's 18.5% higher than the previous peak and 343,000+ more pages than the last crawl. Some standout growth in a single month: → Igbo: +124% → Sango: +259% → Tswana: +279% → Swahili: +45% (now at 294K pages) For context, English sits at 42% — roughly 728x more pages than all 29 detected African languages combined. There's still a massive gap, but the direction is right. We believe projects like AfriCC are contributing to this shift by actively increasing the volume and diversity of African language content available for web crawlers. The full data is open — Common Crawl publishes language stats for every monthly crawl: https://lnkd.in/dE-WAUeZ What's your take — what else can we do to close this gap? #AfriCC #AfricanLanguages #NLP #CommonCrawl #DigitalInclusion #OpenData
CommonLID: Re-evaluating State-of-the-Art Language Identification…
Introducing our work CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data ArXiv: https://lnkd.in/dBgW6b6P Some months ago, I invited the community to participate in a hackathon to annotate a language identification dataset, with authorship on the resulting dataset description paper as an incentive. We are deeply grateful to everyone who participated, as well as to the Common Crawl team led by Pedro Ortiz Suarez, my boss Vukosi Marivate, and our collaborators Shamsuddeen H. Muhammad, PhD, and Atnafu Lambebo Tonja. We hope this resource will inspire a wide range of NLP research and applications, and contribute meaningfully to advancing African NLP.