Table of Contents
Overview
Language ISO 639-3 Speakers Region
Hausa hau ~80 million Nigeria, Niger, Ghana, Cameroon
Fongbe fon ~2 million Benin, Togo, Nigeria
This repository catalogs 60 datasets for Hausa and 18 datasets for Fongbe spanning various NLP tasks including machine translation, named entity recognition, sentiment analysis, speech recognition, and more.
Hausa Resources
Hausa Text Corpora
Resource Description Size License Links
CC-100 (Hausa) Monolingual data from Common Crawl for language modeling Large - Download
AfriBERTa Corpus Multilingual corpus including Hausa from BBC news and Common Crawl - Apache 2.0 GitHub | HuggingFace
Naijaweb Dataset 270k+ documents from Nigerian web sources ~230M tokens MIT HuggingFace
BloomLM Bloom Library data for language modeling (~400 languages) Varies CC-BY variants HuggingFace
African Storybook Multilingual children's stories ~6,700 stories - GitHub
Hausa Parallel Corpora & Machine Translation
Resource Description Size License Links
MAFAND-MT News domain parallel corpus (English-Hausa) - CC-BY-NC-4.0 GitHub
FLORES-200 Evaluation benchmark for MT (200 languages) - CC-BY-SA GitHub
NLLB Data Parallel data from No Language Left Behind project Large Custom GitHub
Gamayun 5k/10k Everyday language parallel sentences (English-Hausa) 5k-10k sentences - 5k | 10k
Hausa Visual Genome Multimodal English-Hausa dataset with images 32,923 sentences - LINDAT
TICO-19 COVID-19 translation data - - Website
CrossSum Cross-lingual summarization dataset - CC-BY-NC-SA-4.0 GitHub | HuggingFace
OPUS Corpora Various parallel corpora (KDE4, GNOME, Ubuntu, ParaCrawl, Tatoeba, Tanzil, QED) Varies Varies OPUS
Hausa-English Code-Switched Social media code-switched comments - CC-BY-4.0 Mendeley
hausa-corpus Parallel Hausa-English textual datasets - - GitHub
TED Talks IWSLT TED talk translations - CC-BY-NC-4.0 Website
Microsoft Terminology IT terminology glossary - - Microsoft
Hausa Named Entity Recognition
Resource Description Size License Links
MasakhaNER NER dataset for 10 African languages - CC-BY-NC-4.0 GitHub
MasakhaNER 2.0 Extended NER dataset for 20 African languages - CC-BY-NC-4.0 GitHub
Hausa VOA NER NER from Voice of America Hausa news - - GitHub
Wikidata Names Name lists for African languages from Wikidata ~1.9M names - GitHub
Hausa Sentiment Analysis
Resource Description Size License Links
AfriSenti Twitter sentiment dataset (14 African languages) 110k+ tweets Custom HuggingFace | GitHub
NaijaSenti Nigerian Twitter sentiment corpus - Custom GitHub
NollySenti Nigerian movie review sentiment - CC-BY-4.0 GitHub
BRIGHTER Emotion recognition dataset (28 languages) - CC-BY-4.0 HuggingFace | GitHub
Hausa Speech & ASR
Resource Description Size License Links
ALFFA Hausa ASR corpus with Kaldi recipes - - GitHub
Bible TTS High-quality TTS dataset 86.6 hours aligned CC-BY-SA OpenSLR | Website
CMU Wilderness Multilingual speech dataset from Bible recordings ~20 hours - GitHub
ML Spoken Words Multilingual keyword spotting dataset - CC-BY-4.0 Website | HuggingFace
Hausa Speech Corpus Baseline ASR dataset - CC-BY-4.0 Mendeley
Studios Tamani-Kalangou Radio broadcast audio collection - - Website
Synthetic Voice Data Synthetic ASR training data 2,500+ hours - arXiv | HuggingFace
Ajami HTR Dataset Handwritten text recognition for Ajami manuscripts 400 pages, 6,132 lines - Zenodo
Hausa Question Answering & Reasoning
Resource Description Size License Links
BLEnD Cultural everyday knowledge benchmark 52.6k QA pairs CC-BY-SA-4.0 HuggingFace | GitHub
Global PIQA Physical commonsense reasoning (100+ languages) - - HuggingFace
AfriCLIRMatrix Cross-lingual information retrieval - Apache 2.0 GitHub | HuggingFace
Fikira Dataset Multilingual reasoning dataset 50k examples MIT HuggingFace
Hausa Other Resources
Resource Description Links
Hausa VOA Topics News headline topic classification GitHub
NaijaHate Hate speech detection on Nigerian Twitter HuggingFace
TaTa Table-to-text generation dataset GitHub
MassiveSumm News summarization dataset GitHub
AfriTeVa-Keji T5 pre-training data GitHub
Stopwords Hausa stopword lists GitHub | Kaggle
The 200 Word Project Visual and audio vocabulary tool Website
Aya Dataset Multilingual instruction-following dataset HuggingFace
Goloka African language dataset hub Website
masakhane-wazobia Nigerian parallel corpora GitHub
World Wide Dishes Food/culture dataset GitHub
PanLex Multilingual vocabulary database Website
Fongbe Resources
Fongbe Text Corpora
Resource Description Size License Links
BloomLM Bloom Library data for language modeling Varies CC-BY variants HuggingFace
Niger-Volta LTI Fon Text Training text for NLP, ASR, and TTS - GPL-3.0 GitHub
Fongbe Parallel Corpora & Machine Translation
Resource Description Size License Links
FFR Dataset Fon-French parallel corpus 117k+ sentences - GitHub
Fon-French Daily Dialogues Daily conversation parallel data 25,377 sentences CC-BY-4.0 Zenodo
AI4D Takwimu Lab Dataset French-Fongbe/Ewe MT challenge data 53k Fr-Fon, 23k Fr-Ewe - Zindi | Zenodo
MAFAND-MT News domain parallel corpus (French-Fongbe) - CC-BY-NC-4.0 GitHub
MMTAfrica Multilingual MT for African languages - - GitHub | Demo
FLORES-200 Evaluation benchmark for MT - CC-BY-SA GitHub
NLLB Data No Language Left Behind parallel data Large Custom GitHub
Fongbe Speech & ASR
Resource Description Size License Links
ALFFA Fongbe ASR corpus with Kaldi recipes - - GitHub
pyFongbe ASR Data Fongbe ASR training data - - GitHub
CMU Wilderness Multilingual speech from Bible recordings ~20 hours - GitHub
Fongbe Other Resources
Resource Description Links
AfriVEC Word embeddings for Fon and Nobiin GitHub
Wikidata Names Name lists from Wikidata GitHub
Aya Dataset Multilingual instruction-following dataset HuggingFace
PanLex Multilingual vocabulary database Website
UDHR Universal Declaration of Human Rights translations Website
Multilingual Resources
These resources cover both Hausa and Fongbe along with other African languages:
Resource Languages Task Links
NLLB 200+ languages Machine Translation GitHub
FLORES-200 200 languages MT Evaluation GitHub
Aya Dataset 65+ languages Instruction Following HuggingFace
PanLex 5,000+ languages Lexical Resources Website
CMU Wilderness 700+ languages Speech GitHub
How to Contribute
We welcome contributions to expand this catalog! You can help by:
Adding new resources : Submit a pull request with new datasets following the existing format
Updating information : Fix broken links or add missing details
Reporting issues : Open an issue if you find errors or know of unlisted resources
Contribution Guidelines
Include resource name, description, size (if known), license, and direct links
Verify that resources are publicly accessible
Follow the table format used in this README
Citation
If you use this resource catalog in your research, please cite our survey paper:
@inproceedings{title2026,
title={A Survey of NLP Resources for Hausa and Fongbe Languages},
author={[Authors]},
booktitle={[Conference]},
year={2026}
}
License
This catalog is released under CC-BY-4.0 . Individual datasets have their own licenses as indicated in the tables above.