Overview

LanguageISO 639-3SpeakersRegion
Hausahau~80 millionNigeria, Niger, Ghana, Cameroon
Fongbefon~2 millionBenin, Togo, Nigeria

This repository catalogs 60 datasets for Hausa and 18 datasets for Fongbe spanning various NLP tasks including machine translation, named entity recognition, sentiment analysis, speech recognition, and more.


Hausa Resources

Hausa Text Corpora

ResourceDescriptionSizeLicenseLinks
CC-100 (Hausa)Monolingual data from Common Crawl for language modelingLarge-Download
AfriBERTa CorpusMultilingual corpus including Hausa from BBC news and Common Crawl-Apache 2.0GitHub | HuggingFace
Naijaweb Dataset270k+ documents from Nigerian web sources~230M tokensMITHuggingFace
BloomLMBloom Library data for language modeling (~400 languages)VariesCC-BY variantsHuggingFace
African StorybookMultilingual children's stories~6,700 stories-GitHub

Hausa Parallel Corpora & Machine Translation

ResourceDescriptionSizeLicenseLinks
MAFAND-MTNews domain parallel corpus (English-Hausa)-CC-BY-NC-4.0GitHub
FLORES-200Evaluation benchmark for MT (200 languages)-CC-BY-SAGitHub
NLLB DataParallel data from No Language Left Behind projectLargeCustomGitHub
Gamayun 5k/10kEveryday language parallel sentences (English-Hausa)5k-10k sentences-5k | 10k
Hausa Visual GenomeMultimodal English-Hausa dataset with images32,923 sentences-LINDAT
TICO-19COVID-19 translation data--Website
CrossSumCross-lingual summarization dataset-CC-BY-NC-SA-4.0GitHub | HuggingFace
OPUS CorporaVarious parallel corpora (KDE4, GNOME, Ubuntu, ParaCrawl, Tatoeba, Tanzil, QED)VariesVariesOPUS
Hausa-English Code-SwitchedSocial media code-switched comments-CC-BY-4.0Mendeley
hausa-corpusParallel Hausa-English textual datasets--GitHub
TED Talks IWSLTTED talk translations-CC-BY-NC-4.0Website
Microsoft TerminologyIT terminology glossary--Microsoft

Hausa Named Entity Recognition

ResourceDescriptionSizeLicenseLinks
MasakhaNERNER dataset for 10 African languages-CC-BY-NC-4.0GitHub
MasakhaNER 2.0Extended NER dataset for 20 African languages-CC-BY-NC-4.0GitHub
Hausa VOA NERNER from Voice of America Hausa news--GitHub
Wikidata NamesName lists for African languages from Wikidata~1.9M names-GitHub

Hausa Sentiment Analysis

ResourceDescriptionSizeLicenseLinks
AfriSentiTwitter sentiment dataset (14 African languages)110k+ tweetsCustomHuggingFace | GitHub
NaijaSentiNigerian Twitter sentiment corpus-CustomGitHub
NollySentiNigerian movie review sentiment-CC-BY-4.0GitHub
BRIGHTEREmotion recognition dataset (28 languages)-CC-BY-4.0HuggingFace | GitHub

Hausa Speech & ASR

ResourceDescriptionSizeLicenseLinks
ALFFA HausaASR corpus with Kaldi recipes--GitHub
Bible TTSHigh-quality TTS dataset86.6 hours alignedCC-BY-SAOpenSLR | Website
CMU WildernessMultilingual speech dataset from Bible recordings~20 hours-GitHub
ML Spoken WordsMultilingual keyword spotting dataset-CC-BY-4.0Website | HuggingFace
Hausa Speech CorpusBaseline ASR dataset-CC-BY-4.0Mendeley
Studios Tamani-KalangouRadio broadcast audio collection--Website
Synthetic Voice DataSynthetic ASR training data2,500+ hours-arXiv | HuggingFace
Ajami HTR DatasetHandwritten text recognition for Ajami manuscripts400 pages, 6,132 lines-Zenodo

Hausa Question Answering & Reasoning

ResourceDescriptionSizeLicenseLinks
BLEnDCultural everyday knowledge benchmark52.6k QA pairsCC-BY-SA-4.0HuggingFace | GitHub
Global PIQAPhysical commonsense reasoning (100+ languages)--HuggingFace
AfriCLIRMatrixCross-lingual information retrieval-Apache 2.0GitHub | HuggingFace
Fikira DatasetMultilingual reasoning dataset50k examplesMITHuggingFace

Hausa Other Resources

ResourceDescriptionLinks
Hausa VOA TopicsNews headline topic classificationGitHub
NaijaHateHate speech detection on Nigerian TwitterHuggingFace
TaTaTable-to-text generation datasetGitHub
MassiveSummNews summarization datasetGitHub
AfriTeVa-KejiT5 pre-training dataGitHub
StopwordsHausa stopword listsGitHub | Kaggle
The 200 Word ProjectVisual and audio vocabulary toolWebsite
Aya DatasetMultilingual instruction-following datasetHuggingFace
GolokaAfrican language dataset hubWebsite
masakhane-wazobiaNigerian parallel corporaGitHub
World Wide DishesFood/culture datasetGitHub
PanLexMultilingual vocabulary databaseWebsite

Fongbe Resources

Fongbe Text Corpora

ResourceDescriptionSizeLicenseLinks
BloomLMBloom Library data for language modelingVariesCC-BY variantsHuggingFace
Niger-Volta LTI Fon TextTraining text for NLP, ASR, and TTS-GPL-3.0GitHub

Fongbe Parallel Corpora & Machine Translation

ResourceDescriptionSizeLicenseLinks
FFR DatasetFon-French parallel corpus117k+ sentences-GitHub
Fon-French Daily DialoguesDaily conversation parallel data25,377 sentencesCC-BY-4.0Zenodo
AI4D Takwimu Lab DatasetFrench-Fongbe/Ewe MT challenge data53k Fr-Fon, 23k Fr-Ewe-Zindi | Zenodo
MAFAND-MTNews domain parallel corpus (French-Fongbe)-CC-BY-NC-4.0GitHub
MMTAfricaMultilingual MT for African languages--GitHub | Demo
FLORES-200Evaluation benchmark for MT-CC-BY-SAGitHub
NLLB DataNo Language Left Behind parallel dataLargeCustomGitHub

Fongbe Speech & ASR

ResourceDescriptionSizeLicenseLinks
ALFFA FongbeASR corpus with Kaldi recipes--GitHub
pyFongbe ASR DataFongbe ASR training data--GitHub
CMU WildernessMultilingual speech from Bible recordings~20 hours-GitHub

Fongbe Other Resources

ResourceDescriptionLinks
AfriVECWord embeddings for Fon and NobiinGitHub
Wikidata NamesName lists from WikidataGitHub
Aya DatasetMultilingual instruction-following datasetHuggingFace
PanLexMultilingual vocabulary databaseWebsite
UDHRUniversal Declaration of Human Rights translationsWebsite

Multilingual Resources

These resources cover both Hausa and Fongbe along with other African languages:

ResourceLanguagesTaskLinks
NLLB200+ languagesMachine TranslationGitHub
FLORES-200200 languagesMT EvaluationGitHub
Aya Dataset65+ languagesInstruction FollowingHuggingFace
PanLex5,000+ languagesLexical ResourcesWebsite
CMU Wilderness700+ languagesSpeechGitHub

How to Contribute

We welcome contributions to expand this catalog! You can help by:

  1. Adding new resources: Submit a pull request with new datasets following the existing format
  2. Updating information: Fix broken links or add missing details
  3. Reporting issues: Open an issue if you find errors or know of unlisted resources

Contribution Guidelines


Citation

If you use this resource catalog in your research, please cite our survey paper:

@inproceedings{title2026,
  title={A Survey of NLP Resources for Hausa and Fongbe Languages},
  author={[Authors]},
  booktitle={[Conference]},
  year={2026}
}

License

This catalog is released under CC-BY-4.0. Individual datasets have their own licenses as indicated in the tables above.


Acknowledgments