BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning

HuggingFace: Bayelemabaga


The Bayelemabaga dataset is a collection of 46976 aligned machine translation ready Bambara-French lines, originating from Corpus Bambara de Reference by INALCO's LLACAN Lab. The dataset is constitued of text extracted from 264 text files, varing from periodicals, books, short stories, blog posts, part of the Bible and the Quran.

Snapshot: 46976

Lines 44976
French Tokens 691312
Bambara Tokens 660732
French Types 32018
Bambara Types 29382
Avg. Fr line length 77.6
Avg. Bam line length 61.69
Number of text sources 264

Data Splits

Train 80% 37580
Valid 10% 4698
Test 10% 4698





To note:




@misc{bayelemabagamldataset2022 title={Machine Learning Dataset Development for Manding Languages}, author={ Valentin Vydrin and Jean-Jacques Meric and Kirill Maslinsky and Andrij Rovenchak and Allashera Auguste Tapo and Sebastien Diarra and Christopher Homan and Marco Zampieri and Michael Leventhal }, howpublished = {url{}}, year={2022} }
For any questions, contact

This dataset is distributed under the CC-BY-SA-4.0 LICENSE