The Bayelemabaga dataset is a collection of 46976 aligned machine
translation ready Bambara-French lines, originating from
Corpus Bambara de Reference by INALCO's LLACAN Lab. The dataset is constitued of text extracted from
264 text files, varing from periodicals, books,
short stories, blog posts, part of the Bible and the Quran.
Snapshot: 46976
Lines
44976
French Tokens
691312
Bambara Tokens
660732
French Types
32018
Bambara Types
29382
Avg. Fr line length
77.6
Avg. Bam line length
61.69
Number of text sources
264
Data Splits
Train
80%
37580
Valid
10%
4698
Test
10%
4698
Remarks
**We are working on resolving some last minute misalignment issues
encountered.
Maintenance
This dataset is supposed to be actively maintained.
ʃ => (sh/shy) sound: Symbol left in the dataset, although not a
part of bambara orthography nor French orthography.
License
Version
1.0.1
Citation
@misc{bayelemabagamldataset2022
title={Machine Learning Dataset Development for Manding Languages},
author={
Valentin Vydrin and
Jean-Jacques Meric and
Kirill Maslinsky and
Andrij Rovenchak and
Allashera Auguste Tapo and
Sebastien Diarra and
Christopher Homan and
Marco Zampieri and
Michael Leventhal
},
howpublished = {url{https://github.com/robotsmali-ai/datasets}},
year={2022}
}
For any questions, contact sdiarra@robotsmali.org
This dataset is distributed under the CC-BY-SA-4.0 LICENSE