Skip to content

oussamaahmia/TED-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TED-dataset

The two sub-datasets, fd-TED and par-TED, will be updated in a regular basis to keep tracks of the new calls for tender published by the EU states.

  • The par-TED is a multilingual (24 languages) aligned corpus in the form of a set of parallel unique sentences translated to at least 23 languages.

  • The fd-TED corpus is built from the full content of the documents extracted from the TED − Tenders Electronic Daily platform. This dataset can be used as a benchmark for supervised classification or for training machine learning models applied to business intelligence application. We also propose a filtered version of fd-ted created by ignoring administrative information.

For further information please refer to this article.

Citation:
@inproceedings{ahmia-etal-2018-two, title = "Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications.", author = "Ahmia, Oussama and B{\'e}chet, Nicolas and Marteau, Pierre-Fran{\c{c}}ois", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L18-1583", }

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors