TED-dataset

The two sub-datasets, fd-TED and par-TED, will be updated in a regular basis to keep tracks of the new calls for tender published by the EU states.

The par-TED is a multilingual (24 languages) aligned corpus in the form of a set of parallel unique sentences translated to at least 23 languages.
The fd-TED corpus is built from the full content of the documents extracted from the TED − Tenders Electronic Daily platform. This dataset can be used as a benchmark for supervised classification or for training machine learning models applied to business intelligence application. We also propose a filtered version of fd-ted created by ignoring administrative information.

For further information please refer to this article.

Citation:
@inproceedings{ahmia-etal-2018-two, title = "Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications.", author = "Ahmia, Oussama and B{\'e}chet, Nicolas and Marteau, Pierre-Fran{\c{c}}ois", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L18-1583", }

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
examples		examples
samples_old		samples_old
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TED-dataset

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TED-dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages