release3_inspection

This repository contains data, code and documentation related to manual inspection of HPLT v3.

Purpose of inspection

We want to get a rough idea about the actual content of the cleaned version of the 3rd data release. More specifically, for each language L we want to estimate the proportion of documents that are:

not in the language L,
contain undesirable artifacts,
fully undesirable because they are mostly unnatural,
undesirable porn texts.

Data for round 1 of HPLT 3.0 (cleaned) inspection:

samples stratified by language,
5 batches of random documents per language,
200 documents per batch,
full text for texts shorter than 1500 characters, otherwise the first 500 characters, the last 500 characters and 500 characters from the middle of the text.

Inspection

Inspection is performed by volunteers who were mostly the members of the HPLT project. Volunteers inspect languages which they are native or fluent speakers of following the guidelines.

Results

Data:

Analysis:

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
annot_round1		annot_round1
observations		observations
sample/per_lang_1000		sample/per_lang_1000
GUIDELINES.md		GUIDELINES.md
LICENSE		LICENSE
Proportions-HPLTv3.ipynb		Proportions-HPLTv3.ipynb
README.md		README.md
annotated.jsonl.zst		annotated.jsonl.zst
annotations.tsv		annotations.tsv
merge.py		merge.py
results_per_lang.tsv		results_per_lang.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

release3_inspection

Purpose of inspection

Data for round 1 of HPLT 3.0 (cleaned) inspection:

Inspection

Results

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

release3_inspection

Purpose of inspection

Data for round 1 of HPLT 3.0 (cleaned) inspection:

Inspection

Results

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages