We lose frequency information in deduplication

https://github.com/paracrawl/cirrus-scripts/blob/61765e3bb1da3d580bc72f48b34634cf8c79ea45/bitextor-buildTMX.py#L180-L184

Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of `Yes -> Ja` in the data, and one `Yes -> Fuck off`, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.

	elif prev_hash == line_hash and options.dedup:
	urls1.update(fieldsdict['url1'].split(' '))
	urls2.update(fieldsdict['url2'].split(' '))
	if 'collection' in fieldsdict.keys():
	collections.add(fieldsdict['collection'])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We lose frequency information in deduplication #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

We lose frequency information in deduplication #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions