|
elif prev_hash == line_hash and options.dedup: |
|
urls1.update(fieldsdict['url1'].split(' ')) |
|
urls2.update(fieldsdict['url2'].split(' ')) |
|
if 'collection' in fieldsdict.keys(): |
|
collections.add(fieldsdict['collection']) |
Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of Yes -> Ja in the data, and one Yes -> Fuck off, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.
cirrus-scripts/bitextor-buildTMX.py
Lines 180 to 184 in 61765e3
Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of
Yes -> Jain the data, and oneYes -> Fuck off, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.