Efficient de-duplication method

A Bloom filter is overkill, but position-based duplicate assignment is perhaps not ideal either. A hybrid approach seems likely to offer a middle-ground:

| method | mapped data | unmapped data | CPU | memory | specificity | complexity |
| --- | --- | --- | --- | --- | --- | --- |
| Bloom filter | ✓ | ✓ | +++ | +++ | ++ | +++ |
| Start position | ✓ | - | + | + | + | + |
| Positional hash | ✓ | - | ++ | ++ | +++ | ++ |

The "positional hash" can be anchored by a genomic location (maybe an actual genomic coordinate, and not `chrom, pos` tuple) and the hash function can be weak and truncated to the first few significant characters. Hash collisions will be rare as long as the hash function is uniform.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient de-duplication method #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

method	mapped data	unmapped data	CPU	memory	specificity	complexity
Bloom filter	✓	✓	+++	+++	++	+++
Start position	✓	-	+	+	+	+
Positional hash	✓	-	++	++	+++	++

Efficient de-duplication method #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions