Skip to content

Efficient de-duplication method #9

@mdshw5

Description

@mdshw5

A Bloom filter is overkill, but position-based duplicate assignment is perhaps not ideal either. A hybrid approach seems likely to offer a middle-ground:

method mapped data unmapped data CPU memory specificity complexity
Bloom filter +++ +++ ++ +++
Start position - + + + +
Positional hash - ++ ++ +++ ++

The "positional hash" can be anchored by a genomic location (maybe an actual genomic coordinate, and not chrom, pos tuple) and the hash function can be weak and truncated to the first few significant characters. Hash collisions will be rare as long as the hash function is uniform.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions