A Bloom filter is overkill, but position-based duplicate assignment is perhaps not ideal either. A hybrid approach seems likely to offer a middle-ground:
| method |
mapped data |
unmapped data |
CPU |
memory |
specificity |
complexity |
| Bloom filter |
✓ |
✓ |
+++ |
+++ |
++ |
+++ |
| Start position |
✓ |
- |
+ |
+ |
+ |
+ |
| Positional hash |
✓ |
- |
++ |
++ |
+++ |
++ |
The "positional hash" can be anchored by a genomic location (maybe an actual genomic coordinate, and not chrom, pos tuple) and the hash function can be weak and truncated to the first few significant characters. Hash collisions will be rare as long as the hash function is uniform.
A Bloom filter is overkill, but position-based duplicate assignment is perhaps not ideal either. A hybrid approach seems likely to offer a middle-ground:
The "positional hash" can be anchored by a genomic location (maybe an actual genomic coordinate, and not
chrom, postuple) and the hash function can be weak and truncated to the first few significant characters. Hash collisions will be rare as long as the hash function is uniform.