Skip to content

Fingerprint matching #1

@lnicola

Description

@lnicola

Hi, I've stumbled upon your blog post and skimmed the code a little. If I'm not mistaken, you've been optimizing the comparison between two fingerprints. In practice, however, you'll be comparing on the order of N^2 pairs of fingerprints, not just two, which still grows proportionally to the square of the number of songs (10 times more songs will take 100 time longer).

So I propose another approach, in the form of a a memory-time-complexity trade-off:

  • for each song, extract the fingerprint and add it into a data structure that supports fuzzy searching
  • for each song, compute its fingerprint and look up matches in the data structure built previously

Choosing and implementing a data structure is the hard part, of course. One that might work is a Levenshtein automaton. These have the advantage of naturally supporting insertions and deletions, so you wouldn't have to insert 200 fingerprints for each song. I never tried to implement one, but this post gives a nice introduction.

One limitation, however, is that the way these are usually built, they allow insertions both at the ends and in the middle. For fingerprint comparisons you'd probably want to allow insertions and deletions only at the ends, and only changes in the middle. But you'd probably have to implement it yourself in that case. An alternative is to build a DFA that only allows substitutions, and also insert the shifted fingerprints into it.

Anyway, I don't know if you'll ever get to try it, but it might be worth keeping in mind.

EDIT: After thinking a little more about it, you'd want this to accept large edit distances (you mention a 55% match in the blog post). That would probably blow up the size of the automaton, so this approach might not work for you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions