Feature: duplicate detection algorithm#145
Conversation
|
Thanks very much for taking the time to provide patches. I presume you're not hitting a bottleneck from md5+sha1, since those combined would be less than the I/O bottleneck. Especially on spinning rust. The shortcut of only checking the first 1MiB of each file could save a lot of course. |
|
Right md5sum_approx is only used to quickly exclude potential duplicates. hard_links -> file_size -> md5(512) -> md5(all) -> sha1(all) You're proposing: hard_links -> file_size -> md5(1MiB) BTW are there many hardlinked files but were sure they were always in disparate groups, then you could enable merge_early in the findup script to improve sieving to a single member of each hardlink group |
|
Yes, that is my proposition from this pull request - to add 2 modes:
Original fslint behaviour would be preserved as default - after changes from this PR md5+sha1 pass is the default mode of duplicate verification. |

Resolves #141
This change introduces
Duplicate detectionradio in which the user may choose suitable algorithm:warning tooltip:

fast test:
default md5&sha1:

unsafe md5 of first 1M:
