The current algorithm does not accommodate variation in recordedBy that includes multiple collectors.
For example, recordedBy will not be considered as overlapping between a record containing recordedBy=Tim Robertson|Nicky Nicolson and another with Tim Robertson.
@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.
To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.
If this identifies useful links, the best approach to incorporate this into the clustering could be explored.
The current algorithm does not accommodate variation in
recordedBythat includes multiple collectors.For example,
recordedBywill not be considered as overlapping between a record containingrecordedBy=Tim Robertson|Nicky Nicolsonand another withTim Robertson.@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.
To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.
If this identifies useful links, the best approach to incorporate this into the clustering could be explored.