Topic 3: Explore tokenizing the recordedBy 

The current algorithm does not accommodate variation in `recordedBy` that includes multiple collectors.
For example, `recordedBy` will not be considered as overlapping between a record containing `recordedBy=Tim Robertson|Nicky Nicolson` and another with `Tim Robertson`.

@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.

To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.

If this identifies useful links, the best approach to incorporate this into the clustering could be explored.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic 3: Explore tokenizing the recordedBy #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Topic 3: Explore tokenizing the recordedBy #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions