Skip to content

RFC: Harvesting kit v. 2 #140

@Dziolas

Description

@Dziolas

We all know that harvesting kit is not perfect. Here are some problems that it has:

  • code replication,
  • week and unstandardised error handling,
  • different output (record structure) per publisher,
  • no easy way to add a new publisher/harvester to the kit,
  • each package is a mix of connection/package retrieval and and data extraction/data manipulation.

Solution

  • Convert current harvesting packages into plugins which are going to follow interfaces defined by Harvester class,
  • Introduce one Harvester class that will take care of harvesting and record generation process (Unified Record Creation Process ;p):
    • Loads a plugin (elsevier, iop, etc.) which has defined some utility functions that Harvester can use to get metadata from XML file or to correctly connect to the content provider,
    • Connection and download of data from content provider/publisher can be defined as workflow,
    • Utility functions from plugin will returndata in internal (intermediary) data model defined by Harvester class:
      • types,
      • sanitization,
      • cleaning of metadata,
      • all of those can be defined as part of data model or Harvester so it will not need to implemented each time
    • Record creation process managed and defined in Harvester - all records will always keep the same pattern of fields and field content regardless harvester (now the same configuration needs to be provided for every package
    • Logs and errors will be better structured
    • JSON
    • unified error handling and logs

Other ideas:

  • pacakges will become auto-detected plugins as in other modules
  • CLI will stay like it looks now harvestingkit elsevier but parameters that can be passed to scripts need to be standardized (or something)

Some pseudo-code sample:

har = Harvester("elsevier", "contrast-out")
har.full_harvest()

##########
class Harvester(object):
    def __init__(self, plugin_name, *args):
        self.harvest_plugin = load_harvest_plugin(plugin_name, args)

    def self.source_updated(self):
        return self.harvest_plugin.source_updated()

    def self.full_harvest(self):
        if self.source_updated():
            self.run_download_workflow()
            self.extract_records()
            self.upload()

   def self.extract_records(self):
        rec = create_empty_internal_record()
        rec.doi = self.harvester_plugin.get_doi()
        rec.title = self.harvester_plugin.get_title()
        ....
        self.validate(self.record) # something to make sure that record is complete, it can be done on the fly while creating record.

Let me know what do you think about this idea.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions