RFC: Harvesting kit v. 2

We all know that harvesting kit is not perfect. Here are some problems that it has:
- code replication,
- week and unstandardised error handling,
- different output (record structure) per publisher,
- no easy way to add a new publisher/harvester to the kit,
- each package is a mix of connection/package retrieval and and data extraction/data manipulation.

Solution
- Convert current harvesting packages into plugins which are going to follow interfaces defined by Harvester class,
- Introduce one Harvester class that will take care of harvesting and record generation process (Unified Record Creation Process ;p):
  - Loads a plugin (elsevier, iop, etc.) which has defined some utility functions that Harvester can use to get metadata from XML file or to correctly connect to the content provider,
  - Connection and download of data from content provider/publisher can be defined as workflow,
  - Utility functions from plugin will returndata in internal (intermediary) data model defined by Harvester class:
    - types,
    - sanitization,
    - cleaning of metadata,
    - all of those can be defined as part of data model or Harvester so it will not need to implemented each time
  - Record creation process managed and defined in Harvester - all records will always keep the same pattern of fields and field content regardless harvester (now the same configuration needs to be provided for every package
  - Logs and errors will be better structured
  - JSON
  - unified error handling and logs

Other ideas:
- pacakges will become auto-detected plugins as in other modules
- CLI will stay like it looks now `harvestingkit elsevier` but parameters that can be passed to scripts need to be standardized (or something)

Some pseudo-code sample:

``` python
har = Harvester("elsevier", "contrast-out")
har.full_harvest()

##########
class Harvester(object):
    def __init__(self, plugin_name, *args):
        self.harvest_plugin = load_harvest_plugin(plugin_name, args)

    def self.source_updated(self):
        return self.harvest_plugin.source_updated()

    def self.full_harvest(self):
        if self.source_updated():
            self.run_download_workflow()
            self.extract_records()
            self.upload()

   def self.extract_records(self):
        rec = create_empty_internal_record()
        rec.doi = self.harvester_plugin.get_doi()
        rec.title = self.harvester_plugin.get_title()
        ....
        self.validate(self.record) # something to make sure that record is complete, it can be done on the fly while creating record.
```

Let me know what do you think about this idea.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Harvesting kit v. 2 #140

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Harvesting kit v. 2 #140

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions