-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
We all know that harvesting kit is not perfect. Here are some problems that it has:
- code replication,
- week and unstandardised error handling,
- different output (record structure) per publisher,
- no easy way to add a new publisher/harvester to the kit,
- each package is a mix of connection/package retrieval and and data extraction/data manipulation.
Solution
- Convert current harvesting packages into plugins which are going to follow interfaces defined by Harvester class,
- Introduce one Harvester class that will take care of harvesting and record generation process (Unified Record Creation Process ;p):
- Loads a plugin (elsevier, iop, etc.) which has defined some utility functions that Harvester can use to get metadata from XML file or to correctly connect to the content provider,
- Connection and download of data from content provider/publisher can be defined as workflow,
- Utility functions from plugin will returndata in internal (intermediary) data model defined by Harvester class:
- types,
- sanitization,
- cleaning of metadata,
- all of those can be defined as part of data model or Harvester so it will not need to implemented each time
- Record creation process managed and defined in Harvester - all records will always keep the same pattern of fields and field content regardless harvester (now the same configuration needs to be provided for every package
- Logs and errors will be better structured
- JSON
- unified error handling and logs
Other ideas:
- pacakges will become auto-detected plugins as in other modules
- CLI will stay like it looks now
harvestingkit elsevierbut parameters that can be passed to scripts need to be standardized (or something)
Some pseudo-code sample:
har = Harvester("elsevier", "contrast-out")
har.full_harvest()
##########
class Harvester(object):
def __init__(self, plugin_name, *args):
self.harvest_plugin = load_harvest_plugin(plugin_name, args)
def self.source_updated(self):
return self.harvest_plugin.source_updated()
def self.full_harvest(self):
if self.source_updated():
self.run_download_workflow()
self.extract_records()
self.upload()
def self.extract_records(self):
rec = create_empty_internal_record()
rec.doi = self.harvester_plugin.get_doi()
rec.title = self.harvester_plugin.get_title()
....
self.validate(self.record) # something to make sure that record is complete, it can be done on the fly while creating record.Let me know what do you think about this idea.
Metadata
Metadata
Assignees
Labels
No labels