Skip to content

History tracking, provenance and states #303

@pablo-de-andres

Description

@pablo-de-andres

In GitLab by @pablo-de-andres on May 5, 2020, 10:58

Keeping track of different versions of an object has been a recurring topic in our discussions for a long time. This issue will group the motivation, approaches and decisions regarding this topic.

Relates to #127

Motivation

Wrappers

The behaviour of the wrappers could follow 2 main paradigms, namely modify or create data

1. Modify data

Advocate: @pablo-de-andres

Example: Current implementation of SimLammps

Description: The wrapper takes some input data, run for a number of steps, and overwrites the input data with the latest value. The wrapper here behaves as a process generating an output from an input. But no memory capabilities are present, and there is no direct way of tracing back the changes.

History tracking: Has to be implemented outside of the wrapper, and controlled by the user. The user sets what and when to store a snapshot.

Pros:

  • No need to keep track of the changes. This simplifies the handling of uids in the wrapper level.
  • Less memory consumption. No initial or intermediate states are stored.

Cons:

  • Requires external history tracking implementation
  • Theoretically, the user has to control the history.

2. Create data

Advocate: @urbanmatthias

Example: Current implementation of Gromacs Wrapper

Description: The wrapper has some input data (connected through a hasInput or similar relationship) and generates output data. When the user queries the wrapper, the data will be stored under a hasOutput (or equivalent) relationship. If multiple runs are called sequentially, the output of one simulation becomes the input of the next one. This means loading the full output state of the engine in the output, so it is available as an input. The behaviour of the wrapper would mimic more a full workflow, where every run is a process, with its own input and output.

History tracking: An inherent part of the wrapper. Requires to keep a connection between an entity through all its states (possibly through a relationship).

Pros:

  • No data is lost.
  • Avoids conflicts with multiple users working on the same data

Cons:

  • It would still require an external History Tracking to encompass more complex scenarios.
  • The uid changes its meaning. Now it doesn't refer to an entity, but to an entity in a specific state. This would make tracking the first and last states of an entity after multiple runs a bit cumbersome.
  • Higher memory impact, that could be unnecessary (depending on the use case).

Bonus option: Internal engine output file storage

Advocate: @ahashibon

Description: A hybrid of the first option where the engine is internally asked to generate output files every step (or multiple fixed steps) and stored internally. This files could be parsed on demand if the data is required.

History tracking: Done through the tracking of the files generated by the engine and kept internally.

Pros:

  • No data is lost.
  • Extra processing is only required when the files have to be parsed.

Cons:

  • Requires file handling. This would also become more complex when we want to persist the data.
  • Requires parsers for all engines. However, some of them might not generate files.
  • The history tracking becomes specific to each wrapper.
  • It would still require an external History Tracking to encompass more complex scenarios.
  • Higher memory impact, that could be unnecessary (depending on the use case).

Decision

Standardise option 1.

This requires the design an implementation of an external history tracking mechanism.

A desired requirement coming from this point would be to integrate the history tracking in a way that the user could easily define some parameters and have an optional approach similar to option 2 that would be automatic. This means the tracking should become a part of the semantic or the interoperability (session class) layer.

Implementation considerations

  • History tracking could be independent of the ontology, and become an intrinsic part of OSP-core (@yoavnash).
  • It should provide a way to keep multiple instances of objects with the same uid in the same place. (@pablo-de-andres)

Implementation ideas

  • Create a pseudo database wrapper that keeps a table per state, allowing objects with the same uuid in different tables.

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏗️ software architecture🌱 new featureSolving the issue involves the incorporation of a new feature.💬 discussionThe idea is not mature enough to result in an implementation, and needs further discussion.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions