History tracking, provenance and states

In GitLab by @pablo-de-andres on May 5, 2020, 10:58

Keeping track of different versions of an object has been a recurring topic in our discussions for a long time. This issue will group the motivation, approaches and decisions regarding this topic. 

Relates to #127

# Motivation
## Wrappers
The behaviour of the wrappers could follow 2 main paradigms, namely **modify** or **create** data

### 1. Modify data
_Advocate:_ @pablo-de-andres 

_Example:_ Current implementation of SimLammps

_Description:_ The wrapper takes some input data, run for a number of steps, and overwrites the input data with the latest value. The wrapper here behaves as a _process_ generating an output from an input. But no memory capabilities are present, and there is no direct way of tracing back the changes.

_History tracking:_ Has to be implemented outside of the wrapper, and controlled by the user. The user sets what and when to store a snapshot.

_Pros:_
 - No need to keep track of the changes. This simplifies the handling of uids in the wrapper level.
 - Less memory consumption. No initial or intermediate states are stored.

_Cons:_
 - Requires external history tracking implementation
 - Theoretically, the user has to control the history.



### 2. Create data
_Advocate:_ @urbanmatthias 

_Example:_ Current implementation of Gromacs Wrapper

_Description:_ The wrapper has some input data (connected through a `hasInput` or similar relationship) and generates output data. When the user queries the wrapper, the data will be stored under a `hasOutput` (or equivalent) relationship. If multiple runs are called sequentially, the output of one simulation becomes the input of the next one. This means loading the full output state of the engine in the output, so it is available as an input. The behaviour of the wrapper would mimic more a full _workflow_, where every run is a process, with its own input and output.

_History tracking:_ An inherent part of the wrapper. Requires to keep a connection between an entity through all its states (possibly through a relationship).

_Pros:_
 - No data is lost.
 - Avoids conflicts with multiple users working on the same data 

_Cons:_
 - It would still require an external History Tracking to encompass more complex scenarios.
 - The uid changes its meaning. Now it doesn't refer to an entity, but to an entity in a specific state. This would make tracking the first and last states of an entity after multiple runs a bit cumbersome.
 - Higher memory impact, that could be unnecessary (depending on the use case).

### Bonus option: Internal engine output file storage
_Advocate:_ @ahashibon 

_Description:_ A hybrid of the first option where the engine is internally asked to generate output files every step (or multiple fixed steps) and stored internally. This files could be parsed on demand if the data is required.

_History tracking:_ Done through the tracking of the files generated by the engine and kept internally.

_Pros:_
 - No data is lost.
 - Extra processing is only required when the files have to be parsed.

_Cons:_
 - Requires file handling. This would also become more complex when we want to persist the data.
 - Requires parsers for all engines. However, some of them might not generate files.
 - The history tracking becomes specific to each wrapper.
 - It would still require an external History Tracking to encompass more complex scenarios.
 - Higher memory impact, that could be unnecessary (depending on the use case).

### Decision
Standardise option 1.

This requires the design an implementation of an external history tracking mechanism.

A desired requirement coming from this point would be to integrate the history tracking in a way that the user could easily define some parameters and have an optional approach similar to option 2 that would be automatic. This means the tracking should become a part of the semantic or the interoperability (session class) layer.

# Implementation considerations
- History tracking could be independent of the ontology, and become an intrinsic part of OSP-core (@yoavnash).
- It should provide a way to keep multiple instances of objects with the same uid in the same place. (@pablo-de-andres)

# Implementation ideas
- Create a pseudo database wrapper that keeps a table per state, allowing  objects with the same uuid in different tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History tracking, provenance and states #303

Motivation

Wrappers

1. Modify data

2. Create data

Bonus option: Internal engine output file storage

Decision

Implementation considerations

Implementation ideas

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

History tracking, provenance and states #303

Description

Motivation

Wrappers

1. Modify data

2. Create data

Bonus option: Internal engine output file storage

Decision

Implementation considerations

Implementation ideas

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions