Skip to content

Extended Attributes at variable level: processing_level, comment, creator_name, project, date_modified, date_metadata_modified #40

@DocOtak

Description

@DocOtak

netCDF allows for a lot more information than exists in exchange files, with the CCHDO documentation metadata extraction project going, eventually we will need a place to put that metadata. For a while now, I have wanted to store the information contained in the "Bob Headers" in a more structured way. The following ACDD attributes, when pushed down from the global to variable level, should enable the creation of "Bob Headers": processing_level, comment, creator_name, project, date_modified, date_metadata_modified. Further examination of each of these:

  • processing_level
    In ACDD the processing level is a freeform string. We should use this to indicate the following status that very roughly correspond to the satellite communities L0 though L4 processing levels :
    • collected - water was taken but not received
    • raw - used for CTD but not discrete
    • preliminary - data in the file that maybe has not had final calibration applied
    • final - data that is not expecting any more updates
    • product - we probably won't use this, but included since that is what L4 tends to be
      A controlled vocabulary of these should be searched for.
  • comment
    Free text notes, usually these are very short for each parameter. This is the "notes" part of the Bob Headers
  • creator_name
    This is the PI for the parameter in question, we should use array of strings for multiple PIs in our at rest data files. This is the "who" part of the Bob Headers. There is also a creator_url attribute that we might consider storing ORCiDs in.
  • project
    We need a way to tie multiple variables with the same PI/status together, e.g. nutrients are usually 3~5 variables. In the ACDD docs, a program (GO-SHIP) is made up of multiple projects (Total Carbon, pH, Nutrients, CTD, etc..). Variables that have the same project value would be grouped into the "includes" list in the Bob Headers, the comment and creator_name would need to be the same to avoid ambiguity.
    There probably is not a single controlled vocabulary for these project names, they would also likely benefit from some coordination with GO-SHIP.
  • date_modified
    If the data itself is changed, this would be updated to be the date it was changed in the data file. The merge_fq accessor already updates this.
  • date_metadata_modified
    If only the metadata were modified, this attribute would be updated to the date the change was done. The merge_fq accessor already updates this if the print format is different.

The only non standard ACDD usage of the above are being at the variable level rather than global, and the possible use of arrays of strings. We could define combining rules to put all this information in the global attributes that fully conform to ACDD, but this would likely be one way (update the globals from variables, not the other way around). For example: the global date_modified would be set to the most recent date seen from all the variables that also have date_modified.

Things this might make possible:

  • Getting a list of updated files since (or even between/before) a certain date could be done at a per variable level by examining the date_modified attribute. We can even exclude simple metadata updates that didn't change the values used in science.
  • Find all the preliminary data or exclude preliminary data from a result set.
  • Know who has not turned in their data yet by examining the processing_level attribute for "collected" and the creator_name attribute. This can also be done for bottle data with flag 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions