Skip to content

OldClippings

Jefferson Smith edited this page Jun 2, 2021 · 2 revisions

These notes were taken from an earlier draft of this site and are being held here in case I need any of it.

Table of Contents

Interact IDs

  1. Initial generation scheme
  2. TrekSoft generation scheme
  3. Harmonizer integration of IDs across partners

SenseDoc Telemetry

Migration

  1. SenseDoc devices are collected from participants by regional coordinator
  2. Telemetry data is extracted by coordinator from each device and stored on local computer or institutional data server
  3. From time to time during the collection period, a batch of these local data files are uploaded to ComputeCanada over SSH
  4. Incoming files are placed in /projects/def-dfuller/interact/incoming_data/{CITYNAME}/Wave{WAVENUM}/
  5. Each batch includes the telemetry data itself, plus an MD5 checksum for each file, computed on the local machine prior to upload
  6. After receiving data files on CC, those MD5 checksums are verified by the data manager, and if any problems are found, the necessary files are re-uploaded
  • checksums: produced using platform independent ExactFile tool, although any MD5 tool will work
  • ProvLog: on ComputeCanada, provenance monitoring is done using our own ProvLog tool, in conjunction with the native crontab system

Ingest Prep

  1. Once the files have been fully verified, they need to be normalized for ingest:
    1. all files are copied to /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/SenseDoc
    2. This permanent_archive is then organized into a canonical file hierarchy, with data from each contributor stored in its own folder, named as {IID}_{DEVICEID}
    3. Within each user-device folder, verify that the crucial file for ingest present: SD{DEVID}fw{OSVER}_{TIMESTAMP}.sdb
  2. The normalized permanent_archive files are then added to our ProvLog system, which scans every night to ensure that all files still match the checksums they were uploaded with and have not been deleted or altered on disk during the course of working with them
    1. This is done by running: provlog -m "Adding to ProvLog after validating upload checksum" -T DIRNAME
    2. If any changes are detected to logged files on disk, a message is sent to the data manager, who investigates and either restores the data files from backup, or updates the ProvLog record to explain the change, thus ensuring a complete manifest of data changes

Ingest

Actual ingest is a semi-interactive process, governed by a JupyterNotebook, that walks the data manager through some preliminary data validation steps and then finally performs the ingest, loading the user linkage information and telemetry files into the relevant tables in the DB.

  1. Create a new copy of the notebook (called?) and rename it as (format?)
  2. Edit the parameter assignments in the first code block to set the wave, city, folder paths, etc.
  3. The first few blocks simply declare variables and functions that will be used lower down
  4. The next section of blocks perform a series of data validation tests:
    1. All expected files are present and named correctly
    2. The incoming account linkage data is well-formed
    3. All linked user accounts have corresponding telemetry data in the permanent_archive folder
    4. All telemetry found in the permanent_archive folder is expected, and has a corresponding user account in the linkage table
    5. Any unmatched, unexpected, or ill-formed data found at this stage must be corrected, usually through consultation with the coordinator
      1. Linkage table records for non-legitimate users (such as test accounts, coordinator accounts, etc) must be skipped during ingest, which can be accomplished by putting the word 'ignore' in the data_disposition field for that user record (which schema.table?)
  5. The last few sections perform the actual ingest
    1. In the first pass, the raw telemetry files are loaded into a temporary DB table
    2. In the next block, that temporary table is cross-linked with the proper IID, based on the mapping from the device id found in the linkage table
    3. Finally, the cross-linked telemetry data is added to the final telemetry tables (sd_gps, sd_accel, and others?)
  6. Once the ingest has completed successfully, a few housekeeping tasks are required:
    1. Delete the temporary tables (called?)
    2. Export the JupyterNotebook as a PDF, which provides a complete document of the ingest process, as it happened.
    3. If any substantive code was changed in the notebook (aside from setting parameters), clear all the output blocks, save the notebook, and commit the changes to the git repo, describing what improvements or corrections were made to the code
  7. Congratulations, you have now completed an ingest cycle.

SenseDoc Participant Metadata

Wave-1-Metadata-Ingest

Ethica Telemetry

Ethica Participant Metadata

Ethica Survey Data

TrekSoft Participant Metadata

TrekSoft Survey Data


From the old Intro page

This wiki is organized into four major sections:

  1. Introduction, which documents the overall goals and ambitions of the INTERACT project and the context in which it is being conducted;
  2. Data Sources, which describe the types of raw data that were collected from the various data capture partners and systems;
  3. Data Sinks, which are the various servers and environments in which the data resided or through which it passed; and
  4. Protocols, which are the scripts, processes, and procedures used to move the data between the various sources and sinks.

Clone this wiki locally