Skip to content

Wave 1 Metadata Ingest

zoeps edited this page Jun 2, 2021 · 14 revisions

During Wave 1, the process of collecting and ingesting metadata was still being developed, so it was an evolving process. As new issues were identified, the process was updated to handle the new edge cases, and in some cases, scripts were written to assist with either identifying or correcting those issues in the future.

The result was a fairly manual process that took place in several distinct steps, as outlined below:

Table of Contents

Receive Excel file from coordinator

Files were uploaded to ComputeCanada by the city coordinators, each accompanied by a companion MD5 checksum file. Upon being notified that a new upload was ready, the data manager first verified the checksum and requested a new upload if they were found not to match.

In most cases, the checksum was produced on Windows-based desktop computers, using a GUI tool called ExactFile, but the checksum itself is a simple MD5 fingerprint, so in some cases, the checksums were produced using other tools, so long as they conformed to the MD5 protocol. ($zoe do we have the upload protocol document available somewhere that we can link to? $jeff: from $zoe, it's not online, but we have a word version, yes. )

Since the data manager was working on linux servers and had no way to run Windows-based GUI apps, the checksums were validating using the Linux standard md5sum tool.

Convert XSLX to CSV

Conversion of .xls and .xlsx files into CSV was done inconsistently. There were 3 main methods used, but these were not documented on a case-by-case basis.

  • Some coordinators exported CSV from Excel and uploaded that directly
  • Some were converted by the data manager by first loading them into LibreOffice and then exporting CSV from there
  • Some were converted on the command line using xlsx2csv, version 0.7.6.

Normalize CSV

As manually created and maintained documents, the tracking spreadsheets often contained inconsistencies that had to be resolved before the data could be ingested into a database table. Dates and serial numbers were not always entered in a consistent format. In some columns, some entries were in string format while others were numeric. When unusual circumstances arose with a particular entry, this was sometimes documented by adding explanatory text within the data field. Some fields were accidentally omitted, digits were transposed, and in a couple of cases, user-ids were accidentally swapped between two different users. Additionally, the column names used in the documents were not consistent between cities, so these two had to be normalized.

Unpacking all of these into rectangular data records was a largely manual process. After converting the Excel file into a CSV file, the data manager went through each CSV and executed the required changes by hand. Not every individual change was documented, but with each revision pass, a new copy of the CSV file was made, so the history of each file can still be examined.

Validate CSV

Once the metadata was in a consistent format, it was then verified against the information contained in the telemetry files using a number of different techniques.

Inconsistent Date Formats

A date like 2017-21-06 is very clearly in YYYY-DD-MM format, but what about 2017-02-03? Is that February 3 or March 2? Unfortunately, it was discovered that spreadsheet entries had not all been made with the same date scheme, so they had to be corrected. To do this, a script called linkage_date_correct was created to compare the dates encoded into the CSV file against those actually recorded in the participant's telemetry data. If those dates differed, the script allowed the data manager to quickly correct the CSV record.

Incomplete and/or Spurious Metadata

In some cases, users who contributed telemetry data to the study were omitted from the tracking spreadsheet. Or conversely, some users who were registered for the study contributed no data. Additionally, some records were created by staff as a way to test the system, but needed to be filtered out before ingesting the telemetry, as they were not actually part of the study.

To identify these situations, a script called linkage_verify was created. It takes two inputs: the path to a CSV metadata file, and the path to a directory that contains the telemetry data for all the users in the study. The script reviews all records at both locations and compares a mismatch list, identifying which user records are missing telemetry and which telemetry records have no registered users. This list is then presented to the data manager, who consults with the city coordinator to classify these gaps. In cases where legitimate users had been inadvertently omitted from the metadata list, the coordinator was able to supply the missing information, which the data manager then added to the metadata CSV. In cases where the telemetry data was missing, the coordinator's information allowed the data manager to mark those records as either "did not complete" or "test user", which allowed the ingest process to handle them accordingly.

Also, in a few cases, it was discovered that either the spreadsheet, or the telemetry data filenames had been given incorrect information, and wayward telemetry data was then reunited with the appropriate participant by correcting the erroneous entry.

Ingesting the Metadata Table

Once all the data has been cleaned, checked, and verified, the process of actually loading the metadata CSV into the DB is rather simple. A script called load_sdlinkage reads each row of the CSV file and for rows not flagged as test users or otherwise ignorable, the record is written into the DB in the portal_dev.sensedoc_assignments table.

interact_id integer
sensedoc_serial integer
city_id integer
wave_id integer
started_wearing text (YYYY-MM-DD)
stopped_wearing text (YYYY-MM-DD)

Clone this wiki locally