SenseDocIngest/Protocol.md at master · TeamINTERACT/SenseDocIngest

What follows is a step-by-step protocol for ingesting a city/wave of data.

MIGRATION STEPS - [ ] VERIFY checksums valid for each uploaded SDB batch dir script: exactfile-verify if problem: work with coordinator to get proper files/checksums

MERGING DISTINCT BATCHES INTO A COMMON HIERARCHY
    Usually, coordinators upload SD data in batches, which end up
    being in mulitple folder hierarchies, with a variety of 
    different topologies. But our ingest scripts require a known
    and consistent structure, so after everything has been uploaded.
    we need to go through and enforce that expected structure.

    But we do NOT want to risk damaging the data that has been uploaded.
    So we do this by first making a copy of the entire uploaded tree.

    - [ ] CREATE a new empty directory for working with SDB files
            cmd: mkdir <SDBROOT>_copy

    - [ ] COPY and MERGE the SDB file hierarchy into a duplicate tree
            for each SDBDIR in batch dir
                cp -rf <BATCHDIR/<SDBDIR> <SDBROOT>_copy/

    MERGE all SDB directories/batches into single hierarchy
        Once the SDB directories have been put into a common folder,
        it is also common to find them organized by IID, rather than
        by IID_DEVID pairs. In such cases, data from multiple devices
        will be in a common IID folder, which is not how the scripts 
        expect to find them. So we need to break them up into different
        folders.

        - [ ] Run helper script: sdb_directory_refactor on <SDBROOT>_copy

    VERIFY refactoring
        After doing all this mucking about with these files, we want to 
        verify that we haven't lost or mangled anything in the process.
        So we run a script that will compare the original batch trees to 
        the reorganized tree and report any files that are missing or that
        have been altered. This is done by comparing MD5 checksums, not
        filenames and sizes, because filenames and paths have been changed.

        - [ ] Run helper script: verify_refactor on <SDBROOT>_copy and <SDBROOT>

VERIFY participant folders are named in the form <IID>_<deviceid>
    Not all uploads will need extensive refactoring, but all of them still need 
    to be sure that the folders are named in the canonical form.

    ALTER if necessary
    - [ ] helper script: sdb_generate_fname

VERIFY all SDB files have properly formed filenames
    Similarly, the SDB files themselves need to conform to the expected naming convention.

        SD<DEVID>fw<REVNO>_DATESTAMP(_rtcN).sdb

    Some sdb files may not encode their deviceid/revno
    If the sdb_generate_fname script can't figure it out from the supplementary files, we may have a problem.

    - [ ] Run helper script: sdb_generate_fname

        for f in `find <SDBROOT>_copy -name "SDfw*"` 
        do 
            d=`dirname $f` s=`sdb_generate_fname -f $d`
            echo "$f -> $d/$s"
            # add a mv cmd here
            # EXCEPT THAT THIS merges the foo.sdb and foo_rtc1.sdb variants. So use the 'find' to locate them, but then
            change the fnames manually?
        done

VERIFY new refactored directory
    This is another good place to verify that we haven't mangled any of the datafiles.

    - [ ] Run: verify_refactor OLD_UNIFIED_DIR NEW_UNIFIED_DIR

CREATE a new, unified checksum file for <SDBROOT>_copy
    Since we know the SDBROOT passed its checksums, and the verify_refactor works by comparing file checksums between the original and refactored directories, we know that the files in SDBROOT_copy are all good. So we can now build a new checksum file for the unified directory which we'll use from now on.

    - [ ] Run helper script: exactfile-generate

NORMALIZE the linkage file
    In addition to the SD data folders, the other crucial upload from the coordinators
    (at least, until the linkage portal is operational) are the linkage files that
    provide a mapping between the userids used by our various data partners.
    The linkage file also serves to link users with specific SD devices, and provide
    start and end dates during which the devices were worn.

    Note: the linkage file needs to be in a particular format, with particular column
    names, but coordinators tend to use their own terms and language for this. So the
    first step is to create a normalized copy of the linkage file (never modify the
    original) and make your adjustments there.

    - [ ] Run linkage_verify CSVFILE ROOTDIR
        This will examine all the uploaded files and compare them to the information in
        the linkage file to be sure that we have all the data we're expecting, that 
        the data folders have actual data in them, that the linkage file has proper wear
        dates, and that there are no unexpected data folders. This step is crucial to 
        ensuring that we have all our ducks in a row before we try ingesting anything.

        The linkage_verify script produces a log that will itemize all the problems it 
        found. You'll now have to read that log and work with the coordinators to
        correct any problems. This usually results in the creation of an updated linkage
        file. 

        Repeat this step with the new linkage file, and then again, as many times as 
        necessary until the verify runs cleanly.


MOVE the restructured hierarchy into the permanent archive
    - [ ] COPY <SDBROOT>_copy hierarchy to permanent_archive
    - [ ] COPY unified checksum file to permanent_archive
    - [ ] COPY linkage file to permanent_archive

INGEST PREPARATION STEPS [Steps not yet play-tested] CONFIRM SDB FILES NORMALIZED, CHECKSUMMED, AND ARCHIVED VIC confirmed - see logs 2019-09-* MTL confirmed - see log 2019-12-17 VAN confirmed - see log 2020-01-22 SSK confirmed - see log 2020-02-05 CONFIRM TREKSOFT LINKAGE VIC present MTL present SSK present VAN present CONFIRM SENSEDOC LINKAGE VIC present MTL present SSK present - extracted from xlsx file VAN present CONFIRM SURVEY VIEWS REFRESHED VIC present MTL present SSK missing VAN missing

VERIFY linkage.csv file has column names expected by verifier
    ALTER as necessary
        - also ensure one assignment window per row
        - some coordinators track multiple windows per row by duplicating the start/end columns


VERIFY all assignments in linkage file have associated data dirs
VERIFY all sdb data dirs are associated with known participants

ALTER sdb filenames to 

INGEST linkage file into sensedoc_assignments
    - [ ] Run load_sdlinkage

INGEST SDB folders

If linkage was managed in an external file,
    
    verify linkage 
        test that all user/device assignments listed in linkage have data and that all data has a corresponding user/device record

        cmd: linkage_verify LINKAGE.CSV SDBDIR

    verify filename format 
        test that all user data files in target dir are formed correctly, and encode the serial number data referenced in the linkage file

        Also handled by the above cmd.

ingest linkage table
    psql interact_db -c "\copy foo bar blat..."

INGEST LEVEL_0 SENSEDOC TABLES [Steps not yet play-tested] run sensedoc_ingest

examine log file for reported problems

    once problems found and corrected, make a record in the data_disposition col of the linkage table (CSV?) so they won't be reported in future ingests and need to be re-investigated

    run some basic stats on the log file to examine record retention rates. (Low rates may signal a problem with the data.)
        grep -o -E "RATE=[0-9.]+" logfile | cut -d=  -f 2 | jdescribe

also run: ingest_spotcheck SDBDIR
    This will give a report of how many users are in the SDB directory but did not end up with records in the table.

    It will also give a report of any users with low record counts

    Either of these factors may indicate problems with the ingest.

if log is clean, put it in the Logs directory, named to reflect city/wave/date of ingest
if not, figure out what went wrong, fix it, and then reingest

Digest data into ToP tables Refresh materialized views of survey for city being ingested

Confirm that Kole has access to:
    level_0.* tables
    survey.* tables
    portal_dev.* tables

    Some of these tables get dropped and rebuilt during the
    ingest process and that resets all their permissions.

Technically, ToP would be considered a Digest script, as it is a transformation of existing tables, but since it needs to be run at ingest time, I'm including it here.

Run the script in:
    tableofpower/top_generation.py
    or have Kole run it

Ingest the ToP.csv file from wherever Kole put it.
(Currently in projects/def-dfuller/testing/<cityname>_<waveid>_table_of_power_<YYYY-MM-DD>.csv

    run: Digest/load_ToP fname
            Note: it's in the Digest scripts, not Ingest

Move CSV file to $PRJ/data_snapshots/<CITY)_W<WAVE>_SD_ToP_1sec_YYYYMMDD.zip

Dump level_0 tables to archive format: Be sure to dump only the records for that particular wave/city (Just in case there are other wave/cities in the table.) copy (select * from TABLENAME where city=1 and wave=1) into 'FNAME' with csv delimiter = ','; From the command line, run a command like this:

        psql interact_db -c "\copy level_0.sd_gps to 'montreal_gps_wave1_level0.csv' with (format csv, delimiter ',', header 1, quote '\"');"

    Then do the same thing for the sd_accel file.

    Then tar them both together with:
        tar cvjf montreal_wave1_level0.tar.bz2 montreal_*_wave1_level0.csv

    Compare the row counts of the two tables with the 
    counts in PSQL. 

    And if the row counts match:
        move the tar file to the permanent archive
        delete the .csv files
        drop the sd_gps and sd_accel tables

The compress saved file:
    permanent_archive/CITYNAME/WaveN/Level0_SD_GPS.csv.bz2
    permanent_archive/CITYNAME/WaveN/Level0_SD_ACCEL.csv.bz2
These can then be reloaded far more easily than re-ingesting

Drop level_0 tables or DELETE FROM SD_TABLENAME; for both tables. Best method: TRUNCATE TABLENAME; This is best because delete leaves ghosts that need to be garbage collected later. Truncate does not.

If you DO forget and use DELETE, you can then clean up
the ghost records by running REINDEX TABLE TABLENAME; afterward.

Update portal dashboard status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

Protocol.md

Latest commit

History

Protocol.md

File metadata and controls