Skip to content

Data Management

Katie mason edited this page Mar 29, 2021 · 7 revisions

Data Management

It is important to keep track of the data we currently store on Tufts for many reasons. We have a storage quota and some file types we work with are large. On the Tufts cluster login showquota will tell you how much space the group has currently.

We also want to make sure to keep track of which files are already available on tufts and whether or not they are up to date. Generally the files you transfer over will fall into one of three categories.

  1. We only need the final merged file(s). All individual files can be deleted.

    • Many specialty samples, such as detvar samples with no strange behavior, fall in this.
  2. We want to keep the individual files in case of debugging.

    • Move samples to this category if you see strange features in your output files.
  3. We are actively working on, or planning to work on the individual files for code development.

    • This category contains our core samples (i.e. real data, bnb_overlay samples, extbnb, etc.).

    • These should not have to be updated often, but when they are it is important to then remove the old samples to avoid confusion.

When you first transfer, you should already have an idea if the sample will fall into 1/2 or 3. For the case of 1/2:

  • When files finish transferring and you have finished merging and POT calculating, make a new folder with a very obvious name. I use:

    mkdir deleteme
    
  • You now move everything into this folder that you don't think you need to keep (i.e copies of scripts, your transfer container, pathlists, and the whole data/data_stripped folder).

  • Set an alarm/reminder for yourself to come back here in ~2 weeks.

  • In the meantime, run any checks you feel are necessary to check the samples. (A note from Katie: for my pi0 analysis, I make distributions of variables such as reco shower energy and check the shape compared with my expectations)

  • After two weeks, go back to the folder. If your checks found strange behavior move the individual data files back out of the deleteme folder to keep for now. Its better to be safe than sorry here. 2 weeks is an estimate of how long checks might take, but take longer if needed.

Once you have finished your checks and decided if your sample falls in category 1/2.

  1. If you are ready to delete, for now email [tts-research@tufts.edu] to submit folder delete request. Include the full path to deleteme folder. They will take care of it for you.

  2. If you are saving the individual data folders, you can still clean up the folder by removing the container files and any script copies you made.

  3. If it is one of the core data sets (type 3), you can also clean up the folder like for type 2 at this point.

Cloud Storage

We have set up a SharePoint repository that is accessible through your Tufts account. Email [taritree.wongjirad@tufts.edu] or [Tom.Phimmasen@tufts.edu] (cc'ing Taritree) if you don't already have access. SharePoint site

At the time of writing this wiki, it is used to store final merged files. Eventually the final merged files that are not currently used in analyses on Tufts will only be available on SharePoint. You can also use it to store other documents that may be of interest to the whole groups, (i.e pdfs of presentations/papers etc.)

From the SharePoint site, click the sync button on the toolbar. This will sync the drive with OneDrive on your computer. If you don't already have OneDrive, you can install it for free through Tufts. https://access.tufts.edu/office-365

Once you have sync'd you can go into your local file explorer to the OneDrive folder. From the file explorer you can download any files you wish to work with locally.

Additionally, you can add local files to your OneDrive folder that you wish to upload. Then sync again to move the files to cloud storage.

Moving merged files to SharePoint

The final step of transferring is to put a copy of any merged files you made on the sharepoint to allow others to easily work with them and to preserve them. This is true regardless of whether you need to keep the individual files on tufts. Once you perform first time SharePoint setup, this step is relatively straightforward

  • log onto the xfer node of the Tufts cluster.

    ssh <username>@xfer.cluster.tufts.edu
    
  • go into a screen again - only necessary if move large files over so the terminal doesn't die.

    screen -s <screen name>
    
  • edit your bashrc script (only necessary the first time)

    vim .bashrc 
    # added to last line: 
    alias rclone=/cluster/tufts/rt/tphimm01/rclone-v1.53.1-linux-amd64/rclone
    
    source .bashrc
    
  • That's it! Now you can use rclone to move files over. Example:

    rclone copyto "<full path of file/folder to save>" WongjiradLab-01:"<full path on sharepoint>" -P
    
  • If you have been using the file structure in the tutorial and cleaned up your dir so only the merged files remain it will be: (in this case we are just copying over the whole folder remaining)

    rclone copyto "/cluster/tufts/wongjiradlab/larbys/data/mcc9/<samplename>" WongjiradLab-01:"larbys/data/mcc9/<samplename>" -P
    

Clone this wiki locally