Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions R/sysmeta_all.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
sysmeta_all <- function(mn, pid){
mn <- mn
pid <- pid

allPIDS <- get_all_versions(mn, pid)

allSysmeta <- list()
i=0
for(i in 1:length(all_vers)){
allSysmeta[[i]] <- getSystemMetadata(mn, allPIDS[i])
}

return(allSysmeta)
}


156 changes: 156 additions & 0 deletions codeChunks.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
title: "Frequently Used Code Chunks"
author: "Vivian Tran"
date: "3/7/2018"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

These are some code chunks that I frequently come back to when processing data for the Arctic Data Center.

#Reading in raw data
Copy link
Collaborator

@isteves isteves Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub is finicky about spaces after # for the headers, so make sure to include them! RStudio will preview it just fine, but GitHub won't. (#Reading--> # Reading)

##Single data file
```{r eval=FALSE}
df <- read.table("path/to/data",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you use read.table rather than read.csv or read_csv? I'm curious, but it might also be adding that those other options also exist.

header=T,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be spaces around the = sign. Doesn't affect the code at all, but it makes it more readable (especially once your code gets long/complicated). This is our go-to reference for style: http://style.tidyverse.org/

fill = T, # blank fields are added for rows that have unequal length,
sep= ",", # put "," for .csv file, "\t" for files with values separated by tabs
na.strings = c('','NA')) # fills blank rows with NA's
```
Some things to note: I specify fill=T because I often get an error related to the lengths of rows/columns:

```{r echo=FALSE}
cat("Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 10 elements")
```

This usually happens in .txt files.


##Multiple data files
I use these chunks when I want to read in several data files that share the same column names and formatting. I usually group my data files into different folders according to the type of data and/or formatting to facilitate reading in the data. This will help automate reformatting later on.
```{r eval=FALSE}
# grab data paths from the folder that data is stored in
# "path" specifies name of folder that data paths are stored
# full.names = T produces full paths to fukes instead of just file name

rawPaths <- dir(path = "path/to/folder", full.names = T)
```

Read in data using a for loop. Remember to initialize all variables that you will be using outside of the for loop.
```{r eval=FALSE}
dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job initializing a list! I always have to stop myself from growing vectors.

i=0
for(i in 1:length(rawPaths)){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally better practice to use seq_along(rawPaths), rather than 1:length(x) (which I also do all the time). It allows the code to fail more gracefully. See the discussion here: https://stackoverflow.com/questions/24917228/proper-way-to-loop-over-the-length-of-a-dataframe-in-r

dataList[[i]] <- read.table(rawPaths[i],
na.strings = c("", "NA"),
header=T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the indentation is a little bit off here (though maybe it's GitHub, I'm not sure). A neat trick I learned from Bryce is to highlight your code and then use Cmd + I to fix the indentation!

}
```
Note: list() creates an empty list of length 0. However, vector("list", length(rawPaths)) allocates a designated number of slots within the list instead of the list being constantly updated every time the for loop interates. With a small number of iterations, the time it takes for the code to run is not noticeable. However, for a large number of iterations, not allocating space will cause the code to run very slowly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this reference (or something similar) is worth including in here: https://paulvanderlaken.com/2017/10/13/functional-programming-and-why-you-should-not-grow-vectors-in-r/



#Removing Extraneous Rows and Columns

##Rows

Iterate through all the rows in a data frame.
allRows is a vector containing "TRUE" and "FALSE". Each element corresponds to a row in dataFrame.
is.na(dataFrame[i,]) outputs "TRUE" if the row contains at least one blank cell, and "FALSE" otherwise.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ` to indicate code within sentences in Rmarkdown (like we do in slack)

all(is.na(dataFrame[i,])) outputs "TRUE" if all cells in that row are blank, and "FALSE" otherwise.
```{r eval=F}
i=0
allRows <- c() # initialize vector
for(i in 1:nrow(dataFrame)){
allRows[i] <- all(is.na(dataFrame[i,])) # store each output into allRows
}

blankRows <- which(allRows) # outputs indices of rows that contain "TRUE" (rows with all NA's)
dataFrame <- dataFrame[-blankRows,] # remove those blankRows from dataFrame
```

Alternatively, you can use apply() to iterate through all rows. You can use this for a single data frame or a list of multiple data frames using a for loop.
```{r eval=F}
# outputs indices of rows with all NA's
blankRows <- which(apply(dataFrame,1,function(x)all(is.na(x))))
```

##Columns
```{r eval=F}

i=0
allCols <- c()
for(i in 1:length(dataFrame)){ # length(dataFrame) gives us # of cols
allRows[i] <- all(is.na(dataFrame[,i])) # notice that we switch where the i goes
}

blankCols <- which(allCols)
dataFrame <- dataFrame[,-blankCols]
```

Alternatively:
```{r eval=F}
blankCols <- which(apply(dataFrame,2,function(x)all(is.na(x))))
```


#Searching Through Strings - Dates

Use the grepl() function to search for a particular string. Since we often have to reformat dates in our data sets, searching for particular dates or times could be useful.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this would be a good place to introduce some helpful resources. I personally like this cheatsheet: https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf


```{r}
# an example of common date/time scenarios
# this is usually a column within a data frame
dates <- c("3/4/2016", "3/4/16", "3-4-2016", "3-4-16","3-4-16 12:30",
"3/4/2016", "3/4/16", "3-4-2016", "3-4-16","3-4-16 12:30",
"3/4/2016", "3/4/16", "3-4-2016", "3-4-16","3-4-16 12:30")
```

Run unique() to see what kind of formats there are.
```{r}
unique(dates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovered the get_dupes function yesterday. Could be interesting to add! (or at least link to) https://cran.r-project.org/web/packages/janitor/vignettes/introduction.html

```
The international standard format for dates and time are YYYY-MM-DD and hh:mm:ss respectively, while the combined date-time standard is YYYY-MM-DDThh:mm:ss. Often times, researchers' data contain dates and times in varying formats because it may have been inputted by different people.

None of these are in the standard format, so we'll have to do some reformatting.


The following code gives us the indices that contain "/2016".
```{r}
indDates <- which(grepl("/2016",dates))
indDates
```

Use as.POSIXct() to specify what our original date format is.
Use format() to specify the format that we want.
Store values back into dates vector.
```{r}
dates[indDates] <- format(as.POSIXct(dates[indDates], tz = "", format="%m/%d/%Y"), format = "%Y-%m-%d")
dates[indDates]
```
This same process works for all of the formats in our dates vector.
Note:
"-16" is ambiguous because it could also refer to the day within an already standard-formatted date (e.g. 2018-05-16). Always check to make sure.
We will reformat combined date/time items before the observations that don't contain times because they also contain "-16", which is ambiguous.

```{r}
indDates1 <- which(grepl("/16",dates))
dates[indDates1] <- format(as.POSIXct(dates[indDates1], tz = "", format="%m/%d/%y"), format = "%Y-%m-%d")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use the lubridate package to work with dates. If you haven't tried it, I'd definitely recommend checking it out! https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html There are also some other date/time packages, but I'm not as familiar with them. tibbletime is another one seems promising.


indDates2 <- which(grepl("-2016",dates))
dates[indDates2] <- format(as.POSIXct(dates[indDates2], tz = "", format="%m-%d-%Y"), format = "%Y-%m-%d")

indDates3 <- which(grepl("-16 ",dates))
dates[indDates3] <- format(as.POSIXct(dates[indDates3], tz = "", format="%m-%d-%y %H:%M"), format = "%Y-%m-%dT%H:%M:%S")

indDates4 <- which(grepl("-16",dates))
dates[indDates4] <- format(as.POSIXct(dates[indDates4], tz = "", format="%m-%d-%y"), format = "%Y-%m-%d")
```

Our final dates vector now looks like this:
```{r echo=FALSE}
dates
```