Rcourse/04-getting_data_into_R.Rmd at master · machinegurning/Rcourse · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# Getting data into R {#getting_data_into_R}

Probably the first question you are likely to ask when approaching R for the first time is 'how do I get my data into R'?

R is able to read directly from excel spread sheets using a number of packages, however they usually take some tweaking. In this tutorial we will restrict ourselves to importing data from 'comma separated values' (.csv) files and 'tab delimited' text files (.txt). You can create files in this format using the 'Save As' menu in Excel, or whichever spread sheet software you are using.

## Small variables

Sometimes you will need to insert data into R which is small and is not stored in an external file. There are two easy wasy to do this.

 Note that in the following (and all subsequent code chunks) anything preceded with a `#` is ignored by R and referred to as a 'comment'.

It is a good idea to get into the habit of commenting every few lines of code, and explain why - not necessarily how (as this will be self evident) you have written a certain line of code. It makes reading code much easier, especially if you didn't write it.

In the following example we use the `c()` or concatenate function. This works exactly the same way in R as the CONCATENATE() function in Excel, which you may be familiar with. In the example that follows, we call the function `c()` and specify a number of 'arguments' which follow - in this case a string of numbers that we want to combine.

Get used to the idea of calling functions like this with a number of arguments following - we will do this a lot!

```{r}

# Create an object called short_variable, and assign a series of numbers to it.

# We use '<-' to create the object. This is called 'gets'.

short_variable <- c(1,5,6,7,9,2,10)

# To see what is contained within an object, simply input the name of the object.

short_variable

# Note that this the equivalent in R of writing print(short_variable).
# This is not particularly important in this course, but it becomes important
# as you begin to write your own functions.

print(short_variable)
```

<div class="ex">
A more convenient way of entering a series of numbers is to use the `scan()` function.
You must still assign a name to the object you are creating as before, but `scan()`
Will allow you to enter the data more easily.

Try that now.
</div>


## Importing tables

Obviously you won't want to manually input all your data, it is much easier to import from a file which you have prepared in a spread sheet. As mentioned, in this course we will work exclusively with the simple formats '.csv' and .txt.

When importing data from files, there are a few rules that must be adhered to, otherwise R will throw up an error.


*  Data must be complete. Any missing values should be replaced with NA.
*  There must be no spaces in text (e.g. column titles), you should either use underscores, e.g.: my_variable, or conflate words using capitals to sperate, e.g.: myVariable.
*  Data should be arranged in 'long' format and be 'tidy'. This is required for many types of analysis in R - a good paper on this can be found [here](http://vita.had.co.nz/papers/tidy-data.pdf") - more on this later.


Let's assume we've done that already:

To import from a '.txt' file we used the read.table() function again you must assign the function a name first. In this case we will import a data file referred to in 'The R book'.
Notice that we are able to import it directly from the internet, but the location could just as easily be a local folder: "C://data/".

```{r}

rats <- read.table(
    "http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/rats.txt",
    header = T
)
```

Noew we can call the data to examine it. Note this data is dealt with on p475 of 'The R Book'

```{r}
rats

```

This isn't a particularly big table so we can view the whole thing at once, but if it was longer we might want to summarise the data or look at smaller chunks fo it.
We can look at the first 6 rows with `head()`:

```{r}
head(rats)
```
Or the last six rows with `tail()`:

```{r}
tail(rats)
```
We can select individual columns or rows with square brackets.

```{r}
# rats[1,1] will give you the value on the first row of the first column:

rats[1,1]

# The first number denotes row, the second column. So to see the whole first row, we do:

rats[1,]

# We can also specify a series by using a colon:

rats[1:10,]

# If we wanted every third row, we could use the seq() function:

rats[seq(3,36,3),]

# In this case the three arguments in the seq function mean: start at value 3, end at value 36, and jump 3 rows at a time.

# We can also compute some summary statistics from the data:

summary(rats)

# Or look at the structure of the object

str(rats)

```

 `str()` is useful, because it lets us know how R is seeing the parts of this table. `int` for instance means integer. If we were wanting to conduct an ANOVA on the rats data, we would need to tell R that Treatment, Rat and, Liver are categorical values, not continuous, otherwise we would be doing the wrong analysis. Note that R will always assume integers to be continuous, unless we explicitly tell it otherwise by labelling it as a `factor`.


```{r}
# To solve this, either use characters - A, B, C instead of 1, 2, 3 for factors, or convert them manually:

rats$Treatment <- factor(rats$Treatment)

# By using the factor command, we in effect convert a continuous variable into a categorical one:

str(rats)

```

Now convert Rat and Liver into categorical variables in the same way - we will need this data later.

```{r,eval=TRUE,include=FALSE}

rats$Rat <- factor(rats$Rat)
rats$Liver <- factor(rats$Liver)

```

There is a dedicated function `revalue()` to convert numerical factor levels to character based levels in the library `plyr`. This can save you the hassle of using `factor()` each time you open a dataset (although you could of course do this in your spreadsheet).

```{r}

# Install the package plyr - if you don't have it already - note you can check this on the right hand 'packages' tab of RStudio. But, it doesn't hurt to run the command again, even if it is installed!

install.packages('plyr',repos="http://cran.rstudio.com/")

# Load the package:

library(plyr)

rats$Treatment1 <- revalue(rats$Treatment, c("1" = "A", "2" = "B", "3" = "C"))

rats$Treatment
rats$Treatment1

```

Note that in these two examples we refer to the individual columns of the table with a `$`. R studio is great for this. Type `rats$` in the console and the press the tab button. RStudio will then automatically complete the variable name, or give you the options available. Try it.

We could also have referred to the columns by number: `rats[ ,1]`.

Another option which gives you a much more familiar spread sheet like view is `fix(rats)` - try this too.

If you have followed the above instructions, you should have something that looks like this:

```{r}
str(rats)

rats
```

Note that a longer discussion of getting data from excel files is available here: <http://www.r-bloggers.com/read-excel-files-from-r/>.