-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path02-data-structures.Rmd
More file actions
255 lines (162 loc) · 7.45 KB
/
02-data-structures.Rmd
File metadata and controls
255 lines (162 loc) · 7.45 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# Working with data structures
In our second lesson, we start to look at two **data structures**, **vectors** and **dataframes**, that can handle a large amount of data.
Before we jump into these bigger things, we introduce a new kind of operator:
## Comparison Operations
Sometimes, we want to make comparisons between data types, such as if one is bigger than the other, or whether they are the same.
```{r}
age = 35
age > 18
```
We asked whether `age` is greater than 18, and it is `TRUE` because `age` is 35. We can follow-up to ask if `age` is equal or less than 65:
```{r}
age <= 65
```
Besides comparing numbers, we can ask characters whether they are a specific value:
```{r}
building_name = "Arnold"
building_name == "Weintraub"
```
We asked whether `building_name` is "Weintraub" via the `==` comparison operator (extremely easy to confuse with `=`), and it is `FALSE` because `building_name` is "Arnold". We can follow-up to ask if `building_name` is *not equal* to "Weintraub".
```{r}
building_name != "Weintraub"
```
## Full list of Comparison Operations
`<` less than
`<=` less or equal than
`>` greater than
`>=` greater than or equal to
`==` equal to
`!=` not equal to
You can also write out multiple comparisons at once, which you will see more in your exercise this week...
## Vectors
In the first exercise, you started to explore **data structures**, which store information about data types. You played around with **vectors**, which is a ordered collection of a data type. Each *element* of a vector contains a data type, and all elements of a vector must be the same type, such as numeric, character, or logical.
We often create vectors using the combine function, `c()` :
```{r}
staff = c("chris", "sonu", "sean")
chrNum = c(2, 3, 1)
```
If we try to create a vector with mixed data types, R will try to make them be the same data type, or give an error:
```{r}
staff = c("chris", "shasta", 123)
staff
```
Our numeric got converted to character so that the entire vector is all characters.
### Using operations on vectors
Recall from the first class:
- Expressions are be built out of **operations** or **functions**.
- Operations and functions combine **data types** to return another data type.
Now that we are working with data structures, the same principle applies:
- Operations and functions combine **data structures** to return another data structure (or data type!).
What happens if we use some familiar operations we used for numerics on a numerical vector? If we multiply a numerical vector by a numeric, what do we get?
```{r}
chrNum = c(2, 3, 1)
chrNum = chrNum * 3
chrNum
```
All of `chrNum`'s elements tripled! Our multiplication operation, when used on a *numeric vector with a numeric*, has a *new* meaning: it multiplied all the elements by 3. Here's another example: numeric vector multiplied by another numeric vector:
```{r}
chrNum * c(2, 2, 0)
```
Or how about comparison operators?
```{r}
chrNum > 2
```
but there are also limits: a numeric vector added to a character vector creates an error:
```{r}
#chrNum + staff
```
When we work with operations and functions, we must be mindful what inputs the operation or function takes in, and what outputs it gives, no matter how "intuitive" the operation or function name is.
Lastly, here's a function you can use on vectors: `length()` gives you the length of the vector:
```{r}
length(chrNum)
```
### Subsetting vectors explicitly
In the exercise this past week, you looked at a new operation to subset elements of a vector using brackets. Let's look at all the possible ways to subset vectors carefully:
We subset vectors using the bracket `[ ]` operation.
Inside the bracket can be:
1. A single numeric value
```{r}
staff[2]
```
which returns the second value of `staff`.
2. A **numerical indexing vector** containing numerical values. They dictate which elements of the vector to subset.
```{r}
staff[c(1, 2)]
```
Alternatively, you can also store the subetted vector as a new variable:
```{r}
small_staff = staff[c(1, 2)]
small_staff
```
3. A **logical indexing vector** with the same length as the vector to be subsetted. The `TRUE` values indicate which elements to keep, the `FALSE` values indicate which elements to drop.
If we want the first element:
```{r}
staff[c(TRUE, FALSE, FALSE)]
```
If we want the first and second elements:
```{r}
staff[c(TRUE, TRUE, FALSE)]
```
If we want the first and second elements and store the result as a variable:
```{r}
small_staff = staff[c(TRUE, TRUE, FALSE)]
small_staff
```
### A trick: When subsetting large vectors
Suppose you have a large vector `age` with 100 elements:
```{r}
set.seed(123) #don't worry about this function
age = round(runif(100, 1, 100)) #don't worry about these functions
age
```
Suppose you want the first 20 elements of this vector using a numerical indexing vector. Writing out `c(1, 2, 3, 4, …` for the numerical indexing vector a pain. We can generate a numerical vector 1 to 20 via the following trick:
```{r}
1:20
```
Then, you just use it to help subset:
```{r}
age[1:20]
```
## Dataframes
Before we dive into dataframes, check that the `tidyverse` package is properly installed by loading it in your R Console:
```{r, message=F}
library(tidyverse)
```
Here is the data structure you have been waiting for: the **Dataframe**. A dataframe is a spreadsheet such that each column must have the same data type. Think of a bunch of vectors organized as columns, and you get a dataframe.
Below is some code to load in a Dataframe. Notice that the file extension here is in `.RData`, which is a format specifically for R. In the last week of class we will talk about how to load and save spreadsheets from CSVs or Excel.
```{r}
load(url("https://github.com/fhdsl/S1_Intro_to_R/raw/main/classroom_data/CCLE.RData"))
```
### Using functions and operations on Dataframes
We can run some useful functions on dataframes to get some useful properties, similar to how we used `length()` for vectors:
```{r}
nrow(metadata)
ncol(metadata)
dim(metadata)
colnames(metadata)
```
The last function, `colnames()` returns a character vector of the column names of the dataframe. This is an important property of dataframes that we will make use of to subset on it.
We introduce an operation for dataframes: the `dataframe$column_name` operation selects for a column by its column name and returns the column as a vector. For instance:
```{r}
metadata$OncotreeLineage[1:5]
metadata$Age[1:5]
```
The bracket operation `[ ]` on a dataframe can also be used for subsetting rows and columns at once. `dataframe[row_idx, col_idx]` subsets the dataframe by a row indexing vector `row_idx`, and a column indexing vector `col_idx`.
```{r}
metadata[1:5, c(1, 3)]
```
We can refer to the column names directly:
```{r}
metadata[1:5, c("ModelID", "CellLineName")]
```
We can leave the column index or row index empty to just subset columns or rows.
```{r}
metadata[1:5, ]
```
```{r}
head(metadata[, c("ModelID", "CellLineName")])
```
The bracket operation on a dataframe can be difficult to interpret because multiple expression for the row and column indicies is a lot of information for one line of code. You will see easier-to-read functions for dataframe subsetting in the next lesson.
Lastly, try running `View(metadata)` in RStudio Console...whew, a nice way to examine your dataframe like a spreadsheet program!
## Exercises
You can find [exercises and solutions on Posit Cloud](https://posit.cloud/content/8245357), or on [GitHub](https://github.com/fhdsl/Intro_to_R_Exercises).