Bio720/R_exercises/R_activities_InClass_Sept9_2024.Rmd at master · DworkinLab/Bio720 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
title: "BIO720_Week2"
author: "Ian Dworkin"
date: "`r Sys.Date()`"
output:
  pdf_document:
    toc: true
  html_document:
    toc: true
    number_sections: true
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```


# In class activities for week 2

Now that you have had a bit of an introduction to the fundamentals of R, let's start working with some examples of how we can use these skills with real data.


We are going to work with the count data of RNAseq sequence reads from a large data set from a paper we recently submitted. It was part of a study of the genomic basis of the evolutionary of sociability in *Drosophila melanogaster".

The [link to the data and script repository is here](https://github.com/DworkinLab/DrosophilaSociabilityTranscriptomics)


But for the moment, all we need is the counts that were generated. For this particular analysis this was done by aligning the sequence reads to the *Drosophila melanogaster* genome using a *splice aware* alignment tool, [STAR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/). While it is not relevant for the moment, if you want to read about this approach, you can [take a look at this tutorial](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html).


In any case, the count data can be downloaded like this, directly from the github repository.

```{r}
rna_counts <- read.table("https://raw.githubusercontent.com/DworkinLab/DrosophilaSociabilityTranscriptomics/main/data/star/star_count_matrix.txt",
                         header = TRUE)
```


## What we did yesterday (Sept 10th, 2024)

We first thought through some things we should check after we downloaded the data

### what should we look at first?

- does the data look like it is expected
    - size of the data. Dimensions of the data?
    - names of variables?
    - how is missing data encoded?
    - what object class are we working with? Is it what we expected?
    - Is all the data there? how do we check
    - are column headers being read as headers?


      - Is the data possible?


We realized there were a variety of ways to check the number of rows (and columns of the data

Probably the most concise
```{r}
dim(rna_counts)
```

Has lots of other useful information, but maybe too much information in this case!
```{r}
str(rna_counts)
```

We can ask about how many rows and how many columns, *per se*
```{r}
nrow(rna_counts)
ncol(rna_counts)
```


We then checked to see if there was any missing data (encoded by `NA`)

```{r}
anyNA(rna_counts)
```


## Let's look at the names of the variables we have


So what are we looking at?


```{r}
class(rna_counts)
mode(rna_counts)
```

### A small but important change

Typically in R and bioconductor for genomics experiments, we have columns as samples and rows as features (transcripts, genes, SNPs, taxa, etc). How is it currently set up?

How would we make a new object (generally don't over-write the old one!) where it corresponds to this format? Write down the steps you need to take in order to do so (the pseudo-code).

Row names need to be the feature names (in this case the genes)

What steps do we need?

1. extract out gene names as a variable
2. generate a copy of the data, with a new name, without first column
3. use gene names as rownames


or


1. generate a copy of the data, with a new name, without first column
2. use gene names as rownames via extracting the gene identifier from the original data set


```{r}
rna_counts2 <- rna_counts[, 2:143] # or [, -1]

rownames(rna_counts2) <- rna_counts$gene

rna_counts2[1:6, 1:6]
```


### extract the information about the experimental design from the column names for each sample

We used underscore "_" for our delimiter in the files

- AS: Initials of who collected the samples
- the next is for the evolutionary treatments
- Replicate lineage
- Sex
- Environment
- Unique sample
- Lane of Illumina flowcell


## Pseudo code

Write pseudo-code to take the sample names and generate an object (but what kind of object?) to store all of the experimental meta-data.

- such that rows represent individual samples
- and columns experimental variables.


1. make a new variable storing all samples names as a vector.
2a. initialize a data frame to store our new information. Thankfully in R, it does it for us.
2b. separate based on the underscore delimiter
3. label the variables (columns)
4. Label the rows by their unique sample names


```{r}
sample_names <- colnames(rna_counts2)
class(sample_names)
```


we are going to use a function in the stringr library
```{r}
#install.packages("stringr") # only install once!
```

To use the library...
```{r}
library("stringr")
```


```{r}
sample_names_matrix <- str_split_fixed(string = sample_names,
                                       pattern = "_",
                                       n = 7)

# alternative (but can cause issues if there are problems with the delimiters)
sample_names_matrix2 <- str_split(string = sample_names,
                                       pattern = "_",
                                       simplify = TRUE)
```


For the metadata, make sure we have each experimental variable (columns) labelled, along with the sample identifiers (along each row).

```{r}
colnames(sample_names_matrix) <- c("initials", "treatment", "lineage", "sex", "environment", "sample", "lane")

rownames(sample_names_matrix) <- sample_names
```


## let's look at the data!


### How many mapped reads do we have in each sample?

pseudo-code
1. Compute sum of each columns (i.e. compute sum of column 1, then column 2...)

For column 1 you could compute the sum like:

```{r}
sum(rna_counts2[,1])
```

I don't want to do that for each column!


We could write a for loop (which I do below), but there are two "Rish" ways that are worth knowing about.

First is `colSums` which computes the sum for each column. This is generally the fastest way (both coding and computing)

#### colSums
```{r}
read_sums <- colSums(rna_counts2)

length(read_sums)

read_sums
```


#### apply
Alternatively, you could use a function like `apply` to do the same thing, by applying the `sum` function to each column. apply functions tend to be somewhat slower (but not for a tiny problem like this)

```{r}
read_sums_alt1 <- apply(X = rna_counts2, MARGIN = 2, sum)
```


#### for loop
In most other languages, a `for` loop would be fine and you could use it like this:

First we initialize the variable and make it the "right size" vector to pre-allocate memory to store the data we are going to generate (the sums from each column). In this case I am creating a vector with `NA`, which we will fill with the sums we compute in the next step.
```{r}
read_sums_alt2 <- rep(x = NA,
                      times = ncol(rna_counts2))


names(read_sums_alt2) <- colnames(rna_counts2) # just so we have the sample names
length(read_sums_alt2)

head(read_sums_alt2)
```

Then the for loop
```{r}
for (i in 1:ncol(rna_counts2)) {
  read_sums_alt2[i] <- sum(rna_counts2[, i])
}

rm(i) # a bit of cleanup
```


All of these achieve the same thing. A small note, for some reason using `colSums` stores it as numeric not integers.
```{r}
head(read_sums)
head(read_sums_alt1)
head(read_sums_alt2)
```