Skip to content

Latest commit

 

History

History
514 lines (444 loc) · 23.7 KB

File metadata and controls

514 lines (444 loc) · 23.7 KB

02 Metacoder heat trees

Daniel 19/09/2022

This script takes the previously merged and formatted data from mBrave (produced through script 01) and does preliminary analyses with Metacoder. The idea of this script is to have a preliminary estimate of diversity within the metabarcoding samples. Due to the nature of mBrave’s classification associated with BOLD databases, there are a number of BINs which have multiple, uncertain, identifications. I have (manually) cleaned the data further, now we are only including species name for the 1st BOLD BIN hit. This is normally either, the species form the BCI database (the first dataset used to classify through mBRAVE), or the species most frequently ‘named’ that name. E.g. Agelenopsis pennsylvanica, potteri, emertoni, katsoni, acutosa, is now Agelenopsis pennsylvanica. The file used is the final dataset with ~3500 BINs (finaldat_merged.csv).

rm(list=ls()) #I always start scritps this way to make sure I have a clean R environment
library('tidyr')
library('dplyr')
library('metacoder')
dat<- read.csv('data/finaldat_merged.csv', header = TRUE)
sample <- read.csv('data/location_ctrl.csv')
obj <- parse_tax_data(dat,
                      class_cols = "classification",
                      class_sep = ";",
                      class_regex = "^([a-z]{0,1})_{0,2}(.*)$",
                      class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))
print(obj)
## <Taxmap>
##   3091 taxa: aab. Animalia, aac. Arthropoda ... eox. nitens
##   3091 edges: NA->aab, aab->aac, aac->aad ... bzn->eow, bzo->eox
##   2 data sets:
##     tax_data:
##       # A tibble: 3,173 x 43
##         taxon_id bin_uri      classi~1 total~2 total~3 total~4 total~5
##         <chr>    <chr>        <chr>      <int>   <int>   <int>   <int>
##       1 bzp      BOLD:AAA0009 k__Anim~       0       0       0       0
##       2 bzq      BOLD:AAA0012 k__Anim~       0       0       0       0
##       3 bzr      BOLD:AAA0016 k__Anim~       0       0       0       0
##       # ... with 3,170 more rows, 36 more variables:
##       #   total39868 <int>, total39869 <int>, total39870 <int>,
##       #   total39871 <int>, total39872 <int>, total39873 <int>,
##       #   total39874 <int>, total39875 <int>, total39876 <int>,
##       #   total39877 <int>, ..., and abbreviated variable names
##       #   1: classification, 2: total39864, 3: total39865,
##       #   4: total39866, 5: total39867
##     class_data:
##       # A tibble: 22,211 x 5
##         taxon_id input_index tax_rank name       regex_match  
##         <chr>          <int> <chr>    <chr>      <chr>        
##       1 aab                1 k        Animalia   k__Animalia  
##       2 aac                1 p        Arthropoda p__Arthropoda
##       3 aad                1 c        Insecta    c__Insecta   
##       # ... with 22,208 more rows
##   0 functions:

Above is the print out of a taxmap object. The R console output shows that there are 3,651 unique taxa and lists their ID, but this includes many NA’s (unassigned taxonomies), Better to look at the tbl_df which shows a ‘tibble’ of 3,173 rows (assigned classifications) in 43 columns (40 samples + taxon_id, bin_uri and classification). Each row shows how many reads for that taxon was in each sample.

obj$data$tax_data <- obj$data$tax_data[c("taxon_id","bin_uri", sample$sampleID)]
obj$data$tax_data <- zero_low_counts(obj, data = "tax_data", min_count = 10, cols= sample$sampleID, other_cols = TRUE)
## Zeroing 5811 of 126920 counts less than 10.

## Warning: The following columns will be replaced in the output:
##    total39883, total39882, total39881 ... ZET1B, ZET2A, ZET2B
#zeroing 5811 of 139800 counts with less than 10 reads (depending how stringent this can be changed in the min_count flag above
no_reads <- rowSums(obj$data$tax_data[, sample$sampleID]) == 0
sum(no_reads) 
## [1] 937
#there are 1259 taxon_id's with less than 10 reads which we will remove in the next step
obj <- filter_obs(obj, data = "tax_data", ! no_reads, drop_taxa = TRUE)
print(obj) #2236 taxa in 40 samples (it says 43 columns because of columns taxon, OTU_ID and bin_uri, the classification column is not really needed right now)
## <Taxmap>
##   2572 taxa: aab. Animalia, aac. Arthropoda ... eox. nitens
##   2572 edges: NA->aab, aab->aac, aac->aad ... bzm->eov, bzo->eox
##   2 data sets:
##     tax_data:
##       # A tibble: 2,236 x 42
##         taxon_id bin_uri      total3~1 total~2 total~3 total~4 total~5
##         <chr>    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 bzp      BOLD:AAA0009        0       0       0       0       0
##       2 bzr      BOLD:AAA0016        0       0       0       0       0
##       3 bzv      BOLD:AAA0100        0       0       0       0       0
##       # ... with 2,233 more rows, 35 more variables:
##       #   total39878 <dbl>, total39877 <dbl>, total39876 <dbl>,
##       #   total39875 <dbl>, total39874 <dbl>, total39873 <dbl>,
##       #   total39872 <dbl>, total39871 <dbl>, total39870 <dbl>,
##       #   total39869 <dbl>, ..., and abbreviated variable names
##       #   1: total39883, 2: total39882, 3: total39881, 4: total39880,
##       #   5: total39879
##     class_data:
##       # A tibble: 21,590 x 5
##         taxon_id input_index tax_rank name       regex_match  
##         <chr>          <int> <chr>    <chr>      <chr>        
##       1 aab                1 k        Animalia   k__Animalia  
##       2 aac                1 p        Arthropoda p__Arthropoda
##       3 aad                1 c        Insecta    c__Insecta   
##       # ... with 21,587 more rows
##   0 functions:

We now have cleaned data set (with our chosen minimum number of reads to >10) We can add new ‘tibbles’ to obj$data such as proportion of total reads, or abundance

obj$data$tax_props <- calc_obs_props(obj, "tax_data", cols= sample$sampleID, other_cols = TRUE)
## Calculating proportions from counts for 40 columns for 2236 observations.

## Warning: The following columns will be replaced in the output:
##    total39883, total39882, total39881 ... ZET1B, ZET2A, ZET2B
obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = sample$sampleID)
## Summing per-taxon counts from 40 columns for 2572 taxa
print(obj)
## <Taxmap>
##   2572 taxa: aab. Animalia, aac. Arthropoda ... eox. nitens
##   2572 edges: NA->aab, aab->aac, aac->aad ... bzm->eov, bzo->eox
##   4 data sets:
##     tax_data:
##       # A tibble: 2,236 x 42
##         taxon_id bin_uri      total3~1 total~2 total~3 total~4 total~5
##         <chr>    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 bzp      BOLD:AAA0009        0       0       0       0       0
##       2 bzr      BOLD:AAA0016        0       0       0       0       0
##       3 bzv      BOLD:AAA0100        0       0       0       0       0
##       # ... with 2,233 more rows, 35 more variables:
##       #   total39878 <dbl>, total39877 <dbl>, total39876 <dbl>,
##       #   total39875 <dbl>, total39874 <dbl>, total39873 <dbl>,
##       #   total39872 <dbl>, total39871 <dbl>, total39870 <dbl>,
##       #   total39869 <dbl>, ..., and abbreviated variable names
##       #   1: total39883, 2: total39882, 3: total39881, 4: total39880,
##       #   5: total39879
##     class_data:
##       # A tibble: 21,590 x 5
##         taxon_id input_index tax_rank name       regex_match  
##         <chr>          <int> <chr>    <chr>      <chr>        
##       1 aab                1 k        Animalia   k__Animalia  
##       2 aac                1 p        Arthropoda p__Arthropoda
##       3 aad                1 c        Insecta    c__Insecta   
##       # ... with 21,587 more rows
##     tax_props:
##       # A tibble: 2,236 x 42
##         taxon_id bin_uri      total3~1 total~2 total~3 total~4 total~5
##         <chr>    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 bzp      BOLD:AAA0009        0       0       0       0       0
##       2 bzr      BOLD:AAA0016        0       0       0       0       0
##       3 bzv      BOLD:AAA0100        0       0       0       0       0
##       # ... with 2,233 more rows, 35 more variables:
##       #   total39878 <dbl>, total39877 <dbl>, total39876 <dbl>,
##       #   total39875 <dbl>, total39874 <dbl>, total39873 <dbl>,
##       #   total39872 <dbl>, total39871 <dbl>, total39870 <dbl>,
##       #   total39869 <dbl>, ..., and abbreviated variable names
##       #   1: total39883, 2: total39882, 3: total39881, 4: total39880,
##       #   5: total39879
##     tax_abund:
##       # A tibble: 2,572 x 41
##         taxon_id total39883 total39882 total~1 total~2 total~3 total~4
##         <chr>         <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aab           57486      74134   95845   87200   63183   70228
##       2 aac           57486      74134   95845   87200   63183   70228
##       3 aad           57486      74080   95797   87200   63045   70228
##       # ... with 2,569 more rows, 34 more variables:
##       #   total39877 <dbl>, total39876 <dbl>, total39875 <dbl>,
##       #   total39874 <dbl>, total39873 <dbl>, total39872 <dbl>,
##       #   total39871 <dbl>, total39870 <dbl>, total39869 <dbl>,
##       #   total39868 <dbl>, ..., and abbreviated variable names
##       #   1: total39881, 2: total39880, 3: total39879, 4: total39878
##   0 functions:

Notice we now have 3 tibbles, one for counts (tax_data - this could be changed with something like: >names(obj$data) <- “otu_counts”), one for proportion (tax_props) and one for abundance (tax_abund). We can, and will, add more refined data sets afterwards, all linked to the original parsing so we never loose the information of what taxonid actually means.

we can now view distribution of reads in the samples:

hist(colSums(obj$data$tax_data[ , sample$sampleID]))

Sampling is quite uneven with one samples with < 100,000 reads. We will rarefy our data to simulate even number of reads per sample (18914 is the minimum depth) keep in mind that this discards a lot of data - we could probably instead remove some of the samples with the lowest reads instead. We can add the rarefied data set as a separate data frame as we did above with tax_props and tax_abund. If you want, you can print(obj$data$tax_rarefied) to see what it looks like. We can also make a rarefaction curve with metacoder, but look at script 03_diversity_and_ordination where we do rarefaction curves with ggplot instead.

obj$data$tax_rarefied <- rarefy_obs(obj, "tax_data", cols = sample$sampleID, other_cols = TRUE)
## Rarefying to 18914 since that is the lowest sample total.

## Warning: The following columns will be replaced in the output:
##    total39883, total39882, total39881 ... ZET1B, ZET2A, ZET2B
library(vegan)
## Warning: package 'vegan' was built under R version 4.1.2

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.1.3

## Loading required package: lattice

## This is vegan 2.5-7
rarecurve(t(obj$data$tax_data[, sample$sampleID]), step = 1000,
          sample = min(colSums(obj$data$tax_data[, sample$sampleID])),
          col = "blue", cex = 0.8)

Below an OPTIONAL chunk to visualize a very rudimentary heat tree for all counts for metabaroded data. It is meant for illustration purposes only and its not very informative, it also takes a long time to run. See figure allcounts.pdf in the project.

set.seed(99) # This makes the plot appear the same each time it is run
heat_tree(obj,
          node_label = taxon_names,
          node_size = n_obs,
          node_color = n_obs,
          node_size_axis_label = "BIN count",
          node_color_axis_label = "Samples with reads",
          node_label_size_range = c( 0.005, 0.03),
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations

# this takes > 30 minutes to run

Far more informative would be to see the differences between seasons in this case. We can easily calculate the number of samples that have reads for each taxon according to different groups. We can also compare statistically using Wilcoxon Rank Sum test, significant differences between taxa samples in wet or dry season.

obj$data$tax_occ <- calc_n_samples(obj, "tax_abund", groups = sample$SEASON, cols = sample$sampleID)
## Calculating number of samples with a value greater than 0 for 40 columns in 2 groups for 2572 observations
print(obj$data$tax_occ)
## # A tibble: 2,572 x 3
##    taxon_id   wet   dry
##    <chr>    <int> <int>
##  1 aab         20    20
##  2 aac         20    20
##  3 aad         20    20
##  4 aae         11     0
##  5 aaf          5     0
##  6 aag         20    20
##  7 aah         20    19
##  8 aai         20    20
##  9 aaj         20    19
## 10 aak         20    20
## # ... with 2,562 more rows
obj$data$diff_table <- compare_groups(obj, data = "tax_abund",
                                      cols = sample$sampleID, # What columns of sample data to use
                                      groups = sample$SEASON) # What category each sample is assigned to
obj <- mutate_obs(obj, "diff_table",
                  wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0 #get significant values for group comparison

And we can plot these differences in a metacoder heat tree.

heat_tree(obj,
          node_label = taxon_names,
          node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
          node_color = log2_median_ratio, # A column from `obj$data$diff_table`
          node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
          node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
          node_color_digits = 1,
          node_size_axis_label = "BIN count",
          node_color_axis_label = "Mean difference in sample proportion",
          node_label_size_range = c( 0.005, 0.03),
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford")

#>30 minutes to run

To understand the colouring scheme, read this paragraph carefully from the metacoder tutorial:

What color corresponds to each group depends on the order they were given in the compare_groups function. Since “leaf” is “treatment_1” in the “diff_table”, and “log2_median_ratio” is defined as “log2(treatment_1 / treatment_2)”, when a taxon has more counts in leaf samples, the ratio is positive, therefore taxa more abundant in leafs are colored magenta in this case.

their code has ’node_color_range = c(“cyan”, “gray”, “magenta”)###

If we look at print(obj$data$diff_table) above the plot, we can see that in our case, treatment_1 is ‘WET’. The log2 median ratio is defined as”log2(wet / dry). When a taxon has more counts in the wet season, the ratio is positive, therefore taxa more abundant in the wet season are coloured ‘steelblue’ in our case (not magenta).

But more interesting for us, to separate them according to focal groups. In orange for dry season, in blue for wet season. The following are the differences between seasons using wilcox_p_vlue function we ran on the diff_table above.

obj %>%
  metacoder::filter_taxa(taxon_names == "Coleoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names =="Lepidoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Hemiptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Hymenoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Diptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Trichoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Blattodea",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

Now we can see some clear differences between seasons. As expected, the wet season has higher diversity compared to the dry season, and here we can see where the differences are concentrated (branches in blue indicate that those taxa are significantly more abundant in the wet season when compared to the dry). Gray branches indicate no significant difference between number of reads for that taxa in wet or dry season.

These are visually interesting results but we still need to compare these data statistically through diversity and ordination analyses which will be covered in the following script.

saveRDS(object = obj, file = "data/taxmap_object.rds") #this will save the clean, filtered taxmap object for future uses