16S rRNA gene amplicon - upstream data analysis

Introduction

(20-30 min)

The next-generation sequencing (NGS) of 16S rRNA gene amplicon generates large amounts of raw sequence data that needs to be processed before being annotated against a reference database. Processing reads into comprehensive biological information consists in decreasing the complexity from these huge sequence datasets until annotation tables using high-performance algorithms integrated in bioinformatics pipelines (Leipzig 2017). Basically what we try to do when we are analysing 16S rRNA gene amplicon NGS data is to reverse the whole proccess: from sequencing –> library prep –> DNA extraction –> original community –> to single bacteria. For the analysis of 16S rRNA gene amplicon NGS data it usually includes the following steps (not necessarily all neither by this order):

demultiplexing the raw amplicon libraries into samples based on barcodes/indexes sequences
filter/trim primers and the reads based on quality (Phred) scores, length, homopolymers, etc
join/merge paired-end reads (only for paired-end reads - Illumina)
align reads for further quality control (optional)
dereplicating equal/unique sequences (100% identical sequences)
detect/remove chimeras - artificial sequences with multi parent origin produced during PCR amplification (optional)
cluster unique/dereplicated sequences into Operational Taxonomic Units (OTUs) based on user-defined threshold (usually 97-98%)
classify/annotate taxonomically OTUs against a non-redundant reference database

Following this pipeline you will end-up with the so called OTU table (with annotation) often clustered at 97% similarity threshold. OTU stands for Operational Taxonomic Unit that represents one cluster of similar sequences (>97%) of one species or taxa. Sequences are clustered into OTUs mostly by two reasons:

to mask some intraspecific diversity
to mask sequencing errors

However, to get an approximation of roughly 1 OTU ≈ 1 species, the threshold of 97% proved to be slightly higher - close to 100% for the V4 region of the 16S rRNA gene (Edgar, 2018). Therefore, if we could eliminate sequencing errors, we would not need to cluster sequences into OTUs.

dada2 — Figure describing the difference between *Amplicon Sequence Variants* (**ASVs**) vs *Operational Taxonomic Units* (**OTUs**) (figure adapted from (Callahan et al. 2016)).

Denoising is the step of removing sequencing noise, i.e., removing sequencing errors from sequencing reads. There are several denoising algorithms. This course will focus on the Divisive Amplicon Denoising Algorithm (aka DADA2) an algorithm for correcting Illumina-sequenced amplicon errors (Rosen et al. 2012; Callahan et al. 2016). DADA2 removes the sequencing error from reads yielding true biological sequences. Thus it does not cluster sequences into OTUs to mask sequencing errors. These unique partitions of sequences are different from OTUs since they vary between them as little as 1 nucleotide. Therefore these unique partitions of sequences were designated Amplicon Sequence Variants (aka exact sequences or sub-OTUs). We choose DADA2 for several reasons, one of these was because it allows flexibility to proccess several types of data, such as 454, Ion Torrent and single- as well as paired-end Illumina amplicon NGS data, with different amplicon sizes, e.g., ITS, 18S and 16S rRNA genes, full-length 16S rRNA genes (from PacBio), in all operating systems (ubuntu, Mac OS and Windows). The two main advantages of applying denoising-based instead of OTU-based approaches and of working with ASVs vs OTUs are:

they diverge as little as 1 nucleotide and, therefore, they represent single nucleotide variants, i.e., there was an increase in genetic resolution compared with OTU-clustering based approaches
it allows consistent labelling since denoised unique sequences are true biological sequences without sequencing errors, this means that there is no need to cluster sequences into OTUs in order to mask sequencing errors (Callahan et al., 2017)

DADA2 is a denoising algorithm but also a pipeline per se implemented in the R programing language that is able to perform the following bioinformatic steps:

DADA2 pipeline flowchart

filter and trim reads
estimate error rates
denoise unique sequences (and exclude singletons)
merge denoised forward and reverse reads
construct an ASV table
remove chimeras
assign taxonomy

Despite these differences between denoising- vs clustering-pipelines, the diversity patterns of a specific 16S rRNA gene amplicon NGS study should be broadly similar no matter the kind of pipeline applied (Glassman and Martiny 2018).

Before getting started

(10-15 min)

This tutorial is a replication of the original DADA2 tutorial made by Benjamin Callahan. The dada2 pipeline and R package were developed by Benjamin Callahan and colleagues (Callahan et al. 2016). We will use the same subset of 16S rRNA gene amplicon data from fecal mouse samples published by the Schloss lab (Schloss et al. 2012; Kozich et al. 2013). This dataset was also analyzed with the mothur MiSeq SOP that runs an OTU-clustering based pipeline. This subset includes fecal mouse samples collected daily from the next 150 days post-weaning (early - 0-9 days - and late - 141-150 days). So, one of the questions that you can try to answer is: How did the composition of the mouse gut microbiome change from the early days to the late days after weaning?

The 16S rRNA gene amplicon - upstream data analysis course assumes that you will start with paired-end V4 16S rRNA Illumina fastq files:

all the samples came from the same sequencing run (otherwise you need to run each subset from a different sequencing run independently);
demultiplexed (most of the sequencing companies deliver fastq files demultiplexed; however, if you need to demultiplex your samples you can download, install and use sabre);
without primers/adapters (in the future you may need to remove primers/adapters from your samples, if so, you can download, install and use cutadapt (Martin 2011));
with the sequences of forward and reverse fastq files matching, i.e., the first sequence of one forward sample should belong to the same lane, tile, and cluster (x, y coordinates) that the first sequence of the respective reverse sample and so on (this could not be the case if you filter your forward and reverse fastq files independently, if so, follow the recommendations made in the dada2 FAQ What if my forward and reverse reads aren’t in matching order?).

see the example

Requirements:

Warning: all the requirements needed to follow the first part of the course were already satisfied for you for convenience, although you may need to go through these if you are trying to set up this in your own computer

Download 16S rRNA gene amplicon NGS data and metadata (Schloss et al. 2012; Kozich et al. 2013):
- 16S rRNA gene amplicon fastq and metadata files

Download the Silva reference database (Quast et al. 2012) into a folder named database:
- silva_nr_v132_train_set.fa.gz: to assign taxonomy until genus based on naive Bayes classifier
- silva_species_assignment_v132.fa.gz: to assign taxonomy until species level based on 100% match

Download and install R version 3.6.2 and Rstudio (R Core Team 2019)
Download and install dada2 v.1.14.0 R package and dependencies (Callahan et al. 2016)
Download and install biom v.2.1.7 command-line python program (Mcdonald et al. 2012)
Download and install Biome-Shiny: A Shiny R app for microbiome visualization R shiny app - developed by Henrique Costa from Biodata.pt/IGC

How to read this document?

Text: tips and important messages appear indented like this.

Concept (click!)

If text appear in bold is a note with auxiliary information that may be useful. Usually it contains bioinformatic terms, exaplanations and information about R code and syntax.

Answer (click!)

Answers to questions appeared in italic.

R markdown

This document that you open in Rstudio is a R markdown document (with the file extension .Rmd). A markdown document is a plain text file with some special syntax that can be easily compiled into a html file. The R markdown is a markdown document with R language embedded. Therefore, R markdown is a very useful resource to write reports since you can combine the text, R programming language and output, e.g., tables, plots and e.t.c., into one single document.

Let’s start

(5-10 min)

Import dada2 package (and dependencies)

First we will run the R chunk code below to import/load dada2, its dependencies packages. When you import a package, you are importing/loading all the functions, and perhaps data, if it contains, to your current R session (and memory).

R chunk code

It is the special gray text box that allow you to write and run R code inside the R markdown document. This particular chunk will install and import packages required to follow the upstream data analysis of 16S rRNA gene amplicon data. The default R chunk code opened in Rstudio has color-symbolic text:

green: text, aka strings in programming (although there is two types of code - (1) comments that start with ‘#’ and (2) strings that incorporate the R code always between quotation marks like this “string” )
blue: special functions(), like library() to import packages/libraries, logical (TRUE/FALSE) and numbers
black: most of the functions(), signs, like +, x, and parameters (function options that the user uses when calls a function, like below, we passed the argument “dada2” to the parameter package)

### Import package and check the package version

library(package = "dada2") # import package 'dada2'
packageVersion(pkg = "dada2") # prints the 'dada2' version that you have imported - see below after running this function

## [1] '1.14.0'

Set seed

Often you will use functions to perform tasks that requires randomness, such as bootstrapping in statistics (i.e., random sampling with replacement). Well, we usually re-analyse countless times the same data. Therefore, this will cause slight changes in the output, let’s say a number from some calculation, even if you run exactly the same script. To avoid this we will do what is known as set seed. This is, we will keep the seed the same no matter how many times we need to run the script. How this guarantees the reproducibility? Computers can not generate true random numbers. They rely on the seed to generate a pseudorandom number. If you know the seed, i.e., if you set the seed, you can predict which will be the pseudorandom number generated by the computer.

Run the R chunk code below to set the seed to ‘1024’.

### Set seed
set.seed(1024) # set seed to the number '1024' - you can use any number

Define the directory of 16S rRNA gene amplicon fastq files

R’s assignment operator

Below you can find for the first time in this document the sign ‘<-’ that is known as assignment operator (equal to ‘=’) and as the name suggests it assigns something to other thing. In the first line of code below it assigns the string “./MiSeq_SOP” to the variable fastqPath. Therefore the variable or as it is known in R the object fastqPath contains the string “./MiSeq_SOP”.

Run the R chunk code below to assign the folder MiSeq_SOP to the R object fastqPath.

### set the relative path to the 16S rRNA gene amplicon fastq files directory

fastqPath <- "./MiSeq_SOP" # set the path to the 16S rRNA gene amplicon fastq files that you have download 'MiSeq_SOP' (the './' is a notation used that means 'in the current directory')
list.files(fastqPath) # list all the files under the directory 'fastqPath == ./MiSeq_SOP'

##  [1] "F3D0_S188_L001_R1_001.fastq"   "F3D0_S188_L001_R2_001.fastq"  
##  [3] "F3D1_S189_L001_R1_001.fastq"   "F3D1_S189_L001_R2_001.fastq"  
##  [5] "F3D141_S207_L001_R1_001.fastq" "F3D141_S207_L001_R2_001.fastq"
##  [7] "F3D142_S208_L001_R1_001.fastq" "F3D142_S208_L001_R2_001.fastq"
##  [9] "F3D143_S209_L001_R1_001.fastq" "F3D143_S209_L001_R2_001.fastq"
## [11] "F3D144_S210_L001_R1_001.fastq" "F3D144_S210_L001_R2_001.fastq"
## [13] "F3D145_S211_L001_R1_001.fastq" "F3D145_S211_L001_R2_001.fastq"
## [15] "F3D146_S212_L001_R1_001.fastq" "F3D146_S212_L001_R2_001.fastq"
## [17] "F3D147_S213_L001_R1_001.fastq" "F3D147_S213_L001_R2_001.fastq"
## [19] "F3D148_S214_L001_R1_001.fastq" "F3D148_S214_L001_R2_001.fastq"
## [21] "F3D149_S215_L001_R1_001.fastq" "F3D149_S215_L001_R2_001.fastq"
## [23] "F3D150_S216_L001_R1_001.fastq" "F3D150_S216_L001_R2_001.fastq"
## [25] "F3D2_S190_L001_R1_001.fastq"   "F3D2_S190_L001_R2_001.fastq"  
## [27] "F3D3_S191_L001_R1_001.fastq"   "F3D3_S191_L001_R2_001.fastq"  
## [29] "F3D5_S193_L001_R1_001.fastq"   "F3D5_S193_L001_R2_001.fastq"  
## [31] "F3D6_S194_L001_R1_001.fastq"   "F3D6_S194_L001_R2_001.fastq"  
## [33] "F3D7_S195_L001_R1_001.fastq"   "F3D7_S195_L001_R2_001.fastq"  
## [35] "F3D8_S196_L001_R1_001.fastq"   "F3D8_S196_L001_R2_001.fastq"  
## [37] "F3D9_S197_L001_R1_001.fastq"   "F3D9_S197_L001_R2_001.fastq"  
## [39] "filtered"                      "HMP_MOCK.v35.fasta"           
## [41] "Mock_S280_L001_R1_001.fastq"   "Mock_S280_L001_R2_001.fastq"  
## [43] "mouse.dpw.metadata"            "mouse.time.design"            
## [45] "stability.batch"               "stability.files"

As you can see we have 6 types of data inside the directory/folder ‘./MiSeq_SOP’:

fastq: the fastq files, with the exception of the files with the preffix Mock, are from mouse fecal samples. The first F3 means Female number 3 and the D followed by one number means Day and the number of post-weaning. Since we have paired-end 16S rRNA gene amplicon Illumina MiSeq data (250 bp-long reads spanning the V4 region), the tag R1 denotes forward fastq files and the tag R2 denotes reverse fastq files.
HMP_MOCK.v35.fasta: text plain file with 32 fasta DNA sequences (with some redundant sequences) containing 21 representative strains sequences used in the mock community (check the 21 strains that appear in the mock community - Materials and Methods section, Community DNA subsection from Kozich et al., (2013)) - although only 20 strains were identified in the corresponding fastq sequencing files (Mock_S280_L001_R1_001.fastq and Mock_S280_L001_R2_001.fastq). This will be used latter to evaluate the accuracy and possible contamination in this study.
mouse.time.design: text plain file with the experimental design of this study. It cointains two columns: (1) group column with the samples names (the same name as you will see below) and (2) time categorical/factor column assuming two possible levels Early (from day 0 to 9) or Late (from day 141 to 150) - depending on the post-weaning day.
mouse.dpw.metadata, stability.files and stability.batch: text plain files with metadata and commands needed to run the mothur MiSeq SOP. We will not use these!

Run the R chunk code below to create a path specifying a directory for each forward fastq file with the pattern - _R1_001.fastq - and reverse fastq files with the pattern - _R2_001.fastq. Notice that after we had specified the relative path for all the forward and reverse fastq files we used the function sort(). This is really important, because it will order forward and reverse fastq files in the same way if they have exactly the same preffix name (of course with the exception of the R1 and R2 tags). All dowstream steps assume that forward and reverse fastq files are ordered. What this means? It means that the 1^st element of the vector fastqFwdPath corresponds to the same sample in the 1^st element of the vector fastqRevPath and so on.

Then, retrieve the sampleNames from one of the paths that we specified before (because the sample names is the same for forward and reverse files just change the suffix name _R1_001.fastq/_R2_001.fastq between forward/reverse).

Nested functions in R

Below you can find for the first time in this document the application of a nested function in R. If you note the function list.files() is given as input to the argument x of the function sort(). This is known as nested functions - call functions inside functions (there is no limit about how many functions we can call inside other functions, since the output of a function can be used as input of the outmost function).

vectors in R

An R vector is 1-D data structure comprising elements of the same class, e.g., numeric or character, built with the c() function that stands for combine - a <- c(1, 2, 3, 45, 56) assigning a vector of class numeric to the variable name a that contains the values 1, 2, 3, 45, 56.

# create a path to each forward and reverse fastq file 
fastqFwdPath <- sort(x = list.files(path = fastqPath, pattern = "_R1_001.fastq", full.names = TRUE)) # sort fwd fastq file paths
fastqRevPath <- sort(x = list.files(path = fastqPath, pattern = "_R2_001.fastq", full.names = TRUE)) # sort rev fastq file paths

# extract the sample names  
sampleNames <- sapply(X = strsplit(x = basename(path = fastqFwdPath), split = "_"), FUN = `[`, 1) # extract sample names from fastqFwdPath
sampleNames # print the sample names

##  [1] "F3D0"   "F3D1"   "F3D141" "F3D142" "F3D143" "F3D144" "F3D145" "F3D146"
##  [9] "F3D147" "F3D148" "F3D149" "F3D150" "F3D2"   "F3D3"   "F3D5"   "F3D6"  
## [17] "F3D7"   "F3D8"   "F3D9"   "Mock"

Are you sure that file paths of forward and reverse fastq files are sorted in the same way?

Below you can use a in-house-developed function to check if fastqFwdPath and fastqRevPath are or not sorted (sanity check!).

### Compare if fwd and rev fastq file lists correspond (if are sorted)

source("./scripts/biodataPtCrashCourse.R") # import R script with in-house-developed functions

## compareSamplesNames(): in-house-developed function to compare vector files lists of fwd and rev sample names
compareSampleNames(fwdPathList = fastqFwdPath, revPathList = fastqRevPath, splitSymbol = "_", pickElement = 1) # function that takes as input 4 arguments: (1-2) `fastqFwdPath` and `fastqRevPath` vector fwd and rev path lists; (3) the `splitSymbol` (in our case samples are separated by the symbol "_") and (4) `pickElement` `1`, it means after siplit names by "_" pick the first suffix (that contains the sample name)

## [1] "./MiSeq_SOP/F3D0_S188_L001_R1_001.fastq and ./MiSeq_SOP/F3D0_S188_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D1_S189_L001_R1_001.fastq and ./MiSeq_SOP/F3D1_S189_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D141_S207_L001_R1_001.fastq and ./MiSeq_SOP/F3D141_S207_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D142_S208_L001_R1_001.fastq and ./MiSeq_SOP/F3D142_S208_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D143_S209_L001_R1_001.fastq and ./MiSeq_SOP/F3D143_S209_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D144_S210_L001_R1_001.fastq and ./MiSeq_SOP/F3D144_S210_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D145_S211_L001_R1_001.fastq and ./MiSeq_SOP/F3D145_S211_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D146_S212_L001_R1_001.fastq and ./MiSeq_SOP/F3D146_S212_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D147_S213_L001_R1_001.fastq and ./MiSeq_SOP/F3D147_S213_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D148_S214_L001_R1_001.fastq and ./MiSeq_SOP/F3D148_S214_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D149_S215_L001_R1_001.fastq and ./MiSeq_SOP/F3D149_S215_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D150_S216_L001_R1_001.fastq and ./MiSeq_SOP/F3D150_S216_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D2_S190_L001_R1_001.fastq and ./MiSeq_SOP/F3D2_S190_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D3_S191_L001_R1_001.fastq and ./MiSeq_SOP/F3D3_S191_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D5_S193_L001_R1_001.fastq and ./MiSeq_SOP/F3D5_S193_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D6_S194_L001_R1_001.fastq and ./MiSeq_SOP/F3D6_S194_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D7_S195_L001_R1_001.fastq and ./MiSeq_SOP/F3D7_S195_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D8_S196_L001_R1_001.fastq and ./MiSeq_SOP/F3D8_S196_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/F3D9_S197_L001_R1_001.fastq and ./MiSeq_SOP/F3D9_S197_L001_R2_001.fastq are sorted!"
## [1] "./MiSeq_SOP/Mock_S280_L001_R1_001.fastq and ./MiSeq_SOP/Mock_S280_L001_R2_001.fastq are sorted!"

Question: How many samples make part of this study? (1 min)

Tip: try to give the sampleNames object to the function length()

Answer

There are 20 samples: 19 fecal mouse samples plus one mock community sample.

Fastq Control

(10 min)

One of the preliminary tasks in the analysis of 16S rRNA gene amplicon NGS data is the analysis of the quality scores (aka Phred (Q) scores) in both, forward and reverse, fastq files.

Take a look into the followinf example:

It is based on the interpretation of these plots that you can make a supported decision latter about the parameters that you will choose to filter and trim the fastq files. Therefore, apply the dada2 function plotQualityProfile() in order to get the quality profiles for forward and reverse files. Give as input/argument the vector list of forward and reverse fastq files: fastqFwdPath and fastqRevPath.

### Plot the quality profiles 

plotQualityProfile(fl = fastqFwdPath) # plot the fastq profiles for the forward fastq files

plotQualityProfile(fl = fastqRevPath) # plot the fastq profiles for the reverse fastq files

Question: Based on the interpretation of these plots, would you say that forward and reverse fastq files are similar in terms of quality? (1 min)

Answer

No! This is because the sequencing-by-synthesis (the method used by Illumina technology) is based on enzymatic reactions that have less efficiency to incorporate nucleotides in the reverse direction 3’–>5’. Of course that the enzyme kinetics also drops in efficiency at the begining of each read and at the end of each read, no matter the direction, but often this is much more pronounced in reverse reads.

(1) Filter and trim reads

(15 min)

Filter and trim fastq files based on quality parameters is an essential step in the analysis of 16S rRNA gene amplicon NGS data. It is really important to use very stringent parameters to try to filter and trim out any sequencing error. Latter we will perform the denoising step which tries to remove sequencing reads with errors, but the more errors you have it will discard more sequences. It is well known in NGS data analysis in general that removing errors improves the performance of dowstream data analysis.

Bioinformatic jargon

Truncation: it refers to the action of cutting the reads at some specific bp position in order to have all reads with the same length. If you truncate your reads at 200 bp, it means that you will not have reads less neither higher than 200 bp long.

Trimming: it refers to the action of cutting the 5’- or/and 3’-ends from your reads. It includes trim out the adapters, barcodes and primers from your reads.

I want to know more!

Q score - truncQ: here we will use a conservative quality score of 2 meaning that if a read has at some position a Q score ≤ 2 (this means that has a base-call accuracy of < 50%), truncate the read at that position. If the read-length after truncating this read is less than the truncation length (truncLen see below), discard the read. Set truncQ = 2 for both, forward and reverse reads (if you put only one number it will assume the same for forward and reverse reads, otherwise put truncQ = c(2,3) if you want to set a different Q score for forward and reverse reads).

Read-length - truncLen: the Q scores start to drop close to the 250 bp end of forward reads and after the 150 bp for the reverse reads. Thus a good read-length would be the truncation of forward and reverse reads at 240 and 160 bp positions, respectively, i.e, truncLen = c(240, 160).

Other important parameters to have into consideration are the number of ambigous, aka N, bases (maxN parameter in the function filterAndTrim below) and expected errors (maxEE). Usually the no. of ambiguous bases allowed is zero because you do not want to keep reads with bases that you do not know what they are. Any read with ambiguous bases will be discarded - maxN = 0. The maximum expected errors for forward and reverse reads will be the same 2 - maxEE = 2 because the sequencing error in Illumina is around 1%, so let’s allow 2 mismatches along 200 bp length.

Since the bacterial phage DNA is widely used as a control in sequencing facilities, the parameter rm.phix = TRUE will search for reads that align against the phiX genome and discard them. The option compress = TRUE is given to compress the filtered fastq files (that’s why they have the file extension ‘.gz’). The parameter multithread for the function filterAndTrim is to parallelize the jobs but does not work for Windows. So if you want to run this pipeline on Windows operating system just set this parameter to multithread = FALSE.

In this case the 16S rRNA gene region sequenced was the V4 region that is around 253 bp long. This means that our forward and reverse 250 bp long reads fully overlap. When you are truncating paired-end sequencing data you should be aware that after truncation reads should overlap at least over a 20 bp long region (a minimum of 12 perfect matches between forward and reverse reads ends - without mismatches). Otherwise you will not be able to merge the forward and reverse reads latter on.

Run the next R chunk code to apply the filtering and trimming with the dada2 function filterAndTrim(). The filterAndTrim() function performs the filtering and trimming as specified in the parameters and write the output, i.e., the fastq filtered files, to the directories specified below filtFastqFwdPath (for forward fastq files) and filtFastqRevPath (for reverse fastq files) and also prints/gives the number of sequences that were kept after filtering/trimming in relation to the initial number of reads.

### Filter and trim reads

filtFastqFwdPath <- file.path(fastqPath, "filtered", paste0(sampleNames, "_fwd_filt.fastq.gz")) # relative file path for fwd filtered reads that will be created below 
filtFastqRevPath <- file.path(fastqPath, "filtered", paste0(sampleNames, "_rev_filt.fastq.gz")) # relative file path for rev filtered reads that will be created below

## assign to each file path the sample name
names(filtFastqFwdPath) <- sampleNames 
names(filtFastqRevPath) <- sampleNames

## filter and trim fwd and rev fastq files writing the new filtered files in compressed - '.gz' - format to the directories specified above
filterTrimReads <- filterAndTrim(fwd = fastqFwdPath, filt = filtFastqFwdPath, rev = fastqRevPath, filt.rev = filtFastqRevPath, truncLen = c(240,160), maxEE = 2, truncQ = 2, maxN = 0, rm.phix = TRUE, compress = TRUE, verbose = FALSE, multithread = TRUE) 

knitr::kable(filterTrimReads) # fancy way to print the matrix, you could just do 'filterTrimReads'

	reads.in	reads.out
F3D0_S188_L001_R1_001.fastq	7793	7113
F3D1_S189_L001_R1_001.fastq	5869	5299
F3D141_S207_L001_R1_001.fastq	5958	5463
F3D142_S208_L001_R1_001.fastq	3183	2914
F3D143_S209_L001_R1_001.fastq	3178	2941
F3D144_S210_L001_R1_001.fastq	4827	4312
F3D145_S211_L001_R1_001.fastq	7377	6741
F3D146_S212_L001_R1_001.fastq	5021	4560
F3D147_S213_L001_R1_001.fastq	17070	15637
F3D148_S214_L001_R1_001.fastq	12405	11413
F3D149_S215_L001_R1_001.fastq	13083	12017
F3D150_S216_L001_R1_001.fastq	5509	5032
F3D2_S190_L001_R1_001.fastq	19620	18075
F3D3_S191_L001_R1_001.fastq	6758	6250
F3D5_S193_L001_R1_001.fastq	4448	4052
F3D6_S194_L001_R1_001.fastq	7989	7369
F3D7_S195_L001_R1_001.fastq	5129	4765
F3D8_S196_L001_R1_001.fastq	5294	4871
F3D9_S197_L001_R1_001.fastq	7070	6504
Mock_S280_L001_R1_001.fastq	4779	4314

Question: How many reads did you loose during the filter and trim step? (2.5min)

Tip: the following command assigns the percentage of reads lost during this step to the filterTrimReadsDiscard R object: filterTrimReadsDiscard <- (filterTrimReads[,"reads.in"] - filterTrimReads[,"reads.out"]) / filterTrimReads[,"reads.in"] * 100. Run the previous command and give the ouput filterTrimReadsDiscard to the function summary() to summarise the no. of reads lost

Answer

So, the solution is: filterTrimReadsDiscard <- (filterTrimReads[,"reads.in"] - filterTrimReads[,"reads.out"]) / filterTrimReads[,"reads.in"] * 100 summary(filterTrimReadsDiscard)

The percentage of reads lost during the (1) Filter and trim reads step was between 7.0969-10.6691527%.

Let’s look again to the quality of forward and reverse reads after we had applied the filtering/trimming.

Question: Plot the quality profile of your forward and reverse good-quality/filtered reads with the function plotQualityProfile(). Which object do you need to give to this function? Are the quality profiles better than before we had applied the filtering/trimming? (2 min)

Answer

You need to pass the object that contains the path to the filtered forward or reverse fastq files to the function plotQualityProfile(): filtFastqFwdPath (object with a list of relative paths to the filtered forward fastq files) and filtFastqRevPath (object with a list of relative paths to the filtered reverse fastq files).

# forward quality plots
plotQualityProfile(fl = filtFastqFwdPath) # list of paths to the forward filtered fastq files

# reverse quality plots
plotQualityProfile(fl = filtFastqRevPath) # list of paths to the reverse filtered fastq files

Yes, the quality of the filtered forward and reverse reads looks much better now.

(2) Estimate error rates

(10 min)

This step tries to learn/estimate the error rate which is independent for each amplicon NGS run. The most common sequencing errors in Illumina are substitutions (Metzker et al. 2010). Therefore during this step dada2 tries to estimate/infer error rates for all possible transition or transversion point mutations. The figure below describes how the table of transition probabilities is generated.

Run the R chunk code below to estimate the error rate for forward and reverse reads with the dada2 function: learnErrors(). This function takes as input the forward or reverse (so, independently) filtered fastq files: filtFastqFwdPath or filtFastqRevPath. This step is more computational intensive than any other done before.

### Learn error rates

errFwd <- learnErrors(fls = filtFastqFwdPath, multithread = TRUE) # model/learn the fwd error rates for the filtered fastq files

## 33514080 total bases in 139642 reads from 20 samples will be used for learning the error rates.

errRev <- learnErrors(fls = filtFastqRevPath, multithread = TRUE) # model/learn the rev error rates for the filtered fastq files

## 22342720 total bases in 139642 reads from 20 samples will be used for learning the error rates.

Run the R chunk code below to plot the expected and observed error rate for forward and reverse reads.

points: observed error rate for each consensus Q score
line: estimated error rates
line: expected under the nominal definition of the Q score

## Plot errors 

plotErrors(dq = errFwd, nominalQ = TRUE) # for fwd

plotErrors(dq = errRev, nominalQ = TRUE) # for rev

(3) Denoise unique sequences (dereplicate and exclude singletons)

(15 min)

The next step in the pipeline is the core algorithm of DADA2 that performs the inference of Amplicon Sequence Variants (ASVs) based on the previous estimated error rates. In addition, it dereplicates reads into unique sequences (100% identical) and remove singletons. Have a look into the figure below:

Run the R chunk code below in order to dereplicate, denoise (=sample inference) and exclude singletons. We will give as input to the dada2 function that performs this step - dada() - the filtered fastq sequences and the error models for forward and reverse reads that we have determined before.

I want to know more!

The output is a list of samples from dada-class that contains the Amplicon Sequence Variants (among other useful information: “denoised”, “clustering”, “sequence”, “quality”, “birth_subs”, “trans”, map“,”err_in“,”err_out“,”opts“,”pval"). Do ?dada() to know more about this function, parameters and the ouput and its meaning.

### Denoising  
dadaFwd <- dada(derep = filtFastqFwdPath, err = errFwd, multithread = TRUE) # denoise fwd seqs

## Sample 1 - 7113 reads in 1979 unique sequences.
## Sample 2 - 5299 reads in 1639 unique sequences.
## Sample 3 - 5463 reads in 1477 unique sequences.
## Sample 4 - 2914 reads in 904 unique sequences.
## Sample 5 - 2941 reads in 939 unique sequences.
## Sample 6 - 4312 reads in 1267 unique sequences.
## Sample 7 - 6741 reads in 1756 unique sequences.
## Sample 8 - 4560 reads in 1438 unique sequences.
## Sample 9 - 15637 reads in 3590 unique sequences.
## Sample 10 - 11413 reads in 2762 unique sequences.
## Sample 11 - 12017 reads in 3021 unique sequences.
## Sample 12 - 5032 reads in 1566 unique sequences.
## Sample 13 - 18075 reads in 3707 unique sequences.
## Sample 14 - 6250 reads in 1479 unique sequences.
## Sample 15 - 4052 reads in 1195 unique sequences.
## Sample 16 - 7369 reads in 1832 unique sequences.
## Sample 17 - 4765 reads in 1183 unique sequences.
## Sample 18 - 4871 reads in 1382 unique sequences.
## Sample 19 - 6504 reads in 1709 unique sequences.
## Sample 20 - 4314 reads in 897 unique sequences.

dadaRev <- dada(derep = filtFastqRevPath, err = errRev, multithread = TRUE) # denoise rev seqs

## Sample 1 - 7113 reads in 1660 unique sequences.
## Sample 2 - 5299 reads in 1349 unique sequences.
## Sample 3 - 5463 reads in 1335 unique sequences.
## Sample 4 - 2914 reads in 853 unique sequences.
## Sample 5 - 2941 reads in 880 unique sequences.
## Sample 6 - 4312 reads in 1286 unique sequences.
## Sample 7 - 6741 reads in 1803 unique sequences.
## Sample 8 - 4560 reads in 1265 unique sequences.
## Sample 9 - 15637 reads in 3414 unique sequences.
## Sample 10 - 11413 reads in 2522 unique sequences.
## Sample 11 - 12017 reads in 2771 unique sequences.
## Sample 12 - 5032 reads in 1415 unique sequences.
## Sample 13 - 18075 reads in 3290 unique sequences.
## Sample 14 - 6250 reads in 1390 unique sequences.
## Sample 15 - 4052 reads in 1134 unique sequences.
## Sample 16 - 7369 reads in 1635 unique sequences.
## Sample 17 - 4765 reads in 1084 unique sequences.
## Sample 18 - 4871 reads in 1161 unique sequences.
## Sample 19 - 6504 reads in 1502 unique sequences.
## Sample 20 - 4314 reads in 732 unique sequences.

Now, it would be interesting to know how many denoised sequences we obtained per sample. With this in mind, run the next R chunk code.

## get the no. of unique and ASVs for fwd
source("./scripts/biodataPtCrashCourse.R") # import the in-house-developed functions 
uniqFwd <- countUniqueFromDadaObjcList(dadaFwd)[["unique"]] ## count unique Fwd seqs
denoisedFwd <- countUniqueFromDadaObjcList(dadaFwd)[["denoised"]] ## count denoised Fwd seqs

## get the no. of unique and ASVs for rev
uniqRev <- countUniqueFromDadaObjcList(dadaRev)[["unique"]]
denoisedRev <- countUniqueFromDadaObjcList(dadaRev)[["denoised"]]

## total reads
total <- filterTrimReads[,"reads.in"] # retrieve the total no. of seqs
names(total) <- sampleNames # rename samples

## now lets join these two vectors to the previous matrix (with absolute abundance)
denoisedMtx <- rbind(total, uniqFwd, uniqRev,  denoisedFwd, denoisedRev) # you need be careful when you are joining vectors indiscriminately
#here we are joining the vectors assuming that the order of elements between vectors and the matrix is the same. However this was verified by us...if you sorted your samples in the beginning, in principle, the sample order should remain the same

## Barplot with total, unique and denoised fwd and rev sequences
#barplot(denoisedMtx, main = "Comparison of the no. of total, unique and denoised sequences", xlab = "Samples", ylab = "Absolute no. of sequences", col = c("#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C"), legend = rownames(x = denoisedMtx), beside = TRUE)

## Barplot only denoised fwd and rev sequences
barplot(denoisedMtx[4:5,], main = "Comparison of the no. of denoised sequences: forward vs. reverse", xlab = "Samples", ylab = "Absolute no. of sequences", col = c("#3182BD", "#08519C"), legend = rownames(x = denoisedMtx[4:5,]), beside = TRUE) # notice that we are running the same plot as before, but this time we indexed the matrix to plot only the last two rows, the ones that contain the no. of denoised forward and reverse reads

On average there are 110.9 and 98.15 denoised sequences in the forward and reverse sample files. The mock sample has 21 (forward) and 19 (reverse) denoised sequences, that is pretty much the number of strains in this sample: 20.

Question: Why we can not call already these denoised sequences ASVs sequences? (1 min)

Answer

These denoised sequences still need to be merged (and we can loose some of these true sequences during this step) and may have chimeric sequences among them. Therefore, despite denoised (without sequencing errors) that does not mean that chimeric sequences, that arise before sequencing but it constitutes spurious/artificial sequences, can not be among these true biological sequences.

(4) Merge denoised forward and reverse reads

(5-10 min)

After denoising, we will merge our forward reads with the respective reverse-complement of the reverse reads (have a look into the figure below).

I want to know more!

The forward reads and the reverse-complement of the reverse reads should overlap along a minimum of 12 bases without any mismatch (maxMismatch allowed in the minOverlap region is 0 by default). By default, if forward and the reverse-complement of reverse reads does not overlap, reads are discarded as well forward and reverse reads that does not have their pair.

Run the following R chunk code in order to merge paired-end reads with the dada2 function: mergePairs(). Notice that we are giving as input the denoised forward and reverse objects - dadaFwd and dadaRev - as well as the filtered forward and reverse fastq file paths - filtFastqFwdPath and filtFastqRevPath. The output is a list of data frames.

### Merge paired-end reads
mergePE <- mergePairs(dadaF = dadaFwd, derepF = filtFastqFwdPath, dadaR = dadaRev, derepR = filtFastqRevPath, verbose = TRUE) # merge PE reads

(5) Construct an ASV table

(5 min)

Now, we will build our Amplicon Sequence Variant table (however, it still contains chimeras!).

For that, we will use the dada2 function makeSequenceTable() giving as input the previous R object with merged reads: mergePE. The ouput is one ASV matrix/table with the distribution of ASVs (columns) across samples (rows).

### Make an ASV table

asvTbl <- makeSequenceTable(samples = mergePE) # tabulate ASVs

At this point is also interesting to check the length distribution of our reads after being merged. This is important because after merging you may get weird sequence lengths that does not correspond to the sequence length of our target region - around 250 bp (v4 region of the 16S rRNA gene).

histSeqLen <- table(nchar(getSequences(asvTbl))) # the dada2 function `getSequences()` retrieve the column name sequences and the `nchar()` counts the read-lenth of that sequences, and, finally the `table()` counts the frequency of read-length

Question: What do you think about the read-length of these ASVs? How many ASVs do we have in asvTbl? (2.5 min)

Tip: try to plot an histogram with the vector nchar(getSequences(asvTbl)) using the function hist() to check the read-length. Try to use the function dim() or ncol() with asvTbl R object to get information about the number of ASVs in our dataset.

Answer

If you just print the R object histSeqLen you will see that the read-length of our ASVs have a proper read-length around 250 bp, as expected.

dim(asvTbl) gives you the dimension of the asvTbl matrix and by extension the dimension of our dataset: 20 samples and 293 ASVs.

Solution to plot the histogram:

hist(nchar(getSequences(asvTbl)))

What can I do if some of my reads have a read-length lower/higher than expected?

Let’s imagine that for some reason you found sequences higher than 255 bp, with some around 300 bp. In this case you would like to remove them to keep only ASV sequences between 250-256 bp long. For that you could just do: asvTbl2 <- asvTbl[,nchar(colnames(asvTbl)) %in% 250:256]). This will pick the previous ASV table, keeping only the ASV columns (colnames(asvTbl)) with a read-length (nchar(colnames(asvTbl))) between 250-256 bp (%in% 250:256). Keep in mind if you work with different amplicon regions or genes, it could be beneficial to keep sequences that vary in length because the natural interspecific variation due to indels - for instance if you work with ITS (Internal transcribed spacer - ITS1 and 2 vary between 200-600 bp) gene often used as marker gene in Fungi.

(6) Remove chimeras

(15 min)

The next step in dada2 pipeline is the identification and removal of ASVs that may represent chimeric sequences.

I want to know more!

For this purpose, we will use the dada2 function removeBimeraDenovo() with the method = "consensus". So this method is based on the function isBimera() that assumes chimeras arise during PCR with incomplete amplifications and subsequent pairing giving origin to chimeric sequences that result from two incomplete true/parent sequences. Well, the assumptions are: chimeric sequences arise from two pre-existing true/parents sequences, and, based on probability, if this happens, it happens probably with the most abundant sequences that are being amplified. Thus, this method ranks all the ASVs sequences and it compares the less abundant sequences with the most abundant sequences. If one of the less abundant ASVs sequences have high identity with two abundant ASVs sequences, it is flagged as chimeric. The method = "consensus" does this in a sample basis, i.e., it searches for chimeras in each sample and then it removes the flagged chimeric ASVs sequences if the same chimeric sequence appeared flagged in a sufficiently high no. of samples. It ouputs an ASV table without chimeras.

Run the R chunk code below to remove chimeras from the ASV table. Give as input the asvTbl R object created before to the dada2 removeBimeraDenovo() function with the argument method = "consensus".

### Remove chimeras from the ASV table

asvTblNoChim <- removeBimeraDenovo(unqs = asvTbl, method = "consensus", multithread = TRUE, verbose = TRUE)

Question: How many ASVs did you get? How many ASVs did you loose during this step? (1.5 min)

Tip: try to use the ncol() function that gives you the number of columns that a matrix or dataframe R objects have with the asvTbl (before removing chimeras) and asvTblNoChim (after removing chimeras) R objects (don’t forget that in our case the columns of these ASV tables contains the ASVs found in the mouse fecal dataset).

Answer

The asvTblNoChim ASV table contains 232 ASVs. During the remove chimeras step we lost 61 ASVs marked as chimeric sequences.

Since now we have the final ASV table we can summarize the number of sequences that was kept in each step of the pipeline.

I want to know more!

For that, run the following piece of R code. Basically, we are creating a function to count the no. of unique sequences from previous R objects - getN() - and then building a new matrix with cbind() that contains in each column the no. of sequences that were kept in each sample, with different columns representing distinct steps in the pipeline. Then, we rename the columns and rows of the matrix that we created - summaryTblSeq. Finally, we save this matrix with the function write.table() inside the folder output just created with dir.create() function before.

### Summarize the no. of sequences kept in each pipeline step

getN <- function(x) sum(getUniques(x)) # function that sums `sum(getUniques(x)` the no. of unique sequences `getUniques(x)`

## build a matrix with all the sequences kept in each pipeline step
summaryTblSeq <- cbind(filterTrimReads, # initial reads and filtered/trimmed reads
                       sapply(dadaFwd, getN), sapply(dadaRev, getN), # denoised sequences 
                       sapply(mergePE, getN), # merged PE sequences
                       rowSums(asvTblNoChim)) # non-chimeric sequences

## rename the column and row names 
colnames(summaryTblSeq) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(summaryTblSeq) <- sampleNames

## create a second summary table seq with one column for the samples 
summaryTblSeq2 <- cbind("Samples" = sampleNames, summaryTblSeq)

dir.create("output") # let's create 'output' folder
write.table(x = summaryTblSeq2, file = "./output/summaryTblSeq.tsv", sep = "\t", row.names = FALSE)

knitr::kable(summaryTblSeq)

	input	filtered	denoisedF	denoisedR	merged	nonchim
F3D0	7793	7113	6976	6979	6540	6528
F3D1	5869	5299	5227	5239	5028	5017
F3D141	5958	5463	5331	5357	4986	4863
F3D142	3183	2914	2799	2830	2595	2521
F3D143	3178	2941	2822	2867	2552	2518
F3D144	4827	4312	4151	4228	3627	3488
F3D145	7377	6741	6592	6627	6079	5820
F3D146	5021	4560	4450	4470	3968	3879
F3D147	17070	15637	15433	15507	14231	13006
F3D148	12405	11413	11250	11268	10529	9935
F3D149	13083	12017	11857	11898	11154	10653
F3D150	5509	5032	4880	4925	4349	4240
F3D2	19620	18075	17907	17939	17431	16835
F3D3	6758	6250	6149	6176	5853	5491
F3D5	4448	4052	3930	3994	3716	3716
F3D6	7989	7369	7231	7294	6865	6679
F3D7	5129	4765	4646	4673	4428	4217
F3D8	5294	4871	4786	4802	4576	4547
F3D9	7070	6504	6341	6442	6092	6015
Mock	4779	4314	4287	4288	4269	4269

Question: How many sequences remained through the pipeline? (1 min)

Tip: the summaryTblSeq[,6] / summaryTblSeq[,1] * 100 expression gives you the percentage of sequences that remained since the beginning. Use this expression with the summary() function to get some statistics about the sequences kept through the pipeline

Answer

The number of sequences that remained through the whole pipeline was on average: 81.4552543% (83.7674836, 85.4830465, 81.6213494, 79.2020107, 79.2322215, 72.260203, 78.8938593, 77.2555268, 76.19215, 80.0886739, 81.4262784, 76.9649664, 85.8053007, 81.2518497, 83.5431655, 83.6024534, 82.2187561, 85.8896864, 85.0777935, 89.3283114).

Let’s plot the no. of sequences kept in each step since the beginning. Run the R chunk code below to get a barplot with the absolute number of sequences obtained across each step of the pipeline (run the R chunk below).

I want to know more!

Basically, what the R chunk code are doing is transposing the matrix built above summaryTblSeq to inverte the rows into columns and columns into rows with the t() function. We are doing this because then to plot and get the samples name across the x-axis, these samples name needs to be spread across columns and not rows (as before). Then, we give the transposed summaryTblSeqTrans matrix (with samples as columns) to the function barplot() in order to do a barplot.

### Barplot with the abs. abundance of sequences

summaryTblSeqTrans <- t(summaryTblSeq) # transpose 
barplot(summaryTblSeqTrans, main = "Absolute no. of sequences kept through the pipeline", ylab = "Absolute no. of sequences", xlab = "Samples", col = c("gray", "#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C"), legend = rownames(summaryTblSeqTrans), beside = TRUE)

Now, we will do the same as before but this time we will plot the percentage of sequences kept through the pipeline (run the R chunk below).

I want to know more!

For that, we need to determine the percentage of sequences kept for each step in the pipeline. For this we can use the function apply() that can be used to apply another function, in this case we are applying the function(x) x / summaryTblSeq[,1] * 100 (to get the percentage of sequences per sample), across a matrix rows or columns (in our case we are applying the function across the matrix columns setting MARGIN = 2). Then, we transpose with t() and plot with barplot(). Run the R code chunk below to get the barplot.

### Barplot in percentage

summaryTblSeqPerc <- apply(X = summaryTblSeq, MARGIN = 2, function(x) x / summaryTblSeq[,1] * 100) # get the correspondent percentage table
summaryTblSeqPercTrans <- t(summaryTblSeqPerc) # transpose
barplot(summaryTblSeqPercTrans, main = "Percentage of sequences kept through the pipeline", ylab = "Percentage of sequences (%)", xlab = "Samples", col = c("gray", "#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C"), legend = rownames(summaryTblSeqPercTrans), beside = TRUE) # plot it

(7) Assign taxonomy

(15 min)

The last step in the pipeline is the taxonomic classification of the ASV sequences. This is also one of the most computational intensive steps in the pipeline (check the figure below).

I want to know more!

1^st we classify the ASVs against the non-redundant SILVA database (version 132) (Quast et al. 2012) (keep in mind that there is a new version of SSU SILVA database - 138 - but it is not formatted for dada2 yet) with the RDP Naive Bayesian Classifier algorithm (Wang et al. 2007) implemented in R through the dada2 function assignTaxonomy() (with a minimum of condidence of 50% minBoot = 50), and, 2^nd tries to assign the species name based on 100% exact match with the dada2 function addSpecies() (retrieving only unambigous matches). It outputs the taxonomy matrix table for our ASVs.

Run the R chunk code below to classify taxonomically your ASVs.

### naive Bayes classifier

taxTbl <- assignTaxonomy(seqs = asvTblNoChim, refFasta = "./database/silva_nr_v132_train_set.fa.gz", multithread = TRUE) # assign taxonomy against the SILVA NR database (version 132)

## add species

taxTbl <- addSpecies(taxtab = taxTbl, refFasta = "./database/silva_species_assignment_v132.fa.gz") # add species to the , previous assignment based on 100% match

Question: What are the rownames(taxTbl) and colnames(asvTblNoChim)? (1 min)

Answer

The row names of the matrix table taxTbl ànd the column names from the ASV matrix table asvTblNoChim are the ASV DNA sequences. This offers all the convenience of being a consistent label, but it is hard to manipulate, e.g., if you need to know which is the taxonomy or abundance of the 3^rd most abundant ASV (you will not write the full DNA sequence to index/query taxonomy and ASV tables).

Next save the asvTblNoChim and taxTbl R objects in rds format (R format) to the output folder in our working environment. Latter, if you need these untouched/not manipulated objects you can use the inverse function readRDS() to import/read these R objects into R.

### Save ASV and taxonomy tables in R format 
# this can be important if you need just these tables in R format to import latter instead of repeating the whole tutorial

saveRDS(object = asvTblNoChim, file = "./output/asvTblNoChim.rds") # save the ASV table
saveRDS(object = taxTbl, file = "./output/taxTbl.rds") # save the ASV taxonomy

Well, as you saw above the rownames(taxTbl) and colnames(asvTblNoChim) are the unique denoised DNA sequences of the ASVs found in this study. But it would be much better and easier to manipulate if we use shorter ASV_ID such as ASV_nrSeq, where the nrSeq is just the no. of the sequence in the table. Then, we can save the DNA sequences with this tag/id to keep track of the true unique identifiers of ASVs (have a look into the figure).

I want to know more!

With this purpose in mind, we will run the R chunk code below. First, we will create a new taxonomic matrix table using the previous taxonomic matrix table taxTbl R object and adding to this a new column called ASV with cbind() (column bind) function. The new column contains the ASV_nrSeq labels, that will start in 1, i.e., pasting ASV to 1 - ASV_1 - with paste(), until the length of rows in taxTbl, i.e., nrow(taxTbl) equal to ASV_232. Then, we will just set as row names of this new matrix the same ASV_nrSeq labels just created before with rownames(taxTbl2) <- taxTbl2[,8]. Next, we will save the ASV sequences to the fasta file asvFastaDNASequences.fasta, inside the folder output, with the dada2 uniquesToFasta() function, where the fasta headers/identifiers are the same as ASV_nrSeq labels created before - ids = taxTbl2[,8] - (to keep trackability!). Finally, we will substitute also the ASV identifiers by the new ASV_nrSeq labels in the ASV table - asvTblNoChim - as we did for the taxonomic table before: first we will copy the asvTblNoChim to asvTblNoChim2, then rename the ASV column names by the new ASV_nrSeq labels with colnames(asvTblNoChim2) <- taxTbl2[,8], transpose the columns into rows and vice-versa to have samples as columns (more convenient) - asvTblNoChim2 <- t(asvTblNoChim2) -, convert the matrix ASV table R object into a data frame - asvTblNoChim2 <- as.data.frame(asvTblNoChim2) -, add a new column with the new ASV labels - asvTblNoChim2[,"ASV_ID"] <- rownames(asvTblNoChim2) -, and remove the mock community (because this is a synthetic/control community that we want to remove from downstream analyses) that is the 20th column and put the 21st column - with the ASV labels - in the first column - asvTblNoChim2 <- asvTblNoChim2[, c(21,1:19)]. Save the taxonomic (taxTbl.txt) and ASV (asvTblNoChim.txt) tables inside the output folder in tab-delimited format.

Run the R chunk code below.

## keep the trackability of your ASVs 
taxTbl2 <- cbind(taxTbl, "ASV" = paste0("ASV_", 1:nrow(taxTbl))) # add a new column with the new ASV labels/ids to the taxonomy table
rownames(taxTbl2) <- taxTbl2[,8] # substitute the DNA sequences in rownames by the new identifiers/tags/ids "ASV_nrSeq" in the taxonomy table

## retrieve the DNA sequences 
uniquesToFasta(asvTblNoChim, "./output/asvFastaDNASequences.fasta", ids = taxTbl2[,8])

## do the same for the ASV table (with the distribution)
asvTblNoChim2 <- asvTblNoChim # copy ASV table
colnames(asvTblNoChim2) <- taxTbl2[,8] # substitute column DNA sequences names by "ASV_nrSeq" in the ASV table
asvTblNoChim2 <- t(asvTblNoChim2) # transpose the ASV matrix table 
asvTblNoChim2 <- as.data.frame(asvTblNoChim2)
asvTblNoChim2[,"ASV_ID"] <- rownames(asvTblNoChim2)
asvTblNoChim2 <- asvTblNoChim2[, c(21,1:19)] # remove the mock community and put the last as the first column

## Let's save these 2 R objs 
write.table(x = taxTbl2, file = "./output/taxTbl.txt", sep = "\t", row.names = FALSE, quote = FALSE) # save taxonomy table 
write.table(x = asvTblNoChim2, file = "./output/asvTblNoChim.txt", sep = "\t", row.names = FALSE, quote = FALSE) # save ASV table
# the code above assumes that the rownames(taxTbl) are in the same order of colnames(asvTblNoChim)
# if you have doubts about it you can run `rownames(taxTbl) == colnames(asvTblNoChim)` 
# this will compare all the entries between both vector lists all will return TRUE or FALSE if they are the same or not

In order to use the Biome-Shiny R app latter we need to create a biom file that contains the distribution of ASVs, i.e., the abundance, across samples with taxonomy altogether. In order to convert the ASV + Taxonomy table into a biom file format latter, the taxonomy needs to follow some standards (see below).

I want to know more!

To this purpose it was created a in-house-developed function tax2biom() in the biodataPtCrashCourse.R R script that performs the following taxonomy formatting: aggregates all taxonomic levels separated by ‘;’ with the prefix ’R__’ (with R being substituted by the taxonomic rank at which belong, like Phylum, e.g., ’p’, Class, ’c’, and so on). It will return a data frame with 2 columns: “ASV_ID” (gene/ASV) and “taxonomy” (full taxonomy path as it appears in Greengenes database - convenient to convert to biom format and import to Biome-Shiny app).

Run the following R chunk code to get the ASV table merged with the new formatted taxonomy, saving it inside the output folder with the name asvTaxTbl.txt (tab-delimited format). The file asvTaxTbl.txt will be converted latter into biom format in order to use it with Biome-Shiny.

## put taxonomy in a compatible format to convert it latter to biom format

source("./scripts/biodataPtCrashCourse.R") # import R script with in-house-developed functions
taxTbl2 <- tax2biom(taxTbl2) 

## Join ASV and Taxonomy tables into one
asvTaxTbl <- cbind(asvTblNoChim2, "taxonomy" = taxTbl2[,-1]) # exclude the "ID" first column from "taxTbl2" because "asvTblNoChim2" has already this information
write.table(x = asvTaxTbl, file = "./output/asvTaxTbl.txt", sep = "\t", row.names = FALSE, quote = FALSE) # save ASV-taxonomy tables

Below we are importing and formatting the mouse.time.design file that contains information about the fecal mouse samples used in this course and that will be used latter with the Biome-Shiny R app.

Run the R code chunk below to get the metadata.csv file formatted to use latter inside the output directory.

### Import metadata and put it in a biom format too

metadata <- read.table("./MiSeq_SOP/mouse.time.design", header = TRUE)
rownames(metadata) <- metadata[,1]
colnames(metadata) <- c("SampleID", "Condition")
write.csv(x = metadata, file = "./output/metadata.csv", quote = FALSE, row.names = TRUE)

Evaluate accuracy and contamination (with the mock community)

(5 min)

Now we will assess the accuracy of DADA2 and possible contamination. For this purpose we will assess if DADA2 was able to identify all the strains in the mock community correctly. If we identify a different bacterial ASV that was not present in the original mock community that means that something went wrong somewhere else between library prep and sequencing and you got your samples contaminated. However, this simple comparison does not discard contamination before library prep and sequencing.

I want to know more!

First select only the Mock from the previous/untouch ASV table - mockAsvTbl <- asvTbl["Mock",] -, filter the vector R object mockAsvTbl to ASVs higher than 0 and ordered them from the most to less abundant - mockAsvTbl <- sort(x = mockAsvTbl[mockAsvTbl>0], decreasing = TRUE). Then, import reference mock sequences from HMP_MOCK.v35.fasta - refMockSeq <- getSequences(file.path(fastqPath, "HMP_MOCK.v35.fasta")). Finally, we compare the ASV sequences obtained in the mock community names(mockAsvTbl) with the with the reference sequences refMockSeq and we sum the number of matches. The total number of sequences identified in the mock community can be determined by: length(mockAsvTbl). If the no. of mock ASVs is the same as the number of ASV sequences identified in the reference, i.e., compareRefAsvMock, that means a accuracy of 100% and no contamination.

Run the R chunk code below to get the number of mock ASV sequences identified against the reference mock strain sequences - HMP_MOCK.v35.fasta.

### Assess the DADA2 accuracy

mockAsvTbl <- asvTbl["Mock",] # retrieve the sample "Mock" from the ASV table
mockAsvTbl <- sort(x = mockAsvTbl[mockAsvTbl>0], decreasing = TRUE) # retrieve only ASVs higher than 0
refMockSeq <- getSequences(file.path(fastqPath, "HMP_MOCK.v35.fasta")) # import reference mock sequences from "HMP_MOCK.v35.fasta"
compareRefAsvMock <- sum(sapply(names(mockAsvTbl), function(x) any(grepl(x, refMockSeq)))) # compare our ASV mock sequences wiht the reference

Question: How many mock strains were identified in our dataset? Were we able to identify all the strains in the mock community? There was any contamination? (1 min)

Tip: the R object compareRefAsvMock give you the number of strains identified in the mock community/sample.

Answer

In total, it was identified 20 strains in the mock community/sample. Yes, we identified all the 20 mock strains, as expected. Therefore the accuracy of DADA2 for this sample was 100%. No, there was not contamination based on this mock comparison. Although, we do not have negative control and, therefore, we can not discard contamination before library prep and sequencing.

Convert the ASV table text file into a biom (Biological Observation Matrix) file

(2.5 min)

biom: you may need to install biom in your computer (this was already done for you for convenience). If you need to install biom program, please follow the instructions at: http://biom-format.org/

Finally, we will use the program biom to convert the ASV table with taxonomy in text plain format into a Biological Observation Matrix, aka biom, file format (Mcdonald et al. 2012).

I want to know more!

A biom file is a biological observation matrix file that can contain an OTU table or other gene feature table as well other metadata fields such as, taxonomy data, known as observation-metadata. Usually, the biom program is called from the command-line. However, for convenience we will use a in-house-developed function convertTab2Biom from the R script biodataPtCrashCourse.R that calls the biom program directly from the R console. Thus, to convert the ASV table with taxonomy, run the following R chunk code. Give as input the file path (=inFile) for the ASV table with taxonomy, ./output/asvTaxTbl.txt, and as output the file path for the new biom that will created (=outFile) - ./output/asvTable.biom -, with the file extension .biom.

Run the R chunk code below.


## Convert ASV table with taxonomy in tab-delimited format into biom format  

source("./scripts/biodataPtCrashCourse.R") # import R script with in-house-developed functions
convertTab2Biom(inFile = "./output/asvTaxTbl.txt", outFile = "./output/asvTable.biom")

Estimate the running time of R

endTime <- Sys.time() 

endTime - startTime # running time

## Time difference of 6.151535 mins

R packages and versions used in this course

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=pt_PT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dada2_1.14.0 Rcpp_1.0.3  
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-38             png_0.1-7                  
##  [3] Rsamtools_2.0.3             Biostrings_2.52.0          
##  [5] assertthat_0.2.1            digest_0.6.23              
##  [7] R6_2.4.1                    GenomeInfoDb_1.20.0        
##  [9] plyr_1.8.5                  ShortRead_1.42.0           
## [11] stats4_3.6.2                evaluate_0.14              
## [13] highr_0.8                   ggplot2_3.2.1              
## [15] pillar_1.4.3                zlibbioc_1.30.0            
## [17] rlang_0.4.2                 lazyeval_0.2.2             
## [19] S4Vectors_0.22.1            Matrix_1.2-18              
## [21] rmarkdown_2.0               labeling_0.3               
## [23] BiocParallel_1.18.1         stringr_1.4.0              
## [25] RCurl_1.95-4.12             munsell_0.5.0              
## [27] DelayedArray_0.10.0         compiler_3.6.2             
## [29] xfun_0.11                   pkgconfig_2.0.3            
## [31] BiocGenerics_0.30.0         htmltools_0.4.0            
## [33] tidyselect_0.2.5            SummarizedExperiment_1.14.1
## [35] tibble_2.1.3                GenomeInfoDbData_1.2.1     
## [37] IRanges_2.18.3              matrixStats_0.55.0         
## [39] crayon_1.3.4                dplyr_0.8.3                
## [41] GenomicAlignments_1.20.1    bitops_1.0-6               
## [43] grid_3.6.2                  gtable_0.3.0               
## [45] lifecycle_0.1.0             magrittr_1.5               
## [47] scales_1.1.0                RcppParallel_4.4.4         
## [49] stringi_1.4.3               farver_2.0.1               
## [51] XVector_0.24.0              hwriter_1.3.2              
## [53] reshape2_1.4.3              latticeExtra_0.6-29        
## [55] RColorBrewer_1.1-2          tools_3.6.2                
## [57] Biobase_2.44.0              glue_1.3.1                 
## [59] purrr_0.3.3                 jpeg_0.1-8.1               
## [61] parallel_3.6.2              yaml_2.2.0                 
## [63] colorspace_1.4-1            GenomicRanges_1.36.1       
## [65] knitr_1.26

Disclaimer

This tutorial is a replication of the original DADA2 tutorial made and maintained by Dr. Benjamin Callahan using the same subset of 16S rRNA gene amplicon data from fecal mouse samples published by the Schloss lab (Schloss et al. 2012; Kozich et al. 2013).

License: CC-BY 4.0

Acknowledgements

References

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7). Nature Publishing Group: 581.

Glassman, Sydney I, and Jennifer BH Martiny. 2018. “Broadscale Ecological Patterns Are Robust to Use of Exact Sequence Variants Versus Operational Taxonomic Units.” MSphere 3 (4). Am Soc Microbiol: e00148–18.

Kozich, James J, Sarah L Westcott, Nielson T Baxter, Sarah K Highlander, and Patrick D Schloss. 2013. “Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the Miseq Illumina Sequencing Platform.” Appl. Environ. Microbiol. 79 (17). Am Soc Microbiol: 5112–20.

Leipzig, Jeremy. 2017. “A Review of Bioinformatic Pipeline Frameworks.” Briefings in Bioinformatics 18 (3). Oxford University Press: 530–36.

Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet. Journal 17 (1): 10–12.

McDonald, Daniel, Jose C Clemente, Justin Kuczynski, Jai Ram Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, et al. 2012. “The Biological Observation Matrix (Biom) Format or: How I Learned to Stop Worrying and Love the Ome-Ome.” GigaScience 1 (1). BioMed Central: 7.

Metzker, Michael L. 2010. “Sequencing Technologies—the Next Generation.” Nature Reviews Genetics 11 (1). Nature Publishing Group: 31.

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. “The Silva Ribosomal Rna Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41 (D1). Oxford University Press: D590–D596.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rosen, Michael J, Benjamin J Callahan, Daniel S Fisher, and Susan P Holmes. 2012. “Denoising Pcr-Amplified Metagenome Data.” BMC Bioinformatics 13 (1). BioMed Central: 283.

Wang, Qiong, George M Garrity, James M Tiedje, and James R Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” Appl. Environ. Microbiol. 73 (16). Am Soc Microbiol: 5261–7.

Back to main page.

16S rRNA gene amplicon - upstream data analysis

António Sousa

31 January 2020

Introduction

Before getting started

Requirements:

How to read this document?

Let’s start

Import dada2 package (and dependencies)

Define the directory of 16S rRNA gene amplicon fastq files

Fastq Control

(1) Filter and trim reads

(2) Estimate error rates

(3) Denoise unique sequences (dereplicate and exclude singletons)

(4) Merge denoised forward and reverse reads

(5) Construct an ASV table

(6) Remove chimeras

(7) Assign taxonomy

Evaluate accuracy and contamination (with the mock community)

Convert the ASV table text file into a biom (Biological Observation Matrix) file

Estimate the running time of R

R packages and versions used in this course

Disclaimer

Acknowledgements

References