3 16s Analysis
In this tutorial we are going to see/use different commands and tools to perform the 16s analysis on long reads NanoPore. However, the practical analysis will cover only some of these steps as data has been prepared in advance due to their computation and time consuming limits.
This guide has been created with the purpose of a practical crush course and it is not intended as a complete reference but rather as a beginners pipeline to analyze NanoPore generated data.
3.1 Commands and features that we will use in our practice
3.1.2 echo
The echo command is a built-in Linux feature that prints out arguments as the standard output. echo is commonly used to display text strings or command results as messages.
Example: echo "Hello World"
3.1.5 chmod
Change the permissions of a file or directory and make it executable.
- use the “chmod +x” command on a system file to give permission to all users to execute it.
- use the “chmod u+x” for made the file executable for your user.
Example: chmod u+x s01_filtering.sh
3.1.6 basename
It removes the path from a file string, providing only its filename and trailing suffix from given file names.
Example:
Output:
filename.txt
3.1.7 Create a shell scripts
Sometime you don’t need to run a command at a time, we can pre-think, organize series of actions (a program) that you can then execute within Bash.
For example we can write a shell script that runs a series of commands and we can run the script from the terminal to execute all the steps that we have integrated in the script.
For example, lets assume that we want to visualize the first four reads from a FASTQ and redirect to a file. For this task we are going to create an empty file and write in it the following text:
#!/bin/bash
cat /SERVER/16s_data/mysample.FASTQ | head -n 4 > first_read.fastq
var="Hello World"
echo $varNow we give the ‘executable’ permission to our script in order to be executed:
chmod u+x myscript.sh
Now we execute our script:
./myscript.sh
- The first row must be
#!/bin/bashto allows the shell to interpret your code with bash - The symbol
|is the PIPE, it lets you connect actions: the output of a command is the input of the next command - The command
headallows to print a specified number of rows from an input. - Sometime different actions cannot be linked with a PIPE, in this case we use to write the series of actions in each line, as we did in the last two rows of our scripts.
3.2 Package Manager
3.2.1 Conda
Conda is a powerful command line tool for package and environment management that runs on Windows, macOS, and Linux.
https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
Within conda, you can create, export, list, remove, and update environments that have different versions of Python and/or packages installed in them. Switching or moving between environments is called activating the environment. You can also share an environment file.
3.2.2 Mamba (Recommended)
Mamba is a reimplementation of the conda package manager in C++, so it is faster and more convenient due to its faster dependencies solving.
https://mamba.readthedocs.io/en/latest/user_guide/mamba.html
3.2.2.2 Installing tools
Most of the bioinformatics packages can be searched in Bioconda at this link https://bioconda.github.io, but also in conda-forge channel and then we can install them by executing
mamba install -c bioconda -c conda-forge nanofilt
For our practical analysis, conda environments have already been created.
The following environment were created:
mamba activate /home/irsa/miniconda3/envs/ONTppmamba activate /home/irsa/miniconda3/envs/emu
3.3 Base Calling
Base calling is the process of translating the electronic raw signal of the sequencer into bases, i.e., ATCG and converting the raw files (FAST5) to a FASTQ files (human-readable), which contains the nucleotide sequences of the reads.
Raw data are huge in terms of storage, and since basecalling is computationally and time demanding, the fastq files are already provided.
However, to perform this step we suggest to use one of the two following tools:
3.3.1 Guppy
Guppy usage:
guppy_basecaller \
--num_callers 4 \
--cpu_threads_per_caller 64 \
--input_path \
--save_path \
--flowcell FLO-MIN106 \
--kit SQK-RBK004sAs we can see from the command line this tool requires the flowcell model and eventually barcoding kit information.
3.3.2 Dorado
Recently released by Nanopore, Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads, with the options of super, high and low, accuracy.
https://github.com/nanoporetech/dorado
First download the flowcell model kit (You need to know which flowcell was used):
Dorado usage:
3.4 Filtering and Trimming
The fastq file containing the 16S sequence need to be filtered based on quality and/or read length, and optional trimmed after passing filter
3.4.1 NanoFilt
NanoFilt - filtering and trimming of long read sequencing data
https://github.com/wdecoster/nanofilt
Requirement
To execute this tool, you need to activate the conda environment
mamba activate /home/irsa/miniconda3/envs/ONTpp
STEP TIPS:
- Your input data are available in the following path:
/SERVER/16s_data/ - Your task should be to use NanoFilt on each fastq file, using the following options:
--length LENGTHFilter on a minimum read length--maxlength MAXLENGTHFilter on a maximum read length--quality QUALITYFilter on a minimum average read quality score
- Create the folder for your output files (to use in your script)
- To perform this step you should create a script (Recommended name:
s01_nanofilt.sh), and give it permission to be executed:chmod u+x s01_nanofilt.sh - Execute your script as follow:
./s01_nanofilt.sh - Look at the results
[SPOILER] - Scripts that we will use
We create an empty file called s01_nanofilt.sh
touch s01_nanofilt.sh
We can write our actions in the scripts as follows:
#!/bin/bash
for i in /SERVER/16s_data/*fastq
do
f=$(basename "$i" .fastq)
echo "$f"
echo "$i"
cat $i | NanoFilt -q 9 -l 1200 --maxlength 1800 > /home/irsa/analisi_16s/output_s01/"$f"-nf.fastq
doneCreate output directory for this script (Change irsa with your utenteX name)
mkdir -p /home/irsa/analisi_16s/output_s01/
Change its permission:
chmod u+x s01_nanofilt.sh
Execute it:
./s01_nanofilt.sh
3.5 Subsetting
We are going to reduce the number of reads in order to use less resources for our practical analysis. It is important to note that this step is not part of a common pipeline.
3.5.1 BBMAP Tools
BBMap - short read aligner for DNA/RNAseq, and other bioinformatic tools, including BBMap.
https://github.com/BioInfoTools/BBMap
From this step, we use a script provided by BBMAP tools collection, called: reformat.sh, which reformats reads to change ASCII quality encoding, interleaving, file format, or compression format.
Requirement
To execute this tool, you need to activate the conda environment:
mamba activate /home/irsa/miniconda3/envs/ONTpp
STEP TIPS:
- Your input data should be available from the output folder of the previous step
- Your task should be to use
reformat.shon each output files obtained from the previous step, using the following options:in=<file>Input fileout=<outfile>Ouput filesamplereadstarget=10000Exact number of OUTPUT reads (or pairs) desired.
- Create the folder for your output files (to use in your script)
- To perform this step you should create a script (Recommended name:
s02_subsampling.sh), and give it permission to be executed:chmod u+x s02_subsampling.sh - Execute your script as follow:
./s02_subsampling.sh - Look at the results
[SPOILER] - Scripts that we will use
We create an empty file called s02_subsampling.sh
touch s02_subsampling.sh
We can write our actions in the scripts as follows:
#!/bin/bash
# Create output directory for this script
mkdir -p /home/irsa/analisi_16s/output_s02/
for i in /home/irsa/analisi_16s/output_s01/*-nf.fastq
do
f=$(basename "$i" -nf.fastq)
echo "$f"
echo "$i"
reformat.sh in="$i" out=/home/irsa/analisi_16s/output_s02/"$f"-ss-nf.fastq samplereadstarget=10000
doneCreate output directory for this script (Change irsa with your utenteX name)
mkdir -p /home/irsa/analisi_16s/output_s02/
Change its permission:
chmod u+x s02_subsampling.sh
3.6 Taxonomic Assignment
Last step, the sequences are compared to a reference database for taxonomic assignment and a relative abundance estimator for 16S genomic sequences.
3.6.1 EMU
Emu - species-level taxonomic abundance for full-length 16S reads.
This tool use a method optimized for error-prone full-length reads. However, it can be used for short-reads.
https://github.com/treangenlab/emu
To perform this annotation, we need a reference database that contains all the taxonomic information, which was already been downloaded in the following path: /SERVER/emu_database
Requirement
To execute this tool, you need to activate the conda environment:
mamba activate /home/irsa/miniconda3/envs/emu
STEP TIPS:
- Your input data should be available from the previous step
- Your task should be to use
emu abundanceon each output files obtained from the previous step, using the following options:--type map-ontdenote sequencer [short-read:sr, Pac-Bio:map-pb, ONT:map-ont]--keep-countsinclude estimated read counts for each species in output--output-dir <output_dir>directory for output results (to be created in advance)--output-basename <basename_files>basename of all output files saved in output-dir; default utilizes basename from input file(s)
- Create the folder for your output files (to use in your script)
- To perform this step you should create a script (Recommended name:
s03_abundance.sh), and give it permission to be executed:chmod u+x s03_abundance.sh- Specify the path of the EMU database with the following line as an action:
export EMU_DATABASE_DIR=/SERVER/emu_database - Last line of your script, should be an action that performs the command
emu combine-outputsto create a single table containing all Emu output relative abundances in a single directory. Note this function will select all the .tsv files in the provided directory that contain ‘rel-abundance’ in the filename. Use the following options in youremu combine-outputscommand:--counts outputestimated counts rather than relative abundance percentage in combined table. Only includes Emu relative abundance outputs that already have ‘estimated counts’tax_idto get results for the most specific taxa level
- Specify the path of the EMU database with the following line as an action:
- Execute your script as follow:
./s03_abundance.sh - Look at the results
[SPOILER] - Scripts that we will use
We create an empty file called s03_abundance.sh
touch s03_abundance.sh
We can write our actions in the scripts as follows:
#!/bin/bash
# mamba activate /home/irsa/miniconda3/envs/emu
export EMU_DATABASE_DIR=/SERVER/emu_database
for i in /home/irsa/analisi_16s/output_s02/*-ss-nf.fastq
do
f=$(basename "$i" -ss-nf.fastq)
echo "$f"
echo "$i"
emu abundance "$i" --type map-ont --output-basename "$f" --keep-counts --output-dir /home/irsa/analisi_16s/output_s03/ --threads 4
done
emu combine-outputs --counts /home/irsa/analisi_16s/output_s03/ tax_id
Create output directory for this script (Change irsa with your utenteX name)
mkdir -p /home/irsa/analisi_16s/output_s03/
Change its permission:
chmod u+x s03_abundance.sh