3 16s Analysis

In this tutorial we are going to see/use different commands and tools to perform the 16s analysis on long reads NanoPore. However, the practical analysis will cover only some of these steps as data has been prepared in advance due to their computation and time consuming limits.

This guide has been created with the purpose of a practical crush course and it is not intended as a complete reference but rather as a beginners pipeline to analyze NanoPore generated data.

3.1 Commands and features that we will use in our practice

3.1.1 touch

Create a new empty file.

Example: touch prova.txt

3.1.2 echo

The echo command is a built-in Linux feature that prints out arguments as the standard output. echo is commonly used to display text strings or command results as messages.

Example: echo "Hello World"

3.1.3 cat

Concatenate or print the contents of a file

Example: cat prova.txt

3.1.4 $

Append $ to the variable name to access the variable value.

Example:

var="Hello World"
echo $var

Output:

Hello World

3.1.5 chmod

Change the permissions of a file or directory and make it executable.

  • use the “chmod +x” command on a system file to give permission to all users to execute it.
  • use the “chmod u+x” for made the file executable for your user.

Example: chmod u+x s01_filtering.sh

3.1.6 basename

It removes the path from a file string, providing only its filename and trailing suffix from given file names.

Example:

basename /path/to/filename.txt

Output:

filename.txt

3.1.7 Create a shell scripts

Sometime you don’t need to run a command at a time, we can pre-think, organize series of actions (a program) that you can then execute within Bash.

For example we can write a shell script that runs a series of commands and we can run the script from the terminal to execute all the steps that we have integrated in the script.

For example, lets assume that we want to visualize the first four reads from a FASTQ and redirect to a file. For this task we are going to create an empty file and write in it the following text:

#!/bin/bash

cat /SERVER/16s_data/mysample.FASTQ | head -n 4 > first_read.fastq

var="Hello World"
echo $var

Now we give the ‘executable’ permission to our script in order to be executed:

chmod u+x myscript.sh

Now we execute our script:

./myscript.sh

  • The first row must be #!/bin/bash to allows the shell to interpret your code with bash
  • The symbol | is the PIPE, it lets you connect actions: the output of a command is the input of the next command
  • The command head allows to print a specified number of rows from an input.
  • Sometime different actions cannot be linked with a PIPE, in this case we use to write the series of actions in each line, as we did in the last two rows of our scripts.

3.2 Package Manager

3.2.1 Conda

Conda is a powerful command line tool for package and environment management that runs on Windows, macOS, and Linux.

https://conda.io/projects/conda/en/latest/user-guide/getting-started.html

Within conda, you can create, export, list, remove, and update environments that have different versions of Python and/or packages installed in them. Switching or moving between environments is called activating the environment. You can also share an environment file.

3.3 Base Calling

Base calling is the process of translating the electronic raw signal of the sequencer into bases, i.e., ATCG and converting the raw files (FAST5) to a FASTQ files (human-readable), which contains the nucleotide sequences of the reads.

Raw data are huge in terms of storage, and since basecalling is computationally and time demanding, the fastq files are already provided.

However, to perform this step we suggest to use one of the two following tools:

3.3.1 Guppy

https://community.nanoporetech.com/docs/prepare/library_prep_protocols/Guppy-protocol/v/gpb_2003_v1_revax_14dec2018/guppy-software-overview

Guppy usage:

guppy_basecaller \
  --num_callers 4 \
  --cpu_threads_per_caller 64 \
  --input_path \
  --save_path \
  --flowcell FLO-MIN106 \
  --kit SQK-RBK004s

As we can see from the command line this tool requires the flowcell model and eventually barcoding kit information.

3.3.2 Dorado

Recently released by Nanopore, Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads, with the options of super, high and low, accuracy.

https://github.com/nanoporetech/dorado

First download the flowcell model kit (You need to know which flowcell was used):

dorado download --model dna_r10.4.1_e8.2_400bps_hac@v4.1.0

Dorado usage:

dorado basecaller \
  -b 36 \
  --device cpu \
  --emit-fastq
  mysample.FAST5

3.4 Filtering and Trimming

The fastq file containing the 16S sequence need to be filtered based on quality and/or read length, and optional trimmed after passing filter

3.4.1 NanoFilt

NanoFilt - filtering and trimming of long read sequencing data

https://github.com/wdecoster/nanofilt

Requirement

To execute this tool, you need to activate the conda environment

mamba activate /home/irsa/miniconda3/envs/ONTpp

STEP TIPS:

  • Your input data are available in the following path: /SERVER/16s_data/
  • Your task should be to use NanoFilt on each fastq file, using the following options:
    • --length LENGTH Filter on a minimum read length
    • --maxlength MAXLENGTH Filter on a maximum read length
    • --quality QUALITY Filter on a minimum average read quality score
  • Create the folder for your output files (to use in your script)
  • To perform this step you should create a script (Recommended name: s01_nanofilt.sh), and give it permission to be executed: chmod u+x s01_nanofilt.sh
  • Execute your script as follow: ./s01_nanofilt.sh
  • Look at the results
[SPOILER] - Scripts that we will use

We create an empty file called s01_nanofilt.sh

touch s01_nanofilt.sh

We can write our actions in the scripts as follows:

#!/bin/bash

for i in /SERVER/16s_data/*fastq
do
    f=$(basename "$i" .fastq)
    echo "$f"
    echo "$i"
    cat $i | NanoFilt -q 9 -l 1200 --maxlength 1800 > /home/irsa/analisi_16s/output_s01/"$f"-nf.fastq
done

Create output directory for this script (Change irsa with your utenteX name)

mkdir -p /home/irsa/analisi_16s/output_s01/

Change its permission: chmod u+x s01_nanofilt.sh

Execute it: ./s01_nanofilt.sh

3.5 Subsetting

We are going to reduce the number of reads in order to use less resources for our practical analysis. It is important to note that this step is not part of a common pipeline.

3.5.1 BBMAP Tools

BBMap - short read aligner for DNA/RNAseq, and other bioinformatic tools, including BBMap.

https://github.com/BioInfoTools/BBMap

From this step, we use a script provided by BBMAP tools collection, called: reformat.sh, which reformats reads to change ASCII quality encoding, interleaving, file format, or compression format.

Requirement

To execute this tool, you need to activate the conda environment:

mamba activate /home/irsa/miniconda3/envs/ONTpp

STEP TIPS:

  • Your input data should be available from the output folder of the previous step
  • Your task should be to use reformat.sh on each output files obtained from the previous step, using the following options:
    • in=<file> Input file
    • out=<outfile> Ouput file
    • samplereadstarget=10000 Exact number of OUTPUT reads (or pairs) desired.
  • Create the folder for your output files (to use in your script)
  • To perform this step you should create a script (Recommended name: s02_subsampling.sh), and give it permission to be executed: chmod u+x s02_subsampling.sh
  • Execute your script as follow: ./s02_subsampling.sh
  • Look at the results
[SPOILER] - Scripts that we will use

We create an empty file called s02_subsampling.sh

touch s02_subsampling.sh

We can write our actions in the scripts as follows:

#!/bin/bash

# Create output directory for this script
mkdir -p /home/irsa/analisi_16s/output_s02/

for i in /home/irsa/analisi_16s/output_s01/*-nf.fastq
do
    f=$(basename "$i" -nf.fastq)
    echo "$f"
    echo "$i"
    reformat.sh in="$i" out=/home/irsa/analisi_16s/output_s02/"$f"-ss-nf.fastq samplereadstarget=10000
done

Create output directory for this script (Change irsa with your utenteX name)

mkdir -p /home/irsa/analisi_16s/output_s02/

Change its permission: chmod u+x s02_subsampling.sh

3.6 Taxonomic Assignment

Last step, the sequences are compared to a reference database for taxonomic assignment and a relative abundance estimator for 16S genomic sequences.

3.6.1 EMU

Emu - species-level taxonomic abundance for full-length 16S reads.

This tool use a method optimized for error-prone full-length reads. However, it can be used for short-reads.

https://github.com/treangenlab/emu

To perform this annotation, we need a reference database that contains all the taxonomic information, which was already been downloaded in the following path: /SERVER/emu_database

Requirement

To execute this tool, you need to activate the conda environment:

mamba activate /home/irsa/miniconda3/envs/emu

STEP TIPS:

  • Your input data should be available from the previous step
  • Your task should be to use emu abundance on each output files obtained from the previous step, using the following options:
    • --type map-ont denote sequencer [short-read:sr, Pac-Bio:map-pb, ONT:map-ont]
    • --keep-counts include estimated read counts for each species in output
    • --output-dir <output_dir> directory for output results (to be created in advance)
    • --output-basename <basename_files> basename of all output files saved in output-dir; default utilizes basename from input file(s)
  • Create the folder for your output files (to use in your script)
  • To perform this step you should create a script (Recommended name: s03_abundance.sh), and give it permission to be executed: chmod u+x s03_abundance.sh
    • Specify the path of the EMU database with the following line as an action: export EMU_DATABASE_DIR=/SERVER/emu_database
    • Last line of your script, should be an action that performs the command emu combine-outputs to create a single table containing all Emu output relative abundances in a single directory. Note this function will select all the .tsv files in the provided directory that contain ‘rel-abundance’ in the filename. Use the following options in your emu combine-outputs command:
      • --counts output estimated counts rather than relative abundance percentage in combined table. Only includes Emu relative abundance outputs that already have ‘estimated counts’
      • tax_id to get results for the most specific taxa level
  • Execute your script as follow: ./s03_abundance.sh
  • Look at the results
[SPOILER] - Scripts that we will use

We create an empty file called s03_abundance.sh

touch s03_abundance.sh

We can write our actions in the scripts as follows:

#!/bin/bash

# mamba activate /home/irsa/miniconda3/envs/emu

export EMU_DATABASE_DIR=/SERVER/emu_database

for i in /home/irsa/analisi_16s/output_s02/*-ss-nf.fastq
do
    f=$(basename "$i" -ss-nf.fastq)
    echo "$f"
    echo "$i"
    emu abundance "$i" --type map-ont --output-basename "$f" --keep-counts --output-dir /home/irsa/analisi_16s/output_s03/ --threads 4
done


emu combine-outputs --counts /home/irsa/analisi_16s/output_s03/ tax_id

Create output directory for this script (Change irsa with your utenteX name)

mkdir -p /home/irsa/analisi_16s/output_s03/

Change its permission: chmod u+x s03_abundance.sh