4 Metagenomics
The metagenomics sequence are produced using a shotgun approach in which each all the gene present inside the samples were sequenced. This sequencing not only extends taxonomic resolution to the species- or strain-level but also provides potential functional information
4.1 Reads Quality Check
After the basecalling, on the fastq file the “quality check” of the sequence can be performed using stats.sh, a tool inc luded in BBMap/BBTools
4.1.1 BBMap / stats.sh
stats.sh - Generates basic assembly statistics such as scaffold count, N50, L50, GC content, gap percent, etc. Works with fasta and fastq only (gzipped is fine).
https://github.com/BioInfoTools/BBMap/blob/master/sh/stats.sh
4.2 Assembly
One of the most important step for the metagenomic analysis is the Assembly, in which the reads were compared, aligned and overlapped in order to create longer sequenced called “CONTIGS”.
4.2.1 Flye (Metaflye)
Setup Conda Env
mamba create -n flye -c conda-forge -c bioconda flye
Flye is a de novo assembler for single molecule sequencing reads using repeat graphs as core data structure. Compared to de Bruijn graphs (which require exact k-mer matches), repeat graphs are built using approximate sequence matches.
https://github.com/fenderglass/Flye
Options used
--nano-raw ONTregular reads--metaenables the mode for metagenome/uneven coverage assembly--out-dir <outputdir>Output directory
4.3 Read mapping
To quantify the coverage, we map the original input reads to the contig
4.3.1 Minimap2
A versatile pairwise aligner for genomic and spliced nucleotide sequences, used for evaluate the coverage of the original reads on each previously created contig using pairwise alignment.
A report and two charts are generated with complementary information, showing a summary of the DNA-Seq Alignment results.
This page contains information about the reference genome sequences, the input FASTQ files, and a results overview.
The last section is divided into several subsections: globals, paired information, ACTG content, coverage, mapping quality, insert size, mismatches, and indels.
Minimap2 rates an alignment by the score of the max-scoring sub-segment, excluding introns, and marks the best alignment as primary in SAM.
Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence. Practically, SAM is a TAB-delimited text format consisting of a header section and an alignment section. Header lines start with ‘@’, while alignment lines do not.
https://github.com/lh3/minimap2
Example command line
Options:
- map-ont Align noisy long reads of ~10% error rate to a reference genome. This is the default mode
4.3.2 samtools
Samtools is a package for reading/writing/editing/indexing/viewing SAM/BAM/CRAM format.
A BAM file (*.bam) is a compressed binary version (BGZF format) of a SAM file that is used to represent aligned sequences. This file can be created starting from a SAM (using samtools) file in order to reduce its size.
4.4 Binning
In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome, called “MAGs” (Metagenome Assembled Genomes).
4.4.1 SemiBin2
Setup Conda Env
mamba create -n semibin -c conda-forge -c bioconda semibin
SemiBin is a command line tool for metagenomic binning with semi-supervised siamese neural network using additional information from reference genomes and contigs themselves. It supports single sample, co-assembly, and multi-samples binning modes.
https://github.com/BigDataBiology/SemiBin
3 Options are available for this tool:
single_easy_bin: Running with single-sample binningmulti_easy_bin: Running with multi-sample binningco-assembly: samples are co-assembled first (as if the pool of samples were a single sample) and then bins are constructed from this pool of co-assembled contigs.
You will need the following inputs:
- A contig file (contig.fa in the example below)
- BAM file(s) from mapping short reads to the contigs, sorted (mapped_reads.sorted.bam in the example below)
The single_easy_bin command can be used to produce results in a single step
SemiBin2 single_easy_bin --sequencing-type=long_read --input-fasta assembly.fasta --input-bam aln.bam --output output_folderAlternatively, you can train a new model for that sample, by not passing in the --environment flag.
This is the fastest option and should work the best if you have metagenomes from one of our prebuilt habitats (alternatively, you can use the global “habitat” which combines all of them).
4.5 MAGs Quality Check
As well as the original reads and contigs, also MAGs can be checked to extract the ones that can be considered High Quality.
4.5.1 CheckM2
CheckM2 - Rapid assessment of genome bin quality using machine learning
https://github.com/chklovski/CheckM2
CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. Completeness is the percentage of the mapped genome that were covered by each mag. Contamination is the inclusion of foreign sequences on the mags
The strain heterogeneity (SH) index indicates the proportion of the contamination that appears to be from the same or similar strains (as determined with an AAI threshold).
In order to extract the completeness and contamination for each Mags, You will also need to download and install the external DIAMOND database that CheckM2 relies on for rapid annotation.
Setup Conda Env and install CheckM2 Database
mamba create -n checkm2 -c bioconda -c conda-forge checkm2
mamba activate checkm2
pip install CheckM2
mkdir -p /path/to/checkm2_database
checkm2 database --download --path /path/to/checkm2_database/As we can see in the following command, the database path can also be set by using the environmental variable.
The main use of CheckM2 is to predict the completeness and contamination of metagenome-assembled genomes (MAGs)
4.6 Taxonomic Classification
4.6.1 GTDB-Tk
4.6.1.1 Setup Conda Env
mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk=2.3.2
GTDB-tk - assigning taxonomic classifications
https://github.com/Ecogenomics/GTDBTk
GTDB-Tk is the software toolkit used for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy (GTDB). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples.
GTDB-Tk requires an external database that needs to be downloaded and unarchived:
https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data
To perform the taxonomic classification, we use the workflow function classify_wf which consists (internally) of four steps: ani_screen, identify, align, and classify
gtdbtk classify_wf --genome_dir selected_genomes/ --out_dir classify_wf_out --extension fa --force --cpus 32 --skip_ani_screen --pplacer_cpus 32Options used:
--genome_dir <directory>directory containing genome files in FASTA format
--out_dir <directory>directory to output files--extension faextension of files to process, gz = gzipped--forcecontinue processing if an error occurs on a single genome--skip_ani_screenSkip the ani_screening step to classify genomes using mash and FastANI--pplacer_cpus 32number of CPUs to use during pplacer placement
List of output files:
- summary.tsv: Classifications for bacterial and archaeal genomes (see the GTDB-Tk documentation for details). Here we will find:
- fastani_reference: indicates the accession number of the reference genome (species) to which a user genome was assigned based on ANI and AF. ANI values are only calculated when a query genome is placed within a defined genus and are evaluated for all reference genomes in that genus.
- fastani_reference_radius: indicates the species-specific ANI circumscription radius of the reference genomes used to determine if a query genome should be classified to the same species as the reference. https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html
- fastani_af: indicates the alignment fraction (AF) between the query and above reference genome.
- closest_placement_reference: indicates the accession number of the reference genome when a genome is placed on a terminal branch.
- classification_method: indicates the rule used to classify the genome.
- classify.tree.gz: Reference tree in Newick format containing query genomes placed with pplacer.
- markers_summary.tsv: A summary of unique, duplicated, and missing markers within the 120 bacterial marker set, or the 122 archaeal marker set for each submitted genome.
- msa.fasta.gz: FASTA file containing MSA of submitted and reference genomes.
- filtered.tsv: A list of genomes with an insufficient number of amino acids in MSA.
- log: Log files.
- failed_genomes.tsv: A list of genomes for which the GTDB-Tk analysis failed, e.g. because Prodigal could not detect any genes.
- gtdbtk_summary.tsv: A summary table of the GTDB-Tk classification results for all bins
The taxonomic classification of each bacterial and archaeal genome is contained in the [prefix].[domain].summary.tsv output files.
A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. This placeholder genus name is generally derived from the oldest representative genome within the lineage and formed from NCBI organism name or NCBI infraspecific/strain ID.
4.7 Functional Annotation
4.7.1 Anvi’o
For the final and detailed analysis of the MAGs (both from short and long reads) several specific tool were developed in the last years (i.e. is Anvi’o).
Anvi’o is a comprehensive platform that brings together many aspects of today’s cutting-edge computational strategies of data-enabled microbiology, including genomics, metagenomics, metatranscriptomics, pangenomics, metapangenomics, phylogenomics, and microbial population genetics in an integrated and easy-to-use fashion through extensive interactive visualization capabilities.
https://github.com/merenlab/anvio
Setup anvio and download databases
mamba create -y --name anvio-8 python=3.10
mamba activate anvio-8
mamba install -y -c conda-forge -c bioconda python=3.10 \
sqlite prodigal idba mcl muscle=3.8.1551 famsa hmmer diamond \
blast megahit spades bowtie2 bwa graphviz "samtools>=1.9" \
trimal iqtree trnascan-se fasttree vmatch r-base r-tidyverse \
r-optparse r-stringi r-magrittr bioconductor-qvalue meme ghostscript
mamba install -y -c bioconda fastani
curl -L https://github.com/merenlab/anvio/releases/download/v8/anvio-8.tar.gz \
--output anvio-8.tar.gz
sudo apt install build-essential
pip install anvio-8.tar.gz
mkdir -p /SERVER/anvio-db/pfam /SERVER/anvio-db/kegg /SERVER/anvio-db/cazy /SERVER/anvio-db/scg
anvi-setup-kegg-data --mode KOfam --only-download --kegg-data-dir /SERVER/anvio-db/kegg/
anvi-setup-pfams --pfam-data-dir /SERVER/anvio-db/pfam/
anvi-setup-cazymes --cazyme-data-dir /SERVER/anvio-db/cazy/
anvi-setup-scg-taxonomy --scgs-taxonomy-data-dir /SERVER/anvio-db/scg/ --resetthe following workflow is useful for converting a bunch of genomes into an anvi’o-compatible format. It generates contigs databases from each input FASTA file, and subsequently runs a variety of annotation programs of your choice to populate these databases with some useful information for your downstream work (i.e. functions, single-copy-core genes, taxonomy, etc).
To start things going with this workflow, first ask anvi’o to give you a default workflow-config file for the contigs workflow:
Used options:
- -w contigs select the different type of “workflow (i.e. contig, metagenome …)
- --get-default-config contigs.json generate a default config file (.json) for a given workflow
https://anvio.org/help/main/workflows/contigs/
Here we create a file_list.tsv
echo -e "name\tpath" > file_list.tsv && find path/to/selected_genomes -type f -exec sh -c 'echo -e "$(basename "{}" .fa)\t{}"' \; >> file_list.tsvIf everything looks alright, you can run this workflow the following way from the same folder of file_list.tsv:
Finally, perform the annotation using one of the pre-loaded database
for i in 02_CONTIGS/*.db; # 02_CONTIGS viene generata da anvi-run-workflow
do
echo "$i";
anvi-run-cazymes -c $i --cazyme-data-dir path/to/CAZYdb/;
doneand export the annotation