List

In shotgun metagenomics, we sequence all DNA from a microbial community and analyze who is present (taxonomic composition) and what they are capable of (functional potential)[2][3]. Below is a step-by-step pipeline for whole-genome shotgun metagenomic analysis, tailored for graduate students and researchers.

Preprocessing: Quality Filtering and Host Removal

Quality Filtering of Raw Reads

High-throughput sequencing reads (FASTQ files) often contain adapter sequences and low-quality regions that can confound analysis[4][5]. We use a tool like Fastp to perform adapter trimming, quality filtering, and basic QC in one step[6]. Fastp processes paired-end reads and outputs cleaned reads plus a quality report.

Command example: Trim adapters and low-quality bases with Fastp:

fastp -i sample_R1.fastq -I sample_R2.fastq \
      -o sample_trimmed_R1.fastq -O sample_trimmed_R2.fastq \
      -q 20 -u 30 -n 5 -l 50 \
      -h fastp_report.html -j fastp_report.json

  • -i and -I: Input paired raw reads (read1 and read2 FASTQ files)[7].
  • -o and -O: Output filenames for trimmed, filtered read1 and read2 FASTQs[7].
  • -q 20: Quality threshold; Phred scores <20 are considered low quality (Q20 ≈ 99% base call accuracy). Bases below Q20 will be trimmed from read ends or cause reads to be filtered.
  • -u 30: Maximum percent of bases allowed below quality threshold. Here if >30% of a read’s bases are <Q20, the read is discarded (default in fastp is 40%[8][9]).
  • -n 5: Remove reads with more than 5 undetermined bases (Ns).
  • -l 50: Minimum length 50 bp; reads shorter after trimming are dropped.
  • -h fastp_report.html -j fastp_report.json: Generate an HTML report and JSON log with QC statistics before/after filtering. These reports include read length distribution, quality score distribution, adapter trimming info, etc.

Fastp will automatically detect and trim Illumina adapter sequences (unless disabled)[10], perform sliding-window quality trimming, and filter out low-quality or too-short reads. After this step, sample_trimmed_R1.fastq and …R2.fastq contain high-quality reads ready for downstream analysis, and fastp_report.html can be opened in a browser to verify improvements in read quality.

Removing Host DNA Contamination

If the sample is from a host-associated environment (e.g., human gut, plant root), a substantial portion of reads may come from host DNA. These should be removed to avoid skewing results[11]. We can align reads to the host reference genome and discard aligned reads. A common approach is using the Bowtie2 aligner in very sensitive local mode to capture even partial alignments[12], outputting unmapped reads that remain for analysis.

Preparation: Obtain or build a Bowtie2 index of the host genome (e.g., human GRCh38). For example, using a pre-built index GRCh38_noalt_as for human[13].

Command example: Remove human reads with Bowtie2:

bowtie2 -x GRCh38_noalt_as -1 sample_trimmed_R1.fastq -2 sample_trimmed_R2.fastq \
  –very-sensitive-local –threads 8 \
  –un-conc-gz sample_host_removed \
  -S /dev/null

  • -x GRCh38_noalt_as: Base name of the Bowtie2 index for the host genome (here human GRCh38).
  • -1/-2: Paired trimmed reads as input. Bowtie2 will align these reads to the host reference.
  • –very-sensitive-local: Bowtie2 preset for very sensitive local alignment, allowing soft-clipping of ends. This finds contaminant matches even if adapters or low-quality bases remain at read ends[14]. Local mode (-local) avoids penalizing reads that only partially align to the host (common when adapters or low-quality tails are present)[15].
  • –threads 8: Use 8 CPU threads for faster alignment.
  • –un-conc-gz sample_host_removed: Write unpaired reads that do not align to the host into gzipped FASTQ files. Bowtie2 will produce two files: sample_host_removed.1.gz and .2.gz, containing read1 and read2 from pairs that both failed to align to host[16][17]. These are our host-depleted reads. (We rename them to sample_host_removed_R1.fastq.gz/_R2.fastq.gz for clarity after the run[18].)
  • -S /dev/null: We discard the SAM output of alignments, since we only care about unmapped reads.

After this, sample_host_removed_R1.fastq.gz and …R2.fastq.gz contain the high-quality, host-screened reads for downstream analysis. Typically, only a small fraction (e.g., <2% for human gut stool[19]) of reads are host if the sample was well-prepared, but this step is crucial for host-rich samples (e.g., oral or skin microbiome)[11]. Removing host reads greatly reduces irrelevant data and speeds up downstream metagenomic analysis.

Note: An alternative to manual Bowtie2 steps is using pipelines like KneadData or BBMap for host filtration[20], but the principle is the same: map to host reference and discard mapped reads.

Taxonomic Profiling of the Microbial Community

Goal: Identify the taxonomic composition of the metagenome – which organisms are present and in what relative abundances. Two major approaches exist: k-mer-based whole-genome classification and marker gene-based profiling.

  • K-mer (Whole-Genome) methods: Classify reads by matching DNA k-mers to reference genomes. Kraken2 is a popular tool that uses a k-mer database of known genomes for ultra-fast classification[21]. It assigns each read to the lowest common ancestor (LCA) of organisms sharing those k-mers[22][23]. This approach can detect a wide range of taxa (including novel species) but may produce some false positives if genomes share k-mers[24].
  • Marker gene methods: Search reads for conserved marker genes (e.g., 16S rRNA, or hundreds of clade-specific protein markers). Tools like MetaPhlAn use curated marker genes to profile taxa[25]. This is computationally efficient and tends to be more precise (fewer false positives) since it ignores non-informative sequences[25]. However, it might under-report taxa that lack or have divergent markers, and it requires markers to be known in advance[26][27].

In practice, Kraken2 (k-mer based) is widely used for shotgun taxonomic profiling, often followed by Bracken for abundance estimation. We will focus on Kraken2 for classification, then use Bracken to refine species abundance counts.

Taxonomic Classification with Kraken2

Kraken2 requires a database of reference genomes or sequences, pre-built into a specialized index. Common choices are the Standard Kraken2 database (RefSeq bacteria/archaea/viral) or custom databases (you can build one including specific libraries like fungi, protozoa, etc.[28][29]). Ensure you have the database path ready (e.g., KRAKEN2_DB environment variable or –db path).

Command example: Classify reads with Kraken2 and generate a report:

kraken2 –db /path/to/Kraken2_DB –paired sample_host_removed_R1.fastq.gz sample_host_removed_R2.fastq.gz \
        –report sample.kraken.report –output sample.kraken.labels \
        –use-names –threads 4

  • –db /path/to/Kraken2_DB: Path to Kraken2 database directory. This example uses a pre-built database. (Building a custom DB can be done with kraken2-build commands[28][30], but that is outside scope here.)
  • –paired …R1.fastq.gz …R2.fastq.gz: Provide paired-end read files. Kraken2 will process paired reads together, increasing classification accuracy.
  • –report sample.kraken.report: Output a summary report of read counts and percentages at each taxonomic level (domain to species) in a human-readable table[31][32]. This report is very useful for quick community composition overview.
  • –output sample.kraken.labels: Output classification results for each read (tab-delimited: read ID, taxon ID, and classification info)[22]. (Optional, used if you need per-read assignments for further analysis).
  • –use-names: Print taxon names in the report (otherwise it prints only tax IDs).
  • –threads 4: Use 4 CPU threads for faster classification.

After running, sample.kraken.report is a text file listing taxa with the number and percentage of reads classified to each (and each taxon’s rank code and NCBI taxID)[32][33]. For example, you might see lines for Bacteria, specific phyla, genera, etc., indented by rank. Unclassified reads (no match in DB) are also reported as a percentage. The classification output file (sample.kraken.labels) contains one line per read with a “C” or “U” (classified or unclassified) and the taxon it was assigned to[34], but this is often not needed unless you plan to extract reads for a particular taxon or feed results to Bracken.

Abundance Estimation with Bracken: Kraken2 classifies reads to taxa but does not by itself adjust for database composition biases or assess abundances at a specific rank. Bracken uses Kraken’s output to re-estimate species or genus abundances more accurately. It accounts for the fact that Kraken might assign some reads to higher ranks when exact matches aren’t found, by probabilistically redistributing read counts to the species level. Ensure you have Bracken’s companion files for your Kraken2 database (or use Kraken2’s –use-bracken options if available).

Command example: Refine Kraken report to species abundance with Bracken (assuming 150 bp reads):

bracken -d /path/to/Kraken2_DB -i sample.kraken.report -o sample.bracken_species.txt \
        -r 150 -l S

  • -d /path/to/Kraken2_DB: Path to Kraken2 database (same DB used for classification)[35]. Bracken needs database information to compute k-mer distributions.
  • -i sample.kraken.report: Input the Kraken2 report file for the sample[36].
  • -o sample.bracken_species.txt: Output file with Bracken-adjusted abundances at the chosen level.
  • -r 150: Read length used in sequencing (in bp). This is important as Bracken’s model depends on k-mer coverage of genomes for a given read length (common values are 100, 150, or 250 bp).
  • -l S: Taxonomic level to report (“S” = species level)[37]. (Use -l G for genus, etc., if needed.)

Bracken will output a table (sample.bracken_species.txt) listing species, taxonomy IDs, the number of reads Kraken initially assigned, the number of reads Bracken estimates actually belong to that species, and the fraction of the total reads[38][39]. This gives a refined community profile at species level. For example, if Kraken2 overestimated a species due to reads from a genus-level match, Bracken will adjust those counts downwards, allocating reads appropriately. The new_est_reads column is the key Bracken estimate of reads per species[40][41].

Tip: You can feed multiple Kraken reports to Bracken (or run separately per sample) and then combine results for comparative analysis. Visualization tools like Krona or Pavian can also be used on Kraken or Bracken outputs for interactive exploration of community composition[42][43].

Marker vs k-mer method note: Marker gene profilers like MetaPhlAn can be an alternative or complement to Kraken. They use a curated set of ~marker genes (e.g., MetaPhlAn4 uses ~1M clade-specific markers) to profile the community[44]. This often requires less memory and can ignore spurious matches, yielding fewer false positives[24]. However, marker methods might miss organisms not well-represented by the chosen markers or in low abundance[45]. Kraken2, by using all genomic k-mers, is more sensitive to detecting rare taxa but at the cost of potentially including false positives if the database contains near neighbors[24]. In practice, one might use Kraken2/Bracken for a broad overview and MetaPhlAn as a consistency check for major taxa[24]. Always interpret results with caution, and consider running negative controls or using mock communities to assess false discovery rates[24].

Functional Profiling of the Metagenome

Goal: Determine the functional potential of the microbial community – what genes are present and which metabolic pathways or processes are encoded. There are two main strategies:

  • Read-based functional profiling: Bypass assembly; directly map reads to reference gene databases to quantify gene families and pathways. HUMAnN3 (The BioBakery’s tool) is a leading method for this. It takes cleaned reads and outputs gene family abundances and pathway abundances (e.g., MetaCyc pathways)[46]. HUMAnN3 uses a two-phase approach: (a) nucleotide-level alignment of reads to a database of pangenomes (ChocoPhlAn) to identify known gene families in known species, and (b) translated search (DIAMOND) for reads with no nucleotide hits, mapping them against a protein database (UniRef)[47][46]. The gene abundances are then regrouped into pathway abundances (using MetaCyc pathway definitions by default)[48].
  • Assembly-based functional analysis: Assemble contigs first, then predict ORFs (genes) on contigs, then annotate those genes (via e.g., eggNOG-mapper, KEGG, or Prokka/DRAM for metabolic functions). We cover assembly in the next section; here we focus on read-based approach with HUMAnN3, which is often effective when high strain-level diversity makes assembly difficult.

Functional Profiling with HUMAnN3

HUMAnN3 (HUMan Analysis of Metagenomes) is designed to quantify gene families and pathways from metagenomic data[48]. It integrates MetaPhlAn for taxonomic profiling (to guide gene search per species) and uses large reference databases (UniRef for proteins, MetaCyc for pathways). Ensure you have downloaded the HUMAnN databases (ChocoPhlAn nucleotide and UniRef protein databases) as per the documentation[49][50].

Since HUMAnN expects a single input file, paired reads need to be concatenated into one FASTQ (simply one after the other)[51][52]. This treats them as independent reads for functional mapping, which HUMAnN’s developers note has minimal impact on results[53][54]. Alternatively, you could run HUMAnN on each FASTQ separately and merge results, but concatenation is simpler.

Command example: Run HUMAnN3 on one sample’s reads:

# First, concatenate paired reads
cat sample_host_removed_R1.fastq sample_host_removed_R2.fastq > sample.fastq

# Run HUMAnN (assuming humann is in PATH and databases are configured)
humann -i sample.fastq -o humann_sample_out –threads 8

  • cat … > sample.fastq: Joins R1 and R2 reads into one file. (Ensure both are in FASTQ format, uncompressed or both compressed; HUMAnN accepts gzipped files as well with –input-format flag if needed.)[51][54].
  • humann -i sample.fastq -o humann_sample_out: Run HUMAnN on the input FASTQ. humann_sample_out is an output directory that will be created to store results.
  • –threads 8: Use 8 CPU threads to speed up the alignment steps (Bowtie2 and DIAMOND).
    (We used default behavior: HUMAnN will first run MetaPhlAn internally to get a taxonomic profile unless provided one, then map reads to nucleotide database for identified species, then map unmapped reads in translated mode. Additional options like –bypass-translated-search can skip the slow protein search if not needed, etc.)

Outputs: In humann_sample_out/, you’ll find files like:
– sample_genefamilies.tsv – Table of gene families (UniRef IDs or custom IDs) and their abundances in the sample (often in reads per kilobase units).
– sample_pathabundance.tsv – Table of pathway abundances (relative abundance or coverage of MetaCyc pathways)[48]. Each pathway is identified by a MetaCyc ID and name, with a value reflecting the reconstructed abundance of that pathway in the community.
– sample_pathcoverage.tsv – (if enabled) pathway coverage values indicating completeness of each pathway.
– (The gene family abundances are stratified by species by default; HUMAnN also produces an “unstratified” summary by collapsing species contributions).

For example, sample_pathabundance.tsv might show entries like: PWY-7221\tPropanoate Biosynthesis\t12.4 (meaning pathway PWY-7221 has an abundance of 12.4 in the sample in normalized units). The gene families file contains rows of UniRef90 IDs (e.g., UniRef90_Q8ZDN2) with counts, and possibly stratified rows like UniRef90_Q8ZDN2|g__Bacteroides if a gene is assigned to Bacteroides genus in the MetaPhlAn step. After HUMAnN, one often performs post-processing: renormalizing tables to relative abundances, and regrouping UniRef IDs to broader categories (e.g., KEGG Orthogroups or Enzyme EC numbers). HUMAnN provides utility scripts: humann_renorm_table, humann_regroup_table, humann_join_tables for combining multiple samples, etc.[55][56]. For instance, you can normalize gene families to copies per million (CPM) or relative abundance, and regroup UniRefs to KEGG KOs or COGs if needed using provided mappings

Functional Annotation via Assembly (Alternative)

Instead of (or in addition to) HUMAnN’s read-based approach, you can functionally annotate after assembly and binning. The general process would be: assemble contigs, bin them into MAGs, then predict genes on contigs (using a tool like Prodigal), and annotate genes using databases. Tools like eggNOG-mapper can take predicted protein sequences and assign COG/KOG categories, KEGG Orthologs, enzyme EC numbers, GO terms, etc., based on the eggNOG database (covering many functional systems). Similarly, KEGG’s own tools or BlastKOALA could map genes to KEGG pathways. Another specialized tool is DRAM (for microbial and viral genomes) which annotates MAGs for metabolic functions and even calculates summary metabolism tables. For the sake of focus, we will cover assembly and binning next, then demonstrate annotation on bins using Prokka as an example. Keep in mind that read-based (HUMAnN) and assembly-based functional analyses are often complementary: read-based gives a quick overview of pathway potentials, while assembly-based allows deeper exploration of specific genomes and high-confidence functions.

Assembly-Based Analysis of the Metagenome

Goal: Assemble the metagenomic reads into longer contigs (contiguous sequences) to reconstruct genomes or genomic fragments. Assembly can reveal gene context, operons, and larger genomic elements not discernible from individual reads. Metagenome assembly is challenging due to mixed species, varying abundances, and strain heterogeneity. However, specialized assemblers like MEGAHIT and metaSPAdes are optimized for metagenomic data (using strategies like succinct de Bruijn graphs and iterative k-mer sizes)[57][58]. Here we use MEGAHIT for its speed and low memory usage on large datasets[57]

Assembling Reads into Contigs with MEGAHIT

Ensure you have enough memory and compute for assembly – metagenome assemblies can be intensive. MEGAHIT can scale to large datasets efficiently, but assembly time increases with data size. Provide it the host-filtered, high-quality reads from preprocessing.

Command example: Assemble paired reads with MEGAHIT:

megahit -1 sample_host_removed_R1.fastq.gz -2 sample_host_removed_R2.fastq.gz \
        -t 8 -m 0.5 -o sample_megahit_out

  • -1 and -2: FASTQ files for paired-end reads (R1 and R2). MEGAHIT accepts gzipped files.
  • -t 8: Use 8 threads.
  • -m 0.5: Use up to 50% of available memory for assembly (by default MEGAHIT uses up to 90% if not specified)[59]. Here we limit to 50%, but you can adjust or omit this parameter.
  • -o sample_megahit_out: Output directory name. MEGAHIT will create this directory to store assembly results. (If omitted, it defaults to megahit_out.)

MEGAHIT will automatically choose a range of k-mer sizes (21 to 141 by default) and assemble contigs. Upon completion, the primary output is sample_megahit_out/final.contigs.fa – a FASTA file of assembled contigs[60]. It also logs assembly statistics.

Check the assembly summary at the end of MEGAHIT’s log: it reports number of contigs and N50 (the contig length at which half the assembly genome is in contigs of that length or longer). If you have multiple samples and want to assemble each separately, run MEGAHIT per sample. Alternatively, co-assembly can be done by providing all FASTQs together to MEGAHIT (concatenate files or use comma-separated lists with -1 and -2 for multiple libraries) – but be cautious, co-assembly merges samples and could mix populations unless they are from the same environment

Assessing Assembly Quality with QUAST

After assembly, it’s important to evaluate the quality of the contigs: How many contigs were produced? What is the total assembly length? N50? Are there very short contigs (which might be spurious)? QUAST (Quality Assessment Tool for Genome Assemblies) can generate such metrics quickly[61][62]. For isolate genomes, QUAST can also compare to a reference genome if provided, but for metagenomes, usually no single reference exists, so we run QUAST in de novo mode (no reference) to get basic stats.

Command example: Run QUAST on the assembled contigs:

quast.py sample_megahit_out/final.contigs.fa -o sample_quast_out -t 4 -m 1000

  • sample_megahit_out/final.contigs.fa: The assembly FASTA to evaluate.
  • -o sample_quast_out: Output directory for QUAST results (will contain an HTML report, a text summary, and plots).
  • -t 4: Use 4 threads for QUAST (it can parallelize some computations).
  • -m 1000: Minimum contig length to include in the report = 1000 bp. (This filters out contigs <1kb, which is often sensible since very short contigs might be ignored in analysis and including them skews N50/counts.)

After running, QUAST outputs a report (text and HTML). Key metrics in report.txt include[63]:
– Number of contigs (>= 1kb, as specified).
– Total length of contigs (sum of all contig lengths).
– N50 (the length such that 50% of the total assembly is in contigs of that length or longer).
– L50 (the number of contigs that constitute half the genome length).
– Largest contig length.
– GC percentage, etc.

For example, you might see “# contigs (>= 1000 bp) = 2000”, “Total length = 250,000,000 bp”, “N50 = 20,000 bp”, “Largest contig = 150,000 bp”. Higher N50 and larger contigs indicate a more contiguous assembly, but remember metagenome assemblies are typically fragmented because of varying coverage and complexity.

QUAST’s HTML report sample_quast_out/report.html provides interactive plots and tables. If a reference genome or multiple assemblies are provided, it can also show misassembly counts and alignment dotplots (for single genomes). For a metagenome, you might not have a reference, so QUAST will just report basic stats[62]. (QUAST also has a MetaQUAST mode to compare to multiple reference genomes if you have a set of reference sequences for expected organisms[64], but this is optional.) After assessment, you can decide if the assembly quality is sufficient. For instance, if N50 is extremely small (e.g., a few kb), the downstream binning may be challenging – you might consider deeper sequencing or better assembly parameters.

Binning and MAG (Metagenome-Assembled Genome) Reconstruction

Goal: Group contigs into bins that ideally represent individual microbial genomes (MAGs). Binning leverages the patterns in sequence composition and read coverage to cluster contigs belonging to the same species or strain.

After assembly, you have a set of contigs from all species mixed together. Binning algorithms attempt to separate these contigs by organism. This is possible because contigs from the same genome tend to have similar GC content and k-mer frequencies (composition), and similar coverage depth across multiple samples or within a sample (if uneven coverage, e.g., abundant genomes have high coverage)[65][66].

Popular binning tools include MetaBAT2, MaxBin2, and CONCOCT, among others. Each uses different algorithms (graph-based vs. probabilistic mixture, etc.). Combining multiple binners and then taking the consensus (via tools like DASTool) can improve results. To keep things simple, we’ll use MetaBAT2 – a widely used, composition+coverage based binning tool that often gives good results with default settings[67][68].

Before binning: It’s highly recommended to compute coverage depth of contigs by mapping the original reads back to the assembly. MetaBAT2 can use coverage from single or multiple samples to inform binning (differential coverage helps separate organisms). For one sample, coverage is still useful: contigs from very abundant genomes have higher read depth than low-abundance ones.

To get coverage, map the filtered reads to the contigs (using Bowtie2 or BWA), then calculate average depth per contig. MetaBAT2 provides a script jgi_summarize_bam_contig_depths for this. For example:

# Map reads back to contigs
bowtie2 -x sample_contigs_index -1 sample_host_removed_R1.fastq.gz -2 sample_host_removed_R2.fastq.gz | samtools sort -o sample_vs_contigs.bam
samtools index sample_vs_contigs.bam

# Calculate coverage depth per contig
jgi_summarize_bam_contig_depths –outputDepth sample_depth.txt sample_vs_contigs.bam

This yields sample_depth.txt with columns: contig, length, total bases, depth in sample, etc.

Now we run MetaBAT2:

Command example: Bin contigs using MetaBAT2:

metabat2 -i sample_megahit_out/final.contigs.fa -a sample_depth.txt -o bins/bin

  • -i final.contigs.fa: Input assembly FASTA. Only contigs above a certain length (default 2500 bp) will be binned[69][70]. (MetaBAT2’s default –minContig is 1500, but it recommends >=2500 for best results[69]. You can adjust via -m flag if needed.)
  • -a sample_depth.txt: Contig depth file generated earlier[71]. This provides mean coverage for each contig to MetaBAT2. If you omit this, MetaBAT2 will attempt binning using composition alone, but including coverage improves accuracy. For multi-sample binning, you would provide a depth file with multiple columns (one per sample).
  • -o bins/bin: Output file prefix for bins. MetaBAT2 will create multiple files like bins/bin.1.fa, bins/bin.2.fa, etc., each containing contigs assigned to a bin (these are the draft genomes).[72][73] It also writes a summary of bin composition.

MetaBAT2 clusters contigs by iteratively building a graph of contig connections based on tetranucleotide frequency and coverage, then partitioning that graph. It automatically tries to optimize parameters internally[74], so default settings usually work well. If contigs are very short (<2.5kb), they often remain unbinned (MetaBAT by default ignores <1500 bp contigs as they lack sufficient signal[69]). You can find unbinned sequences in a file if you use the –unbinned option (e.g., bin.unbinned.fa).

After running, check how many bins were formed and their sizes. For example, you might get 20 bins (bin.1.fa … bin.20.fa). The number doesn’t directly equal species; some bins may be partial or mixed. That’s why the next step (check bin quality) is crucial.

Tip: For better binning, you could also run other tools like MaxBin2 or Dope (or run MetaBAT2 with different min contig lengths), and then use DASTool to merge results. DASTool takes bin sets and uses a scoring algorithm to resolve disagreements, often producing higher-quality bins than any single tool. This is advanced usage; for now, we proceed with the MetaBAT2 bins.

Evaluating Bins: Completeness and Contamination with CheckM

Not all bins are good – some may be incomplete (missing chunks of the genome) or contaminated (contain contigs from different organisms). CheckM is a tool to assess genome bin quality by looking for universal single-copy marker genes[75][76]. It estimates completeness (what fraction of expected single-copy genes are present) and contamination (if multiple copies of those single-copy genes are found, implying mixture)[75][77].

Command example: Run CheckM on all bins:

checkm lineage_wf -x fa bins/ checkm_out -t 4

  • lineage_wf: This is CheckM’s lineage workflow – it will place each genome bin on a reference tree, find marker genes, and compute completeness/contamination[75][76].
  • -x fa: Bins have extension .fa (fasta). CheckM will scan bins/ directory for all .fa files (each bin)[78].
  • bins/: Input directory containing your bin FASTA files (from MetaBAT2 output).
  • checkm_out: Output directory for CheckM results.
  • -t 4: Use 4 threads for HMM searches (speeds up the marker gene finding).

CheckM will output a table (in checkm_out/ and also print to terminal) with each bin and metrics like:

Bin_Id    Completeness    Contamination    Strain_heterogeneity   Taxonomy (if identified)
bin.1     95.3%           2.1%            0%                     d__Bacteria; p__Firmicutes; … 
bin.2     40.0%           5.0%            –                      (no marker lineage)

Completeness is the percentage of expected lineage-specific marker genes found[76]. Contamination is inferred by how many marker genes are present in >1 copy (e.g., if a single-copy gene appears twice, likely two genomes’ content got mixed)[79]. Strain heterogeneity indicates if multiple strains might be merged in one bin (CheckM can sometimes detect subtle multi-population bins).

We generally want high completeness (e.g. >90%) and low contamination (<5%) for a high-quality MAG (as per MIMAG standards). Medium-quality drafts might have >50% completeness and <10% contamination. If a bin is very incomplete (e.g. 40%) or very contaminated, it might need refining: you can try re-binning that subset of contigs with different parameters, or manually curate (remove contigs that clearly don’t belong based on GC or taxonomy).

CheckM also assigns each bin to a lineage in the GTDB or NCBI taxonomy (not the final taxonomy, but roughly genus-level) using the marker gene phylogeny. This is useful to know what each bin likely is (e.g., “Bin.1 is likely a Firmicute, Bin.2 an Actinobacteria, etc.”). At this stage, you might decide to discard bins that are too low-quality (e.g., <50% complete or >10-15% contaminated), and keep the better ones for further analysis. You can also merge/split bins if you suspect issues – sometimes two bins are actually one genome split, or one bin might contain two species mixed. Tools like Blobology or using coverage differential can help, but that’s beyond this overview.

Taxonomic Assignment of Bins with GTDB-Tk

Goal: Assign a standardized taxonomy to each MAG (bin), typically using the Genome Taxonomy Database (GTDB) which is designed for bacterial/archaeal genomes. GTDB-Tk is a toolkit that places your genomes into the GTDB reference tree and gives each a taxonomy (e.g., “genus X, species Y”)[80].

GTDB-Tk uses a set of marker genes and phylogenetic inference to classify genomes, which is more consistent and up-to-date than relying on 16S alone. It will report the closest matching known species or label as novel (e.g., “UBAxxxx” for new taxa) if no close match.

Command example: Classify bins with GTDB-Tk:

gtdbtk classify_wf –genome_dir bins/ –out_dir gtdbtk_out –cpu 4

  • classify_wf: Runs GTDB-Tk’s full classification workflow (which includes identifying markers, aligning, placing in tree, and classifying) in one go.
  • –genome_dir bins/: Directory of genome FASTA files (bins from MetaBAT2). GTDB-Tk will process each .fa file.
  • –out_dir gtdbtk_out: Directory to output results (it will create this).
  • –cpu 4: Use 4 CPUs. (The process is CPU and memory intensive; more CPUs can speed up, but ensure you have enough RAM as GTDB-Tk with large reference can use tens of GB of memory).

GTDB-Tk requires the GTDB reference data (a folder typically set via GTDBTK_DATA_PATH env variable). Make sure that’s available (e.g., the release207_v2 or latest release data).

Once run, check the output files in gtdbtk_out/. The main result is gtdbtk.bac120.summary.tsv (and …ar122.summary.tsv for archaeal, since GTDB has separate bacterial and archaeal trees). In the summary, each input genome is listed with columns: user genome, classification, FastANI reference, classification method, and notes. For example, you might see:

user_genome    classification                  closest_ref          other_info…
bin.1          d__Bacteria;p__Firmicutes;…;s__Lactobacillus reuteri    GCF_000008865.1   …
bin.2          d__Bacteria;p__Proteobacteria;…;s__Escherichia coli     GCF_000005845.2   …
bin.3          d__Bacteria;p__Firmicutes;…;s__unclassified Paenibacillus   (no close species, novel)

GTDB taxonomy uses prefixes like d__ (domain), p__ (phylum), c__, o__, f__, g__, s__. If a genome doesn’t confidently match an existing species, it may get a placeholder name (e.g., s__unclassified or a GTDB placeholder for novel species).

This gives you a name for each MAG. It’s more rigorous than using BLAST on a single marker gene, as it considers whole-genome marker sets and evolutionary placement[80]. Now you have reconstructed draft genomes (MAGs) with quality metrics and taxonomic labels. These can be used to link specific organisms to functions or to compare with isolates. For example, you might say “We recovered 5 high-quality MAGs, including one classified as Lactobacillus reuteri (95% complete, 2% contamination) and another novel Paenibacillus species.”

Genome Annotation of Bins (Prokka Example)

Goal: Annotate the MAGs to identify genes and their functions (e.g., coding sequences, rRNAs, tRNAs, and assign product names or functions to genes). This provides insight into the metabolic capabilities of each genome.

Tools for genome annotation include Prokka (which is for prokaryotic genomes), eggNOG-mapper (functional annotation using ortholog databases), and DRAM (focused on metabolic annotation of MAGs, outputting metabolic summary tables). We will demonstrate Prokka as a straightforward pipeline for annotating a single prokaryotic genome bin, producing standard outputs (GenBank file, GFF, protein fasta, etc.).

Prokka comes with its own databases for typical bacterial genes (based on Uniprot, RefSeq, etc.), and assigns locus tags to each gene. It also detects rRNA/tRNA genes using bundled tools like Aragorn (tRNA) and Barrnap (rRNA).

Important: Since MAGs may be incomplete or fragmented, we use the –metagenome option in Prokka to relax certain criteria (so it doesn’t expect a single contiguous genome or complete set of rRNAs, etc.)[81].

We also provide a useful prefix for loci (so gene IDs are unique per bin and identifiable).

Command example: Annotate a bin (e.g., bin.1.fa) with Prokka:

prokka –outdir bin1_annotation –prefix BIN1 –metagenome –cpus 4 bins/bin.1.fa

  • –outdir bin1_annotation: Directory to place output files. (Prokka will create it; we use a separate folder per genome).
  • –prefix BIN1: Prefix for output file names and locus tags. For example, proteins will be named like BIN1_001, BIN1_002, etc., and output files will start with BIN1. This helps trace which MAG the annotations belong to.
  • –metagenome: Informs Prokka that this is a metagenome-assembled sequence, not a polished complete genome. It will, for example, allow running even if no plasmid or if 16S rRNA is missing, and it will not try to split the input into chromosome vs plasmid, etc.[81].
  • –cpus 4: Use 4 threads for faster processing.
  • bins/bin.1.fa: The input FASTA of contigs for bin 1.

Prokka will output multiple files in bin1_annotation/:
– BIN1.gff: GFF3 file with all annotations (features on contigs, coordinates, and functional annotations).
– BIN1.faa: protein sequences of predicted genes (amino acid FASTA).
– BIN1.ffn: nucleotide sequences of coding sequences (DNA FASTA).
– BIN1.fna: the input contig sequences (just copied).
– BIN1.gbk: GenBank format file (can be viewed in genome browsers, contains annotations in rich format).
– BIN1.tsv: Tab-delimited summary of features (locus tag, gene, product, etc.).
– BIN1.txt: A log/report of the annotation run (statistics like how many CDS, how many rRNA/tRNA found, etc.).

During the run, Prokka uses Prodigal to find open reading frames (genes)[82] and annotate them by comparing to its databases (by default includes protein family HMMs, and BLAST against a curated set like SwissProt, etc.). It also identifies rRNAs and tRNAs: you will see in the output if it found 5S/16S/23S rRNAs or any tRNAs (for MAGs, often some rRNAs are missing due to assembly issues). The –norrna –notrna options can skip those searches (used in some cases to speed up[81], but by default Prokka will do them unless told not to).

Once finished, examine the BIN1.txt summary. For example, it may report: “Predicted genes: 3405 CDS, 1 rRNA, 20 tRNA.” And it will list how many got annotations (e.g., via protein database hits) versus labeled “hypothetical protein”. The GFF/Gbk files contain the detailed annotations per gene (EC numbers, product names, etc., if identified)[83][84].

Repeat Prokka for each bin of interest (you can loop through bins directory or run jobs in parallel). Each MAG will then have its own annotated genome.

Alternative annotation approaches:

  • eggNOG-mapper: If you prefer a broader functional labeling, you could take Prokka’s protein FASTA (.faa) and run eggnog-mapper (with DIAMOND or HMM mode) to get COG category assignments, KEGG Orthologs, etc. EggNOG-mapper is slower but very comprehensive, providing GO terms and pathway info from a huge database of orthologous groups.
  • DRAM: A newer tool that reads in MAGs and annotates them specifically for metabolism (using KEGG, UniRef, Pfam, dbCAN for CAZymes, etc.), outputting summaries like which MAGs can do methanogenesis, which have certain vitamin synthesis pathways, etc. Useful for environmental metagenomes where you want a quick metabolic summary of each bin.

However, using Prokka gives a quick, standard GenBank annotation which is often sufficient for many purposes and can be refined later.