4.1. Analysis Pipeline Overview

List

October 24, 2025 | rr19488 | Uncategorized

Raw Data

Input Files

FASTQ sequences from sequencer

Format: .fastq.gz

Metadata

Sample information and covariates

Format: .tsv, .csv

Initial QC

Read counts and quality scores

Tools: FastQC, MultiQC

Preprocessing

Quality Filtering

Remove low-quality reads

Tools: Trimmomatic, fastp

Adapter Removal

Trim technical sequences

Tools: cutadapt, BBDuk

Host Removal

Filter contaminating DNA

Tools: Bowtie2, BMTagger

Analysis

Taxonomic Profiling

Identify microbial species

Tools: Kraken2, MetaPhlAn

Functional Analysis

Predict metabolic pathways

Tools: HUMAnN, PICRUSt2

Assembly

Reconstruct genomes

Tools: MEGAHIT, metaSPAdes

Results

Statistical Analysis

Diversity and differential abundance

Tools: R phyloseq, QIIME2

Visualization

PCoA, heatmaps, networks

Tools: ggplot2, Cytoscape

Integration

Multi-omics correlation

Tools: mixOmics, MOFA

16S rRNA Amplicon Analysis Pipeline Overview

Overview: In this lecture, we will walk through a complete 16S rRNA gene amplicon sequencing data analysis pipeline using QIIME 2 (Quantitative Insights Into Microbial Ecology 2). This pipeline starts from raw sequencing data and ends with publication-quality analyses and visualizations. We assume an intermediate background in command-line tools and microbial ecology. Our focus is conceptual clarity and QIIME 2 command usage (no R scripting). We will use a publicly available dataset (e.g. the “Moving Pictures” human microbiome time series dataset from Qiita) to illustrate each step. The pipeline is structured into five key stages:

Quality Control (QC) – Sequence demultiplexing, trimming, denoising; OTUs vs ASVs.
Taxonomic Classification – Assigning taxonomy with a Naïve Bayes classifier (reference databases: SILVA, Greengenes 2, RDP).
Diversity Analyses – Alpha diversity (within-sample diversity) and beta diversity (between-sample diversity), including ordination methods (PCA, PCoA, NMDS).
Statistical Testing – Hypothesis testing on community differences (PERMANOVA, ANOSIM) and differential abundance analysis (ANCOM, mention of DESeq2).
Visualization – Generating plots and interactive visualizations in QIIME 2 (taxonomic barplots, PCoA plots with Emperor, etc.).

Quality Control (QC) and Processing

Goal: Convert raw sequencing reads into high-quality amplicon sequence variants (ASVs) while removing errors and artifacts. We will also explain the difference between traditional OTU clustering and the modern ASV approach.

Data Import & Demultiplexing: Start with raw reads (e.g., Illumina paired-end FASTQ files). In QIIME 2, import data using qiime tools import (specify format, e.g., EMP paired-end) to get a QIIME artifact (.qza). If reads are multiplexed, perform demultiplexing (sorting sequences by sample using barcodes). For example:

qiime demux emp-paired \
–m-barcodes-file sample-metadata.tsv \
–m-barcodes-column BarcodeSequence \
–i-seqs emp-paired-end-sequences.qza \
–o-per-sample-sequences demux.qza \
–o-error-correction-details demux-details.qza

Inputs: raw reads + metadata with barcodes. Output: demux.qza (demultiplexed sequences per sample). We can visualize read quality profiles with: qiime demux summarize –i-data demux.qza –o-visualization demux.qzv (to decide on truncation length for denoising).

OTUs vs ASVs: Traditionally, sequences were clustered into Operational Taxonomic Units (OTUs) at a fixed similarity threshold (often 97%), meaning sequences ≥97% identical are grouped as one “species” proxy. This clustering addressed sequencing errors but could lump distinct species if they were similar[1]. Modern pipelines use Amplicon Sequence Variants (ASVs) – unique error-corrected sequences exact to single-nucleotide resolution. ASVs are generated by denoising algorithms (like DADA2 or Deblur) which model and remove sequencing errors instead of clustering. ASVs provide higher resolution than OTUs (distinguishing sequences differing by 1 nucleotide) and are reproducible across studies. In QIIME 2, we will use DADA2 to produce ASVs; this inherently performs quality filtering, denoising, chimera removal, and singleton filtering. (Note: QIIME 2 still allows OTU clustering via the q2-vsearch plugin, but we focus on ASVs as best practice.)
Denoising with DADA2 (to get ASVs): QIIME 2’s DADA2 plugin will filter low-quality reads, correct errors, remove chimeras, and output representative sequences (ASVs) and their feature table. Use qiime dada2 denoise-paired for paired-end reads (or denoise-single for single-end). Key parameters include trimming/truncation lengths based on quality plots. Example command:

qiime dada2 denoise-paired \
–i-demultiplexed-seqs demux.qza \
–p-trim-left-f 0 –p-trim-left-r 0 \
–p-trunc-len-f 250 –p-trunc-len-r 240 \
–o-table table.qza \
–o-representative-sequences rep-seqs.qza \
–o-denoising-stats denoising-stats.qza

Inputs: demux.qza (demultiplexed reads). Parameters: trim/trunc lengths (example values; adjust according to quality). Outputs: – table.qza: feature table (samples × ASVs counts). – rep-seqs.qza: representative sequences for each ASV (the unique DNA sequences). – denoising-stats.qza: summary of how many reads were filtered, denoised, chimera-removed, etc.

After this step, we have high-quality ASVs. We can summarize the feature table with qiime feature-table summarize (to see per-sample read counts) and visualize length/frequency of ASVs with qiime feature-table tabulate-seqs (which also allows BLAST searching representative sequences).

Why ASVs: Unlike OTUs at 97% (which could group sequences differing by up to ~3%, i.e. tens of base pairs), ASVs retain fine-scale variation, enabling detection of subtle ecological patterns and true strain-level differences if present. We essentially get error-corrected exact sequences per distinct organism rather than fuzzy OTU bins.

(Optional) Sequence Quality Filtering: DADA2 already includes quality-based filtering. If not using DADA2, one might do separate QC (e.g., qiime quality-filter q-score for simple quality filtering or qiime deblur denoise for Deblur). Since we use DADA2, separate filtering is not needed. DADA2 also removes chimeras by default (via consensus method). The denoising-stats.qza can be visualized (qiime metadata tabulate) to evaluate how many reads survived filtering and chimera removal per sample.
Outcome of QC: At this point, we have a feature table of samples × ASV abundances and a representative sequence set of all ASVs. This is the foundation for downstream steps. We have minimized sequencing errors and eliminated artifacts, ensuring that differences observed between samples are biological and not due to random errors.

To create a valid sample-metadata.tsv file for QIIME 2, follow these steps:

✅ 1. Format Requirements

QIIME 2 expects a tab-separated values (TSV) file with:

A header row (column names)
A first column named #SampleID (exactly that, with the #)
One row per sample
No empty rows or columns
All samples listed must match sample IDs in your sequence data

2. Example Structure

#SampleID	BodySite	Subject	Treatment
sample1	gut	subj1	placebo
sample2	tongue	subj1	placebo
sample3	gut	subj2	drug
sample4	tongue	subj2	drug

#SampleID: must match sample names in your .fastq.gz filenames or in manifest.csv
Other columns (BodySite, Subject, Treatment) can include any metadata you want for grouping, coloring plots, running stats, etc.

3. How to Create It

Option A: Create with Excel or Google Sheets

Make a spreadsheet using the structure above
Save or download as TSV (tab-delimited text)
- In Excel: Save As → Text (Tab delimited) (*.txt) → rename to .tsv
- In Google Sheets: File → Download → Tab-separated values

Option B: Use a text editor (Notepad++, VS Code)

Paste the template shown above
Make sure tabs (not spaces) separate columns
Save with UTF-8 encoding as sample-metadata.tsv

4. Validate Metadata (Strongly Recommended)

Before using the file, validate it using QIIME 2:

qiime metadata tabulate \
  --m-input-file sample-metadata.tsv \
  --o-visualization metadata.qzv

Then view metadata.qzv with qiime tools view or https://view.qiime2.org. This checks for formatting errors and shows a preview.

Taxonomic Classification of ASVs

Goal: Assign each ASV to a taxonomic lineage (e.g., Kingdom, Phylum, … Genus/Species) using a reference database. QIIME 2 uses a Naïve Bayes machine learning classifier for taxonomy assignment, which is trained on reference sequences with known taxonomy.

Reference Databases: Common 16S rRNA reference databases include SILVA (high-quality, comprehensive, latest version SILVA 138 or 139), Greengenes 13_8 (older but commonly used; Greengenes2 is an updated effort), and RDP (Ribosomal Database Project). SILVA and Greengenes provide sequences aligned to taxonomy. We need a reference FASTA of 16S sequences and a mapping of those sequences to taxonomy strings to train the classifier. QIIME 2 also offers pre-trained classifiers for certain regions (e.g., a SILVA 138 classifier for V3–V4 reads of 250 bp). For this example, assume we use SILVA.
Training a Classifier (if needed): If a suitable pre-trained classifier is not available for our specific amplicon region, we train our own. For instance, using SILVA database:

qiime feature-classifier fit-classifier-naive-bayes \
–i-reference-reads silva138_99otus_16S.qza \
–i-reference-taxonomy silva138_taxonomy.qza \
–o-classifier silva-v34-classifier.qza

Input: Reference sequences and their taxonomy. Output: a QIIME 2 .qza classifier artifact. This step can be time-consuming, so pre-trained classifiers are convenient.

Taxonomy Assignment: With a classifier (pre-trained or newly trained), classify our representative sequences:

qiime feature-classifier classify-sklearn \
–i-classifier silva-v34-classifier.qza \
–i-reads rep-seqs.qza \
–p-confidence 0.7 \
–o-classification taxonomy.qza

Here we use the Naïve Bayes sklearn-based classifier (classify-sklearn). Inputs: the classifier and our ASV sequences (rep-seqs.qza). Parameters: –p-confidence 0.7 (confidence threshold for assignments; QIIME’s default is 0.7). Output: taxonomy.qza – contains taxonomic assignments for each ASV (each sequence gets labels like “Kingdom;Phylum;Class;…”). We can inspect it with qiime metadata tabulate –m-input-file taxonomy.qza –o-visualization taxonomy.qzv.

Result: Each ASV now has an assigned taxonomy (often to genus level if reads are short; species level requires exact or near-exact matches and high confidence). It’s common to see some ASVs labeled “Unclassified” at certain levels if the sequence has no close match. Using a comprehensive database like SILVA helps maximize classification rates.
Taxonomy-Based Analyses: We can now compute community composition summaries. A typical visualization is a stacked bar plot of relative abundances of taxa:

qiime taxa barplot \
–i-table table.qza \
–i-taxonomy taxonomy.qza \
–m-metadata-file sample-metadata.tsv \
–o-visualization taxa-barplot.qzv

Inputs: our feature table and taxonomy, plus sample metadata to group samples. Output: taxa-barplot.qzv – an interactive bar chart (each sample is a bar, colored by taxonomic groups). In QIIME 2 View, one can choose taxonomic level (Phylum, Family, etc.) and even collapse groups by metadata (e.g., average composition per treatment group).

Interpretation: The taxonomic classification allows us to report which microbes are present. For example, we might find that Phylum Firmicutes and Bacteroidota dominate human gut samples, or detect specific genera shifts. This is descriptive; further statistical tests (later section) can quantify significance of composition differences.

Note on databases: Ensure the reference database 16S region overlaps your amplicon region (e.g., if primers target V3–V4, a classifier trained on full-length 16S is okay, but if using a very short region, a region-specific classifier can improve accuracy[1]). Greengenes (last updated 2013) is somewhat dated; SILVA (regularly updated) is preferred for most current analyses. RDP is another option (has its own classifier tool as well). In QIIME 2, the Naïve Bayes approach (Bokulich et al. 2018) has been shown to be robust and is the recommended method

Diversity Analyses (Alpha & Beta Diversity, Ordination)

With each sample’s ASV abundances and taxonomies in hand, we analyze diversity – within-sample diversity (alpha) and between-sample diversity (beta). We also use ordination plots to visualize complex multi-sample patterns.

Alpha Diversity: Measures of species richness and evenness within a single sample. Common alpha diversity indices:
1.Shannon Diversity Index (H’): accounts for both richness and evenness (higher if many species with balanced abundances).
2.Simpson’s Index (1–D or 1/D forms): emphasizes dominant taxa; the probability that two randomly chosen individuals are from the same species. Often we use Inverse Simpson so that higher = more diversity. 3.Observed ASVs (richness): simply the count of unique ASVs observed in the sample (a measure of species richness).
4.Faith’s Phylogenetic Diversity (PD): (if a phylogenetic tree is available) – the total branch length of the phylogeny covered by the sample’s organisms, reflecting biodiversity in a phylogenetic context.

Beta Diversity: Measures differences between samples in community composition. Calculated as pairwise distances. Common metrics:
1.Bray–Curtis dissimilarity: a non-phylogenetic, abundance-based metric (ranges 0–1, 0 if communities identical). It considers the proportion of counts not shared between two samples.
2.Jaccard index: a presence/absence version (fraction of species not shared).
3.UniFrac (unweighted and weighted): phylogenetic distances that incorporate evolutionary distances:
3.1.Unweighted UniFrac: considers presence/absence along a phylogenetic tree (effectively, fraction of branch length unique to each community). 3.2.Weighted UniFrac: incorporates relative abundance along the tree (weighting branches by abundance). Both UniFrac require a phylogenetic tree of the ASVs.

Preparing for Diversity Analysis: Beta diversity metrics like UniFrac need a phylogenetic tree of the ASVs. We can generate a quick approximate tree using QIIME 2’s phylogeny pipeline (multiple sequence alignment with MAFFT, phylogeny with FastTree):

qiime phylogeny align-to-tree-mafft-fasttree \
–i-sequences rep-seqs.qza \
–o-alignment aligned-rep-seqs.qza \
–o-masked-alignment masked-alignment.qza \
–o-tree unrooted-tree.qza \
–o-rooted-tree rooted-tree.qza

Output: rooted-tree.qza (a rooted phylogenetic tree for ASVs). This uses an arbitrary mid-root or best root automatically. Now we’re set to compute diversity metrics.

Core Diversity Metrics: QIIME 2 provides a convenient shortcut command to compute a suite of diversity metrics in one go: qiime diversity core-metrics-phylogenetic. It will: – Rarefy (subsample) the feature table to a specified depth (to normalize sequencing depth across samples). – Compute several alpha diversities (Shannon, Observed, evenness, Faith’s PD). – Compute several beta diversity distance matrices (Bray-Curtis, Jaccard, UniFrac). – Compute principal coordinate analyses (PCoA) for each distance matrix. – Generate ready-to-visualize Emperor plots.

For example:

qiime diversity core-metrics-phylogenetic \
–i-table table.qza \
–i-phylogeny rooted-tree.qza \
–p-sampling-depth 10000 \
–m-metadata-file sample-metadata.tsv \
–output-dir core_metrics_results

Inputs: feature table, phylogenetic tree, and sample metadata. Parameter: –p-sampling-depth 10000 (subsample each sample to 10,000 reads; choose a depth that is low enough that all (or most) samples have at least that many reads to include them, but high enough to retain diversity).
Output: A directory core_metrics_results containing many files: – shannon_vector.qza, observed_otus_vector.qza, evenness_vector.qza, faith_pd_vector.qza (alpha diversity results per sample). – Distance matrices: bray_curtis_distance_matrix.qza, jaccard_distance_matrix.qza, unweighted_unifrac_distance_matrix.qza, weighted_unifrac_distance_matrix.qza. – PCoA results: e.g. bray_curtis_pcoa_results.qza etc. – Emperor plots: e.g. unweighted_unifrac_emperor.qzv (an interactive 3D PCoA plot of samples colored by metadata groups).

We can examine alpha diversity distributions across groups with a statistical test:

qiime diversity alpha-group-significance \
–i-alpha-diversity core_metrics_results/shannon_vector.qza \
–m-metadata-file sample-metadata.tsv \
–o-visualization shannon-group-significance.qzv

This generates a boxplot of Shannon index per category (e.g., treatment or body site) and performs a Kruskal-Wallis test (non-parametric one-way ANOVA) to see if diversity differs by group.

For beta diversity, QIIME 2 can statistically test whether communities differ by some category using PERMANOVA or ANOSIM:

qiime diversity beta-group-significance \
–i-distance-matrix core_metrics_results/bray_curtis_distance_matrix.qza \
–m-metadata-file sample-metadata.tsv \
–m-metadata-column BodySite \
–p-method permanova \
–p-pairwise \
–o-visualization braycurtis-permanova.qzv

Here, for example, we test if Bray-Curtis distances between samples differ by body site. –p-method permanova chooses PERMANOVA (default)[1]; we also add –p-pairwise to get pairwise PERMANOVA between each level of the category. The output braycurtis-permanova.qzv will show an pseudo-F statistic and p-value for overall difference (and pairwise p-values if requested). If method was anosim, it would output ANOSIM R statistic and significance. PERMANOVA (Permutational ANOVA) tests whether the centroids of groups in multivariate space are significantly different, accounting for within-group variation[1]. ANOSIM (Analysis of Similarities) is another non-parametric test that uses rank correlation of distances (yielding an R statistic) to evaluate whether between-group differences are greater than within-group differences (common in ecology)[2]. Both tests use permutation to assess significance.

Ordination: To interpret beta diversity, we often use ordination plots: 1.Principal Coordinates Analysis (PCoA): This takes a distance matrix (e.g., UniFrac) and finds major axes that capture the variance. The QIIME 2 Emperor outputs are PCoA plots.
2.Principal Component Analysis (PCA): Similar concept but works on raw data or covariance (in QIIME, one could use qiime diversity pca if needed on feature table or on centered log-ratio transformed data).
3.Multi-Dimensional Scaling (NMDS): Another ordination method that tries to represent rank order of distances in fewer dimensions. (QIIME 2 doesn’t have a direct NMDS plugin, but one can use scikit-bio externally or R. Emperor/PCoA is more common in QIIME for beta diversity.)

In practice, PCoA on a distance matrix (like unweighted UniFrac) is very common. QIIME 2’s Emperor allows interactive exploration: points (samples) can be colored/shaped by metadata to see clustering patterns (e.g., samples cluster by body site or treatment if those factors are strong drivers of community composition).

Interpreting Diversity Results: Alpha diversity results (Shannon, Simpson) might show, for example, that treatment A has significantly lower diversity than treatment B (perhaps due to an antibiotic reducing richness). Beta diversity PERMANOVA could show that gut samples are significantly different from tongue samples (p < 0.001), indicating distinct community compositions in different habitats. The PCoA plot might show tight clustering of replicates and separation between conditions, which provides a visual confirmation of statistical results

Statistical Testing and Differential Abundance

Beyond overall diversity comparisons, we often want to pinpoint which taxa differ or whether community differences are statistically significant in more complex experimental designs.

PERMANOVA (Permutational Multivariate ANOVA): We described this above in context of beta diversity. Use PERMANOVA to test hypotheses like “Do samples cluster by treatment?” on distance matrices[1]. The output includes an F-value (pseudo-F) and a p-value from permutation. A significant PERMANOVA (e.g., p=0.01) indicates the groups (e.g., Control vs Treated) have significantly different community compositions. Keep in mind PERMANOVA can be influenced by differences in within-group dispersion; always check group dispersions (QIIME 2’s output includes a test for homogeneity of dispersions).
ANOSIM (Analysis of Similarities): An alternative to PERMANOVA, also non-parametric. It yields an R statistic (close to 1 if differences between groups are much larger than within groups, ~0 if no difference). ANOSIM is also available via beta-group-significance –p-method anosim. In practice, PERMANOVA is more widely used in recent microbiome studies, but ANOSIM can corroborate findings[2].
Alpha Diversity Significance: We mentioned alpha-group-significance (Kruskal-Wallis). For example, if comparing 3 sites, a KW p<0.05 for Shannon suggests at least one site differs. Post-hoc pairwise comparisons (Dunn’s test) are shown in the QZV output to see which pairs differ.
Differential Abundance Testing: These methods identify specific features (ASVs or taxa) that differ in abundance between sample groups:
ANCOM (Analysis of Composition of Microbiomes): A popular compositionally-aware test implemented in QIIME 2. It operates on center log-ratio (CLR) transformed data and identifies features whose log-ratios are significantly different across groups. ANCOM accounts for the compositional nature of microbiome data (constant-sum constraint).
- Preprocess: ANCOM in QIIME 2 requires adding a pseudocount (since zero counts can’t be log-transformed). We first do:
- qiime composition add-pseudocount \
  –i-table table.qza \
  –o-composition-table comp-table.qza
- Then run ANCOM, specifying the metadata grouping:
- qiime composition ancom \
  –i-table comp-table.qza \
  –m-metadata-file sample-metadata.tsv \
  –m-metadata-column Treatment \
  –o-visualization ancom-results.qzv
- Output: ancom-results.qzv shows which features (ASVs) differ significantly in abundance between categories (W-statistics and volcano plot). If many ASVs, often we collapse to higher taxonomic level before ANCOM for interpretability (QIIME 2 allows qiime taxa collapse to sum counts at genus, family, etc., then run ANCOM on those groups).

DESeq2: A differential abundance method (based on negative binomial models) originally in R. QIIME 2 does not have a native DESeq2 plugin in the core distribution, but one can export the feature table to a BIOM or CSV and use the DESeq2 package in R for a more traditional analysis (contrasts, fold-changes, etc.). For this lecture, we note DESeq2 as an option (especially for parametric testing considering library size normalization), but we avoid writing R code here. Instead, we focus on ANCOM as an in-platform approach.
Important: Both ANCOM and DESeq2 assume proper normalization (ANCOM works on log-ratios, DESeq2 has its internal size factor normalization). One must also be mindful of multiple hypothesis testing correction (ANCOM’s W is effectively doing that by requiring feature to be significant in many pairwise log-ratio tests; DESeq2 uses adjusted p-values by default).
Example Interpretation: Suppose ANCOM reveals that an ASV (or a genus) is significantly more abundant in diseased patients than healthy controls (W statistic high, identified as different). This points to a candidate organism associated with disease. DESeq2 might provide a similar result with an estimated fold-change (e.g., that taxa is 5× higher in disease, p<0.01). Always examine effect sizes and biological relevance, not just statistical significance.

Note on Compositionality: Because microbial count data are compositional (relative abundances sum to 100%), specialized methods like ANCOM or ALDEx2 are used to avoid false discoveries due to that constraint. It’s good to mention to students that t-tests on relative abundances can be misleading. QIIME 2’s choice of ANCOM addresses this.

Posts

October 25th, 2025

Rozaimi Razali

Department of Biomedical Sciences, College of Health Sciences, QU-Health, Qatar University