Preprocessing : long read amplicon

Author

Mariam El Khattabi

Published

March 1, 2026

Description

This pipeline processes long-read amplicon sequencing data from Oxford Nanopore or PacBio to generate taxonomic profiles of microbial communities.

Starting from raw FASTQ files, the workflow performs several preprocessing steps including primer trimming, adapter removal, quality filtering, and chimera detection, followed by taxonomic classification using EMU.

The pipeline automatically generates quality control statistics (NanoStat, MultiQC) and produces taxonomic abundance tables as well as a phyloseq object for downstream microbiome analysis in R.

Pipeline overview

The pipeline processes long-read amplicon sequencing data through a series of preprocessing and analysis steps to obtain taxonomic profiles and downstream microbiome analysis objects.

Workflow

FASTQ → Trim → Filter → Chimera removal → Taxonomy → Reports → Phyloseq

Summary of steps

  1. Primer trimming
    Primers are removed and reads are reoriented using Cutadapt.

  2. Adapter removal
    Sequencing adapters are removed using Porechop ABI.

  3. Quality filtering
    Reads are filtered by quality score and length using SeqKit.

  4. Chimera filtering
    Chimeric sequences are detected and removed using VSEARCH.

  5. Taxonomic classification
    Clean reads are classified using EMU against a reference database.

  6. Quality reporting
    Read statistics are generated using NanoStat and aggregated with MultiQC.

  7. Phyloseq object creation
    Taxonomy, abundance tables, and metadata are combined to produce a phyloseq object for downstream analysis in R.

  8. Mock community evaluation (optional)
    If a mock community is provided, performance metrics and visualizations are generated. ## Installation

1. Clone the repository

git clone https://github.com/mariamelk1/preprocessing_omics.git
cd preprocessing_omics

2. Activate the Conda environment

All required dependencies are installed in a pre-configured Conda environment. Simply activate the environment before running the pipeline:

conda activate lr16s_pipeline

Usage

Run the pipeline from the repository root.

For the example dataset, the FASTQ files are located in the data/example folder, and the mapfile points to them.
For your own datasets, place your FASTQ files in the data/ folder and update the mapfile accordingly.

💡 Note for mock analysis: If you enable --mock, the pipeline will look for FASTQ files in the data/mock/ folder.
Make sure to create this folder and place your mock community FASTQ files there.

# Run the full pipeline
./preprocessing_emu.sh --map <path_to_mapfile>

Mandatory arguments:

--map <path_to_mapfile>       #Path to your sample map file (required)

Optional arguments:
  --platform <nanopore|pacbio>  #Select sequencing platform (default: nanopore)
  --no-cutadapt                 #Skip primer removal
  --no-filter-seqkit            #Skip quality filtering
  --min-qual <score>            #Set minimum quality score for SeqKit (default: 20)
  --no-filter-adapt             #Skip adapter removal
  --no-vsearch                  #Skip chimera removal
  --no-EMU                      #Skip EMU taxonomic classification
  --mock                        #Enable mock community analysis
  --mock-scale <lineal|log|both> #Scale for mock metrics (default: log)
  --help                        #Show this help message

Examples

Run the full pipeline on Nanopore data

./preprocessing_emu.sh –map mapfile.tsv #### Run the full pipeline on Pacbio data ./preprocessing_emu.sh –platform pacbio –map mapfile.tsv

Run pipeline with quality filtering only (Nanopore)

./preprocessing_emu.sh –platform nanopore –map mapfile.tsv –no-cutadapt –no-filter-adapt –min-qual 20

Run pipeline including mock community metrics on both scales (Pacbio)

./preprocessing_emu.sh –platform pacbio –map mapfile.tsv –mock –mock-scale both #### Run the example dataset provided in data (Nanopore, mock community) ./preprocessing_emu.sh –platform nanopore –map ./data/mapfile.tsv –mock –mock-scale both

Output structure

All pipeline results are stored in the RESULTS directory with the following structure:

RRESULTS/
├── ADAPTATOR_FILTERING/ # Adapter-trimmed reads (Porechop)
├── CHIMERA_FILTERING/ # Chimera-filtered reads (VSEARCH)
├── EMU_RESULT/ # Taxonomy and abundance tables (EMU)
├── JSON/ # Cutadapt JSON reports
├── LOGS/ # Logs for each processing step
├── MULTIQC/ # Multi-sample QC report
├── NANOSTATS/ # Read statistics generated by NanoStat
├── nanostat_stats.tsv # Combined NanoStat table
├── OTU_CLUSTERING/ # OTU clustering results (if run)
├── physeq.Rdata # Phyloseq object for downstream R analysis
├── QUALITY_FILTERING/ # Quality-filtered reads (SeqKit)
├── repseqs_from_combine.fasta # Reference sequences from EMU outputs
├── TRIM/ # Reads with primers removed
└── UNTRIM/ # Original reads
└── METRICS/ # Mock community metrics (if --mock)
    ├── <scale>/FIGURES/ # Plots for mock metrics
    └── metrics_by_qscore_<scale>.tsv

💡 Notes:
- <scale> can be log, lineal, or both, depending on the mock analysis options.
- All directories are automatically created by the pipeline if they do not exist.
- The folder structure mirrors the workflow steps for easy navigation.

Configuration

The pipeline uses a configuration file (config.sh) to define paths, primers, and execution options.
This file is automatically loaded at the start of the pipeline, so you normally do not need to modify the main script.

What you can customize

  • Paths: Output directory (OUTPUT_DIR), EMU database (EMU_DATABASE_DIR), VSEARCH database (DATABASE_VSEARCH), temporary folder (TMP_DIR)
  • Primers: Nanopore (PRIMER_FWD_NANO / PRIMER_REV_NANO) or PacBio (PRIMER_FWD_PAC / PRIMER_REV_PAC)
  • Execution options: Enable/disable steps like primer removal, quality filtering, adapter trimming, chimera removal, EMU classification, mock analysis
  • Filtering parameters: Minimum/maximum read length (MINLEN / MAXLEN), Qscore (QUAL)
  • Threads: Number of CPU cores used (CPU)
  • R scripts: Paths for phyloseq creation and mock metrics (R_CREATE_PHYLOSEQ, R_METRICS)

All changes are made in config.sh; no modification of the main pipeline script is required.

Notes: Mapfile structure

The pipeline requires a tab-delimited mapfile to link FASTQ files to sample metadata. The structure depends on whether you are analyzing mock community samples or regular samples.


1. Mock community samples

  • If using --mock --mock-scale both, the mapfile must include a scale column specifying the scale for each sample (log or lineal).
  • If using only log or lineal, the mapfile should contain only the sample name and any relevant info.

Example (mock, both scales):

sample_name     scale       other_info
mock1           log         replicate1
mock2           lineal      replicate2

Example (mock, single scale):

sample_name         other_info
mock1               replicate1
mock2               replicate2

2. Non-mock samples

  • The mapfile should contain the sample name followed by any metadata columns (e.g., treatment, timepoint, batch).

Example (non-mock):

sample_name         treatment       timepoint    batch
sample1             control            0h        batch1
sample2             treated            24h       batch1

Important notes

  • Always use tabs to separate columns; spaces will break the pipeline.
  • Sample names in the mapfile should match the FASTQ filenames (without extensions).
  • For quality-specific outputs, the pipeline appends _Q<score> to the sample name, e.g., sample1_Q27.
  • Ensure consistent naming for mock scales if using both.