Preprocessing : long read amplicon
Description
This pipeline processes long-read amplicon sequencing data from Oxford Nanopore or PacBio to generate taxonomic profiles of microbial communities.
Starting from raw FASTQ files, the workflow performs several preprocessing steps including primer trimming, adapter removal, quality filtering, and chimera detection, followed by taxonomic classification using EMU.
The pipeline automatically generates quality control statistics (NanoStat, MultiQC) and produces taxonomic abundance tables as well as a phyloseq object for downstream microbiome analysis in R.
Pipeline overview
The pipeline processes long-read amplicon sequencing data through a series of preprocessing and analysis steps to obtain taxonomic profiles and downstream microbiome analysis objects.
Workflow
FASTQ → Trim → Filter → Chimera removal → Taxonomy → Reports → Phyloseq
Summary of steps
Primer trimming
Primers are removed and reads are reoriented using Cutadapt.Adapter removal
Sequencing adapters are removed using Porechop ABI.Quality filtering
Reads are filtered by quality score and length using SeqKit.Chimera filtering
Chimeric sequences are detected and removed using VSEARCH.Taxonomic classification
Clean reads are classified using EMU against a reference database.Quality reporting
Read statistics are generated using NanoStat and aggregated with MultiQC.Phyloseq object creation
Taxonomy, abundance tables, and metadata are combined to produce a phyloseq object for downstream analysis in R.Mock community evaluation (optional)
If a mock community is provided, performance metrics and visualizations are generated. ## Installation
1. Clone the repository
git clone https://github.com/mariamelk1/preprocessing_omics.git
cd preprocessing_omics2. Activate the Conda environment
All required dependencies are installed in a pre-configured Conda environment. Simply activate the environment before running the pipeline:
conda activate lr16s_pipelineUsage
Run the pipeline from the repository root.
For the example dataset, the FASTQ files are located in the data/example folder, and the mapfile points to them.
For your own datasets, place your FASTQ files in the data/ folder and update the mapfile accordingly.
💡 Note for mock analysis: If you enable
--mock, the pipeline will look for FASTQ files in thedata/mock/folder.
Make sure to create this folder and place your mock community FASTQ files there.
# Run the full pipeline
./preprocessing_emu.sh --map <path_to_mapfile>Mandatory arguments:
--map <path_to_mapfile> #Path to your sample map file (required)
Optional arguments:
--platform <nanopore|pacbio> #Select sequencing platform (default: nanopore)
--no-cutadapt #Skip primer removal
--no-filter-seqkit #Skip quality filtering
--min-qual <score> #Set minimum quality score for SeqKit (default: 20)
--no-filter-adapt #Skip adapter removal
--no-vsearch #Skip chimera removal
--no-EMU #Skip EMU taxonomic classification
--mock #Enable mock community analysis
--mock-scale <lineal|log|both> #Scale for mock metrics (default: log)
--help #Show this help messageExamples
Run the full pipeline on Nanopore data
./preprocessing_emu.sh –map mapfile.tsv #### Run the full pipeline on Pacbio data ./preprocessing_emu.sh –platform pacbio –map mapfile.tsv
Run pipeline with quality filtering only (Nanopore)
./preprocessing_emu.sh –platform nanopore –map mapfile.tsv –no-cutadapt –no-filter-adapt –min-qual 20
Run pipeline including mock community metrics on both scales (Pacbio)
./preprocessing_emu.sh –platform pacbio –map mapfile.tsv –mock –mock-scale both #### Run the example dataset provided in data (Nanopore, mock community) ./preprocessing_emu.sh –platform nanopore –map ./data/mapfile.tsv –mock –mock-scale both
Output structure
All pipeline results are stored in the RESULTS directory with the following structure:
RRESULTS/
├── ADAPTATOR_FILTERING/ # Adapter-trimmed reads (Porechop)
├── CHIMERA_FILTERING/ # Chimera-filtered reads (VSEARCH)
├── EMU_RESULT/ # Taxonomy and abundance tables (EMU)
├── JSON/ # Cutadapt JSON reports
├── LOGS/ # Logs for each processing step
├── MULTIQC/ # Multi-sample QC report
├── NANOSTATS/ # Read statistics generated by NanoStat
├── nanostat_stats.tsv # Combined NanoStat table
├── OTU_CLUSTERING/ # OTU clustering results (if run)
├── physeq.Rdata # Phyloseq object for downstream R analysis
├── QUALITY_FILTERING/ # Quality-filtered reads (SeqKit)
├── repseqs_from_combine.fasta # Reference sequences from EMU outputs
├── TRIM/ # Reads with primers removed
└── UNTRIM/ # Original reads
└── METRICS/ # Mock community metrics (if --mock)
├── <scale>/FIGURES/ # Plots for mock metrics
└── metrics_by_qscore_<scale>.tsv
💡 Notes:
- <scale> can be log, lineal, or both, depending on the mock analysis options.
- All directories are automatically created by the pipeline if they do not exist.
- The folder structure mirrors the workflow steps for easy navigation.
Configuration
The pipeline uses a configuration file (config.sh) to define paths, primers, and execution options.
This file is automatically loaded at the start of the pipeline, so you normally do not need to modify the main script.
What you can customize
- Paths: Output directory (
OUTPUT_DIR), EMU database (EMU_DATABASE_DIR), VSEARCH database (DATABASE_VSEARCH), temporary folder (TMP_DIR)
- Primers: Nanopore (
PRIMER_FWD_NANO/PRIMER_REV_NANO) or PacBio (PRIMER_FWD_PAC/PRIMER_REV_PAC)
- Execution options: Enable/disable steps like primer removal, quality filtering, adapter trimming, chimera removal, EMU classification, mock analysis
- Filtering parameters: Minimum/maximum read length (
MINLEN/MAXLEN), Qscore (QUAL)
- Threads: Number of CPU cores used (
CPU)
- R scripts: Paths for phyloseq creation and mock metrics (
R_CREATE_PHYLOSEQ,R_METRICS)
All changes are made in
config.sh; no modification of the main pipeline script is required.
Notes: Mapfile structure
The pipeline requires a tab-delimited mapfile to link FASTQ files to sample metadata. The structure depends on whether you are analyzing mock community samples or regular samples.
1. Mock community samples
- If using
--mock --mock-scale both, the mapfile must include ascalecolumn specifying the scale for each sample (logorlineal).
- If using only
logorlineal, the mapfile should contain only the sample name and any relevant info.
Example (mock, both scales):
sample_name scale other_info
mock1 log replicate1
mock2 lineal replicate2
Example (mock, single scale):
sample_name other_info
mock1 replicate1
mock2 replicate2
2. Non-mock samples
- The mapfile should contain the sample name followed by any metadata columns (e.g., treatment, timepoint, batch).
Example (non-mock):
sample_name treatment timepoint batch
sample1 control 0h batch1
sample2 treated 24h batch1
Important notes
- Always use tabs to separate columns; spaces will break the pipeline.
- Sample names in the mapfile should match the FASTQ filenames (without extensions).
- For quality-specific outputs, the pipeline appends
_Q<score>to the sample name, e.g.,sample1_Q27.
- Ensure consistent naming for mock scales if using
both.