Pre-defined pipelines
Pre-defined pipelines in PipeCraft2 provide automated workflows for processing amplicon sequencing data. These pipelines include options for generating ASVs with DADA2, ASVs with UNOISE3, OTUs with vsearch, and specialized pipelines like NextITS and OptimOTU. Each pipeline is carefully configured with sensible defaults while still allowing customization of key parameters to suit different experimental needs.
Use at least 2 samples per sequencing run for the pre-defined pipelines.
example data analyses
See the example data analyses with the pre-compiled pipelines here: Example data analyses
Working with multiple sequencing runs
Applicable to: DADA2 ASVs, UNOISE ASVs, vsearch OTUs pre-defined pipelines.
When working with multiple sequencing runs, then pre-defined pipelines can automatically process each sequencing run separately, and then merge the results into a single output OTU/ASV table. Processing each sequencing run separately is necessary for appropriate handling of run-specifiec error profiles and tag-jumps filtering.
Identical sequences from different runs will be recognized as the same ASV, and therefore merged into a single ASV.
Directory structure
Important
When aiming to combine samples from multiple sequencing runs, then follow the below directory structure, with **multiRunDir** being the mandatory directory name (names of the nested sequencing run directories can be changed).
When specifying a working directory in PipeCraft2 for processing multiple sequencing runs, then select the parent directory of the multiRunDir (e.g. my_sequencing_runs in the example below).
my_sequencing_runs/ # SELECT THIS FOLDER AS WORKING DIRECTORY
└── multiRunDir/ # name here MUST BE multiRunDir
├── Run1/ # name here can be anything (without spaces)
│ ├── sample1_R1.fastq
│ ├── sample1_R2.fastq
│ ├── sample2_R1.fastq
│ ├── sample2_R2.fastq
│ └── ...
├── Run2/ # name here can be anything (without spaces)
│ ├── sample10_R1.fastq
│ ├── sample10_R2.fastq
│ ├── sample11_R1.fastq
│ ├── sample11_R2.fastq
│ └── ...
├── skip_Run3/ # this dir will be skipped
│ ├── sample20_R1.fastq
│ ├── sample20_R2.fastq
│ ├── sample21_R1.fastq
│ ├── sample21_R2.fastq
│ └── ...
└── merged_runs/ # this is the dir where the merged ASV/OTU table will be saved
├── ASVs.fasta
├── ASV_table.txt
└── ...
skip_Run3 will be skipped.merged_runs directory will contain the merged ASV/OTU table; avoid naming your sequencing run directories as **merged_runs**!Merge sequencing runs
When working with multiple sequencing runs, then you can merge the results into a single ASV/OTU table by enabling the MERGE SEQUENCING RUNS option in the DADA2 ASVs, UNOISE ASVs, vsearch OTUs pre-defined pipelines.
Note that NextITS and OptimOTU pipelines also support merging of sequencing runs, but require slightly different directory structure (see here for NextITS: NextITS and for OptimOTU: OptimOTU).
DADA2 ASVs
This pre-defined workflow is based on the DADA2 tutorial to form ASVs and an ASV table. The input is the directory that contains per-sample fastq files (demultiplexed data).
CUT PRIMERS step do not represent parts from the DADA2 tutorial. Nevertheless, it is advisable to remove primers before proceeding with ASV generation with DADA2.Herein implemented DADA2 pipeline has three modes:
DADA2 mode |
when do use |
|---|---|
PAIRED-END FORWARD |
for paired-end Illumina data where amplicons
are expected to be in 5’-3’ orientation. If
using DADA2
PAIRED-END FORWARD mode, butyou have sequences in mixed orientation, then
the reverse complement reads are not detected
and are discarded.
|
PAIRED-END MIXED |
for paired-end Illumina data where amplicons
are expected to be both, in 5’-3’ (forward)
and 3’-5’ (reverse) oriented. In that mode,
CUT PRIMERS is mandatory, and generatesseparate directories for forward and reverse
oriented sequences, which will pass DADA2
pipeline individually. After merging the paired
ends, the reverse oriented sequences are
reverse complemented and aggregated with the
forward reads for chimera filtering and ASV
table generation. The output ASVs are all 5’-3’
oriented. If using DADA2
PAIRED-END MIXEDmode, then be sure you have data in mixed
orientation (i.e. both 5’-3’ and 3’-5’ oriented
sequences in samples); if this is not the case
then
PAIRED-END MIXED mode will report anERROR after quality filtering step (no output
files generated after quality filtering).
|
SINGLE-END |
for single-end PacBio data.
CUT PRIMERSstep for single-end data will reoriente all
reads to 5’-3’ (forward) orientation. DADA2 denoising
with PacBioErrfun (errorEstFun = PacBioErrfun).
|
Important
Working directory must contain at least 2 samples for DADA2 pipeline.
Default options:
Analyses step |
Default setting |
output directory |
|---|---|---|
Mandatory for
paired-end mixed modefor getting the fwd and rev oriented
sequences
|
primersCut_out |
|
QUALITY FILTERING
|
maxEE = 2maxN = 0minLen = 20truncQ = 2truncLen = 0truncLen_R2 = 0 (for paired-end data)maxLen = 9999minQ = 2matchIDs = TRUE |
qualFiltered_out |
DENOISE
|
pool = FALSEselfConsist = FASLEqualityType = Auto |
denoised_assembled.dada2 |
MERGE PAIRS
|
minOverlap = 12 (for paired-end data)maxMismatch = 0trimOverhang = FALSEjustConcatenate = FALSE |
denoised_assembled.dada2 |
CHIMERA FILTERING
|
method = consensus |
chimeraFiltered_outASVs in
ASVs_out.dada2 |
filter tag-jumps and ASVs that are
shorter/longer than expected length.
f_value = 0.01[defines the expected tag-jumps rate]
p_value = 1[severity of tag-jump removal]
min_length = 32[minimum length of OTU sequence]
max_length = 0[max length of OTU sequence;
0 means no filtering]
|
ASVs_out.dada2/curated |
UNOISE ASVs
UNOISE3 pipeline for making ASVs (zOTUs). Uses UNOISE3 algorithm in vsearch.
This automated workflow is mostly based on vsearch (Rognes et. al 2016) to form zOTUs and an zOTU table (herein also referred as ASVs).
The input is the directory that contains per-sample fastq files (demultiplexed data).
Analyses step |
Default setting |
output directory |
|---|---|---|
CUT PRIMERS (optional) |
– |
|
min_overlap = 12min_length = 32allow_merge_stagger = TRUEinclude only R1 = FALSEmax_diffs = 20max_Ns = 0max_len = 600keep_disjoined = FALSEfastq_qmax = 41 |
assembled_out |
|
maxEE = 1maxN = 0minLen = 32max_length = undefinedqmax = 41qmin = 0maxee_rate = undefined |
qualFiltered_out |
|
organisms = allregions = allpartial = 50region_for_clustering = ITS2e_value = 1e-2scores = 0domains = 2complement = TRUEonly_full = FALSEtruncate = TRUE |
ITSx_out |
|
strnads = bothminsize = 8denoise_level = globalremove_chimeras = TRUEunoise_alpha = 2similarity_type = 2maxaccepts = 1maxrejects = 32abskew = 16mask = dust |
clustering_out |
|
filter tag-jumps and ASVs that are
shorter/longer than expected length.
f_value = 0.01[defines the expected tag-jumps rate]
p_value = 1[severity of tag-jump removal]
min_length = 32[minimum length of OTU sequence]
max_length = 0[max length of OTU sequence;
0 means no filtering]
|
clustering_out/curated |
vsearch OTUs
This automated workflow is mostly based on vsearch (Rognes et. al 2016) to form OTUs and an OTU table. The input is the directory that contains per-sample fastq files (demultiplexed data).
Pipeline final outputs are in the clustering_out directory; but per process a separate
output directory is created (e.g. primersCut_out, chimeraFiltered_out etc.).
Analyses step |
Default setting |
output directory |
|---|---|---|
CUT PRIMERS (optional) |
– |
|
min_overlap = 12min_length = 32allow_merge_stagger = TRUEinclude only R1 = FALSEmax_diffs = 20max_Ns = 0max_len = 600keep_disjoined = FALSEfastq_qmax = 41 |
assembled_out |
|
maxEE = 1maxN = 0minLen = 32max_length = undefinedqmax = 41qmin = 0maxee_rate = undefined |
qualFiltered_out |
|
pre_cluster = 0.98min_unique_size = 1denovo = TRUEreference_based = undefinedabundance_skew = 2min_h = 0.28 |
chimeraFiltered_out |
|
organisms = allregions = allpartial = 50region_for_clustering = ITS2cluster_full_and_partial = TRUEe_value = 1e-2scores = 0domains = 2complement = TRUEonly_full = FALSEtruncate = TRUE |
ITSx_out |
|
OTU_type = centroidsimilarity_threshold = 0.97strands = bothremove_singletons = falsesimilarity_type = 2sequence_sorting = cluster_sizecentroid_type = similaritymax_hits = 1mask = dustdbmask = dust |
clustering_out |
|
filter tag-jumps and ASVs that are
shorter/longer than expected length.
f_value = 0.01[defines the expected tag-jumps rate]
p_value = 1[severity of tag-jump removal]
min_length = 32[minimum length of OTU sequence]
max_length = 0[max length of OTU sequence;
0 means no filtering]
|
clustering_out/curated |
NextITS
NextITS is an automated pipeline for analysing full-length ITS reads obtained via PacBio sequencing.
Note
Please see other details here: https://next-its.github.io Please note that NextITS pipeline accepts only a single primer pair, i.e., one forward and one reverse primer in STEP_1!
Important
NextITS in pipecraft v1.0.0 requires that your PC has at least 8 cores (and Docker has access to those cores).
NextITS requires your data and folders to be structured in a specific way (see below)!
Directory my_dir_for_NextITS contains Input [hard-coded requirement here] and one or multiple sequencing runs.
In the below example, the sequencing runs [RunID] are named as Run1, Run2 and Run3 (but naming can be different).
Although native NextITS requires multiplexed data as an input, the PipeCraft2’s implementation requires demultiplexed data. So, if you have multiplexed data, then first use the DEMULTIPLEX QuickTool.
In PipeCraft2, following the examples below, select my_dir_for_NextITS as a WORKDIR.
Single sequencing run
my_dir_for_NextITS as a WORKDIR in PipeCraft2. my_dir_for_NextITS/ # SELECT THIS FOLDER AS WORKING DIRECTORY
└── Input/
├── Run1/ # name here can be anything (without spaces)
│ ├── sample1.fastq.gz
│ ├── sample2.fastq.gz
│ ├── sample3.fastq.gz
│ └── sample4.fastq.gz
Input data for this pipeline must be demultiplexed, if your data is multiplexed use the demultiplexer from QuickTools before running the pipeline.
Sample naming
Please avoid non-ASCII symbols in SampleID,
and do not use the period symbol (.), as it represents the wildcard character in regular expressions.
Also, it is preferable not to start the sample name with a number.
Multiple sequencing runs
my_dir_for_NextITS as a WORKDIR in PipeCraft2. my_dir_for_NextITS/ # SELECT THIS FOLDER AS WORKING DIRECTORY
└── Input/
├── Run1/ # name here can be anything (without spaces)
│ ├── Run1__sample1.fastq.gz
│ ├── Run1__sample2.fastq.gz
│ ├── Run1__sample3.fastq.gz
│ └── Run1__sample4.fastq.gz
├── Run2/ # name here can be anything (without spaces)
│ ├── Run2__sample5.fastq.gz
│ ├── Run2__sample6.fastq.gz
│ ├── Run2__sample7.fastq.gz
│ └── Run2__sample8.fastq.gz
└── Run3/ # name here can be anything (without spaces)
├── Run3__sample9.fastq.gz
└── Run3__sample10.fastq.gz
Input data for this pipeline must be demultiplexed, if your data is multiplexed use the demultiplexer from QuickTools before running the pipeline.
Sample naming
Please avoid non-ASCII symbols in RunID and SampleID,
and do not use the period symbol (.), as it represents the wildcard character in regular expressions.
Also, it is preferable not to start the sample name with a number.
NextITS uses the SequencingRunID__SampleID naming convention (please note the double underscore separating RunID and SampleID parts).
This naming scheme allows to easily trace back sequences, especially if the same sample was sequenced several times and is present in multiple sequencing runs.
In the later steps, extracting the SampleID part and summarizing read counts for such samples is easy.
Default settings:
Analyses step |
Default setting |
|---|---|
primer_mismatch = 2its_region = fullqc_maxhomopolymerlen = 25qc_maxn = 4ITSx_evalue = 1e-2ITSx_partial = 0ITSx_tax = allchimera_rescue_occurrence = 2tj f = 0.01tj p = 1hp = TRUE |
|
otu_id = 0.98swarm_d = 1lulu = TRUEunoise = FALSEotu_id_def = 2otu_qmask = dustswarm_fastidious = TRUEunoise_alpha = 2unoise_minsize = 8max_MEEP = 0.5max_chimera_score = 0.5lulu_match = 95lulu_ratio = 1lulu_ratiotype = minlulu_relcooc = 0.95lulu_maxhits = 0 |
Cut primers
Please note that NextITS pipeline accepts only a single primer pair, i.e., one forward and one reverse primer!
Setting |
Tooltip |
|---|---|
|
Specify forward primer, IUPAC codes allowed
|
|
Specify reverse primer, IUPAC codes allowed
|
|
Specify allowed number of mismatches for primers
|
Quality filtering
Filter sequences based on expected errors per sequence and per base, compress and correct homopolymers.
Setting |
Tooltip |
|---|---|
|
Maximum number of expected errors
|
|
Maximum number of expected error per base
|
|
Discard sequences with more than the specified number of ambiguous nucleotides (N’s)
|
|
Threshold for a homopolymer region lenght in a sequence
|
|
Enable or disable homopolymer correction
|
ITS extraction
these conserved regions don’t offer species-level differentiation.
random errors in these areas can disrupt sequence clustering.
chimeric breakpoints, which are common in these regions, are hard to detect in short fragments ranging from 10 to 70 bases.
NextITS deploys the ITSx software (Bengtsson-Palme et al. 2013) for extracting the ITS sequence.
Setting |
Tooltip |
|---|---|
|
ITS part selector (ITS1, ITS2 or full)
|
|
Taxonomy profile for ITSx can be used to restrict the search to only taxon(s) of interest.
|
|
E-value cutoff threshold for ITSx
|
|
Keep partial ITS sequences (specify a minimum length cutoff)
|
Chimera filtering
Additional step in NextITS is a “rescue” of sequences that were initially flagged as chimeric, but are occur at least in 2 samples (which represent independent PCR reactions); thus are likely false-positive chimeric sequences. The chimeric sequence occurrence frequency can be edited using the –chimera_rescue_occurrence parameter.
Setting |
Tooltip |
|---|---|
|
Database for reference based chimera removal (UDB)
|
|
A minimum occurence of initially flagged chimeric sequence required to rescue them
|
Tag-jump correction
Tag-jumps, sometimes referred to as index-switches or index cross-talk, may represent a significant concern in high-throughput sequencing (HTS) data. They can cause technical cross-contamination between samples, potentially distorting estimates of community composition. Here, tag-jump events are evaluated the UNCROSS2 algorithm (Edgar 2018 ) are removed.
Setting |
Tooltip |
|---|---|
|
UNCROSS parameter f for tag-jump filtering
|
|
UNCROSS parameter p for tag-jump filtering
|
UNOISE denoising
The UNOISE algorithm (Edgar 2016 ) focuses on error-correction (or denoising) of amplicon reads. Essentially, UNOISE operates on the principle that if a sequence with low abundance closely resembles another sequence with high abundance, the former is probably an error. This helps differentiate between true biological variation and sequencing errors. It’s important to note that UNOISE was initially designed and optimized for Illumina data. Because of indel errors stemming from inaccuracies in homopolymeric regions, UNOISE might not work well with data that hasn’t undergone homopolymer correction.
Setting |
Tooltip |
|---|---|
|
Enable or disable denoising with UNOISE algorithm
|
|
Alpha parameter for UNOISE
|
|
Minimum sequence abundance
|
Clustering
NextITS supports 3 different clustering methods:
vsearch: this employs greedy clustering using a fixed sequence similarity threshold with VSEARCH (Rognes et al., 2016, );
swarm: dynamic sequence similarity threshold for clustering with SWARM (Mahé et al., 2021, );
unoise: creates zero-radius OTUs (zOTUs) based on the UNOISE3 algorithm (Edgar 2016 );
Setting |
Tooltip |
|---|---|
|
Sequence clustering method (choose from: vsearch, swarm, unoise)
|
|
Sequence similarity threshold
|
|
Sequence similarity definition (applied to UNOISE as well)
|
|
Method to mask low-complexity sequences (applied to UNOISE as well)
|
|
SWARM clustering resolution (d)
|
|
Link nearby low-abundance swarms (fastidious option)
|
Post-clustering with LULU
The purpose of LULU is to reduce the number of erroneous OTUs in OTU tables to achieve more realistic biodiversity metrics. By evaluating the co-occurence patterns of OTUs among samples LULU identifies OTUs that consistently satisfy some user selected criteria for being errors of more abundant OTUs and merges these OTUs.
Setting |
Tooltip |
|---|---|
|
Enable or disable post-clustering curation with lulu
|
|
Minimum similarity threshold
|
|
Minimum abundance ratio
|
|
Abundance ratio type - “min” or “avg
|
|
Relative co-occurrence
|
|
Maximum number of hits (0 = unlimited)
|
OptimOTU
Note
Note that compared with other herein (in PipeCraft) pre-defined pipelines, OptimOTU requires a lot of resources (CPU, RAM), so please allocate sufficient resources when running this pipeline. Due to many optimized steps in the pipeline, the local run of OptimOTU takes comparably more time.
Note
PipeCraft2’s implementation in v 1.1.0 of OptimOTU is currently restricted to Fungi (ITS3-ITS4 and g/fITS7-ITS4 amplicons); the Metazoa COI amplicons mode is beta version and not available in MacOS version.
Docker env built based on optimotu_targets v5.1.0 (https://github.com/brendanf/optimotu_targets/releases/tag/v5.1.0) with optimotu=0.9.3 and optimotu.pipeline=0.5.2.
Important
OptimOTU requires a specific directory structure for input data. See below. Note than if you are analysing a single sequencing run, you still need to follow the directory structure, but just need to have a single directory in “01_raw” (e.g. “Run1”, but you can name it as you want).
my_dir/
└── sequences/ # SELECT THIS FOLDER AS WORKING DIRECTORY (name here can be anything)
└── 01_raw/
├── Run1/ # name here can be anything (without spaces)
│ ├── sample1_R1.fastq.gz
│ ├── sample1_R2.fastq.gz
│ ├── sample2_R1.fastq.gz
│ └── sample2_R2.fastq.gz
├── Run2/ # name here can be anything (without spaces)
│ ├── sample3_R1.fastq.gz
│ ├── sample3_R2.fastq.gz
│ ├── sample4_R1.fastq.gz
│ └── sample4_R2.fastq.gz
└── Run3/ # name here can be anything (without spaces)
├── sample5_R1.fastq.gz
└── sample5_R2.fastq.gz
When startin the OptimOTU pipeline in PipeCraft, then the PROCESSING ... message will be displayed on the left upper corner of the screen
(on the place where SELECT WORKDIR was). The whole OptimOTU pipeline is executed in the background with a
single R-command, there will not be any specific feedback on the GUI which excact process is running and which are completed.
Output files will be saved in the my_dir_for_optimotu/output directory.
Intermediate files will be saved in the my_dir_for_optimotu/sequences/02_trim etc directories.
Target taxa and sequence orientation
Specify if target taxa is fungi or metazoa, and if provided sequences are are expected to be forward, reverse or mixed orientation.
Setting |
Tooltip |
|---|---|
|
specify if target taxa is fungi or metazoa |
seq orientation |
specify if provided sequences are forward (fwd),
reverse (rev) or mixed (mixed)
|
Control sequences
Two types of control sequences are supported:
spike-in sequences: sequences that are added to the samples before PCR These sequences are expected to be present in every sample, even most types of negative control.
positive control sequences: sequences that are added to only a few specific positive control samples. These sequences are expected to be present only in the positive control samples, and their presence in other samples is indicative of cross-contamination. (Either in the lab or “tag-switching”).
In practice both types are treated the same by the pipeline, they are just reported separately.
The sequences should be in a fasta file. Specifying either or both type of control sequences is optional.
Setting |
Tooltip |
|---|---|
|
(optional) specigy a file with spike-in sequences |
|
(optional) specify a file with positive control sequence |
Cut primers and trim reads
Cut primers and trim reads according to the specified parameters (using cutadapt).
Setting |
Tooltip |
|---|---|
|
specify forward primer sequence (supports only single primer) |
|
specify reverse primer sequence (supports only single primer) |
|
(maximum allowed error rate in the primer search) |
|
truncate ends (3’) of R1 at first base with quality score <= N |
|
truncate ends (3’) of R2 at first base with quality score <= N |
|
minimum length of the trimmed sequence |
|
remove N bases from start of R1 |
|
remove N bases from start of R2 |
action |
trim = trim the primers from the reads;
retain = retain the primers after primer has been founds
|
custom_sample_table |
custom primer trimming parameters per sample can be given as columns
in the sample table. See example below.
|
custom sample table
Example of custom primer trimming parameters per sample (tab-delimited):
seqrun |
samples |
fastq_R1 |
fastq_R2 |
orient |
run1 |
sample1 |
sample1_R1.fq.gz |
sample1_R2.fq.gz |
fwd |
run1 |
sample2 |
sample2_R1.fq.gz |
sample2_R2.fq.gz |
fwd |
run2 |
sample3 |
sample3_R1.fq.gz |
sample3_R2.fq.gz |
rev |
run2 |
sample4 |
sample4_R1.fq.gz |
sample4_R2.fq.gz |
rev |
run3 |
sample5 |
sample5_R1.fq.gz |
sample5_R2.fq.gz |
mixed |
Quality filtering
Quality filtering settings; performed using DADA2. Sequences with ambiguous nucleotides (N’s) are discarded.
Setting |
Tooltip |
|---|---|
|
discard sequences with more than the specified number of expected errors in R1 reads |
|
discard sequences with more than the specified number of expected errors in R2 reads |
Denoising and merging paired-end reads
There are no adjustable setting for denoising. The denoising steps are performed using the DADA2 package (Callahan et al. 2016). Error profiles are then learned separately for each sequencing run, read, and orientation using the learnErrors() function. Sequences with binned quality scores, as produced by newer Illumina sequencers, are automatically detected, and the error model is adjusted accordingly. Denoising is then performed using the dada() function, and read pairs are merged using the mergePairs() function.
Chimera filtering
Chimera filtering is performed using the consensus algorithm implemented in DADA2’s isBimeraDenovoTable() function.
Additional database provided in the PROTAX CLASSIFICATION step (with_outgroup file) is used for reference-based chimera filtering (vsearch –uchime_ref).
Filter tag-jumps
Filter potential cases of tag-switching with UNCROSS2 algorithm (Edgar 2018).
Setting |
Tooltip |
|---|---|
f value |
f-parameter of UNCROSS2, which defines the expected tag-jumps rate. Default is 0.03
(equivalent to 3%). A higher value enforces stricter filtering
|
p value |
p-parameter, which controls the severity of tag-jump removal. It adjusts the exponent
in the UNCROSS formula. Default is 1. Opt for 0.5 or 0.3 to steepen the curve
|
Amplicon model setting
Setting |
Tooltip |
|---|---|
model type |
statistical sequence model type for aligning ASVs prior to use of protaxA
and/or NuMt detection and for filtering ASVs to remove spurious sequences.
|
model file |
inbuilt ITS3_ITS4.cm and gITS7_ITS4.cm files are optimized for ITS3-ITS4 and
gITS7-ITS4 amplicons for fungi. COI.hmm is HMM model for COI amplicons.
A custom model may be supplied.
|
|
filter out sequences that are likely to be NUMTs (mitochondrial coding amplicon genes) |
max model start |
maximum start position of the model
(the match must start at this point in the model or earlier)
|
|
minimum end position of the model (the match must end at this point in the model or later) |
|
minimum bit score threshold for model matches |
ProTAX classification
Setting |
Tooltip |
|---|---|
location |
directory where protax is located. For fungi, default is protaxFungi
and for protaxAnimal for metazoa (included in the PipeCraft2 container)
|
UNITE_SHs |
additional database which contains also outgroup (non-target)
sequences from the same locus. For fungi, default is UNITE_SHs,
which is sh_matching_data_0_5_v9 sequences (included in the
PipeCraft2 container)
|
Clustering
Setting |
Tooltip |
|---|---|
cluster thresholds |
select file with clustering thresholds. Default is pre-calculated
thresholds for Fungi from Global Spore Sampling Project (Ovaskainen et al 2024)
|