Pre-compiled pipelines 

Pre-compiled pipelines in PipeCraft2 provide automated workflows for processing amplicon sequencing data. These pipelines include options for generating ASVs with DADA2, ASVs with UNOISE3, OTUs with vsearch, and specialized pipelines like NextITS and OptimOTU. Each pipeline is carefully configured with sensible defaults while still allowing customization of key parameters to suit different experimental needs.

See the example data analyses here for the use of the pre-compiled pipelines.

Use at least 2 samples per sequencing run for the pre-compiled pipelines.

example data analyses

See the example data analyses with the pre-compiled pipelines here: Example data analyses

Working with multiple sequencing runs

Applicable to: DADA2 ASVs, UNOISE ASVs, vsearch OTUs pre-compiled pipelines.

When working with multiple sequencing runs, then pre-compiled pipelines can automatically process each sequencing run separately, and then merge the results into a single output OTU/ASV table. Processing each sequencing run separately is necessary for appropriate handling of run-specifiec error profiles and tag-jumps filtering.

Identical sequences from different runs will be recognized, and merged into a single ASV (or OTU, within vsearch OTUs pipeline).

Note

If the total number of samples exeeds 5000, then the output table is a long-format feature (ASV/OTU) table in UNOISE ASVs and vsearch OTUs pipelines.

Directory structure

Important

When aiming to combine samples from multiple sequencing runs, then follow the below directory structure, with **multiRunDir** being the mandatory directory name (names of the nested sequencing run directories can be changed).

When specifying a working directory in PipeCraft2 for processing multiple sequencing runs, then select the parent directory of the multiRunDir (e.g. my_sequencing_runs in the example below).

Working with multiple sequencing runs

 my_sequencing_runs/             # SELECT THIS FOLDER AS WORKING DIRECTORY
 └── multiRunDir/                # name here MUST BE multiRunDir
     ├── Run1/                   # name here can be anything (without spaces)
     │   ├── sample1_R1.fastq
     │   ├── sample1_R2.fastq
     │   ├── sample2_R1.fastq
     │   ├── sample2_R2.fastq
     │   └── ...
     ├── Run2/                   # name here can be anything (without spaces)
     │   ├── sample10_R1.fastq
     │   ├── sample10_R2.fastq
     │   ├── sample11_R1.fastq
     │   ├── sample11_R2.fastq
     │   └── ...
     ├── skip_Run3/               # this dir will be skipped
     │   ├── sample20_R1.fastq
     │   ├── sample20_R2.fastq
     │   ├── sample21_R1.fastq
     │   ├── sample21_R2.fastq
     │   └── ...
     └── merged_runs/             # this is the dir where the merged ASV/OTU table will be saved
         ├── ASVs.fasta
         ├── ASV_table.txt
         └── ...

Note that you can **skip** processing any sequencing run by adding a skip_ prefix to the directory name. In this example here, sequencing run skip_Run3 will be skipped.

merged_runs directory will contain the merged ASV/OTU table; avoid naming your sequencing run directories as **merged_runs**!

Fastq files with the **same name** will be considered as the same sample and will be merged in the final ASV/OTU table.

Merge sequencing runs

When working with multiple sequencing runs, then you can merge the results into a single ASV/OTU table by enabling the MERGE SEQUENCING RUNS option in the DADA2 ASVs, UNOISE ASVs, vsearch OTUs pre-compiled pipelines.

Note that NextITS and OptimOTU pipelines also support merging of sequencing runs, but require slightly different directory structure (see here for NextITS: NextITS and for OptimOTU: OptimOTU).

DADA2 ASVs

This pre-defined workflow is based on the DADA2 tutorial to form ASVs and an ASV table. The input is the directory that contains per-sample fastq files (demultiplexed data).

Note that CUT PRIMERS step do not represent parts from the DADA2 tutorial. Nevertheless, it is advisable to remove primers before proceeding with ASV generation with DADA2.

Herein implemented DADA2 pipeline has three modes:

DADA2 mode	when do use
`PAIRED-END FORWARD`	for paired-end Illumina data where amplicons are expected to be in 5’-3’ orientation. If using DADA2 `PAIRED-END FORWARD` mode, but you have sequences in mixed orientation, then the reverse complement reads are not detected and are discarded.
`PAIRED-END MIXED`	for paired-end Illumina data where amplicons are expected to be both, in 5’-3’ (forward) and 3’-5’ (reverse) oriented. In that mode, `CUT PRIMERS` is mandatory, and generates separate directories for forward and reverse oriented sequences, which will pass DADA2 pipeline individually. After merging the paired ends, the reverse oriented sequences are reverse complemented and aggregated with the forward reads for chimera filtering and ASV table generation. The output ASVs are all 5’-3’ oriented. If using DADA2 `PAIRED-END MIXED` mode, then be sure you have data in mixed orientation (i.e. both 5’-3’ and 3’-5’ oriented sequences in samples); if this is not the case then `PAIRED-END MIXED` mode will report an ERROR after quality filtering step (no output files generated after quality filtering).
`SINGLE-END`	for single-end PacBio data. `CUT PRIMERS` step for single-end data will reoriente all reads to 5’-3’ (forward) orientation. DADA2 denoising with PacBioErrfun (errorEstFun = PacBioErrfun).

Important

Working directory must contain at least 2 samples for DADA2 pipeline.

Pipeline workflow with default options:

Analyses step	Default setting	output directory
CUT PRIMERS (only mandatory for `paired-end mixed` mode)	`forward primers` = NULL `reverse primers` = NULL `mismatches` = 1 `min overlap` = 21 `seqs to keep` = keep_all `pair filter` = both `no indels` = TRUE	`primersCut_out`
QUALITY FILTERING	`maxEE` = 2 `maxN` = 0 `minLen` = 20 `truncQ` = 2 `truncLen` = 0 `truncLen_R2` = 0 (for paired-end data) `maxLen` = 9999 `minQ` = 2 `matchIDs` = TRUE `trimLeft` = 0 `trimRight` = 0	`qualFiltered_out`
DENOISE	`BAND SIZE` = 16 (32 for PacBio) `nbases` = 1e+8 `randomize` = TRUE `OMEGA_A` = 1e-40 `OMEGA_P` = 1e-4 `OMEGA_C` = 0 `Homopoly gap penalty` = NULL `DETECT_SINGLETONS` = FALSE	`denoised_assembled.dada2` `denoised.dada2` for single-end data
MERGE PAIRS	`minOverlap` = 12 (for paired-end data) `maxMismatch` = 0 `trimOverhang` = FALSE `justConcatenate` = FALSE	`denoised_assembled.dada2`
CHIMERA FILTERING	`method` = consensus	`chimeraFiltered_out` ASVs in `ASVs_out.dada2`
CURATE ASV TABLE (optional)	filter tag-jumps and ASVs that are shorter/longer than expected length. `f_value` = 0.01 [defines the expected tag-jumps rate] `p_value` = 1 [severity of tag-jump removal] `min_length` = 32 [minimum length of OTU sequence] `max_length` = 0 [max length of OTU sequence; 0 means no filtering]	`ASVs_out.dada2/curated`

UNOISE ASVs

UNOISE3 pipeline for making ASVs (zOTUs). Uses UNOISE3 algorithm in vsearch.

This automated workflow is mostly based on vsearch (Rognes et. al 2016) to form zOTUs and an zOTU table (herein also referred as ASVs).

The input is the directory that contains per-sample fastq files (demultiplexed data).

Pipeline workflow with default options:

Analyses step	Default setting	output directory
CUT PRIMERS (optional)	`forward primers` = NULL `reverse primers` = NULL `mismatches` = 1 `min overlap` = 21 `seqs to keep` = keep_all `pair filter` = both `no indels` = TRUE	`primersCut_out`
QUALITY FILTERING with vsearch	`maxEE` = 1 `maxN` = 0 `min length` = 32 `trunc length` `qmax` = 41 `max_length` = undefined `qmin` = 0 `maxee_rate` = undefined `truncqual` = 0 `truncee` = 0 `strip left` = 0 `strip right` = 0	`qualFiltered_out`
MERGE READS	`min_overlap` = 12 `min_length` = 32 `allow_merge_stagger` = TRUE `include only R1` = FALSE `max_diffs` = 20 `max_Ns` = 0 `max_len` = 600 `keep_disjoined` = FALSE `fastq_qmax` = 41	`assembled_out`
ITS Extractor (optional)	`organisms` = all `regions` = all `partial` = 50 `region_for_clustering` = ITS2 `e_value` = 1e-2 `scores` = 0 `domains` = 2 `complement` = TRUE `only_full` = FALSE `truncate` = TRUE	`ITSx_out`
CLUSTERING with UNOISE3	`strnads` = both `minsize` = 8 `denoise_level` = global `remove_chimeras` = TRUE `unoise_alpha` = 2 `similarity_type` = 2 `maxaccepts` = 1 `maxrejects` = 32 `abskew` = 16 `mask` = dust	`clustering_out`
CURATE ASV TABLE (optional)	filter tag-jumps and ASVs that are shorter/longer than expected length. `f_value` = 0.01 [defines the expected tag-jumps rate] `p_value` = 1 [severity of tag-jump removal] `min_length` = 32 [minimum length of OTU sequence] `max_length` = 0 [max length of OTU sequence; 0 means no filtering]	`clustering_out/curated`

vsearch OTUs

This automated workflow is mostly based on vsearch (Rognes et. al 2016) to form OTUs and an OTU table. The input is the directory that contains per-sample fastq files (demultiplexed data).

Pipeline final outputs are in the clustering_out directory; but per process a separate output directory is created (e.g. primersCut_out, chimeraFiltered_out etc.).

Pipeline workflow with default options:

Analyses step	Default setting	output directory
CUT PRIMERS (optional)	`forward primers` = NULL `reverse primers` = NULL `mismatches` = 1 `min overlap` = 21 `seqs to keep` = keep_all `pair filter` = both `no indels` = TRUE	`primersCut_out`
QUALITY FILTERING with vsearch	`maxEE` = 1 `maxN` = 0 `min length` = 32 `trunc length` `qmax` = 41 `max_length` = undefined `qmin` = 0 `maxee_rate` = undefined `truncqual` = 0 `truncee` = 0 `strip left` = 0 `strip right` = 0	`qualFiltered_out`
MERGE READS	`min_overlap` = 12 `min_length` = 32 `allow_merge_stagger` = TRUE `include only R1` = FALSE `max_diffs` = 20 `max_Ns` = 0 `max_len` = 600 `keep_disjoined` = FALSE `fastq_qmax` = 41	`assembled_out`
CHIMERA FILTERING with uchime_denovo	`pre_cluster` = 0.98 `min_unique_size` = 1 `denovo` = TRUE `reference_based` = undefined `abundance_skew` = 2 `min_h` = 0.28	`chimeraFiltered_out`
ITS Extractor (optional)	`organisms` = all `regions` = all `partial` = 50 `region_for_clustering` = ITS2 `cluster_full_and_partial` = TRUE `e_value` = 1e-2 `scores` = 0 `domains` = 2 `complement` = TRUE `only_full` = FALSE `truncate` = TRUE	`ITSx_out`
CLUSTERING	`OTU_type` = centroid `similarity_threshold` = 0.97 `strands` = both `remove_singletons` = false `similarity_type` = 2 `sequence_sorting` = cluster_size `centroid_type` = similarity `max_hits` = 1 `mask` = dust `dbmask` = dust	`clustering_out`
CURATE ASV TABLE (optional)	filter tag-jumps and ASVs that are shorter/longer than expected length. `f_value` = 0.01 [defines the expected tag-jumps rate] `p_value` = 1 [severity of tag-jump removal] `min_length` = 32 [minimum length of OTU sequence] `max_length` = 0 [max length of OTU sequence; 0 means no filtering]	`clustering_out/curated`

NextITS

NextITS is an automated pipeline for analysing full-length ITS reads obtained via PacBio sequencing.

This pipeline implements:
* primer trimming
* quality filtering
* full-length ITS region extraction
* correction of homopolymer errors
* chimera filtering (get database for reference-based chimera filtering here)
* recovery of sequences false-positively annotated as chimeric
* detection of tag-switching artifacts per sequencing run
* multiple options for sequence clustering
* post-clustering with LULU

Note

Please see other details here: https://next-its.github.io Please note that NextITS pipeline accepts only a single primer pair, i.e., one forward and one reverse primer in STEP_1!

Important

NextITS in pipecraft v1.0.0 requires that your PC has at least 8 cores (and Docker has access to those cores; see here).

NextITS requires your data and folders to be structured in a specific way (see below)! Directory my_dir_for_NextITS contains Input [hard-coded requirement here] and one or multiple sequencing runs. In the below example, the sequencing runs [RunID] are named as Run1, Run2 and Run3 (but naming can be different).

Although native NextITS requires multiplexed data as an input, the PipeCraft2’s implementation requires demultiplexed data. So, if you have multiplexed data, then first use the DEMULTIPLEX QuickTool.

In PipeCraft2, following the examples below, select my_dir_for_NextITS as a WORKDIR.

Download example data set here

Single sequencing run

Select my_dir_for_NextITS as a WORKDIR in PipeCraft2.

Directory structure for analysing a single sequencing run:

Required directory structure for NextITS

 my_dir_for_NextITS/   # SELECT THIS FOLDER AS WORKING DIRECTORY
 └── Input/
     ├── Run1/      # name here can be anything (without spaces)
     │   ├── sample1.fastq.gz
     │   ├── sample2.fastq.gz
     │   ├── sample3.fastq.gz
     │   └── sample4.fastq.gz

Input data for this pipeline must be demultiplexed, if your data is multiplexed use the demultiplexer from QuickTools before running the pipeline.

Sample naming

Please avoid non-ASCII symbols in SampleID, and do not use the period symbol (.), as it represents the wildcard character in regular expressions. Also, it is preferable not to start the sample name with a number.

Multiple sequencing runs

Select my_dir_for_NextITS as a WORKDIR in PipeCraft2.

Directory structure for analysing multiple sequencing runs:

Required directory structure for NextITS

 my_dir_for_NextITS/   # SELECT THIS FOLDER AS WORKING DIRECTORY
 └── Input/
     ├── Run1/      # name here can be anything (without spaces)
     │   ├── Run1__sample1.fastq.gz
     │   ├── Run1__sample2.fastq.gz
     │   ├── Run1__sample3.fastq.gz
     │   └── Run1__sample4.fastq.gz
     ├── Run2/      # name here can be anything (without spaces)
     │   ├── Run2__sample5.fastq.gz
     │   ├── Run2__sample6.fastq.gz
     │   ├── Run2__sample7.fastq.gz
     │   └── Run2__sample8.fastq.gz
     └── Run3/      # name here can be anything (without spaces)
         ├── Run3__sample9.fastq.gz
         └── Run3__sample10.fastq.gz

Input data for this pipeline must be demultiplexed, if your data is multiplexed use the demultiplexer from QuickTools before running the pipeline.

Sample naming

Please avoid non-ASCII symbols in RunID and SampleID, and do not use the period symbol (.), as it represents the wildcard character in regular expressions. Also, it is preferable not to start the sample name with a number.

NextITS uses the SequencingRunID__SampleID naming convention (please note the double underscore separating RunID and SampleID parts). This naming scheme allows to easily trace back sequences, especially if the same sample was sequenced several times and is present in multiple sequencing runs. In the later steps, extracting the SampleID part and summarizing read counts for such samples is easy.

Default settings in the NextITS pipeline panels:

Analyses step	Default setting
STEP 1: QUALITY CONTROL, ARTEFACT REMOVAL	`primer_mismatch` = 2 `its_region` = full `qc_maxhomopolymerlen` = 25 `qc_maxn` = 4 `ITSx_evalue` = 1e-2 `ITSx_partial` = 0 `ITSx_tax` = all `chimera_rescue_occurrence` = 2 `tj f` = 0.01 `tj p` = 1 `hp` = TRUE
STEP 2: DATA AGGREGATION, CLUSTERING	`otu_id` = 0.98 `swarm_d` = 1 `lulu` = TRUE `unoise` = FALSE `otu_id_def` = 2 `otu_qmask` = dust `swarm_fastidious` = TRUE `unoise_alpha` = 2 `unoise_minsize` = 8 `max_MEEP` = 0.5 `max_chimera_score` = 0.5 `lulu_match` = 95 `lulu_ratio` = 1 `lulu_ratiotype` = min `lulu_relcooc` = 0.95 `lulu_maxhits` = 0

Cut primers

Note

NextITS pipeline accepts only a single primer pair**, i.e., one forward and one reverse primer!

Setting	Tooltip
`primer_forward`	Specify forward primer, IUPAC codes allowed
`primer_reverse`	Specify reverse primer, IUPAC codes allowed
`primer_mismatch`	Specify allowed number of mismatches for primers

Quality filtering

Filter sequences based on expected errors per sequence and per base, compress and correct homopolymers.

Setting	Tooltip
`qc_maxee`	Maximum number of expected errors
`qc_maxeerate`	Maximum number of expected error per base
`qc_maxn`	Discard sequences with more than the specified number of ambiguous nucleotides (N’s)
`qc_maxhomopolymerlen`	Threshold for a homopolymer region lenght in a sequence
`hp`	Enable or disable homopolymer correction

ITS extraction

When performing ITS metabarcoding, it may be beneficial to trim the flanking 18S and 28S rRNA genes; because:

these conserved regions don’t offer species-level differentiation.

random errors in these areas can disrupt sequence clustering.

chimeric breakpoints, which are common in these regions, are hard to detect in short fragments ranging from 10 to 70 bases.

NextITS deploys the ITSx software (Bengtsson-Palme et al. 2013) for extracting the ITS sequence.

Setting	Tooltip
`its_region`	ITS part selector (ITS1, ITS2 or full)
`ITSx_tax`	Taxonomy profile for ITSx can be used to restrict the search to only taxon(s) of interest.
`ITSx_evalue`	E-value cutoff threshold for ITSx
`ITSx_partial`	Keep partial ITS sequences (specify a minimum length cutoff)

Chimera filtering

NextITS employs a two-pronged strategy to detect chimeras: de novo and reference-based chimera filtering.

A reference database for chimera filtering from full-length ITS data is accessible here. This database is based on EUKARYOME database

Additional step in NextITS is a “rescue” of sequences that were initially flagged as chimeric, but are occur at least in 2 samples (which represent independent PCR reactions); thus are likely false-positive chimeric sequences. The chimeric sequence occurrence frequency can be edited using the –chimera_rescue_occurrence parameter.

Setting	Tooltip
`chimera_database` (optional)	Database for reference based chimera removal (UDB)
`chimera_rescue_occurence`	A minimum occurence of initially flagged chimeric sequence required to rescue them

Tag-jump correction

Tag-jumps, sometimes referred to as index-switches or index cross-talk, may represent a significant concern in high-throughput sequencing (HTS) data. They can cause technical cross-contamination between samples, potentially distorting estimates of community composition. Here, tag-jump events are evaluated the UNCROSS2 algorithm (Edgar 2018 ) are removed.

Setting	Tooltip
`tj_f`	UNCROSS parameter f for tag-jump filtering
`tj_p`	UNCROSS parameter p for tag-jump filtering

UNOISE denoising

The UNOISE algorithm (Edgar 2016 ) focuses on error-correction (or denoising) of amplicon reads. Essentially, UNOISE operates on the principle that if a sequence with low abundance closely resembles another sequence with high abundance, the former is probably an error. This helps differentiate between true biological variation and sequencing errors. It’s important to note that UNOISE was initially designed and optimized for Illumina data. Because of indel errors stemming from inaccuracies in homopolymeric regions, UNOISE might not work well with data that hasn’t undergone homopolymer correction.

Setting	Tooltip
`unoise`	Enable or disable denoising with UNOISE algorithm
`unoise_alpha`	Alpha parameter for UNOISE
`unoise_minsize`	Minimum sequence abundance

Clustering

NextITS supports 3 different clustering methods:

vsearch: this employs greedy clustering using a fixed sequence similarity threshold with VSEARCH (Rognes et al., 2016, );

swarm: dynamic sequence similarity threshold for clustering with SWARM (Mahé et al., 2021, );

unoise: creates zero-radius OTUs (zOTUs) based on the UNOISE3 algorithm (Edgar 2016 );

Setting	Tooltip
`clustering_method`	Sequence clustering method (choose from: vsearch, swarm, unoise)
`otu_id`	Sequence similarity threshold
`otu_iddef`	Sequence similarity definition (applied to UNOISE as well)
`otu_qmask`	Method to mask low-complexity sequences (applied to UNOISE as well)
`swarm_d`	SWARM clustering resolution (d)
`swarm_fastidious`	Link nearby low-abundance swarms (fastidious option)

Post-clustering with LULU

The purpose of LULU is to reduce the number of erroneous OTUs in OTU tables to achieve more realistic biodiversity metrics. By evaluating the co-occurence patterns of OTUs among samples LULU identifies OTUs that consistently satisfy some user selected criteria for being errors of more abundant OTUs and merges these OTUs.

Setting	Tooltip
`lulu`	Enable or disable post-clustering curation with lulu
`lulu_match`	Minimum similarity threshold
`lulu_ratio`	Minimum abundance ratio
`lulu_ratiotype`	Abundance ratio type - “min” or “avg
`lulu_relcooc`	Relative co-occurrence
`lulu_maxhits`	Maximum number of hits (0 = unlimited)

OptimOTU

OptimOTU is a full metabarcoding data analysis pipeline for paired-end Illumina data (arXiv:2502.10350).

OptimOTU uses taxonomically identified reference sequences to determine optimal genetic distance thresholds for clustering ancestor taxa into groups that best match their descendant taxa (taxonomically aware OTU clustering).

Note

Note that compared with other herein (in PipeCraft) pre-compiled pipelines, OptimOTU requires a lot of resources (CPU, RAM), so please allocate sufficient resources when running this pipeline. Due to many optimized steps in the pipeline, the local run of OptimOTU takes comparably more time.

For testing, with small data, use CPU = 1; otherwise the pipeline may fail when aiming to divide dataset into chunks.

Note

PipeCraft2’s implementation in v 1.1.0 of OptimOTU is currently restricted to Fungi (ITS3-ITS4 and g/fITS7-ITS4 amplicons); the Metazoa COI amplicons mode is beta version and not available in MacOS version.

Docker env built based on optimotu_targets v5.1.0 (https://github.com/brendanf/optimotu_targets/releases/tag/v5.1.0) with optimotu=0.9.3 and optimotu.pipeline=0.5.2.

Important

OptimOTU requires a specific directory structure for input data. See below. Note than if you are analysing a single sequencing run, you still need to follow the directory structure, but just need to have a single directory in “01_raw” (e.g. “Run1”, but you can name it as you want).

Required directory structure for OptimOTU

 my_dir/
 └── sequences/         # SELECT THIS FOLDER AS WORKING DIRECTORY (name here can be anything)
     └── 01_raw/
         ├── Run1/      # name here can be anything (without spaces)
         │   ├── sample1_R1.fastq.gz
         │   ├── sample1_R2.fastq.gz
         │   ├── sample2_R1.fastq.gz
         │   └── sample2_R2.fastq.gz
         ├── Run2/      # name here can be anything (without spaces)
         │   ├── sample3_R1.fastq.gz
         │   ├── sample3_R2.fastq.gz
         │   ├── sample4_R1.fastq.gz
         │   └── sample4_R2.fastq.gz
         └── Run3/      # name here can be anything (without spaces)
             ├── sample5_R1.fastq.gz
             └── sample5_R2.fastq.gz

When startin the OptimOTU pipeline in PipeCraft, then the PROCESSING ... message will be displayed on the left upper corner of the screen (on the place where SELECT WORKDIR was). The whole OptimOTU pipeline is executed in the background with a single R-command, there will not be any specific feedback on the GUI which excact process is running and which are completed.

Output files will be saved in the my_dir_for_optimotu/output directory. Intermediate files will be saved in the my_dir_for_optimotu/sequences/02_trim etc directories.

Target taxa and sequence orientation

Specify if target taxa is fungi or metazoa, and if provided sequences are are expected to be forward, reverse or mixed orientation.

“fwd” = all sequences are expected to be in 5’-3’ orientation.
“rev” = all sequences are expected to be in 3’-5’ orientation.
“mixed” = the orientation of seqs is expected to be mixed (5’-3’ and 3’-5)
“custom” = the orientation of different files is given in a custom sample table (see custom sample table)
if seqs are “mixed”, but using “fwd” setting, then some valid seqs (or samples) will be lost.
if seqs are “fwd”, but using “mixed” setting, then ERROR.

Setting	Tooltip
`target taxa`	specify if target taxa is fungi or metazoa
`seq orientation`	specify if provided sequences are forward (fwd), reverse (rev) or mixed (mixed)

Control sequences

Two types of control sequences are supported:

spike-in sequences: sequences that are added to the samples before PCR These sequences are expected to be present in every sample, even most types of negative control.
positive control sequences: sequences that are added to only a few specific positive control samples. These sequences are expected to be present only in the positive control samples, and their presence in other samples is indicative of cross-contamination. (Either in the lab or “tag-switching”).

In practice both types are treated the same by the pipeline, they are just reported separately.

The sequences should be in a fasta file. Specifying either or both type of control sequences is optional.

Setting	Tooltip
`spike in`	(optional) specigy a file with spike-in sequences
`positive control`	(optional) specify a file with positive control sequence

Cut primers and trim reads

Cut primers and trim reads according to the specified parameters (using cutadapt).

Setting	Tooltip
`forward primer`	specify forward primer sequence (supports only single primer)
`reverse primer`	specify reverse primer sequence (supports only single primer)
`max error rate`	(maximum allowed error rate in the primer search)
`truncQ_R1`	truncate ends (3’) of R1 at first base with quality score <= N
`truncQ_R2`	truncate ends (3’) of R2 at first base with quality score <= N
`min_length`	minimum length of the trimmed sequence
`cut_R1`	remove N bases from start of R1
`cut_R2`	remove N bases from start of R2
`action`	trim = trim the primers from the reads; retain = retain the primers after primer has been founds
`custom_sample_table`	custom primer trimming parameters per sample can be given as columns in the sample table. See example below.

custom sample table

Example of custom primer trimming parameters per sample (tab-delimited):

seqrun	samples	fastq_R1	fastq_R2	orient
run1	sample1	sample1_R1.fq.gz	sample1_R2.fq.gz	fwd
run1	sample2	sample2_R1.fq.gz	sample2_R2.fq.gz	fwd
run2	sample3	sample3_R1.fq.gz	sample3_R2.fq.gz	rev
run2	sample4	sample4_R1.fq.gz	sample4_R2.fq.gz	rev
run3	sample5	sample5_R1.fq.gz	sample5_R2.fq.gz	mixed

Quality filtering

Quality filtering settings; performed using DADA2. Sequences with ambiguous nucleotides (N’s) are discarded.

Setting	Tooltip
`maxEE_R1`	discard sequences with more than the specified number of expected errors in R1 reads
`maxEE_R2`	discard sequences with more than the specified number of expected errors in R2 reads

Denoising and merging paired-end reads

There are no adjustable setting for denoising. The denoising steps are performed using the DADA2 package (Callahan et al. 2016). Error profiles are then learned separately for each sequencing run, read, and orientation using the learnErrors() function. Sequences with binned quality scores, as produced by newer Illumina sequencers, are automatically detected, and the error model is adjusted accordingly. Denoising is then performed using the dada() function, and read pairs are merged using the mergePairs() function.

Chimera filtering

Chimera filtering is performed using the consensus algorithm implemented in DADA2’s isBimeraDenovoTable() function. Additional database provided in the PROTAX CLASSIFICATION step (with_outgroup file) is used for reference-based chimera filtering (vsearch –uchime_ref).

Filter tag-jumps

Filter potential cases of tag-switching with UNCROSS2 algorithm (Edgar 2018).

Setting	Tooltip
`f value`	f-parameter of UNCROSS2, which defines the expected tag-jumps rate. Default is 0.03 (equivalent to 3%). A higher value enforces stricter filtering
`p value`	p-parameter, which controls the severity of tag-jump removal. It adjusts the exponent in the UNCROSS formula. Default is 1. Opt for 0.5 or 0.3 to steepen the curve

Amplicon model setting

Setting	Tooltip
`model type`	statistical sequence model type for aligning ASVs prior to use of protaxA and/or NuMt detection and for filtering ASVs to remove spurious sequences.
`model file`	inbuilt ITS3_ITS4.cm and gITS7_ITS4.cm files are optimized for ITS3-ITS4 and gITS7-ITS4 amplicons for fungi. COI.hmm is HMM model for COI amplicons. A custom model may be supplied.
`numt filter`	filter out sequences that are likely to be NUMTs (mitochondrial coding amplicon genes)
`max model start`	maximum start position of the model (the match must start at this point in the model or earlier)
`min model end`	minimum end position of the model (the match must end at this point in the model or later)
`min model score`	minimum bit score threshold for model matches

ProTAX classification

Setting	Tooltip
`location`	directory where protax is located. For fungi, default is protaxFungi and for protaxAnimal for metazoa (included in the PipeCraft2 container)
`UNITE_SHs`	additional database which contains also outgroup (non-target) sequences from the same locus. For fungi, default is UNITE_SHs, which is sh_matching_data_0_5_v9 sequences (included in the PipeCraft2 container)

Clustering

Setting	Tooltip
`cluster thresholds`	select file with clustering thresholds. Default is pre-calculated thresholds for Fungi from Global Spore Sampling Project (Ovaskainen et al 2024)

FunBarONT

FunBarONT is an automated pipeline for processing Oxford Nanopore Technologies (ONT) fungal barcoding data, specifically targeting the ITS rRNA gene region.

This pipeline processes Oxford Nanopore sequencing data through quality filtering, clustering, consensus polishing, ITS extraction, and taxonomic assignment to generate high-confidence fungal identifications.

See example data analysis tutorial here.

Note

FunBarONT requires single-end demultiplexed Oxford Nanopore data as input. The pipeline automatically handles the higher error rates typical of ONT sequencing through consensus polishing with racon and medaka.

Directory structure

Required directory structure for FunBarONT

 my_fungal_barcoding/   # SELECT THIS FOLDER AS WORKING DIRECTORY
 └── sequences/
     ├── sample1.fastq
     ├── sample2.fastq
     ├── sample3.fastq.gz
     └── ...

Input data must be demultiplexed with one fastq file per sample.

Default settings:

Analyses step	Default setting	output directory
QUALITY CONTROL (NanoPlot)	generates quality reports per sample	`01_quality_reports`
QUALITY FILTERING (chopper)	`chopper quality` = 10 `chopper min read length` = 150 `chopper max read length` = 1000	`02_filtered_` `sequences`
CLUSTERING (VSEARCH)	`vsearch cluster id` = 0.95 `vsearch cluster strand` = both	`03_clusters`
READ MAPPING (minimap2)	maps reads to cluster centroids (intermediate step for polishing)	no separate output (used for polishing)
SEQUENCE POLISHING (racon + medaka)	`medaka model` = r1041_e82_400bps_hac_ variant_v4.3.0 `racon quality threshold` = 20 `racon window length` = 100	`04_polished_` `sequences`
ITS EXTRACTION (ITSx)	`use itsx` = TRUE	`05_its_extracted`
TAXONOMY ASSIGNMENT (BLAST)	`strands` = both `e value` = 10 `word size` = 11	`06_blast_results`
FINAL RESULTS	`run id` = funbaront_run `rel abu threshold` = 10 `output all polished seqs` = FALSE	`07_json_results`

Pipeline options

Setting	Tooltip
`use ITSx`	set to FALSE to skip ITS extraction (useful for non-ITS sequences)
`output all polished seqs`	output all polished sequences even those without database hits
`rel abu threshold`	output only clusters with relative abundance above this value (0-100%)
`cpu threads`	number of CPU threads to use for processing

Quality filtering (chopper)

Setting	Tooltip
`chopper quality`	minimum read quality score (Phred). Reads below this threshold are discarded
`chopper min read length`	minimum read length in bp. Shorter reads are removed
`chopper max read length`	maximum read length in bp. Longer reads are removed

Sequence polishing

Setting	Tooltip
`medaka model`	medaka inference model for consensus polishing. Select based on your flowcell, kit, and basecaller model (e.g., r1041_e82_400bps_hac_variant_v4.3.0)
`racon quality threshold`	minimum average base quality for windows used by racon (default: 20)
`racon window length`	window length used by racon for polishing (default: 100)

VSEARCH clustering

Setting	Tooltip
`similarity threshold`	clustering identity threshold (0-1). Sequences above this similarity are clustered
`strands`	check both strands or plus strand only during clustering

Taxonomy assignment (BLAST)

Setting	Tooltip
`database file`	reference database file in FASTA format (e.g., UNITE database). Automatically converted to BLAST database format
`run id`	unique identifier for this analysis run. Used for naming output files
`task`	BLAST search settings according to blastn or megablast
`strands`	query strand to search against database. Both = search also reverse complement
`e value`	a parameter that describes the number of hits one can expect to see by chance when searching a database of a particular size. The lower the e-value the more ‘significant’ the match is
`word size`	the size of the initial word that must be matched between the database and the query sequence
`reward`	reward for a match
`penalty`	penalty for a mismatch
`gap open`	cost to open a gap
`gap extend`	cost to extend a gap

Output files

The pipeline produces the following output structure:

Output	Description
`<run_id>.results.xlsx`	Excel spreadsheet with all results (taxonomy, quality)
`README.txt`	summary of the pipeline run with parameters and citations
`01_quality_reports/`	NanoPlot quality reports per sample
`02_filtered_sequences/`	chopper-filtered sequences (*.chopper.fasta.gz)
`03_clusters/`	VSEARCH clustering centroids (*.centroids.fasta.gz)
`04_polished_sequences/`	racon and medaka polished sequences
`05_its_extracted/`	ITSx extracted ITS sequences (*.its.fasta)
`06_blast_results/`	BLAST taxonomy results (*.blast.tsv)
`07_json_results/`	JSON formatted results per sample