: github

Pre-defined pipelines

vsearch OTUs

This automated workflow is mostly based on vsearch (Rognes et. al 2016) [manual]

Default options:

click on analyses step for more info

Analyses step	Default setting
DEMULTIPLEX (optional)	–
REORIENT (optional)	–
REMOVE PRIMERS (optional)	–
MERGE READS	`read_R1` = \.R1 `min_overlap` = 12 `min_length` = 32 `allow_merge_stagger` = TRUE `include only R1` = FALSE `max_diffs` = 20 `max_Ns` = 0 `max_len` = 600 `keep_disjoined` = FALSE `fastq_qmax` = 41
QUALITY FILTERING with vsearch	`maxEE` = 1 `maxN` = 0 `minLen` = 32 `max_length` = undefined `qmax` = 41 `qmin` = 0 `maxee_rate` = undefined
CHIMERA FILTERING with uchime_denovo	`pre_cluster` = 0.98 `min_unique_size` = 1 `denovo` = TRUE `reference_based` = undefined `abundance_skew` = 2 `min_h` = 0.28
ITS Extractor (optional)	`organisms` = all `regions` = all `partial` = 50 `region_for_clustering` = ITS2 `cluster_full_and_partial` = TRUE `e_value` = 1e-2 `scores` = 0 `domains` = 2 `complement` = TRUE `only_full` = FALSE `truncate` = TRUE
CLUSTERING with vsearch	`OTU_type` = centroid `similarity_threshold` = 0.97 `strands` = both `remove_singletons` = false `similarity_type` = 2 `sequence_sorting` = cluster_size `centroid_type` = similarity `max_hits` = 1 `mask` = dust `dbmask` = dust
ASSIGN TAXONOMY with BLAST (optional)	`database_file` = select a database `task` = blastn `strands` = both

DADA2 ASVs

ASVs workflow panel (with DADA2)

Note

Working directory must contain at least 2 samples for DADA2 pipeline.

This automated workflow is based on the DADA2 tutorial: Note that demultiplexing, reorienting, and primer removal steps are optional and do not represent parts from the DADA2 tutorial. Nevertheless, it is advisable to remove primers before proceeding with ASV generation with DADA2.

The official DADA2 manual is available here

Default options:

Analyses step	Default setting
DEMULTIPLEX (optional)	–
REORIENT (optional)	–
REMOVE PRIMERS (optional)	–
QUALITY FILTERING	`read_R1` = \.R1 `read_R2` = \.R2 `maxEE` = 2 `maxN` = 0 `minLen` = 20 `truncQ` = 2 `truncLen` = 0 `maxLen` = 9999 `minQ` = 2 `matchIDs` = TRUE
DENOISE	`pool` = FALSE `selfConsist` = FASLE `qualityType` = Auto
MERGE PAIRED-END READS	`minOverlap` = 12 `maxMismatch` = 0 `trimOverhang` = FALSE `justConcatenate` = FALSE
CHIMERA FILTERING	`method` = consensus
Filter ASV table (optional)	`collapseNoMismatch` = TRUE `by_length` = 250 `minOverlap` = 20 `vec` = TRUE
ASSIGN TAXONOMY (optional)	`minBoot` = 50 `tryRC` = FALSE `dada2 database` = select a database

QUALITY FILTERING [ASVs workflow]

DADA2 filterAndTrim function performs quality filtering on input FASTQ files based on user-selected criteria. Outputs include filtered FASTQ files located in the qualFiltered_out directory.

Quality profiles may be examined using the QualityCheck module.

Setting	Tooltip
`read_R1`	applies only for paired-end data. Identifyer string that is common for all R1 reads (e.g. when all R1 files have ‘.R1’ string, then enter ‘\.R1’. Note that backslash is only needed to escape dot regex; e.g. when all R1 files have ‘_R1’ string, then enter ‘_R1’.).
`read_R2`	applies only for paired-end data. Identifyer string that is common for all R2 reads (e.g. when all R2 files have ‘.R2’ string, then enter ‘\.R2’. Note that backslash is only needed to escape dot regex; e.g. when all R2 files have ‘_R1’ string, then enter ‘_R2’.).
`maxEE`	discard sequences with more than the specified number of expected errors
`maxN`	discard sequences with more than the specified number of N’s (ambiguous bases)
`minLen`	remove reads with length less than minLen. minLen is enforced after all other trimming and truncation
`truncQ`	truncate reads at the first instance of a quality score less than or equal to truncQ
`truncLen`	truncate reads after truncLen bases (applies to R1 reads when working with paired-end data). Reads shorter than this are discarded. Explore quality profiles (with QualityCheck module) and see whether poor quality ends needs to be truncated
`truncLen_R2`	applies only for paired-end data. Truncate R2 reads after truncLen bases. Reads shorter than this are discarded. Explore quality profiles (with QualityCheck module) and see whether poor quality ends needs to truncated
`maxLen`	remove reads with length greater than maxLen. maxLen is enforced on the raw reads. In dada2, the default = Inf, but here set as 9999
`minQ`	after truncation, reads contain a quality score below minQ will be discarded
`matchIDs`	applies only for paired-end data. If TRUE, then double-checking (with seqkit pair) that only paired reads that share ids are outputted. Note that ‘seqkit’ will be used for this process, because when using e.g. SRA fastq files where original fastq headers have been replaced, dada2 does not recognize those fastq id strings

see default settings

DENOISING [ASVs workflow]

DADA2 dada function to remove sequencing errors. Outputs filtered fasta files into denoised_assembled.dada2 directory.

Setting	Tooltip
`pool`	if TRUE, the algorithm will pool together all samples prior to sample inference. Pooling improves the detection of rare variants, but is computationally more expensive. If pool = ‘pseudo’, the algorithm will perform pseudo-pooling between individually processed samples.
`selfConsist`	if TRUE, the algorithm will alternate between sample inference and error rate estimation until convergence
`qualityType`	‘Auto’ means to attempt to auto-detect the fastq quality encoding. This may fail for PacBio files with uniformly high quality scores, in which case use ‘FastqQuality’

see default settings

MERGE PAIRS [ASVs workflow]

DADA2 mergePairs function to merge paired-end reads. Outputs merged fasta files into denoised_assembled.dada2 directory.

Setting	Tooltip
`minOverlap`	the minimum length of the overlap required for merging the forward and reverse reads
`maxMismatch`	the maximum mismatches allowed in the overlap region
`trimOverhang`	if TRUE, overhangs in the alignment between the forwards and reverse read are trimmed off. Overhangs are when the reverse read extends past the start of the forward read, and vice-versa, as can happen when reads are longer than the amplicon and read into the other-direction primer region
`justConcatenate`	if TRUE, the forward and reverse-complemented reverse read are concatenated rather than merged, with a NNNNNNNNNN (10 Ns) spacer inserted between them

see default settings

CHIMERA FILTERING [ASVs workflow]

DADA2 removeBimeraDenovo function to remove chimeras. Outputs filtered fasta files into chimeraFiltered_out.dada2 and final ASVs to ASVs_out.dada2 directory.

Setting	Tooltip
`method`	‘consensus’ - the samples are independently checked for chimeras, and a consensus decision on each sequence variant is made. If ‘pooled’, the samples are all pooled together for chimera identification. If ‘per-sample’, the samples are independently checked for chimeras

see default settings

filter ASV table [ASVs workflow]

DADA2 collapseNoMismatch function to collapse identical ASVs; and ASVs filtering based on minimum accepted sequence length (custom R functions). Outputs filtered ASV table and fasta files into ASVs_out.dada2/filtered directory.

Setting	Tooltip
`collapseNoMismatch`	collapses ASVs that are identical up to shifts or length variation, i.e. that have no mismatches or internal indels
`by_length`	discard ASVs from the ASV table that are shorter than specified value (in base pairs). Value 0 means OFF, no filtering by length
`minOverlap`	collapseNoMismatch setting. Default = 20. The minimum overlap of base pairs between ASV sequences required to collapse them together
`vec`	collapseNoMismatch setting. Default = TRUE. Use the vectorized aligner. Should be turned off if sequences exceed 2kb in length

see default settings

ASSIGN TAXONOMY [ASVs workflow]

DADA2 assignTaxonomy function to classify ASVs. Outputs classified fasta files into taxonomy_out.dada2 directory.

Setting	Tooltip
`minBoot`	the minimum bootstrap confidence for assigning a taxonomic level
`tryRC`	the reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence
`dada2 database`	select a reference database fasta file for taxonomy annotation Download DADA2-formatted reference databases here

see default settings

UNOISE ASVs

UNOISE3 pipeline for making ASVs (zOTUs). Can optionally do automatic clustering of those ASVs (zOTUs) to OTUs by specifying the similarity threshold < 1. Uses UNOISE3 and clustering algorithms in vsearch.

NextITS

NextITS is an automated pipeline for analysing full-length ITS reads obtained via PacBio sequencing.

This pipeline implements:
* primer trimming
* quality filtering
* full-length ITS region extraction
* correction of homopolymer errors
* chimera filtering (get database for reference-based chimera filtering here)
* recovery of sequences false-positively annotated as chimeric
* detection of tag-switching artifacts per sequencing run
* multiple options for sequence clustering
* post-clustering with LULU

Please see other details here: https://next-its.github.io

Important

NextITS requires your data and folders to be structured in a specific way (see below)! Directory my_dir_for_NextITS contains Input [hard-coded requirement here] and one or multiple sequencing runs. In the below example, the sequencing runs [RunID] are named as Run1, Run2 and Run3 (but naming can be different).

In PipeCraft2, following the examples below, select my_dir_for_NextITS as a WORKDIR.

Single sequencing run

Select my_dir_for_NextITS as a WORKDIR in PipeCraft2.

Directory structure for analysing a single sequencing run:

Input data for this pipeline must be demultiplexed, if your data is multiplexed use the demultiplexer from QuickTools before running the pipeline.

Sample naming

Please avoid non-ASCII symbols in SampleID, and do not use the period symbol (.), as it represents the wildcard character in regular expressions. Also, it is preferable not to start the sample name with a number.

Multiple sequencing runs

Select my_dir_for_NextITS as a WORKDIR in PipeCraft2.

Directory structure for analysing multiple sequencing runs:

Input data for this pipeline must be demultiplexed, if your data is multiplexed use the demultiplexer from QuickTools before running the pipeline.

Sample naming

Please avoid non-ASCII symbols in RunID and SampleID, and do not use the period symbol (.), as it represents the wildcard character in regular expressions. Also, it is preferable not to start the sample name with a number.

NextITS uses the SequencingRunID__SampleID naming convention (please note the double underscore separating RunID and SampleID parts). This naming scheme allows to easily trace back sequences, especially if the same sample was sequenced several times and is present in multiple sequencing runs. In the later steps, extracting the SampleID part and summarizing read counts for such samples is easy.

Default settings:

Analyses step	Default setting
STEP 1: QUALITY CONTROL, ARTEFACT REMOVAL	`primer_mismatch` = 2 `its_region` = full `qc_maxhomopolymerlen` = 25 `qc_maxn` = 4 `ITSx_evalue` = 1e-2 `ITSx_partial` = 0 `ITSx_tax` = all `chimera_rescue_occurrence` = 2 `tj f` = 0.01 `tj p` = 1 `hp` = TRUE
STEP 2: DATA AGGREGATION, CLUSTERING	`otu_id` = 0.98 `swarm_d` = 1 `lulu` = TRUE `unoise` = FALSE `otu_id_def` = 2 `otu_qmask` = dust `swarm_fastidious` = TRUE `unoise_alpha` = 2 `unoise_minsize` = 8 `max_MEEP` = 0.5 `max_chimera_score` = 0.5 `lulu_match` = 95 `lulu_ratio` = 1 `lulu_ratiotype` = min `lulu_relcooc` = 0.95 `lulu_maxhits` = 0

Cut primers

Setting	Tooltip
`primer_forward`	Specify forward primer, IUPAC codes allowed
`primer_reverse`	Specify reverse primer, IUPAC codes allowed
`primer_mismatch`	Specify allowed number of mismatches for primers

Quality filtering

Filter sequences based on expected errors per sequence and per base, compress and correct homopolymers.

Setting	Tooltip
`qc_maxee`	Maximum number of expected errors
`qc_maxeerate`	Maximum number of expected error per base
`qc_maxn`	Discard sequences with more than the specified number of ambiguous nucleotides (N’s)
`qc_maxhomopolymerlen`	Threshold for a homopolymer region lenght in a sequence
`hp`	Enable or disable homopolymer correction

ITS extraction

When performing ITS metabarcoding, it may be beneficial to trim the flanking 18S and 28S rRNA genes; because:

these conserved regions don’t offer species-level differentiation.

random errors in these areas can disrupt sequence clustering.

chimeric breakpoints, which are common in these regions, are hard to detect in short fragments ranging from 10 to 70 bases.

NextITS deploys the ITSx software (Bengtsson-Palme et al. 2013) for extracting the ITS sequence.

Setting	Tooltip
`its_region`	ITS part selector (ITS1, ITS2 or full)
`ITSx_tax`	Taxonomy profile for ITSx can be used to restrict the search to only taxon(s) of interest.
`ITSx_evalue`	E-value cutoff threshold for ITSx
`ITSx_partial`	Keep partial ITS sequences (specify a minimum length cutoff)

Chimera filtering

NextITS employs a two-pronged strategy to detect chimeras: de novo and reference-based chimera filtering.

A reference database for chimera filtering from full-length ITS data is accessible here. This database is based on EUKARYOME database

Additional step in NextITS is a “rescue” of sequences that were initially flagged as chimeric, but are occur at least in 2 samples (which represent independent PCR reactions); thus are likely false-positive chimeric sequences. The chimeric sequence occurrence frequency can be edited using the –chimera_rescue_occurrence parameter.

Setting	Tooltip
`chimera_database` (optional)	Database for reference based chimera removal (UDB)
`chimera_rescue_occurence`	A minimum occurence of initially flagged chimeric sequence required to rescue them

Tag-jump correction

Tag-jumps, sometimes referred to as index-switches or index cross-talk, may represent a significant concern in high-throughput sequencing (HTS) data. They can cause technical cross-contamination between samples, potentially distorting estimates of community composition. Here, tag-jump events are evaluated the UNCROSS2 algorithm (Edgar 2018 ) are removed.

Setting	Tooltip
`tj_f`	UNCROSS parameter f for tag-jump filtering
`tj_p`	UNCROSS parameter p for tag-jump filtering

UNOISE denoising

The UNOISE algorithm (Edgar 2016 ) focuses on error-correction (or denoising) of amplicon reads. Essentially, UNOISE operates on the principle that if a sequence with low abundance closely resembles another sequence with high abundance, the former is probably an error. This helps differentiate between true biological variation and sequencing errors. It’s important to note that UNOISE was initially designed and optimized for Illumina data. Because of indel errors stemming from inaccuracies in homopolymeric regions, UNOISE might not work well with data that hasn’t undergone homopolymer correction.

Setting	Tooltip
`unoise`	Enable or disable denoising with UNOISE algorithm
`unoise_alpha`	Alpha parameter for UNOISE
`unoise_minsize`	Minimum sequence abundance

Clustering

NextITS supports 3 different clustering methods:

vsearch: this employs greedy clustering using a fixed sequence similarity threshold with VSEARCH (Rognes et al., 2016, );

swarm: dynamic sequence similarity threshold for clustering with SWARM (Mahé et al., 2021, );

unoise: creates zero-radius OTUs (zOTUs) based on the UNOISE3 algorithm (Edgar 2016 );

Setting	Tooltip
`clustering_method`	Sequence clustering method (choose from: vsearch, swarm, unoise)
`otu_id`	Sequence similarity threshold
`otu_iddef`	Sequence similarity definition (applied to UNOISE as well)
`otu_qmask`	Method to mask low-complexity sequences (applied to UNOISE as well)
`swarm_d`	SWARM clustering resolution (d)
`swarm_fastidious`	Link nearby low-abundance swarms (fastidious option)

Post-clustering with LULU

The purpose of LULU is to reduce the number of erroneous OTUs in OTU tables to achieve more realistic biodiversity metrics. By evaluating the co-occurence patterns of OTUs among samples LULU identifies OTUs that consistently satisfy some user selected criteria for being errors of more abundant OTUs and merges these OTUs.

Setting	Tooltip
`lulu`	Enable or disable post-clustering curation with lulu
`lulu_match`	Minimum similarity threshold
`lulu_ratio`	Minimum abundance ratio
`lulu_ratiotype`	Abundance ratio type - “min” or “avg
`lulu_relcooc`	Relative co-occurrence
`lulu_maxhits`	Maximum number of hits (0 = unlimited)