: github

User guide 

The interface 

The startup panel:

(click on the image for enlargement)

Glossary 

List of terms that you may encounter in this user guide.

working directory	the directory (folder) that contains the files for the analyses. The outputs will be written into this directory
paired-end data	obtained by sequencing two ends of the same DNA fragment, which results in read 1 (R1) and read 2 (R2) files per library or per sample
single-end data	only one sequencing file per library or per sample. Herein, may mean also assembled paired-end data.
demultiplexed data	sequences are sorted into separate files, representing individual samples
multiplexed data	file(s) that represent a pool of sequences from different samples
read/sequence	DNA sequence; herein, reads and sequences are used interchangeably

Initial PipeCraft2 installation does not contain any software for sequence data processing. All the processes are run through docker, where the PipeCraft’s simply GUI mediates the information exchange. Therefore, whenever a process is initiated for the first time, a relevant Docker image (contains required software for the analyses step) will be pulled from Docker Hub.

Example: running DEMULTIPLEXING for the first time

Thus working Internet connection is initially required. Once the Docker images are pulled, PipeCraft2 can work without an Internet connection.

Docker images vary in size, and the speed of the first process is extended by the docker image download time.

Save workflow 

Note

starting fro m ersion 0.1.4, PipeCraft2 will automatically save the settings into selected WORKDIR prior starting the analyses

Once the workflow settings are selected, save the workflow by pressin SAVE WORKFLOW button on the right-ribbon. For saving, working directory ( SELECT WORKDIR ) does not have to be selected.

Important

When saiving workflow settings in Linux, specify the file extension as JSON (e.g. my_16S_ASVs_pipe.JSON). When trying to load the workflow, only .JSON files will be permitted as input. Windows and Mac OS automatically extend files as JSON (so you may just save “my_16S_ASVs_pipe”).

Load workflow 

Note

Prior loading the workflow, make sure that the saved workflow configuration has a .JSON extension. Note also that workflows saved in older PipeCraft2 version might not run in newer version, but anyhow the selected options will be visible for reproducibility.

Press the LOAD WORKFLOW button on the right-ribbon and select appropriate JSON file. The configuration will be loaded; SELECT WORKDIR and run analyses.

Quality and basic statistics screening of the data 

Quality and basic statistics screening of the data can be done via QualityCheck panel. QualityCheck panel implements FastQC and MultiQC to screen the input fastq files.

To start:

Select folder (a working directory) which contains only fastq (fastq/fq) files that you aim to inspect.

Press CREATE REPORT to start MultiQC

“LOADING …” will be displayed while the report is being generated

Click VIEW REPORT. A html file (multiqc_report.html) will open in your default web browser.

If the summary does not open, check your working floder for the presence of multiqc_report.html and try to open with some other web browser. Something went wrong if the file multiqc_report.html does not exist (may fail when maximum number of fastq files in the folder is extremely large, >10 000).

Check out “using MultiQC reports” in MultiQC web page.

Note

Note that ‘_fastqc.zip’ and ‘_fastqc.html’ are generated for each fastq file in the ‘quality_check’ directory. These are summarized in multiqc_report.html, so you may delete all individual ‘_fastqc.zip’ and ‘_fastqc.html’ files.

Select workdir and run analyses 

1. Open your working directory by pressing the SELECT WORKDIR button. E.g., if working with FASTQ files, then be sure that the working directory contains only relevant FASTQ files because the selected process will be applied to all FASTQ files in the working directory!

Note

When using Windows OS, the selection window might not display the files while browsing through the directories.

After selecting a working directory, PipeCraft needs you to specify if the working directory consists of

multiplexed or demultiplexed data

the data is paired-end or single-end

and the extension of the data (fastq or fasta)

multiplexed –> only one file (or a pair of files, R1 and R2) per sequencing data (library)
demultiplexed –> multiple per-sample sequencing files per library
paired-end data –> such as data from Illumina or MGI-Tech platforms (R1 and R2 files). Be sure to have **R1** and **R2** strings in the paired-end files (not simply _1 and _2)
single-end data –> such as data from PacBio, or assembled paired-end data (single file per library or per sample)

2. Select ASV or OTU workflow panel or press ADD STEP button to select relevant step [or load the PipeCraft settings file]; edit settings if needed (SAVE the settings for later use) and start running the analyses by pressing the RUN WORKFLOW button.

Note

Step-by-step analyses: after RUN WORKFLOW is finished, then press SELECT WORKDIR to specify inputs for the next process

Note

The output files will be overwritten if running the same analysis step multiple times in the same working directory

3. Each process creates a separate output directory with the processed files inside the selected working directory. README file about the process and sequence count summary statistics are included in the output directory.

FULL PIPELINE PANELS 

ASVs workflow panel (with DADA2)

Note

Current ASVs workflow supports only PAIRED-END reads! Working directory must contain paired-end reads for at least 2 samples.

ASV workflow is active (green icon) ; ASV workflow is off

This automated workflow is based on the DADA2 tutorial: Note that demultiplexing, reorienting, and primer removal steps are optional and do not represent parts from the DADA2 tutorial. Nevertheless, it is advisable to reorient your reads (to 5’-3’) and remove primers before proceeding with ASV generation with DADA2.

The official DADA2 manual is available here

Default options:

Analyses step	Default setting
DEMULTIPLEX (optional)	–
REORIENT (optional)	–
REMOVE PRIMERS (optional)	–
QUALITY FILTERING	`read_R1` = \.R1 `read_R2` = \.R2 `samp_ID` = \. `maxEE` = 2 `maxN` = 0 `minLen` = 20 `truncQ` = 2 `truncLen` = 0 `maxLen` = 9999 `minQ` = 2 `matchIDs` = TRUE
DENOISE	`pool` = FALSE `selfConsist` = FASLE `qualityType` = Auto
MERGE PAIRED-END READS	`minOverlap` = 12 `maxMismatch` = 0 `trimOverhang` = FALSE `justConcatenate` = FALSE
CHIMERA FILTERING	`method` = consensus
Filter ASV table (optional)	`collapseNoMismatch` = TRUE `by_length` = 250 `minOverlap` = 20 `vec` = TRUE
ASSIGN TAXONOMY (optional)	`minBoot` = 50 `tryRC` = FALSE `dada2 database` = select a database

QUALITY FILTERING [ASVs workflow]

DADA2 filterAndTrim function performs quality filtering on input FASTQ files based on user-selected criteria. Outputs include filtered FASTQ files located in the qualFiltered_out directory.

Quality profiles may be examined using the QualityCheck module.

Setting	Tooltip
`read_R1`	applies only for paired-end data. Identifyer string that is common for all R1 reads (e.g. when all R1 files have ‘.R1’ string, then enter ‘\.R1’. Note that backslash is only needed to escape dot regex; e.g. when all R1 files have ‘_R1’ string, then enter ‘_R1’.).
`read_R2`	applies only for paired-end data. Identifyer string that is common for all R2 reads (e.g. when all R2 files have ‘.R2’ string, then enter ‘\.R2’. Note that backslash is only needed to escape dot regex; e.g. when all R2 files have ‘_R1’ string, then enter ‘_R2’.).
`samp_ID`	applies only for paired-end data. Identifyer string that separates the sample name from redundant charachters (e.g. file name = sample1.R1.fastq, then underscore ‘\.’ would be the ‘identifier string’ (sample name = sampl84)); note that backslash is only needed to escape dot regex (e.g. when file name = sample1_R1.fastq then specify as ‘_’)
`maxEE`	discard sequences with more than the specified number of expected errors
`maxN`	discard sequences with more than the specified number of N’s (ambiguous bases)
`minLen`	remove reads with length less than minLen. minLen is enforced after all other trimming and truncation
`truncQ`	truncate reads at the first instance of a quality score less than or equal to truncQ
`truncLen`	truncate reads after truncLen bases (applies to R1 reads when working with paired-end data). Reads shorter than this are discarded. Explore quality profiles (with QualityCheck module) and see whether poor quality ends needs to be truncated
`truncLen_R2`	applies only for paired-end data. Truncate R2 reads after truncLen bases. Reads shorter than this are discarded. Explore quality profiles (with QualityCheck module) and see whether poor quality ends needs to truncated
`maxLen`	remove reads with length greater than maxLen. maxLen is enforced on the raw reads. In dada2, the default = Inf, but here set as 9999
`minQ`	after truncation, reads contain a quality score below minQ will be discarded
`matchIDs`	applies only for paired-end data. If TRUE, then double-checking (with seqkit pair) that only paired reads that share ids are outputted. Note that ‘seqkit’ will be used for this process, because when using e.g. SRA fastq files where original fastq headers have been replaced, dada2 does not recognize those fastq id strings

see default settings

DENOISING [ASVs workflow]

DADA2 dada function to remove sequencing errors. Outputs filtered fasta files into denoised_assembled.dada2 directory.

Setting	Tooltip
`pool`	if TRUE, the algorithm will pool together all samples prior to sample inference. Pooling improves the detection of rare variants, but is computationally more expensive. If pool = ‘pseudo’, the algorithm will perform pseudo-pooling between individually processed samples.
`selfConsist`	if TRUE, the algorithm will alternate between sample inference and error rate estimation until convergence
`qualityType`	‘Auto’ means to attempt to auto-detect the fastq quality encoding. This may fail for PacBio files with uniformly high quality scores, in which case use ‘FastqQuality’

see default settings

MERGE PAIRS [ASVs workflow]

DADA2 mergePairs function to merge paired-end reads. Outputs merged fasta files into denoised_assembled.dada2 directory.

Setting	Tooltip
`minOverlap`	the minimum length of the overlap required for merging the forward and reverse reads
`maxMismatch`	the maximum mismatches allowed in the overlap region
`trimOverhang`	if TRUE, overhangs in the alignment between the forwards and reverse read are trimmed off. Overhangs are when the reverse read extends past the start of the forward read, and vice-versa, as can happen when reads are longer than the amplicon and read into the other-direction primer region
`justConcatenate`	if TRUE, the forward and reverse-complemented reverse read are concatenated rather than merged, with a NNNNNNNNNN (10 Ns) spacer inserted between them

see default settings

CHIMERA FILTERING [ASVs workflow]

DADA2 removeBimeraDenovo function to remove chimeras. Outputs filtered fasta files into chimeraFiltered_out.dada2 and final ASVs to ASVs_out.dada2 directory.

Setting	Tooltip
`method`	‘consensus’ - the samples are independently checked for chimeras, and a consensus decision on each sequence variant is made. If ‘pooled’, the samples are all pooled together for chimera identification. If ‘per-sample’, the samples are independently checked for chimeras

see default settings

filter ASV table [ASVs workflow]

DADA2 collapseNoMismatch function to collapse identical ASVs; and ASVs filtering based on minimum accepted sequence length (custom R functions). Outputs filtered ASV table and fasta files into ASVs_out.dada2/filtered directory.

Setting	Tooltip
`collapseNoMismatch`	collapses ASVs that are identical up to shifts or length variation, i.e. that have no mismatches or internal indels
`by_length`	discard ASVs from the ASV table that are shorter than specified value (in base pairs). Value 0 means OFF, no filtering by length
`minOverlap`	collapseNoMismatch setting. Default = 20. The minimum overlap of base pairs between ASV sequences required to collapse them together
`vec`	collapseNoMismatch setting. Default = TRUE. Use the vectorized aligner. Should be turned off if sequences exceed 2kb in length

see default settings

ASSIGN TAXONOMY [ASVs workflow]

DADA2 assignTaxonomy function to classify ASVs. Outputs classified fasta files into taxonomy_out.dada2 directory.

Setting	Tooltip
`minBoot`	the minimum bootstrap confidence for assigning a taxonomic level
`tryRC`	the reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence
`dada2 database`	select a reference database fasta file for taxonomy annotation Download DADA2-formatted reference databases here

see default settings

OTUs workflow panel 

Note

This OTU workflow works with paired-end (e.g. Illumina, MGI-Tech) as well as single-end reads (e.g. PacBio, assembled Illumina reads)

OTU workflow is active (green icon) ; OTU workflow is off

This automated workflow is mostly based on vsearch (Rognes et. al 2016) [manual]: Note that demultiplexing, reorient and remove primers steps are optional. Nevertheless, it is advisable to reorient your reads (to 5’-3’) and remove primers before proceeding.

Default options:

click on analyses step for more info

Analyses step	Default setting
DEMULTIPLEX (optional)	–
REORIENT (optional)	–
REMOVE PRIMERS (optional)	–
MERGE READS	`read_R1` = \.R1 `min_overlap` = 12 `min_length` = 32 `allow_merge_stagger` = TRUE `include only R1` = FALSE `max_diffs` = 20 `max_Ns` = 0 `max_len` = 600 `keep_disjoined` = FALSE `fastq_qmax` = 41
QUALITY FILTERING with vsearch	`maxEE` = 1 `maxN` = 0 `minLen` = 32 `max_length` = undefined `qmax` = 41 `qmin` = 0 `maxee_rate` = undefined
CHIMERA FILTERING with uchime_denovo	`pre_cluster` = 0.98 `min_unique_size` = 1 `denovo` = TRUE `reference_based` = undefined `abundance_skew` = 2 `min_h` = 0.28
ITS Extractor (optional)	`organisms` = all `regions` = all `partial` = 50 `region_for_clustering` = ITS2 `cluster_full_and_partial` = TRUE `e_value` = 1e-2 `scores` = 0 `domains` = 2 `complement` = TRUE `only_full` = FALSE `truncate` = TRUE
CLUSTERING with vsearch	`OTU_type` = centroid `similarity_threshold` = 0.97 `strands` = both `remove_singletons` = false `similarity_type` = 2 `sequence_sorting` = cluster_size `centroid_type` = similarity `max_hits` = 1 `mask` = dust `dbmask` = dust
ASSIGN TAXONOMY with BLAST (optional)	`database_file` = select a database `task` = blastn `strands` = both

ANALYSES PANELS 

DEMULTIPLEX 

If data is multiplexed, the first step would be demultiplexing (using cutadapt (Martin 2011)). This is done based on the user specified indexes file, which includes molecular identifier sequences (so called indexes/tags/barcodes) per sample. Note that reverse complementary matches will also be searched.

Fastq/fasta formatted paired-end and single-end data are supported.
Outputs are fastq/fasta files per sample in demultiplexed_out directory. Indexes are truncated from the sequences.
Paired-end samples get .R1 and .R2 read identifiers.
unknown.fastq file(s) contain sequences where specified index combinations were not found.

Note

If found, sequences with any index combination will be outputted when using paired indexes. That means, if, for example, your sample_1 is indexed with indexFwd_1-indexRev_1 and sample_2 with indexFwd_2-indexRev_2, then files with indexFwd_1-indexRev_2 and indexFwd_2-indexRev_1 are also written (although latter index combinations were not used in the lab to index any sample [i.e. represent tag-switches]). Simply remove those files if not needed or use to estimate tag-switching error if relevant.

Setting	Tooltip
`index file`	select your fasta formatted indexes file for demultiplexing (see guide here), where fasta headers are sample names, and sequences are sample specific index or index combination
`index mismatch`	allowed mismatches during the index search
`overlap`	number of overlap bases with the index Recommended overlap is the maximum length of the index for confident sequence assignments to samples
`min seq length`	minimum length of the output sequence
`no indels`	do not allow insertions or deletions is primer search. Mismatches are the only type of errors accounted in the error rate parameter

Note

Heterogenity spacers or any redundant base pairs attached to index sequences do not affect demultiplexing. Indexes are trimmed from the best matching position.

Indexes file example (fasta formatted)

Note

Only IUPAC codes are allowed in the sequences. Avoid using ‘.’ in the sample names (e.g. instead of sample.1, use sample_1)

Demultiplexing using single indexes:

>sample1

AGCTGCACCTAA

>sample2

AGCTGTCAAGCT

>sample3

AGCTTCGACAGT

>sample4

AGGCTCCATGTA

>sample5

AGGCTTACGTGT

>sample6

AGGTACGCAATT

Demultiplexing using dual (paired) indexes:

Note

IMPORTANT! reverse indexes will be automatically oriented to 5’-3’ (for the search); so you can simply copy-paste the indexes from your lab protocol.

>sample1
AGCTGCACCTAA…AGCTGCACCTAA
>sample2
AGCTGTCAAGCT…AGCTGTCAAGCT
>sample3
AGCTTCGACAGT…AGCTTCGACAGT
>sample4
AGGCTCCATGTA…AGGCTCCATGTA
>sample5
AGGCTTACGTGT…AGGCTTACGTGT
>sample6
AGGTACGCAATT…AGGTACGCAATT

Note

Anchored indexes (https://cutadapt.readthedocs.io/en/stable/guide.html#anchored-5adapters) with ^ symbol are not supported in PipeCraft demultiplex GUI panel.

DO NOT USE, e.g.

>sample1
^AGCTGCACCTAA

>sample1
^AGCTGCACCTAA…AGCTGCACCTAA

How to compose indexes.fasta

In Excel (or any alternative program); first column represents sample names, second (and third) column represent indexes (or index combinations) per sample:

Exaples:

sample1    AGCTGCACCTAA
sample2    AGCTGTCAAGCT
sample3    AGCTTCGACAGT
sample4    AGGCTCCATGTA
sample5    AGGCTTACGTGT
sample6    AGGTACGCAATT

or

sample1    AGCTGCACCTAA    AGCTGCACCTAA
sample2    AGCTGTCAAGCT    AGCTGTCAAGCT
sample3    AGCTTCGACAGT    AGCTTCGACAGT
sample4    AGGCTCCATGTA    AGGCTCCATGTA
sample5    AGGCTTACGTGT    AGGCTTACGTGT
sample6    AGGTACGCAATT    AGGTACGCAATT

Copy those two (or three) columns to text editor that support regular expressions, such as NotePad++ or Sublime Text. If using PAIRED indexes (three columns), proceed to bullet no. 5

single-end indexes:
1. Open ‘find & replace’ Find ^ (which denotes the beginning of each line). Replace with > (and DELETE THE LAST > in the beginning of empty row).
2. Find \t (which denotes tab). Replace with \n (which denotes the new line).
  
  FASTA FORMATTED (single-end indexes) indexes.fasta file is ready; SAVE the file.
Only for paired-indexes:
1. Open ‘find & replace’: Find ^ (denotes the beginning of each line); replace with > (and DELETE THE LAST > in the beginning of empty row).
2. Find .*\K\t (which captures the second tab); replace with … (to mark the linked paired-indexes).
3. Find \t (denotes the tab); replace with \n (denotes the new line).
  
  FASTA FORMATTED (paired indexes) indexes.fasta file is ready; SAVE the file.

REORIENT 

Sequences are often (if not always) in both, 5’-3’ and 3’-5’, orientations in the raw sequencing data sets. If the data still contains PCR primers that were used to generate amplicons, then by specifying these PCR primers, this panel will perform sequence reorientation of all sequences.

For reorienting, first the forward primer will be searched (using fqgrep) and if detected then the read is considered as forward complementary (5’-3’). Then the reverse primer will be searched (using fqgrep) from the same input data and if detected, then the read is considered to be in reverse complementary orientation (3’-5’). Latter reads will be transformed to 5’-3’ orientation and merged with other 5’-3’ reads. Note that for paired-end data, R1 files will be reoriented to 5’-3’ but R2 reads will be reoriented to 3’-5’ in order to merge paired-end reads.

At least one of the PCR primers must be found in the sequence. For example, read will be recorded if forward primer was found even though reverse primer was not found (and vice versa). Sequence is discarded if none of the PCR primers are found.

Sequences that contain multiple forward or reverse primers (multi-primer artefacts) are discarded as it is highly likely that these are chimeric sequences. Reorienting sequences will not remove primer strings from the sequences.

Note

For single-end data, sequences will be reoriented also during the ‘cut primers’ process (see below); therefore this step may be skipped when working with single-end data (such as data from PacBio machines OR already assembled paired-end data).

Reorienting reads may be relevant for generating ASVs with DADA2 as reverse complement sequences will represent separate ASVs. In the clustering step of an OTU pipeline, both strands of the sequences can be compared prior forming OTUs; thus this step may be skipped in the OTU pipeline.

Supported file formats for paired-end input data are only fastq, but also fasta for single-end data. Outputs are fastq/fasta files in reoriented_out directory. Primers are not truncated from the sequences; this can be done using CUT PRIMER panel

Setting	Tooltip
`mismatches`	allowed mismatches in the primer search
`forward_primers`	specify forward primer (5’-3’); IUPAC codes allowed; add up to 13 primers
`reverse_primers`	specify reverse primer (3’-5’); IUPAC codes allowed; add up to 13 primers

CUT PRIMERS 

If the input data contains PCR primers (or e.g. adapters), these can be removed in the CUT PRIMERS panel. CUT PRIMERS processes mostly relies on cutadapt (Martin 2011).

For generating OTUs or ASVs, it is recommended to truncate the primers from the reads (unless ITS Extractor is used later to remove flanking primer binding regions from ITS1/ITS2/full ITS; in that case keep the primers better detection of the 18S, 5.8S and/or 28S regions). Sequences where PCR primer strings were not detected are discarded by default (but stored in ‘untrimmed’ directory). Reverse complementary search of the primers in the sequences is also performed. Thus, primers are clipped from both 5’-3’ and 3’-5’ oriented reads. However, note that paired-end reads will not be reoriented to 5’-3’ during this process, but single-end reads will be reoriented to 5’-3’ (thus no extra reorient step needed for single-end data).

Note

For paired-end data, the seqs_to_keep option should be left as default (‘keep_all’). This will output sequences where at least one primer has been clipped. ‘keep_only_linked’ option outputs only sequences where both the forward and reverse primers are found (i.e. 5’-forward…reverse-3’). ‘keep_only_linked’ may be used for single-end data to keep only full-length amplicons.

Fastq/fasta formatted paired-end and single-end data are supported.

Outputs are fastq/fasta files in primersCut_out directory. Primers are truncated from the sequences.

Setting	Tooltip
`forward primers`	specify forward primer (5’-3’); IUPAC codes allowed; add up to 13 primers
`reverse primers`	specify reverse primer (3’-5’); IUPAC codes allowed; add up to 13 primers
`mismatches`	allowed mismatches in the primer search
`min overlap`	number of overlap bases with the primer sequence. Partial matches are allowed, but short matches may occur by chance, leading to erroneously clipped bases. Specifying higher overlap than the length of primer sequnce will still clip the primer (e.g. primer length is 22 bp, but overlap is specified as 25 - this does not affect the identification and clipping of the primer as long as the match is in the specified mismatch error range)
`seqs to keep`	keep sequences where at least one primer was found (fwd or rev); recommended when cutting primers from paired-end data (unassembled), when individual R1 or R2 read lengths are shorther than the expected amplicon length. ‘keep_only_linked’ = keep sequences if primers are found in both ends (fwd…rev); discards the read if both primers were not found in this read
`pair filter`	applies only for paired-end data. ‘both’, means that a read is discarded only if both, corresponding R1 and R2, reads do not contain primer strings (i.e. a read is kept if R1 contains primer string, but no primer string found in R2 read). Option ‘any’ discards the read if primers are not found in both, R1 and R2 reads
`min seq length`	minimum length of the output sequence
`no indels`	do not allow insertions or deletions is primer search. Mismatches are the only type of errprs accounted in the error rate parameter

QUALITY FILTERING 

Quality filter and trim sequences.

Fastq formatted paired-end and single-end data are supported.

Outputs are fastq files in qualFiltered_out directory.

vsearch 

vsearch setting	Tooltip
`maxEE`	maximum number of expected errors per sequence (see here). Sequences with higher error rates will be discarded
`maxN`	discard sequences with more than the specified number of Ns
`minLen`	minimum length of the filtered output sequence
`max_length`	discard sequences with more than the specified number of bases. Note that if ‘trunc length’ setting is specified, then ‘max length’ SHOULD NOT be lower than ‘trunc length’ (otherwise all reads are discared) [empty field = no action taken] Note that if ‘trunc length’ setting is specified, then ‘min length’ SHOULD BE lower than ‘trunc length’ (otherwise all reads are discared)
`qmax`	specify the maximum quality score accepted when reading FASTQ files. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files. For PacBio data use 93
`trunc_length`	truncate sequences to the specified length. Shorter sequences are discarded; thus if specified, check that ‘min length’ setting is lower than ‘trunc length’ (‘min length’ therefore has basically no effect) [empty field = no action taken]
`qmin`	the minimum quality score accepted for FASTQ files. The default is 0, which is usual for recent Sanger/Illumina 1.8+ files. Older formats may use scores between -5 and 2
`maxee_rate`	discard sequences with more than the specified number of expected errors per base
`minsize`	discard sequences with an abundance lower than the specified value

trimmomatic 

trimmomatic setting	Tooltip
`window_size`	the number of bases to average base qualities Starts scanning at the 5’-end of a sequence and trimms the read once the average required quality (required_qual) within the window size falls below the threshold
`required_quality`	the average quality required for selected window size
`min_length`	minimum length of the filtered output sequence
`leading_qual_threshold`	quality score threshold to remove low quality bases from the beginning of the read. As long as a base has a value below this threshold the base is removed and the next base will be investigated
`trailing_qual_threshold`	quality score threshold to remove low quality bases from the end of the read. As long as a base has a value below this threshold the base is removed and the next base will be investigated
`phred`	phred quality scored encoding. Use phred64 if working with data from older Illumina (Solexa) machines

fastp 

fastp setting	Tooltip
`window_size`	the window size for calculating mean quality
`required_qual`	the mean quality requirement per sliding window (window_size)
`min_qual`	the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified
`min_qual_thresh`	how many percents of bases are allowed to be unqualified (0-100)
`maxNs`	discard sequences with more than the specified number of Ns
`min_length`	minimum length of the filtered output sequence. Shorter sequences are discarded
`max_length`	reads longer than ‘max length’ will be discarded, default 0 means no limitation
`trunc_length`	truncate sequences to specified length. Shorter sequences are discarded; thus check that ‘min length’ setting is lower than ‘trunc length’
`aver_qual`	if one read’s average quality score <’aver_qual’, then this read/pair is discarded. Default 0 means no requirement
`low_complexity_filter`	enables low complexity filter and specify the threshold for low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]). E.g. vaule 30 means then 30% complexity is required. Not specified = filter not applied
`cores`	number of cores to use

DADA2 (‘filterAndTrim’ function)

DADA2 setting	Tooltip
`read_R1`	applies only for paired-end data. Identifyer string that is common for all R1 reads (e.g. when all R1 files have ‘.R1’ string, then enter ‘\.R1’. Note that backslash is only needed to escape dot regex; e.g. when all R1 files have ‘_R1’ string, then enter ‘_R1’.).
`read_R2`	applies only for paired-end data. Identifyer string that is common for all R2 reads (e.g. when all R2 files have ‘.R2’ string, then enter ‘\.R2’. Note that backslash is only needed to escape dot regex; e.g. when all R2 files have ‘_R1’ string, then enter ‘_R2’.).
`samp_ID`	applies only for paired-end data. Identifyer string that separates the sample name from redundant charachters (e.g. file name = sample1.R1.fastq, then underscore ‘\.’ would be the ‘identifier string’ (sample name = sampl84)); note that backslash is only needed to escape dot regex (e.g. when file name = sample1_R1.fastq then specify as ‘_’)
`maxEE`	discard sequences with more than the specified number of expected errors
`maxN`	discard sequences with more than the specified number of N’s (ambiguous bases)
`minLen`	remove reads with length less than minLen. minLen is enforced after all other trimming and truncation
`truncQ`	truncate reads at the first instance of a quality score less than or equal to truncQ
`truncLen`	truncate reads after truncLen bases (applies to R1 reads when working with paired-end data). Reads shorter than this are discarded. Explore quality profiles (with QualityCheck module) and see whether poor quality ends needs to be truncated
`truncLen_R2`	applies only for paired-end data. Truncate R2 reads after truncLen bases. Reads shorter than this are discarded. Explore quality profiles (with QualityCheck module) and see whether poor quality ends needs to truncated
`maxLen`	remove reads with length greater than maxLen. maxLen is enforced on the raw reads. In dada2, the default = Inf, but here set as 9999
`minQ`	after truncation, reads contain a quality score below minQ will be discarded
`matchIDs`	applies only for paired-end data. after truncation, reads contain a quality score below minQ will be discarded

ASSEMBLE PAIRED-END reads 

Assemble paired-end sequences (such as those from Illumina or MGI-Tech platforms).

include_only_R1 represents additional in-built module. If TRUE, unassembled R1 reads will be included to the set of assembled reads per sample. This may be relevant when working with e.g. ITS2 sequences, because the ITS2 region in some taxa is too long for paired-end assembly using current short-read sequencing technology. Therefore longer ITS2 amplicon sequences are discarded completely after the assembly process. Thus, including also unassembled R1 reads (include_only_R1 = TRUE), partial ITS2 sequences for these taxa will be represented in the final output. But when using ITSx , keep only_full = FALSE and include partial = 50.

Fastq formatted paired-end data is supported. Outputs are fastq files in assembled_out directory.

vsearch 

Setting	Tooltip
`read_R1`	applies only for paired-end data. Identifyer string that is common for all R1 reads (e.g. when all R1 files have ‘.R1’ string, then enter ‘\.R1’. Note that backslash is only needed to escape dot regex; e.g. when all R1 files have ‘_R1’ string, then enter ‘_R1’)’
`min_overlap`	minimum overlap between the merged reads
`min_length`	minimum length of the merged sequence
`allow_merge_stagger`	allow to merge staggered read pairs. Staggered pairs are pairs where the 3’ end of the reverse read has an overhang to the left of the 5’ end of the forward read. This situation can occur when a very short fragment is sequenced
`include_only_R1`	include unassembled R1 reads to the set of assembled reads per sample
`max_diffs`	the maximum number of non-matching nucleotides allowed in the overlap region
`max_Ns`	discard sequences with more than the specified number of Ns
`max_len`	maximum length of the merged sequence
`keep_disjoined`	output reads that were not merged into separate FASTQ files
`fastq_qmax`	maximum quality score accepted when reading FASTQ files. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files

DADA2 

Important

Here, dada2 will perform also denoising (function ‘dada’) before assembling paired-end data. Because of that, input sequences (in fastq format) must consist of only A/T/C/Gs.

Setting	Tooltip
`read_R1`	identifyer string that is common for all R1 reads (e.g. when all R1 files have ‘.R1’ string, then enter ‘\.R1’. Note that backslash is only needed to escape dot regex; e.g. when all R1 files have ‘_R1’ string, then enter ‘_R1’.)
`read_R2`	identifyer string that is common for all R2 reads (e.g. when all R2 files have ‘.R2’ string, then enter ‘\.R2’. Note that backslash is only needed to escape dot regex; e.g. when all R2 files have ‘_R1’ string, then enter ‘_R2’.)
`samp_ID`	identifyer string that separates the sample name from redundant charachters (e.g. file name = sample1.R1.fastq, then underscore ‘\.’ would be the ‘identifier string’ (sample name = sampl84)); note that backslash is only needed to escape dot regex (e.g. when file name = sample1_R1.fastq then specify as ‘_’)
`minOverlap`	the minimum length of the overlap required for merging the forward and reverse reads
`maxMismatch`	the maximum mismatches allowed in the overlap region
`trimOverhang`	if TRUE, overhangs in the alignment between the forwards and reverse read are trimmed off. Overhangs are when the reverse read extends past the start of the forward read, and vice-versa, as can happen when reads are longer than the amplicon and read into the other-direction primer region
`justConcatenate`	if TRUE, the forward and reverse-complemented reverse read are concatenated rather than merged, with a NNNNNNNNNN (10 Ns) spacer inserted between them
`pool`	denoising setting. If TRUE, the algorithm will pool together all samples prior to sample inference. Pooling improves the detection of rare variants, but is computationally more expensive. If pool = ‘pseudo’, the algorithm will perform pseudo-pooling between individually processed samples.
`selfConsist`	denoising setting. If TRUE, the algorithm will alternate between sample inference and error rate estimation until convergence
`qualityType`	‘Auto’ means to attempt to auto-detect the fastq quality encoding. This may fail for PacBio files with uniformly high quality scores, in which case use ‘FastqQuality’

CHIMERA FILTERING 

Perform de-novo and reference database based chimera filtering.

Chimera filtering is performed by sample-wise approach (i.e. each sample (input file) is treated separately).

Fastq/fasta formatted single-end data is supported [fastq inputs will be converted to fasta].

Outputs are fasta files in chimera_Filtered_out directory.

uchime_denovo 

Perform chimera filtering with uchime_denovo and uchime_ref algorithms in vsearch

Setting	Tooltip
`pre_cluster`	identity percentage when performing ‘pre-clustering’ with –cluster_size for denovo chimera filtering with –uchime_denovo
`min_unique_size`	minimum amount of a unique sequences in a fasta file. If value = 1, then no sequences are discarded after dereplication; if value = 2, then sequences, which are represented only once in a given file are discarded; and so on
`denovo`	if TRUE, then perform denovo chimera filtering with –uchime_denovo
`reference_based`	perform reference database based chimera filtering with –uchime_ref. Select fasta formatted reference database (e.g. UNITE for ITS reads). If denovo = TRUE, then reference based chimera filtering will be performed after denovo.
`abundance_skew`	the abundance skew is used to distinguish in a threeway alignment which sequence is the chimera and which are the parents. The assumption is that chimeras appear later in the PCR amplification process and are therefore less abundant than their parents. The default value is 2.0, which means that the parents should be at least 2 times more abundant than their chimera. Any positive value equal or greater than 1.0 can be used
`min_h`	minimum score (h). Increasing this value tends to reduce the number of false positives and to decrease sensitivity. Values ranging from 0.0 to 1.0 included are accepted

uchime3_denovo 

Perform chimera filtering with uchime3_denovo algorithm in vsearch
Designed for denoised amplicons.
uchime3_denovo can be applied also in UNOISE3 clustering

Setting	Tooltip
`pre_cluster`	identity percentage when performing ‘pre-clustering’ with –cluster_size for denovo chimera filtering with –uchime_denovo
`min_unique_size`	minimum amount of a unique sequences in a fasta file. If value = 1, then no sequences are discarded after dereplication; if value = 2, then sequences, which are represented only once in a given file are discarded; and so on
`denovo`	if TRUE, then perform denovo chimera filtering with –uchime_denovo
`reference_based`	perform reference database based chimera filtering with –uchime_ref. Select fasta formatted reference database (e.g. UNITE for ITS reads). If denovo = TRUE, then reference based chimera filtering will be performed after denovo.
`abundance_skew`	the abundance skew is used to distinguish in a threeway alignment which sequence is the chimera and which are the parents. The assumption is that chimeras appear later in the PCR amplification process and are therefore less abundant than their parents. The default value is 2.0, which means that the parents should be at least 2 times more abundant than their chimera. Any positive value equal or greater than 1.0 can be used
`min_h`	minimum score (h). Increasing this value tends to reduce the number of false positives and to decrease sensitivity. Values ranging from 0.0 to 1.0 included are accepted

ITS Extractor 

When working with ITS amplicons, then extract ITS regions with ITS Extractor (Bengtsson-Palme et al. 2013)

Note

Note that for better detection of the 18S, 5.8S and/or 28S regions, keep the primers (i.e. do not use ‘CUT PRIMERS’)

Fastq/fasta formatted single-end data is supported [fastq inputs will be converted to fasta].

Outputs are fasta files in ITSx_out directory.

Note

To START, specify working directory under SELECT WORKDIR and the sequence files extension, but the read types (single-end or paired-end) and data format (demultiplexed or multiplexed) does not matter here (just click ‘Next’).

Setting	Tooltip
`organisms`	set of profiles to use for the search. Can be used to restrict the search to only a few organism groups types to save time, if one or more of the origins are not relevant to the dataset under study
`regions`	ITS regions to output (note that ‘all’ will output also full ITS region [ITS1-5.8S-ITS2])
`partial`	if larger than 0, ITSx will save additional FASTA-files for full and partial ITS sequences longer than the specified cutoff value. If his setting is left to 0 (zero), it means OFF
`e-value`	domain e-value cutoff a sequence must obtain in the HMMER-based step to be included in the output
`scores`	domain score cutoff that a sequence must obtain in the HMMER-based step to be included in the output
`domains`	the minimum number of domains (different HMM gene profiles) that must match a sequence for it to be included in the output (detected as an ITS sequence). Setting the value lower than two will increase the number of false positives, while increasing it above two will decrease ITSx detection abilities on fragmentary data
`complement`	if TRUE, ITSx checks both DNA strands for matches to HMM-profiles
`only full`	If TRUE, the output is limited to full-length ITS1 and ITS2 regions only
`truncate`	removes ends of ITS sequences if they are outside of the ITS region. If FALSE, the whole input sequence is saved

CLUSTERING 

Cluster sequences, generate OTUs or zOTUs (with UNOISE3)

Supported file format for the input data is fasta.

Outputs are OTUs.fasta, OTU_table.txt and OTUs.uc files in clustering_out directory.

Note

output OTU table is tab delimited text file.

vsearch 

Setting	Tooltip
`OTU_type`	centroid” = output centroid sequences; “consensus” = output consensus sequences
`similarity_threshold`	define OTUs based on the sequence similarity threshold; 0.97 = 97% similarity threshold
`strands`	when comparing sequences with the cluster seed, check both strands (forward and reverse complementary) or the plus strand only
`remove_singletons`	if TRUE, then singleton OTUs will be discarded (OTUs with only one sequence)
`similarity_type`	pairwise sequence identity definition –iddef
`sequence_sorting`	size = sort the sequences by decreasing abundance; “length” = sort the sequences by decreasing length (–cluster_fast); “no” = do not sort sequences (–cluster_smallmem –usersort)
`centroid_type`	“similarity” = assign representative sequence to the closest (most similar) centroid (distance-based greedy clustering); “abundance” = assign representative sequence to the most abundant centroid (abundance-based greedy clustering; –sizeorder), `max_hits` should be > 1
`max_hits`	maximum number of hits to accept before stopping the search (should be > 1 for abundance-based selection of centroids [centroid type])
`mask`	mask regions in sequences using the “dust” method, or do not mask (“none”)
`dbmask`	prior the OTU table creation, mask regions in sequences using the “dust” method, or do not mask (“none”)

UNOISE3, with vsearch 

Setting	Tooltip
`zOTUs_thresh`	sequence similarity threshold for zOTU table creation; 1 = 100% similarity threshold for zOTUs
`similarity_threshold`	optionally cluster zOTUs to OTUs based on the sequence similarity threshold; if id = 1, no OTU clustering will be performed
`similarity_type`	pairwise sequence identity definition for OTU clustering –iddef
`maxaccepts`	maximum number of hits to accept before stopping the search
`maxrejects`	maximum number of non-matching target sequences to consider before stopping the search
`mask`	mask regions in sequences using the “dust” method, or do not mask (“none”)
`strands`	when comparing sequences with the cluster seed, check both strands (forward and reverse complementary) or the plus strand only
`minsize`	minimum abundance of sequences for denoising
`unoise_alpha`	alpha parameter to the vsearch –cluster_unoise command. default = 2.0.
`denoise_level`	at which level to perform denoising; global = by pooling samples, individual = independently for each sample (if samples are denoised individually, reducing minsize to 4 may be more reasonable for higher sensitivity)
`remove_chimeras`	perform chimera removal with uchime3_denovo algoritm
`abskew`	the abundance skew of chimeric sequences in comparsion with parental sequences (by default, parents should be at least 16 times more abundant than their chimera)
`cores`	number of cores to use for clustering

POSTCLUSTERING 

Perform OTU post-clustering. Merge co-occurring ‘daughter’ OTUs.

LULU 

LULU description from the LULU repository: the purpose of LULU is to reduce the number of erroneous OTUs in OTU tables to achieve more realistic biodiversity metrics. By evaluating the co-occurence patterns of OTUs among samples LULU identifies OTUs that consistently satisfy some user selected criteria for being errors of more abundant OTUs and merges these. It has been shown that curation with LULU consistently result in more realistic diversity metrics.

Additional information:

Input data is tab delimited OTU table (table) and OTU sequences (rep_seqs) in fasta format (see input examples below).
EXAMPLE table here (from LULU repository)
EXAMPLE fasta here (from LULU repository)

Note

To START, specify working directory under SELECT WORKDIR, but the file formats do not matter here (just click ‘Next’).

Output files in lulu_out directory:
# lulu_out_table.txt     = curated table in tab delimited txt format
# lulu_out_RepSeqs.fasta = fasta file for the molecular units (OTUs or ASVs) in the curated table
# match_list.lulu        = match list file that was used by LULU to merge ‘daughter’ molecular units
# discarded_units.lulu   = molecular units (OTUs or ASVs) that were merged with other units based on specified thresholds)

Setting	Tooltip
`table`	select OTU/ASV table. If no file is selected, then PipeCraft will look OTU_table.txt or ASV_table.txt in the working directory. EXAMPLE table here
`rep_seqs`	select fasta formatted sequence file containing your OTU/ASV reads. EXAMPLE file here
`min_ratio_type`	sets whether a potential error must have lower abundance than the parent in all samples ‘min’ (default), or if an error just needs to have lower abundance on average ‘avg’
`min_ratio`	set the minimim abundance ratio between a potential error and a potential parent to be identified as an error
`min_match`	specify minimum threshold of sequence similarity for considering any OTU as an error of another
`min_rel_cooccurence`	minimum co-occurrence rate. Default = 0.95 (meaning that 1 in 20 samples are allowed to have no parent presence)
`match_list_soft`	use either ‘blastn’ or ‘vsearch’ to generate match list for LULU. Default is ‘vsearch’ (much faster)
`vsearch_similarity_type`	applies only when ‘vsearch’ is used as ‘match_list_soft’. Pairwise sequence identity definition (–iddef)
`perc_identity`	percent identity cutoff for match list. Excluding pairwise comparisons with lower sequence identity percentage than specified threshold
`coverage_perc`	percent query coverage per hit. Excluding pairwise comparisons with lower sequence coverage than specified threshold
`strands`	query strand to search against database. Both = search also reverse complement
`cores`	number of cores to use for generating match list for LULU

DADA2 collapse ASVs 

DADA2 collapseNoMismatch function to collapse identical ASVs; and ASVs filtering based on minimum accepted sequence length (custom R functions).

To START, specify working directory under SELECT WORKDIR, but the file formats do not matter here (just click ‘Next’).

Output files in filtered_table directory:
# ASVs_table_collapsed.txt = ASV table after collapsing identical ASVs
# ASVs_collapsed.fasta     = ASV sequences after collapsing identical ASVs
# ASV_table_collapsed.rds  = ASV table in RDS format after collapsing identical ASVs.
If length filtering was applied (if ‘by length’ setting > 0) [performed after collapsing identical ASVs]:
# ASV_table_lenFilt.txt    = ASV table after filtering out ASVs with shorther than specified sequence length
# ASVs_lenFilt.fasta       = ASV sequences after filtering out ASVs with shorther than specified sequence length

Setting	Tooltip
`DADA2 table`	select the RDS file (ASV table), output from DADA2 workflow; usually in ASVs_out.dada2/ASVs_table.denoised-merged.rds
`collapseNoMismatch`	collapses ASVs that are identical up to shifts or length variation, i.e. that have no mismatches or internal indels
`by_length`	discard ASVs from the ASV table that are shorter than specified value (in base pairs). Value 0 means OFF, no filtering by length
`minOverlap`	collapseNoMismatch setting. Default = 20. The minimum overlap of base pairs between ASV sequences required to collapse them together
`vec`	collapseNoMismatch setting. Default = TRUE. Use the vectorized aligner. Should be turned off if sequences exceed 2kb in length

ASSIGN TAXONOMY 

Implemented tools for taxonomy annotation:

BLAST (Camacho et al. 2009)

BLAST search sequences againt selected database.

Important

BLAST database needs to be an unzipped fasta file in a separate folder (fasta will be automatically converted to BLAST database files). If converted BLAST database files (.ndb, .nhr, .nin, .not, .nsq, .ntf, .nto) already exist, then just SELECT one of those files as BLAST database in ‘ASSIGN TAXONOMY’ panel.

Supported file format for the input data is fasta.

Output files in``taxonomy_out`` directory:
# BLAST_1st_best_hit.txt = BLAST results for the 1st best hit in the used database.
# BLAST_10_best_hits.txt = BLAST results for the 10 best hits in the used database.

Note

To START, specify working directory under SELECT WORKDIR and the sequence files extension (to look for input OTUs/ASVs fasta file), but the read types (single-end or paired-end) and data format (demultiplexed or multiplexed) does not matter here (just click ‘Next’).

Note

BLAST values filed separator is ‘+’. When pasting the taxonomy results to e.g. Excel, then first denote ‘+’ as as filed separator to align the columns.

Setting	Tooltip
`database_file`	select a database file in fasta format. Fasta format will be automatically converted to BLAST database
`task`	BLAST search settings according to blastn or megablast
`strands`	query strand to search against database. Both = search also reverse complement
`e_value`	a parameter that describes the number of hits one can expect to see by chance when searching a database of a particular size. The lower the e-value the more ‘significant’ the match is
`word_size`	the size of the initial word that must be matched between the database and the query sequence
`reward`	reward for a match
`penalty`	penalty for a mismatch
`gap_open`	cost to open a gap
`gap_extend`	cost to extend a gap

DADA2 classifier 

Classify sequences with DADA2 RDP naive Bayesian classifier (function assignTaxonomy) againt selected database.

Supported file format for the input data is fasta.

Output files in``taxonomy_out.dada2`` directory:
# taxonomy.txt = classifier results with bootstrap values.

Note

To START, specify working directory under SELECT WORKDIR and the sequence files extension (to look for input OTUs/ASVs fasta file), but the read types (single-end or paired-end) and data format (demultiplexed or multiplexed) does not matter here (just click ‘Next’).

Setting	Tooltip
`dada2_database`	select a reference database fasta file for taxonomy annotation
`minBoot`	the minimum bootstrap confidence for assigning a taxonomic level
`tryRC`	the reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence

Sequence databases 

A (noncomprehensive) list of public databases available for taxonomy annotation

Database	Version	Description (click to download)
UNITE	8.3	ITS region, all Eukaryotes
SILVA	138.1	16S/18S (SSU), Bacteria, Archaea and Eukarya
SILVA 99%	138.1	16S/18S (SSU), Bacteria, Archaea and Eukarya
MIDORI	246	Eukaryota mitochondrial genes
CO1 Classifier	4	Metazoa COI
DADA2-formatted reference databases		DADA2-formatted reference databases
DIAT.BARCODE database		rbcL/18S, diatoms

POSTPROCESSING 

Post-processing tools. See this page

Expert-mode (PipeCraft2 console)

Bioinformatic tools used by PipeCraft2 are stored on Dockerhub as Docker images. These images can be used to launch any tool with the Docker CLI to utilize the compiled tools. Especially useful in Windows OS, where majority of implemented modules are not compatible.

See list of docker images with implemented software here

Show a list of all images in your system (using e.g. Expert-mode):

docker images

Download an image if required (from Dockerhub):

docker pull pipecraft/IMAGE:TAG

docker pull pipecraft/vsearch:2.18

Delete an image

docker rmi IMAGE

docker rmi pipecraft/vsearch:2.18

Run docker container in your working directory to access the files. Outputs will be generated into the specified working directory. Specify the working directory under the -v flag:

docker run -i --tty -v users/Tom/myFiles/:/Files pipecraft/vsearch:2.18

Once inside the container, move to /Files directory, which represents your working directory in the container; and run analyses

cd Files
vsearch --help
vsearch *--whateversettings*

Exit from the container:

exit