PipeCraft2 Logo
  • Installation
  • QuickStart
  • Pre-compiled pipelines
  • Individual steps (Quick Tools)
  • Post-processing tools
  • Example data analyses
    • Inspect quality profiles
    • DADA2 ASVs pipeline, 16S
    • DADA2 ASVs pipeline, COI
    • DADA2 ASVs pipeline, ITS2
      • Starting point
      • Workflow mode
      • Cut primers
      • Quality filtering
        • Denoise and merge pairs
      • Chimera filtering
      • Curate ASV table
      • Save workflow
      • Start the workflow
      • Outputs of the DADA2 ASVs workflow
      • Extract ITS2
        • Input data
        • Outputs of ITSx
        • ITS2 ASVs
      • Cluster ASVs into OTUs
      • LULU post-clustering
      • Assign taxonomy
    • vsearch OTUs pipeline, ITS2
    • OptimOTU pipeline, ITS2
    • NextITS pipeline, full-length ITS
    • FunBarONT pipeline, ITS
  • Troubleshooting
  • Licence
  • How to cite
  • Releases
  • Docker images
  • Contact and Acknowledgements
  • For Developers
PipeCraft2
  • Example data analyses
  • DADA2 ASVs pipeline, ITS2
  • View page source

DADA2 ASVs pipeline, ITS2 PipeCraft2_logo

This example data analyses follows vsearch OTUs workflow as implemented in PipeCraft2’s pre-compiled pipelines panel.

Download example data set here (15.1 Mb) and unzip it.
This is ITS2 Illumina MiSeq dataset.

For this example, we are using EUKARYOME database in the taxonomy annotation process (SINTAX). Download the SINTAX_EUK_ITS file from here and unzip it (note: use 7-Zip software for unzipping files in Windows).


Starting point

This example dataset consists of ITS2 rRNA gene amplicon sequences; targeting fungi:

  • paired-end Illumina MiSeq data;

  • demultiplexed set (per-sample fastq files);

  • primers are not removed;

  • sequences in this set are 5’-3’ oriented.

when working with your own data …

… then please check that the paired-end data file names contain R1 and R2 strings (not just _1 and _2), so that PipeCraft can correctly identify the paired-end reads.

Example:
sample1_R1.fastq.gz
sample1_R2.fastq.gz

At least 2 samples (2x R1 + 2x R2 files) are required for this workflow! Otherwise ERROR in the denoising step:

DADA2_2samples_needed


To select DADA2 pipeline, press
SELECT PIPELINE –> DADA" ASVs.

DADA2_select_pipeline

To select input data, press SELECT WORKDIR
and specify
sequence files extension as *.fastq.gz;
sequencing read types as paired-end.

Workflow mode

Because we are working with sequences that are 5’-3’ oriented, we are selecting hte PAIRED-END FORWARD mode of the pipeline.

DADA2_PE_FWD

if sequences are in mixed orientation

If some sequences in your library are in 5’-3’ and some as 3’-5’ orientation, then with the ‘PAIRED-END FORWARD’ mode exactly the same ASV may be reported twice, where one ASV is just the reverse complementary of another. To avoid that, select PAIRED-END MIXED mode. Sequences have mixed orientation in libraries where sequenceing adapters have been ligated, rather than attached to amplicons during PCR.

Specifying primers (for CUT PRIMERS) is mandatory for the PAIRED-END MIXED mode. Based on the priemr sequences, the library will be split into two: 1) fwd oriented sequences, and 2) rev oriented sequences. Both batches are processed independently to produce ASVs, after which the rev oriented batch ASVs are reverse complemented and merged with the fwd oriented ASVs. Identical ASVs are merged to form a final data set. This is a reccomended workflow for accurate denoising compared with first reorienting all sequences to 5’-3’, and then performing a standard ‘PAIRED-END FORWARD’ workflow.


Cut primers

The example dataset contains primer sequences. Generally, we need to remove these to proceed the analyses only with the variable metabarcode of interest. If there are some additional sequence fragments, from eg. sequencing adapters or poly-G tails, then clipping the primers will remove those fragments as well.

Tick the box for CUT PRIMERS and specify forward and reverse primers. For the example data, the forward primer is GTGARTCATCGAATCTTTG (fITS7) and reverse primer is TCCTCCGCTTATTGATATGC (ITS4).

cut_primers_expand_example

Forward primer has 19 bp and reverse 20 bp - to keep a bit of flexibility in the primer search, we are requesting the min overlap of 18 bp and are allowing maximum of 2 mismatches . Note that too low min overlap may lead to random matches. Check other CUT PRIMER options here.

Note

You may specify add up to 13 primer pairs.

A consideration when working with ITS sequences and plan to use ITS Extractor

Since fITS7 and ITS4 primer binding sites are >50 bp from ITS2 region from the 5.8S side and >40 bp from the 28S side, we can clip the primers in order to safely use ITS Extractor to remove the flanking regions from ITS2 reads.

However, when the primer binding sites are very close to the ITS region (< 25 bp), then you may want to keep the primers for better detection of the 18S, 5.8S and 28S regions.


Quality filtering

Quality filtering removes sequences, which do not meet the specified quality threshold.

Click on QUALITY FILTERING to expand the panel

COI_ex_qFilt

Here, we can leave the settings as DEFAULT by discarding sequences with maximum error rate of >2 and with ambiguous bases of >0. But based on your own data quality profile, you may need to adjust the settings. See also remove low-quality ends/starts of reads section.

Output directory output_icon qualFiltered_out

*.fq.gz

quality filtered sequences per sample in FASTQ format

*.rds

R objects for the following DADA2 workflow processes

seq_count_summary.csv

summary of sequence counts per sample


Denoise and merge pairs

This step performs desiosing (as implemented in DADA2), which first forms ASVs per R1 and R2 files. Then during merging/assembling process the paired ASV mates are assembled to output full amplicon length ASV.

DADA2_denoise_expand

Here, we are working with Illumina data, so let’s make sure that the errorEstFun setting is loessErrfun. We can leave all settings as DEFAULT.

Output directory output_icon denoised_assembled.dada2

*.fasta

denoised and assembled ASVs per sample in FASTA format

*.rds

R objects for the following DADA2 workflow processes

Error_rates_R*.pdf

plots for estimated R1/R2 error rates

seq_count_summary.csv

summary of sequence counts per sample


Chimera filtering

This step performs chimera filtering according to DADA2 removeBimeraDenovo function. During this step, the ASV table is also generated.

Important

make sure that primers have been removed from your amplicons; otherwise many false-positive chimeras may be filtered out from your dataset.

Here, we filter chimeras using the consensus method.

Output directory output_icon chimeraFiltered_out.dada2

*.fasta

chimera filtered ASVs per sample

seq_count_summary.csv

summary of sequence counts per sample

‘chimeras’ dir

ASVs per sample identified as chimeras

Output directory output_icon ASVs_out.dada2

ASVs_table.txt

denoised and chimera filtered ASV-by-sample table

ASVs.fasta

corresponding FASTA formated ASV Sequences

ASVs per sample identified as chimeras

rds formatted denoised and chimera filtered ASV table


Curate ASV table

This process first removes putative tag jumps and then collapses the ASVs that are identical up to shifts or length variation, i.e. ASVs that have no internal mismatches (PipeCraft2 uses vsearch usearch_global –id 1 for that); and finally filters out ASVs that are shorter/longer than specified length (in base pairs).

Here, we are enabling this process by checking the box for CURATE ASV TABLE in the DADA2 ASV workflow panel.

DADA2_filter_table_expand_ITS

The f_value and p_value settings are used to filter out putative tag jumps (using UNCROSS2 algorithm). Generally, we recommend to use p_value of 1 (default), and f_value of 0.03 when using combinational indexing strategy; f_value of 0.05 when using single-indexes, and f_value of 0.01 when using unique dual-indexes.

We are setting the collapseNoMismatch to TRUE, to collapse identical ASVs. This is basically equivalent to 100% clustering by ignoring the end gaps.

Since the ITS2 amplicon length is highly variable, we are keeping min length and max length settings as default. max length is set to 0, meaning no filtering by maximum sequence length.

Output directory output_icon ASVs_out.dada2/curated

ASVs_table_TagJumpFilt.txt

only tag-jump filtered ASV-by-sample table

ASVs.fasta

corresponding ASV Sequences with ASVs_table_TagJumpFilt.txt table

ASVs_collapsed.fasta

tag-jump filtered and collapsed and size filtered
ASV Sequences. Present only if some ASVs were collapsed.
ASVs_table_collapsed.txt

corresponding ASV-by-sample table.
Present only if some ASVs were collapsed.

TagJump_stats.txt

summary of tag-jump filtering results

If there is nothing to collapse or filter out based on the length

then there are no corresponding files in the ASVs_out.dada2/curated directory, and only ASVs_table_TagJumpFilt.txt and ASVs.fasta files will be generated (even when there is nothing to tag-jump filter - in which case ASVs_table_TagJumpFilt.txt is the same ASVs_table.txt in the ASVs_out.dada2 directory).

Note

The pre-compiled pipeline ends here. Outputs ITS2 ASVs can be further filtered to remove flanking gene fragments from ITS2 sequences (via ITS Extractor), and optionally ASVs can be clustered into OTUs. See below.


Save workflow

Once we have decided about the settings in our workflow, we can save the configuration file by pressing save workflow button on the right-ribbon.

save

If you forget the save, then no worries, a pipecraft2_last_run_configuration.json file will be generated for you upon starting the workflow. As the file name says, it is the workflow configuration file for your last PipeCraft run in this working directory. If the file name (pipecraft2_last_run_configuration.json) is not changed, then the file is overwritten with the new configuration if running a new job in the same working directory.

This JSON file can be loaded into PipeCraft2 to automatically configure your next runs exactly the same way.


Start the workflow

Press START on the left ribbon to start the analyses.

when running the module for the first time …

… a docker image will be first pulled to start the process.

For example: pulling_image

When you need to STOP the workflow, press STOP button stop_workflow

When the workflow has completed …

… a message window will be displayed.

workflow_finished



Outputs of the DADA2 ASVs workflow

Several process-specific output folders are generated output_icon

primersCut_out

paired-end fastq files per sample, primers clipped

qualFiltered_out

quality filtered paired-end fastq files per sample

denoised_assembled.dada2

denoised and assembled fasta files per sample

chimeraFiltered_out.dada2

chimera filtered fasta files per sample

ASVs_out.dada2

ASVs table, and ASV sequences files

ASVs_out.dada2/curated

curated ASVs table, and ASV sequences files

Each folder (except ASVs_out.dada2) contain summary of the sequence counts (seq_count_summary.csv). Examine those to track the read counts throughout the pipeline.

For example, from the seq_count_summary.txt file in qualFiltered_out we see that all sequences passed the quality filtering step for the first two samples, while most of the sequences were discarded from the last three samples (note that this is an example dataset, with intentionally lower quality sequences in the last three samples).

input

qualFiltered

sample1

22781

22781

sample2

13748

13748

sample3

11639

4383

sample4

11476

401

sample5

9431

55


Final outputs of the pipeline

Here, we applied also “CURATE ASV TABLE” process. Therefore, our final outputs of the pipeline are in the ASVs_out.dada2/curated directory, which contans:

Output directory output_icon ASVs_out.dada2/curated

ASVs_table_TagJumpFilt.txt

tag-jump filtered ASV-by-sample table

ASVs.fasta

corresponding ASV Sequences with ASVs_table_TagJumpFilt.txt table

TagJump_stats.txt

summary of tag-jump filtering results

Let’s check the README.txt file in the ASVs_out.dada2/curated directory:

there, we see that:
  • Number of Features = 21

  • Number of sequences in the Feature table = 40832

  • Number of samples in the Feature table = 4

Note

Note that even though we applied the collapseNoMismatch setting, no ASVs were collapsed as stated in the README.txt file: “Output has the same number of Features (ASVs/OTUs) as input. No new files outputted.” Therefore, no ASVs_table_collapsed.txt and ASVs_collapsed.fasta files were generated.

Our final ASV table from this pipeline in ASVs_table_TagJumpFilt.txt, which represents the ASV table after the tag-jump filtering, where the 1st column represents ASV identifiers (sha1 encoded), 2nd column is the sequence of an ASV, and all the following columns represent number of sequences in the corresponding samples (sample name is taken from the file name). This is tab delimited text file.

ASVs_table_TagJumpFilt.txt; first 4 ASVs:

OTU

Sequence

sample1

sample2

sample3

sample4

d35ef6…

AACGCA…

3824

4151

840

0

b260a9…

AACGCA…

3379

2167

783

0

52f069…

AACGCA…

2889

1393

407

0

25989c…

AACGCA…

2067

750

404

0

Note: even though the ASVs column header is “OTU”, it represents ASVs as we preformed an ASVs workflow!

Tag-jump filtering does not remove ASVs. Instead, it adjusts the ASV table by removing (setting to zero) low-abundance occurrences of an ASV in specific samples where they are likely due to tag-jumps. Therefore, the ASVs_out.dada2/curated/ASVs.fasta file is the same as ASVs_out.dada2/ASVs.fasta file.

When checking the TagJump_stats.txt file in the ASVs_out.dada2/curated directory, we see that based on our settings, 13 tag-jump events were detected, and total of 316 reads were removed as putative tag-jumps.

You can track the tag-jump filtering by comparing the ASVs_out.dada2/ASVs_table.txt and ASVs_out.dada2/curated/ASVs_table_TagJumpFilt.txt files.

ASVs_out.dada2/ASVs_table.txt

ASV table produced by DADA2

ASVs_out.dada2/curated/ASVs_table_TagJumpFilt.txt

ASV table after tag-jump filtering

First thing, that we notice is that ASVs_out.dada2/ASVs_table.txt has 5 samples, but ASVs_out.dada2/curated/ASVs_table_TagJumpFilt.txt has only 4 samples. This is because tag-jump filtering removed the only ASV d35ef67e34f96f725d342aedac2bc2a69cbf078a from sample5, that was represented by 15 sequences. Based on the sequence distribution of that ASV, it was considered as a tag-jump, thus removed from sample5.

Tag-jump filtering can be relaxed by decreasing the f_value to 0.01. Generally, we recommend to use p_value of 1 (default), and f_value of 0.03 when using combinational indexing strategy; f_value of 0.05 when using single-indexes, and f_value of 0.01 when using unique dual-indexes.


Extract ITS2

Here, in this example dataset, we are working with ITS2 amplicons, and since fungi may have multiple different ITS copies per genome and may exhibit size polymorphism, then we aim to cluster the ASVs into OTUs.

But for that, first we want to remove the flanking gene fragments from ITS2 sequences that may affect clustering. For that, we are using ITSx (QuickTools --> ITS Extractor).

ITSx_quicktools

Since we are interesed only in fungi, we can limit the organisms to only fungi and keep the regions as ITS2. Universal ITS2 primers amplify also other eukaryotes besides fungi, thus, this step helps also to discard off-target sequences when limiting the organisms to only fungi. However, note that this increase in specificity may lead to some decrease in sensitivity; i.e., discarding some Fungal OTUs. For real-world applications, you may set organism to “all”, and discard off-target OTUs after the taxonomy assignment step.

Note

If you are working with only 5’-3’ oriented amplicons, then turn off complement setting under TOGGLE ADVANCE OPTIONS to skip the reverse complementary search; and possibly add more cores to speed things up (see here).

The partial setting is set to 50, which means that if at least one of the 5.8S or 28S motif is found in the sequence and that sequence is at least 50 bp long (after cutting the motif), then it will be classified as partial ITS2 sequence and outputted in the ITS2/full_and_partial directory.

Note

For better detection of the 18S, 5.8S and/or 28S regions by ITSx, you may not want to CUT PRIMERS in your own dataset. With this example dataset, ITSx works fine even when primers were clipped.

Input data

Via SELECT WORKDIR button, we specify the ASVs_out.dada2/curated directory as a working directory. The output directory, ITSx_out, will be created there (ASVs_out.dada2/curated/ITSx_out).

Automatically detected Sequence files extension may be .txt, but we need to set it to .fasta, so ITSx can work with the FASTA files in the working directory. The Sequencing read types do not matter here, just click ‘Confirm’.

Before starting ITSx

Note that all fasta files in the folder will be subjected to ITSx, so include to the WORKING DIRECTORY only ONE FASTA FILE within this workflow ASVs.fasta or ASVs_collapsed.fasta if there were any collapses with your own data.

The WORKING DIRECTORY can contain multiple FASTA files when the aim is to extract ITS2 sequences from all of them.

Outputs of ITSx

Output directory output_icon ITSx_out

ITS2/*.fasta

ITS2 sequences (without flanking gene fragments) per sample

ITS2/full_and_partial/*.fasta

full, but also partial ITS2 sequences per sample

seq_count_summary.txt

summary of sequence counts per sample

Checking the ITSx_out/ITS2/seq_count_summary.txt file, we see that ITS2 region was identified in all 21 ASVs.

File

Reads_in

Reads_out

ASVs.fasta

21

21

For the demonstration purposes, let’s examine the lenghts of the ASV sequences before and after ITSx:

Easy way to do this is to use seqkit stats tool (QuickTools --> Utilities).

seqkit stats output (seqkit_stats.fasta.txt):

File

format

type

num_seqs

sum_len

min_len

avg_len

max_len

ASVs.fasta

FASTA

DNA

21

6614

272

315.0

366

ASVs.ITS2.fasta

FASTA

DNA

21

4619

177

220.0

271

The sequences in the ASVs.fasta still contain the flanking 5.8S and LSU regions, while the sequences in the ASVs.ITS2.fasta only contain the ITS2 region. Therefore, the avg_len of the ASVs.ITS2.fasta is shorter than the avg_len of the ASVs.fasta.

The spread between min_len and max_len reflects true biological length variation of ITS2 among taxa.

Working with only ITS2 reads (without flanking gene fragments) may be useful as it standardizes what part of the rDNA amplicon you cluster (so clustering is driven by the ITS barcode region).

ITS2 ASVs

We can further proceed to cluster ASVs into OTUs, but if the aim is to work with ITS2 ASVs, then we need to match the ASV table to the ITS2 ASVs. In this example, ITSx did not remove any ASVs, therefore, the ASV table to work with is ASVs_table_TagJumpFilt.txt file in the ASVs_out.dada2/curated directory.

In case ITSx removed some ASVs, then you can use ASV TO OTU tool (QuickTools --> Postprocessing --> ASV TO OTU) with similarity threshold of 1 to match the input ASV table to the ITS2 ASVs. ASV TO OTU tool allows the ASVs in the input ASV fasta file to be a subset of the input ASV table.

Applying ASV TO OTU module with similarity threshold of 1 is the same thing as collapseNoMismatch = TRUE in the CURATE ASV TABLE process. It collapses identical ASVs by ignoring the end gaps, but if there are no identical ASVs, then it just matches the input ASV table to the ITS2 ASVs.


Cluster ASVs into OTUs

The ASVs approach may not accurately reflect species composition in the community of as fungi may have multiple different ITS copies per genome and may exhibit size polymorphism. Thus, here we aim to cluster the ASVs into OTUs (QuickTools --> Postprocessing --> ASV TO OTU).

ASVs_to_OTUs_ITS

Input fasta file is in the ITSx_out/ITS2/ASVs.ITS2.fasta and input ASV table is in the ASVs_out.dada2/curated/ASVs_table_TagJumpFilt.txt directory.

As a similarity threshold we are using the commonly used 97% (0.97, default setting).

Note

2nd column of the input ASV table must be ‘Sequences’ (1st column is ASV IDs; this is the default PipeCraft2 output table, so you don’t need to worry about this if you have followed this pipeline). For clustering, the ASV size annotation is obtained from the ASV table.

If the ASV table does not contain ‘Sequence’ column, then add those with QuickTools -> Utilities -> Add sequences to table see here.

Via SELECT WORKDIR button, we specify the ASVs_out.dada2/curated directory as a working directory (Sequence files extension and Sequencing read types do not matter here, just click ‘Confirm’).

The output directory, ASVs2OTUs_out, will be created there (ASVs_out.dada2/curated/ASVs2OTUs_out).

Outputs in ASVs2OTUs_out directory:

OTUs.fasta

FASTA formated representative OTU sequences

OTU_table.txt

OTU table (tab delimited file)

OTUs.uc

uclust-like formatted clustering results


LULU post-clustering

Additionally, we can perform LULU post-clustering to merge co-occurring ‘daughter’ OTUs.

LULU description from the LULU repository: the purpose of LULU is to reduce the number of erroneous OTUs in OTU tables to achieve more realistic biodiversity metrics. By evaluating the co-occurence patterns of OTUs among samples LULU identifies OTUs that consistently satisfy some user selected criteria for being errors of more abundant OTUs and merges these. It has been shown that curation with LULU consistently result in more realistic diversity metrics.

Here, we are performing LULU post-clustering via QuickTools panel (on the right ribbon).

select_LULU

The input data are OTU_table.txt and OTUs.fasta files in the ASVs2OTUs_out directory. Here, we are using the default settings (which are suitable for most cases), but feel free to experiment with various settings to see the effect on the results.

LULU

To START

To START, specify working directory under SELECT WORKDIR (outputs will be written here), but the following requests about Sequence files extension and Sequencing read types do not matter here, just click ‘Confirm’.

The outputs of the process are in the lulu_out directory. But if we examine the README.txt file in that directory, then we see that “Total of 0 Features (OTUs/ASVs) were merged”, and therefore we do not have any OTU table or fasta file on the lulu_out folder. Note that this is a small example dataset, but with larger datasets postclustering merges many ‘daughter’ OTUs into ‘parent’ OTUs.

Thus, with this example dataset, our final OTU table and fasta file are OTU_table.txt and OTUs.fasta files in the ASVs2OTUs_out directory.

See this example of applying LULU post-clustering where LULU merged some OTUs.


Assign taxonomy

Assign taxonomy is not the part of the full per-defined pipeline, but can be run as a separate step in QuickTools.

Here, we are using SINTAX. See other assign taxonomy options here.

select_SINTAX_classifier

We need to specify the location of the reference DATABASE for the taxonomic classification of our OTUs. For this example, we are using EUKARYOME database, which is a comprehensive database of eukaryotic ITS sequences; thus containing also other eukaryotic sequences besides Fungi. “Outgroups” are important in the reference database in order not to overclassify non-Fungal sequences as Fungi. Download the SINTAX_EUK_ITS file from here and unzip it (note: use 7-Zip software for unzipping files in Windows).

See other databases available for taxonomy annotation here.

select_SINTAX_classifier_ITS

Specify the location of your downloaded database and also the fasta file with ASVs (fasta file) to be classified. Herein, we use OTUs.fasta file in the ASVs2OTUs_out directory (since we applied also ASV TO OTU process).

Important

Make sure you do not have any other SINTAX database files is the same directory as the database you are using. That is, use dedicated directory for the SINTAX database.

To START

To START, specify working directory under SELECT WORKDIR (outputs will be written here), but the following requests about Sequence files extension and Sequencing read types do not matter here, just click ‘Confirm’.

Note

First time usage of the fasta formatted database requires conversion to the SINTAX database format (.udb). This conversion is performed automatically by PipeCraft2, and will take some time, depending on the size of the database.

Result from the taxonomy annotation process - taxonomy table (taxonomy.sintax.txt) - is located at the taxonomy_out.sintax directory.

Output directory

taxonomy_out.sintax

taxonomy.sintax.txt

classifier results with bootstrap values

Taxonomy results for the first 4 OTUs

52f069d5552dbeaa8fe7dc379bd33c111bdc958f

d:Fungi(1.00),p:Basidiomycota(1.00),c:Agaricomycetes(1.00),o:Russulales(1.00),f:Albatrellaceae(1.00),g:Byssoporia(1.00),s:terrestris(0.53)

+

d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Russulales,f:Albatrellaceae,g:Byssoporia

99a50398b16a99aa01a932e3cfbf072c1313e8bd

d:Fungi(1.00),p:Basidiomycota(1.00),c:Agaricomycetes(1.00),o:Russulales(1.00),f:Russulaceae(1.00),g:Russula(1.00),s:chloroides(0.85)

+

d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Russulales,f:Russulaceae,g:Russula,s:chloroides

ed8775e608760e3c55fd15cf33eeed6690db89a2

d:Fungi(1.00),p:Ascomycota(1.00),c:Pezizomycetes(1.00),o:Pezizales(1.00),f:Pyronemataceae(1.00),g:Humaria(1.00),s:hemisphaerica(0.70)

+

d:Fungi,p:Ascomycota,c:Pezizomycetes,o:Pezizales,f:Pyronemataceae,g:Humaria

12a6e094e33bc2c9cf286bfde56fc73b90c26847

d:Fungi(1.00),p:Basidiomycota(1.00),c:Agaricomycetes(1.00),o:Russulales(1.00),f:Russulaceae(1.00),g:Lactarius(Fungi)(1.00)

+

d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Russulales,f:Russulaceae,g:Lactarius(Fungi)

taxonomy.sintax.txt is a tab-delimited text file without the initial header row:

  • 1st column: OTU identifier (sha1 encoded)

  • 2nd column: SINTAX classification result with bootstrap values in parentheses

  • 3rd column: “+” sign

  • 4th column: SINTAX classification result when considering the cutoff value (minimum bootstrap of 0.8)

In most cases, the 4th column of the taxonomy.sintax.txt is needed for further downstream ecological analyses. However, when using comma (,) as a field separator, the taxonomy in the 4th column has a variable number of fields (depending on the deepest rank SINTAX assigned at the cutoff).

Below, you can find a R-script to parse sintax taxonomy table (when EUKARYOME database is used). The R-script reads the SINTAX table, takes the 4th column, and writes taxonomy.sintax.formatted.txt with eight fixed columns for each rank. Missing ranks are filled with unclassified_<previous_rank> (for example, Species = unclassified_Lactarius when the deepest rank in column 4 is genus). EUKARYOME s: values are species epithets only; the script writes the binomial as Genus_epithet. Additional clarifications in genus names, specific to EUKARYOME database, such as Lactarius(Fungi) are reduced to Lactarius for the Genus and species columns.

For species level assignment, the R-script has an optional species_threshold argument, for requiring a higher bootstrap value for species assignment (stricter way for assigning species names.) When species_threshold in the script is set to 0.8 or higher, any row that has s: (species epithet) in the 4th column must also have a species assignment in the 2nd column with bootstrap ≥ species_threshold; otherwise the epithet is dropped and species name becomes unclassified_<Genus>. Set species_threshold to 0 to disable this check.

Parse taxonomy resulting from SINTAX run against EUKARYOME database
  1#!/usr/bin/env Rscript
  2### Format SINTAX column 4 for EUKARYOME.
  3### No-hit rows: all ranks output as "unclassified"
  4
  5# specify input
  6taxtab <- "taxonomy.sintax.txt"
  7
  8# (optional) species bootstrap filter: 0 = OFF.
  9species_threshold <- 0.9 # [allowed values: 0 or >= 0.8]
 10
 11# specify output file name
 12outfile <- "taxonomy.sintax.formatted.txt"
 13#--------------------------------------#
 14
 15# if running on CLI, use specified args
 16args <- commandArgs(trailingOnly = TRUE)
 17if (length(args) >= 1L) taxtab <- args[[1L]]
 18if (length(args) >= 2L) outfile <- args[[2L]]
 19if (length(args) >= 3L) species_threshold <- as.numeric(args[[3L]])
 20
 21## Ranks in output order (used for NA fill and unclassified_* propagation)
 22TAX_RANKS <- c("Kingdom", "Phylum",
 23              "Class", "Order",
 24              "Family", "Genus",
 25              "Species")
 26
 27## Strip trailing parenthetical qualifiers from genus
 28# when not a numeric bootstrap (e.g. Lactarius(Fungi)).
 29strip_genus_qualifier <- function(genus) {
 30  if (is.na(genus) || !nzchar(genus)) return(NA_character_)
 31  x <- sub("\\([^)]*[0-9][^)]*\\)$", "", genus)
 32  sub("\\([A-Za-z_][A-Za-z_ ]*\\)$", "", x)
 33}
 34
 35## spaces -> underscores; empty -> NA.
 36sanitize_tokens <- function(x) {
 37  x <- trimws(as.character(x))
 38  keep <- !is.na(x) & nzchar(x)
 39  x[!keep] <- NA_character_
 40  x[keep] <- gsub("\\s+", "_", x[keep], perl = TRUE)
 41  x
 42}
 43
 44## Species bootstrap from SINTAX column 2
 45extract_species_bootstrap <- function(col2_string) {
 46  s <- trimws(as.character(col2_string))
 47  if (is.na(s) || !nzchar(s)) return(NA_real_)
 48  m <- regmatches(s, regexec("s:[^(]+\\(([0-9.]+)\\)", s, perl = TRUE))[[1L]]
 49  if (length(m) < 2L) return(NA_real_)
 50  as.numeric(m[[2L]])
 51}
 52
 53## No hits
 54is_no_hit_col4 <- function(s) {
 55  x <- trimws(as.character(s))
 56  is.na(x) | !nzchar(x) | x == "*"
 57}
 58
 59## Parse one column-4 string into one row
 60parse_sintax_col4 <- function(s) {
 61  na_row <- function() {
 62    m <- matrix(
 63      NA_character_,
 64      nrow = 1L,
 65      ncol = 7L,
 66      dimnames = list(
 67        NULL,
 68        c("Kingdom", "Phylum",
 69          "Class", "Order",
 70          "Family", "Genus",
 71          "species_epithet")
 72      )
 73    )
 74    as.data.frame(m, stringsAsFactors = FALSE)
 75  }
 76
 77  s <- trimws(as.character(s))
 78  if (is.na(s) || !nzchar(s) || s == "*") return(na_row())
 79
 80  d_val <- k_val <- NA_character_
 81  row <- na_row()
 82
 83  for (part in strsplit(s, ",", fixed = TRUE)[[1L]]) {
 84    part <- trimws(part)
 85    if (!nzchar(part) || !grepl("^[dkpcofgs]:", part)) next
 86    code <- substr(part, 1L, 1L)
 87    val <- substr(part, 3L, nchar(part))
 88    if (!nzchar(val)) next
 89    switch(code,
 90          "d" = { d_val <- val },
 91          "k" = { k_val <- val },
 92          "p" = { row$Phylum <- val },
 93          "c" = { row$Class <- val },
 94          "o" = { row$Order <- val },
 95          "f" = { row$Family <- val },
 96          "g" = { row$Genus <- val },
 97          "s" = { row$species_epithet <- val },
 98          NULL
 99    )
100  }
101
102  if (!is.na(k_val)) {
103    row$Kingdom <- k_val
104  } else if (!is.na(d_val)) {
105    row$Kingdom <- d_val
106  }
107  row
108}
109
110replace_na_with_unclassified <- function(df) {
111  out <- df
112  km <- out[["Kingdom"]]
113  out[["Kingdom"]] <- ifelse(is.na(km) | km == "",
114                            "UnknownKingdom", km)
115  for (i in 2L:length(TAX_RANKS)) {
116    cur <- TAX_RANKS[i]
117    prev <- TAX_RANKS[i - 1L]
118    v <- out[[cur]]
119    prev_v <- out[[prev]]
120    out[[cur]] <- ifelse(is.na(v) | v == "",
121                        paste0("unclassified_",
122                                prev_v), v)
123  }
124  out
125}
126
127collapse_unclassified_prefix <- function(df) {
128  out <- df
129  for (nm in TAX_RANKS) {
130    out[[nm]] <- gsub("^(unclassified_)+",
131                      "unclassified_", out[[nm]],
132                      perl = TRUE)
133  }
134  out
135}
136
137## Build Species = Genus_epithet where both present
138species_binomial <- function(genus, epithet) {
139  gen <- sanitize_tokens(genus)
140  ep <- sanitize_tokens(epithet)
141  ok <- !is.na(ep) & !is.na(gen)
142  out <- rep(NA_character_, length(gen))
143  out[ok] <- paste0(gen[ok], "_", ep[ok])
144  out
145}
146
147## Load input and run parse
148raw <- read.table(
149  taxtab,
150  sep = "\t",
151  header = FALSE,
152  stringsAsFactors = FALSE,
153  quote = "",
154  comment.char = "",
155  fill = TRUE,
156  check.names = FALSE
157)
158
159while (ncol(raw) < 4L) {
160  raw <- cbind(raw, rep(NA_character_, nrow(raw)),
161              stringsAsFactors = FALSE)
162}
163
164if (ncol(raw) < 1L || nrow(raw) < 1L) {
165  stop("Expected at least one column and one row in ",
166      taxtab)
167}
168
169feat_id <- as.character(raw[[1L]])
170col4v <- as.character(raw[[4L]])
171no_hit <- vapply(col4v, is_no_hit_col4, FUN.VALUE = NA)
172
173parsed_df <- do.call(rbind, lapply(seq_len(nrow(raw)), function(i) {
174  if (isTRUE(no_hit[i])) {
175    data.frame(
176      Kingdom = "unclassified",
177      Phylum = "unclassified",
178      Class = "unclassified",
179      Order = "unclassified",
180      Family = "unclassified",
181      Genus = "unclassified",
182      species_epithet = NA_character_,
183      stringsAsFactors = FALSE
184    )
185  } else {
186    parse_sintax_col4(col4v[i])
187  }
188}))
189rownames(parsed_df) <- feat_id
190
191parsed_df$Genus <- vapply(parsed_df$Genus,
192                          strip_genus_qualifier,
193                          FUN.VALUE = character(1L))
194
195## species_threshold
196st_num <- suppressWarnings(as.numeric(species_threshold))
197if (!is.na(st_num) && st_num >= 0.8) {
198  col2 <- as.character(raw[[2L]])
199  sp_bt <- vapply(col2, extract_species_bootstrap, FUN.VALUE = NA_real_)
200  epit <- parsed_df$species_epithet
201  has_s4 <- !is.na(epit) & nzchar(trimws(as.character(epit)))
202  fail <- !no_hit & has_s4 & (is.na(sp_bt) | sp_bt < st_num)
203  parsed_df$species_epithet[fail] <- NA_character_
204}
205
206out <- data.frame(
207  parsed_df[, c("Kingdom", "Phylum",
208                "Class", "Order",
209                "Family", "Genus"),
210            drop = FALSE],
211  Species = species_binomial(parsed_df$Genus,
212                            parsed_df$species_epithet),
213  stringsAsFactors = FALSE,
214  row.names = rownames(parsed_df)
215)
216
217out <- replace_na_with_unclassified(out)
218out <- collapse_unclassified_prefix(out)
219
220## No-hit rows: every rank is "unclassified"
221if (any(no_hit)) {
222  ii <- which(no_hit)
223  for (nm in TAX_RANKS) {
224    out[ii, nm] <- "unclassified"
225  }
226}
227
228out <- cbind(feature_id = feat_id, out,
229            stringsAsFactors = FALSE)
230
231# write output file
232write.table(out, file = outfile,
233            sep = "\t",
234            quote = FALSE,
235            row.names = FALSE,
236            col.names = TRUE)
237
238message("Wrote ", nrow(out), " rows to ", outfile)

You can copy-paste the script to a file and name it e.g., format_SINTAX_tax.R to run it on terminal (or run directly in RStudio).

 # Run with species_threshold of 0.9
 Rscript format_SINTAX_tax.R path/to/taxonomy.sintax.txt path/to/taxonomy.sintax.formatted.txt 0.9

# Run with species_threshold disabled
Rscript format_SINTAX_tax.R path/to/taxonomy.sintax.txt path/to/taxonomy.sintax.formatted.txt 0

The output file taxonomy.sintax.formatted.txt (with species_threshold of 0.9):

feature_id

Kingdom

Phylum

Class

Order

Family

Genus

Species

52f069d5552dbeaa8fe7dc379bd33c111bdc958f

Fungi

Basidiomycota

Agaricomycetes

Russulales

Albatrellaceae

Byssoporia

unclassified_Byssoporia

99a50398b16a99aa01a932e3cfbf072c1313e8bd

Fungi

Basidiomycota

Agaricomycetes

Russulales

Russulaceae

Russula

unclassified_Russula

ed8775e608760e3c55fd15cf33eeed6690db89a2

Fungi

Ascomycota

Pezizomycetes

Pezizales

Pyronemataceae

Humaria

unclassified_Humaria

12a6e094e33bc2c9cf286bfde56fc73b90c26847

Fungi

Basidiomycota

Agaricomycetes

Russulales

Russulaceae

Lactarius

unclassified_Lactarius

Previous Next

© Copyright 2026, Sten Anslan.

Built with Sphinx using a theme provided by Read the Docs.