.. |PipeCraft2_logo| image:: _static/PipeCraft2_icon_v2.png :width: 50 :target: https://github.com/pipecraft2/user_guide .. |output_icon| image:: _static/output_icon.png :width: 50 .. raw:: html .. role:: red .. raw:: html .. role:: green .. meta:: :description lang=en: PipeCraft manual. tutorial .. _postprocessingtools: ======================================= Post-processing tools |PipeCraft2_logo| ======================================= .. note:: All post-processing tools accessible under **ADD STEP** -> **POSTPROCESSING** .. contents:: Contents :depth: 2 .. important:: Note that the input ASV/OTU table must contain "Sequence" column for some postprocessing modules. Read the tooltips too see which ones. ____________________________________________________ Filter tag-jumps ---------------- Filter out putative tag-jumps with `UNCROSS2 `_. | Input data is tab delimited **OTU table** and corresponding **fasta file** (representative sequences of ASV/OTUs). .. note:: To **START**, specify working directory under ``SELECT WORKDIR``, but the file formats do not matter here (just click 'Next'). +-----------------------+-------------------------------------------------------------------------------+ | Outputs | | +=======================+===============================================================================+ | ``*_TagJumpFilt.txt`` | output table where tag-jumps have been filtered out | +-----------------------+-------------------------------------------------------------------------------+ | TagJump_plot.pdf | illustration about the presence of tag-jumps based on the selected parameters | +-----------------------+-------------------------------------------------------------------------------+ || TagJump_stats.txt || tag-jump statistics (Total_reads, Number_of_TagJump_Events, | || || TagJump_reads, ReadPercent_removed) | +-----------------------+-------------------------------------------------------------------------------+ +-----------------+------------------------------------------------------------------------------------------+ | Setting | Tooltip | +=================+==========================================================================================+ || ``table`` || select tab-delimited OTU/ASV table, where the 1st column is the OTU/ASV IDs and the | || || following columns represent samples; 2nd column may be Sequence column, with the | || || colName 'Sequence' [**file must be in the SELECT WORKDIR directory**] | +-----------------+------------------------------------------------------------------------------------------+ || ``fasta file`` || select corresponding fasta file for OTU/ASV table [**fasta file must be in the SELECT** | || || **WORKDIR directory**] | +-----------------+------------------------------------------------------------------------------------------+ || ``f value`` || f-parameter of UNCROSS2, which defines the expected tag-jumps rate. Default is 0.03 | || || (equivalent to 3%). A higher value enforces stricter filtering | +-----------------+------------------------------------------------------------------------------------------+ || ``p value`` || p-parameter, which controls the severity of tag-jump removal. It adjusts the exponent | || || in the UNCROSS formula. Default is 1. Opt for 0.5 or 0.3 to steepen the curve | +-----------------+------------------------------------------------------------------------------------------+ ___________________________________________________ ASV to OTU ---------- Cluster ASVs (zOTUs) to OTUs using vsearch. | **Input data** is tab delimited **ASV table** and **ASV sequences** in fasta format. | 2nd column of **ASV table** MUST BE 'Sequences' (1st column is ASV IDs; default pipecraft output table). | For clustering, the ASV size annotation is obtained from the ASV table. +---------------------------+--------------------------------------------------------+ | Outputs ``ASVs2OTUs_out`` | | +===========================+========================================================+ | OTUs.fasta | FASTA formated representative OTU sequences | +---------------------------+--------------------------------------------------------+ | OTU_table.txt | OTU distribution table per sample (tab delimited file) | +---------------------------+--------------------------------------------------------+ | OTUs.uc | uclust-like formatted clustering results for OTUs | +---------------------------+--------------------------------------------------------+ | ASVs.size.fasta | size annotated input sequences | +---------------------------+--------------------------------------------------------+ .. _postclustering_lulu: ___________________________________________________ LULU post-clustering --------------------- Perform OTU post-clustering with `LULU `_ to merge co-occurring 'daughter' OTUs. LULU description from the `LULU repository `_: the purpose of LULU is to reduce the number of erroneous OTUs in OTU tables to achieve more realistic biodiversity metrics. By evaluating the co-occurence patterns of OTUs among samples LULU identifies OTUs that consistently satisfy some user selected criteria for being errors of more abundant OTUs and merges these. It has been shown that curation with LULU consistently result in more realistic diversity metrics. Additional information: - `LULU repository `_ - `LULU paper `_ | Input data is tab delimited **OTU table** (``table``) and **OTU sequences** (``rep_seqs``) in fasta format (see input examples below). | `EXAMPLE table here `_ *(from LULU repository)* | `EXAMPLE fasta here `_ *(from LULU repository)* .. note:: To **START**, specify working directory under ``SELECT WORKDIR``, but the file formats do not matter here (just click 'Next'). +-----------------------+----------------------------------------------------------------------------+ | Outputs ``lulu_out`` | | +=======================+============================================================================+ | OTU_table.lulu.txt | curated table in tab delimited txt format | +-----------------------+----------------------------------------------------------------------------+ | OTUs.lulu.fasta | fasta file for the molecular units (OTUs or ASVs) in the curated table | +-----------------------+----------------------------------------------------------------------------+ | match_list.lulu | match list file that was used by LULU to merge 'daughter' molecular units | +-----------------------+----------------------------------------------------------------------------+ || discarded_units.lulu || molecular units (OTUs or ASVs) that were merged with other units based on | || || specified thresholds | +-----------------------+----------------------------------------------------------------------------+ =============================================== ========================= `Setting `_ Tooltip =============================================== ========================= ``table`` | select OTU/ASV table. If no file is selected, then PipeCraft will | look OTU_table.txt or ASV_table.txt in the working directory. | `EXAMPLE table here `_ ``fasta_file`` | select fasta formatted sequence file containing your OTU/ASV reads. | `EXAMPLE file here `_ ``min_ratio_type`` | sets whether a potential error must have lower abundance than the parent | in all samples 'min' (default), or if an error just needs to have lower | abundance on average 'avg' ``min_ratio`` | set the minimim abundance ratio between a potential error and a | potential parent to be identified as an error ``min_match`` | specify minimum threshold of sequence similarity for considering | any OTU as an error of another ``min_rel_cooccurence`` | minimum co-occurrence rate. Default = 0.95 (meaning that 1 in 20 samples | are allowed to have no parent presence) ``match_list_soft`` | use either 'blastn' or 'vsearch' to generate match list for LULU. | Default is 'vsearch' (much faster) ``vsearch_similarity_type`` | applies only when 'vsearch' is used as 'match_list_soft'. | Pairwise sequence identity definition (--iddef) ``perc_identity`` | percent identity cutoff for match list. Excluding pairwise comparisons | with lower sequence identity percentage than specified threshold ``coverage_perc`` | percent query coverage per hit. Excluding pairwise comparisons with | lower sequence coverage than specified threshold ``strands`` | query strand to search against database. Both = search also reverse complement ``cores`` | number of cores to use for generating match list for LULU =============================================== ========================= .. _postclustering_dada2_table_filtering: ____________________________________________________ DADA2 collapse ASVs ------------------- DADA2 `collapseNoMismatch `_ function collapses identical ASVs with no internal mismatches (~greedy 100% clustering with end-gapping ignored). Representative sequence of a collapsed ASV will be the most abundant one. and ASVs filtering based on minimum accepted sequence length (custom R functions). To **START**, specify working directory under ``SELECT WORKDIR``, but the file formats do not matter here (just click 'Next'). +---------------------------------+-------------------------------------------------------------------------------------+ | Outputs ``filtered_table`` | | +=================================+=====================================================================================+ | ASVs_table_collapsed.txt | ASV table after collapsing identical ASVs | +---------------------------------+-------------------------------------------------------------------------------------+ | ASVs_collapsed.fasta | ASV sequences after collapsing identical ASVs | +---------------------------------+-------------------------------------------------------------------------------------+ | ASV_table_collapsed.rds | ASV table in RDS format after collapsing identical ASVs | +---------------------------------+-------------------------------------------------------------------------------------+ | If length filtering was applied | | +---------------------------------+-------------------------------------------------------------------------------------+ | ASV_table_lenFilt.tx | ASV table after filtering out ASVs with shorther than specified sequence length | +---------------------------------+-------------------------------------------------------------------------------------+ | ASVs_lenFilt.fasta | ASV sequences after filtering out ASVs with shorther than specified sequence length | +---------------------------------+-------------------------------------------------------------------------------------+ ========================== ============ Setting Tooltip ========================== ============ ``DADA2 table`` | select the RDS file (ASV table), output from DADA2 workflow; | usually in ASVs_out.dada2/ASVs_table.denoised-merged.rds ``collapseNoMismatch`` | collapses ASVs that are identical up to shifts or | length variation, i.e. that have no mismatches or internal indels ``by_length`` | discard ASVs from the ASV table that are shorter than specified | value (in base pairs). Value 0 means OFF, no filtering by length ``minOverlap`` | collapseNoMismatch setting. Default = 20. The minimum overlap of | base pairs between ASV sequences required to collapse them together ``vec`` | collapseNoMismatch setting. Default = TRUE. Use the vectorized | aligner. Should be turned off if sequences exceed 2kb in length ========================== ============ __________________________________________________ metaMATE -------- Determine and filter out putative NUMTs (from mitochondrial coding amplicon genes) and and other erroneous sequences based on relative read abundance thresholds within libraries, phylogenetic clades and/or taxonomic groupings. Additional information: - `metaMATE repository `_ - `metaMATE paper `_ ___________________________________________________ ORF-Finder ---------- Filter out putative pseudogenes (NUMTs) from protein coding amplicon dataset (such as COI, rbcL) using NCBI's ORFfinder `(Sayers et al 2022) `_. This process translates sequences to open reading frames (ORFs) and retaines the longest ORF per sequence if the length of the ORF is between the specified range of ``min length`` and ``max length``. .. _assign_taxonomy: ____________________________________________________ .. _postprocessing_deicode: `DEICODE `_ ----------------------------------------------- DEICODE (`Martino et al., 2019 `_) is used to perform beta diversity analysis by applying robust Aitchison PCA on the OTU/ASV table. To consider the compositional nature of data, it preprocesses data with rCLR transformation (centered log-ratio on only non-zero values, without adding pseudo count). As a second step, it performs dimensionality reduction of the data using robust PCA (also applied only to the non-zero values of the data), where sparse data are handled through matrix completion. Additional information: - `DEICODE repository `_ - `DEICODE paper `_ | Input data is tab delimited **OTU table** and optionally **subset of OTU ids** to generate results also for the selected subset (see input examples below). .. note:: To **START**, specify working directory under ``SELECT WORKDIR``, but the file formats do not matter here (just click 'Next'). +-------------------------------------------------------------------+------------------------------------------------------------------------+ | Output directory |output_icon| ``DEICODE_out`` | +===================================================================+========================================================================+ | otutab.biom | full OTU table in BIOM format | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | rclr_subset.tsv | rCLR-transformed subset of OTU table \* | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | ``full``/distance-matrix.tsv | distance matrix between the samples, based on full OTU table | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | ``full``/ordination.txt | ordination scores for samples and OTUs, based on full OTU table | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | ``full``/rclr.tsv | rCLR-transformed OTU table | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | ``subs``/distance-matrix.tsv | distance matrix between the samples, based on a subset of OTU table \* | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | ``subs``/ordination.txt | ordination scores for samples and OTUs, based a subset of OTU table \* | +-------------------------------------------------------------------+------------------------------------------------------------------------+ | \* files are present only if 'subset_IDs' variable was specified | | +-------------------------------------------------------------------+------------------------------------------------------------------------+ =============================================== ========================= Setting Tooltip =============================================== ========================= ``table`` | select OTU/ASV table. If no file is selected, then PipeCraft will | look OTU_table.txt or ASV_table.txt in the working directory. | See OTU table example below ``subset_IDs`` | select list of OTU/ASV IDs for analysing a subset from the full table | see subset_IDs file example below ``min_otu_reads`` | cutoff for reads per OTU/ASV. OTUs/ASVs with lower reads then specified | cutoff will be excluded from the analysis ``min_sample_reads`` | cutoff for reads per sample. Samples with lower reads then | specified cutoff will be excluded from the analysis =============================================== ========================= Example of input ``table`` (tab delimited text file): ================== ============== ============== ============== ============== OTU_id sample1 sample2 sample3 sample4 ================== ============== ============== ============== ============== 00fc1569196587dde 106 271 584 20 02d84ed0175c2c79e 81 44 88 14 0407ee3bd15ca7206 3 4 3 0 042e5f0b5e38dff09 20 83 131 4 07411b848fcda497f 1 0 2 0 07e7806a732c67ef0 18 22 83 7 0836d270877aed22c 1 1 0 0 0aa6e7da5819c1197 1 4 5 0 0c1c219a4756bb729 18 17 40 7 ================== ============== ============== ============== ============== Example of input ``subset_IDs``: .. code-block:: 07411b848fcda497f 042e5f0b5e38dff09 0836d270877aed22c 0c1c219a4756bb729 ... | **PERMANOVA and PERMDISP example using the robust Aitchison distance** .. code-block:: library(vegan) ## Load distance matrix dd <- read.table(file = "distance-matrix.tsv") ## You will also need to load the sample metadata ## However, for this example we will create a dummy data meta <- data.frame( SampleID = rownames(dd), TestData = rep(c("A", "B", "C"), each = ceiling(nrow(dd)/3))[1:nrow(dd)]) ## NB! Ensure that samples in distance matrix and metadata are in the same order meta <- meta[ match(x = meta$SampleID, table = rownames(dd)), ] ## Convert distance matrix into 'dist' class dd <- as.dist(dd) ## Run PERMANOVA adon <- adonis2(formula = dd ~ TestData, data = meta, permutations = 1000) adon ## Run PERMDISP permdisp <- betadisper(dd, meta$TestData) plot(permdisp) Example of plotting the ordination scores .. code-block:: library(ggplot2) ## Load ordination scores ord <- readLines("ordination.txt") ## Skip PCA summary ord <- ord[ 8:length(ord) ] ## Break the data into sample and species scores breaks <- which(! nzchar(ord)) ord <- ord[1:(breaks[2]-1)] # Skip biplot scores ord_sp <- ord[1:(breaks[1]-1)] # species scores ord_sm <- ord[(breaks[1]+2):length(ord)] # sample scores ## Convert scores to data.frames ord_sp <- as.data.frame( do.call(rbind, strsplit(x = ord_sp, split = "\t")) ) colnames(ord_sp) <- c("OTU_ID", paste0("PC", 1:(ncol(ord_sp)-1))) ord_sm <- as.data.frame( do.call(rbind, strsplit(x = ord_sm, split = "\t")) ) colnames(ord_sm) <- c("Sample_ID", paste0("PC", 1:(ncol(ord_sm)-1))) ## Convert PCA to numbers ord_sp[colnames(ord_sp)[-1]] <- sapply(ord_sp[colnames(ord_sp)[-1]], as.numeric) ord_sm[colnames(ord_sm)[-1]] <- sapply(ord_sm[colnames(ord_sm)[-1]], as.numeric) ## At this step, sample and OTU metadata could be added to the data.frame ## Example plot ggplot(data = ord_sm, aes(x = PC1, y = PC2)) + geom_point() ___________________________________________________ .. HIDDEN SECTION BlasCh ---------- **False positive chimera detection and recovery module** for metabarcoding and environmental DNA (eDNA) datasets. BlasCh (BLAST-based Chimera detection) uses BLAST alignment analysis to identify, classify, and recover sequences that were incorrectly flagged as chimeric during initial chimera detection steps. .. important:: **Workflow compatibility requirements:** - BlasCh **cannot be run as part of a full pipeline (for now)** - it is a standalone post-processing tool - Must be used **after** chimera filtering has been completed - Requires **manual workflow**: run chimera filtering → run BlasCh → run clustering - **Rescued sequences** must be later merged with non-chimeric sequences from original samples - Not compatible with automated pipeline workflows that include clustering steps **How BlasCh Works:** BlasCh employs a sophisticated BLAST-based approach to re-evaluate chimeric sequences through multiple analysis steps: 1. **Database Creation**: Creates BLAST databases from both sample FASTA files (self-databases) and reference sequences 2. **BLAST Analysis**: Performs nucleotide BLAST searches against both self-databases and reference database 3. **Hit Analysis**: Examines BLAST alignments for multiple High-scoring Segment Pairs (HSPs), taxonomic diversity, and alignment quality 4. **Classification**: Applies multi-tier thresholds to classify sequences into distinct categories based on identity and coverage metrics 5. **Recovery**: Rescues sequences that meet quality criteria for inclusion in downstream analyses The module implements smart rerun capabilities, automatically detecting and reusing existing BLAST XML files to enable parameter optimization without re-running computationally expensive BLAST searches. | Input data is **chimeric sequences** in FASTA format (`.chimeras.fasta`, `.chimeras.fa`, `.chimeras.fas` files) and a **reference database** (FASTA file or existing BLAST database). .. important:: **File organization requirements:** - Reference database files must be stored in a **separate directory** from input chimera files - Input chimera files should be in the working directory - Sample FASTA files (for self-database creation) should also be in the working directory - Do not place reference database files in the same folder as input files to avoid conflicts .. note:: To **START**, specify working directory under ``SELECT WORKDIR``, but the file formats do not matter here (just click 'Next'). **Output folder structure:** BlasCh creates a well-organized output directory structure to separate rescued sequences from detailed analysis results: +--------------------------------------+-------------------------------------------------------------------------+ | Output directory |output_icon| ``BlasCh_out`` | +======================================+=========================================================================+ | **RESCUED SEQUENCES (main results)** | +--------------------------------------+-------------------------------------------------------------------------+ | ``non_chimeric``/*_non_chimeric.fasta | recovered non-chimeric sequences (high confidence rescue) | +--------------------------------------+-------------------------------------------------------------------------+ | ``borderline``/*_borderline.fasta | borderline sequences (moderate confidence rescue) | +--------------------------------------+-------------------------------------------------------------------------+ | **SUMMARY AND REPORTS** | +--------------------------------------+-------------------------------------------------------------------------+ | chimera_recovery_report.txt | summary statistics and classification results | +--------------------------------------+-------------------------------------------------------------------------+ | README.txt | documentation of analysis parameters and results | +--------------------------------------+-------------------------------------------------------------------------+ | **DETAILED ANALYSIS RESULTS** | +--------------------------------------+-------------------------------------------------------------------------+ | ``detailed_results``/*_chimeric.fasta | confirmed chimeric sequences that remain excluded | +--------------------------------------+-------------------------------------------------------------------------+ | ``detailed_results``/*_multiple_alignments.fasta | sequences with multiple HSPs and low coverage | +--------------------------------------+-------------------------------------------------------------------------+ | ``detailed_results``/*_sequence_details.csv | detailed classification results for each sequence | +--------------------------------------+-------------------------------------------------------------------------+ | **TECHNICAL FILES** | +--------------------------------------+-------------------------------------------------------------------------+ | ``xml``/blast_results.zip | compressed BLAST XML output files (can be used for reanalysis with different thresholds) | +--------------------------------------+-------------------------------------------------------------------------+ **Folder organization explanation:** - **Rescued Sequences**: The ``non_chimeric`` and ``borderline`` folders contain sequences that can be included in downstream analyses - **Detailed Results**: The ``detailed_results`` folder contains sequences that remain excluded along with analysis details - **Summary Files**: Report files provide overview statistics and complete documentation of the analysis - **Technical Files**: Compressed XML files allow reanalysis with different parameters without re-running BLAST =============================================== ========================= Setting Tooltip =============================================== ========================= ``reference_db`` | path to reference database (FASTA file or existing BLAST database). | **Required** - must be provided and stored in separate folder from input files ``high_identity_threshold`` | identity threshold for high-quality matches (default: 99.0%) ``high_coverage_threshold`` | coverage threshold for high-quality matches (default: 99.0%) ``borderline_identity_threshold`` | identity threshold for borderline recovery (default: 80.0%) ``borderline_coverage_threshold`` | coverage threshold for borderline recovery (default: 89.0%) =============================================== ========================= **Detailed classification logic:** BlasCh uses a sophisticated multi-tier classification system with the following decision tree: 1. **Multiple alignments**: Sequences with multiple HSPs in the first non-self BLAST alignment and ≤85% coverage → classified as multiple alignments 2. **Self-hits only**: Sequences that only match to their own sample without reference database matches → confirmed chimeric 3. **High-Quality matches**: Identity ≥threshold AND coverage ≥threshold against reference database → rescued as non-chimeric 4. **Borderline recovery**: Identity ≥threshold AND coverage ≥threshold against reference database → rescued as non-chimeric 5. **Taxonomic diversity**: Multiple different taxonomies in top hits without meeting quality thresholds → confirmed chimeric **Smart rerun capability:** - Automatically detects existing BLAST XML files from previous runs - Extracts XML files from compressed archives when needed - Skips database creation and BLAST steps if XML files are complete - Enables testing different classification thresholds without re-running BLAST - Handles mixed scenarios (some samples have XML, others don't) **Expected Results:** - **Rescued sequences** (non-chimeric and borderline) can be included in downstream analyses - **Detailed analysis results** provide transparency about why certain sequences were confirmed as chimeric - **CSV reports** contain per-sequence classification details and summary statistics - **Documentation** ensures reproducibility and parameter tracking **Post-BlasCh Workflow:** 1. **Merge rescued sequences** with original non-chimeric sequences from each sample 2. **Run clustering manually** on the combined sequence sets 3. **Proceed with downstream analyses** using the updated sequence data 4. **Document** which sequences were rescued for transparency in results .. note:: BlasCh automatically detects `.chimeras` files with various extensions (.fasta, .fa, .fas) in the working directory and creates self-databases from available sample FASTA files. Original sample files are prioritized over .chimeras files for database creation. .. warning:: **Important usage notes:** - Ensure chimera detection has been run prior to BlasCh analysis - Reference database must be provided and in FASTA format or valid BLAST database format - Reference database files must be stored in a separate directory from input files