.. |PipeCraft2_logo| image:: _static/PipeCraft2_icon_v2.png :width: 50 :target: https://github.com/pipecraft2/pipecraft .. raw:: html .. role:: red .. raw:: html .. role:: green .. |workflow_finished| image:: _static/workflow_finished.png :width: 300 :class: center .. |stop_workflow| image:: _static/stop_workflow.png :width: 200 .. |output_icon| image:: _static/output_icon.png :width: 50 .. |save| image:: _static/save.png :width: 50 .. |pulling_image| image:: _static/pulling_image.png :width: 280 .. |NextITS_pipeline| image:: _static/nextits_pipeline.png :width: 200 .. |NextITS_step1_settings| image:: _static/nextits_step1_settings.png :width: 800 .. |NextITS_step2_settings| image:: _static/nextits_step2_settings.png :width: 800 .. meta:: :description lang=en: PipeCraft manual. NextITS tutorial .. _example_analyses_NextITS: NextITS pipeline, full-length ITS |PipeCraft2_logo| ------------------------------------------------------- This example data analysis follows the **NextITS** pipeline as implemented in PipeCraft2's pre-compiled pipelines panel. NextITS is a specialized pipeline for analyzing **full-length ITS** reads obtained via **PacBio** sequencing. | `Download example data set here `_ (1 Mb) and **unzip** it. | This is a **Full-length ITS dataset, PacBio sequencing**. ____________________________________________________ Starting point ~~~~~~~~~~~~~~ The example dataset consists of **PacBio full-length ITS sequences** from **two sequencing runs**. **Key features of the data:** - **Demultiplexed** fastq files. - **Two sequencing runs** (Run_01 and Run_02). - Files follow the ``RunID__SampleID`` naming convention. **Directory structure:** To process data with NextITS in PipeCraft2, your input directory must follow a specific structure: 1. A main folder (e.g., ``my_NextITS_project``). 2. Inside that, a folder named **exactly** ``Input``. 3. Inside ``Input``, subfolders for each sequencing run (e.g., ``Run_01``, ``Run_02``). 4. Inside run folders, your demultiplexed fastq files. .. code-block:: text my_NextITS_project/ <-- SELECT THIS AS WORKING DIRECTORY └── Input/ ├── Run_01/ │ ├── Run01__Sample101.fastq.gz │ ├── Run01__Sample49.fastq.gz │ └── Run01__Sample72.fastq.gz └── Run_02/ ├── Run02__Sample26.fastq.gz ├── Run02__Sample61.fastq.gz └── Run02__Sample87.fastq.gz .. note:: The double underscore ``__`` in filenames (e.g., ``Run01__Sample101``) is important! It allows the pipeline to parse the Run ID and Sample ID correctly, which is crucial for tracking samples across runs and for tag-jump filtering. ____________________________________________________ Select pipeline and input ~~~~~~~~~~~~~~~~~~~~~~~~~ | **To select the NextITS pipeline**, press: | ``SELECT PIPELINE`` --> ``NextITS``. | **To select input data**, press ``SELECT WORKDIR`` | and select the **main folder** (e.g., ``my_NextITS_project``) that contains the ``Input`` directory. |NextITS_pipeline| ____________________________________________________ Step 1: Quality control and artefact removal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NextITS processes data in **two distinct steps**. Step 1 is performed **per sequencing run** to handle run-specific errors and artefacts. |NextITS_step1_settings| Step 1 includes: * **Primer trimming**: Removing primers and filtering reads that don't contain them. * **Quality filtering**: Removing low-quality reads and correcting homopolymer errors (common in PacBio data). * **ITS extraction**: Using ITSx to extract the full ITS region (ITS1-5.8S-ITS2), removing flanking 18S/28S parts. * **Chimera filtering**: De novo and reference-based chimera detection, with a "rescue" step for likely false positives. * **Tag-jump correction**: Removing sequences that likely jumped between samples during library prep. **Key Settings for Step 1:** 1. **Trim Primers**: NextITS requires **exactly one forward and one reverse primer**. * ``primer_forward``: Specify your forward primer (IUPAC codes allowed). * ``primer_reverse``: Specify your reverse primer. * ``primer_mismatch``: Allowed mismatches (default 2). 2. **ITS Extraction**: * ``its_region``: Generally set to **full** for PacBio data to keep the entire ITS region. * ``ITSx_tax``: Can be set to **all**, **Fungi**, or other groups to restrict ITSx search. 3. **Chimera Filtering**: * Uses a built-in or custom reference database. * ``chimera_rescue_occurrence``: Sequences initially flagged as chimeric but appearing in at least this many samples (default 2) are "rescued" (considered valid). This protects against false positives in multi-sample datasets. 4. **Tag-jump Correction**: * Important to remove erroneous reads (assigned to wrong samples). * ``tj_f`` (f-value): Expected tag-jump rate. Default **0.01** is often appropriate for dual-indexed libraries. Use **0.03** or higher for combinational dual indexing if tag-jumping is suspected to be higher. ____________________________________________________ Step 2: Aggregation and clustering ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After Step 1 processes each run individually, **Step 2 pools all valid sequences** from all runs and clusters them into OTUs. |NextITS_step2_settings| **Clustering Options:** You can choose between three clustering strategies via ``clustering_method``: * **vsearch** (default): Greedy clustering at a fixed threshold (e.g., 0.98 for 98% similarity). Fast and widely used. * **swarm**: Exact-sequence based clustering that forms OTUs by chaining sequences differing by *d* nucleotides. Good for high-resolution analysis. * **unoise**: Denoising algorithm (zero-radius OTUs or zOTUs are analogous to ASVs). **Post-clustering LULU:** * ``lulu`` = **TRUE** (default). * LULU merges "daughter" OTUs (errors) into "parent" OTUs based on co-occurrence patterns, producing a cleaner final OTU table. ____________________________________________________ Save and start ~~~~~~~~~~~~~~ Once settings are configured: 1. **Save the configuration**: Click the **Save Workflow** button |save| on the right ribbon. This creates a ``pipecraft2_last_run_configuration.json`` file for reproducibility. 2. **Start the pipeline**: Click **START**. .. admonition:: First time run When running NextITS for the first time, PipeCraft will pull the necessary Docker images. This may take a few minutes. |pulling_image| ____________________________________________________ Examine the outputs ~~~~~~~~~~~~~~~~~~~ NextITS organizes outputs into ``Step1_Results`` and ``Step2_Results``. .. note:: Both Step 1 and Step 2 output directories contain a ``pipeline_info`` folder. This folder includes ``execution_trace_*.txt`` with the detailed log of the pipeline execution (e.g., duration and resources used per each process), as well as ``README_Step{1,2}_Methods.txt`` with the human-readable description (suitable for materials and methods of a publication) of the pipeline steps with references to software tools used. Step 1 outputs (per run) ^^^^^^^^^^^^^^^^^^^^^^^^ Located in ``Step1_Results/Run_XX/``. Key folders include: * ``02_PrimerCheck``: Sequences after primer trimming. * ``03_ITSx``: Results from ITS extraction (ITS sequences, coordinates). * ``05_Chimera``: Chimera filtering results (chimeras found vs. non-chimeras). * ``06_TagJumpFiltration``: Results after removing tag-jump artefacts. * ``07_SeqTable``: **Final processed sequences for this run**. These are used as input for Step 2. * ``08_RunSummary``: Contains ``Run_summary.xlsx`` with read counts per sample at each step. **Check this to evaluate sample quality and dropout.** Step 2 outputs (pooled) ^^^^^^^^^^^^^^^^^^^^^^^ Located in ``Step2_Results/``. This is where your final results are stored. * ``01.Dereplicated``: Pooled and dereplicated sequences from all runs. * ``03.Clustered_VSEARCH`` (or Swarm/UNOISE): Raw clusters before LULU curation. * ``04.PooledResults``: * ``OTUs.fa.gz``: Representative OTU sequences. * ``OTU_table_wide.txt.gz``: OTU table (OTUs x Samples). * ``05.LULU`` (if LULU was enabled): * **OTUs_LULU.fa.gz**: **Final curated OTU sequences.** * **OTU_table_LULU.txt.gz**: **Final curated OTU table.** * ``LULU_merging_statistics.txt.gz``: Info on which OTUs were merged. If required, RData files can be loaded into R using ``readRDS`` function, e.g.: .. code-block:: R OTU_table <- readRDS("Step2_Results/04.PooledResults/OTU_table_long.RData") .. important:: For downstream analysis (taxonomy assignment, statistics), use the files in the **05.LULU** folder (if enabled) or **04.PooledResults** (if LULU was disabled). ____________________________________________________ Taxonomy assignment ~~~~~~~~~~~~~~~~~~~ Taxonomy assignment **is not** part of the core NextITS pipeline but can be run subsequently using **QuickTools**. :ref:`See here `.