.. |PipeCraft2_logo| image:: _static/PipeCraft2_icon_v2.png :width: 50 :target: https://github.com/pipecraft2/pipecraft .. raw:: html .. role:: red .. raw:: html .. role:: green .. |multiQC_main| image:: _static/multiQC_main.png :width: 1000 .. |multiQC_1-3| image:: _static/multiQC_1-3.png :width: 550 .. |multiQC_view_report| image:: _static/multiQC_view_report.png :width: 550 .. |multiQC_stats| image:: _static/multiQC_stats.png :width: 650 .. |multiQC_plot| image:: _static/multiQC_plot.png :width: 650 .. _qualitycheck: Inspect quality profiles ------------------------ Quality scores are assigned to each nucleotide during base calling to indicate errors/accuarcies of the recorded base. Generally, the **quality score of 30** indicates a trustfuly recorded base (99.9% accuracy). +---------------+------------------------------------+--------------------+ | Quality score | Probability of incorrect base call | Base call accuracy | +===============+====================================+====================+ | 10 | 1 in 10 | 90% | +---------------+------------------------------------+--------------------+ | 20 | 1 in 100 | 99% | +---------------+------------------------------------+--------------------+ | 30 | 1 in 1000 | 99.9% | +---------------+------------------------------------+--------------------+ | 40 | 1 in 10000 | 99.99% | +---------------+------------------------------------+--------------------+ | 50 | 1 in 100000 | 99.999% | +---------------+------------------------------------+--------------------+ | 60 | 1 in 1000000 | 99.9999% | +---------------+------------------------------------+--------------------+ .. admonition:: Average quality score of a sequence However, an **average quality score of 30** may be misleading in determining the reliability of a sequence. For example, the average quality score of a 100 base pairs (bp) sequence which has quality score of 37 for 80 bases and quality score of 6 for 20 bases, is **30.8**; but the **expected number of errors for that sequence is 5** (80 x 0.00020 + 20 x 0.25119 = 5.0398; :ref:`see table for the error probabilities `). I.e., the sequence may host errors accountable for 5% variation in that amplicon and thus is not reliable, and should be discarded during quality filtering. Therefore, generally, the quality filtering based on the expected number of errors (sum of the error probabilities) is preferred over the average quality score threshold. `See here for the additional information about expected errors `_. ____________________________________________________ In PipeCraft2, **examine the quality profiles and basic statistics** of the your FASTQ files using the ``QualityCheck module``, which implements `FastQC `_ and `MultiQC `_. |multiQC_main| **TO START:** | 1. ``SELECT FOLDER`` (a working directory) which contains **fastq**/fq files that you aim to inspect. **Files can be gz compressed**. *Note that Windows OS does not display containing files; so double-check you are selecting the correct dir.)* | | 2. Press ``CREATE REPORT`` to start MultiQC. | | 3. **"LOADING ...** will be displayed while the report is being generated. | | 4. Once DONE, then ``VIEW REPORT`` is displayed. Click on ``VIEW REPORT`` and a html file (multiqc_report.html) will open in your default web browser. *If the summary does not open, check your working floder for the presence of* **multiqc_report.html** *in a* **quality_check** directory *and try to open with some other web browser.* |multiQC_1-3| |multiQC_view_report| .. error:: Something went wrong if the file multiqc_report.html does not exist (may fail e.g. when maximum number of fastq files in the folder is extremely large, >10 000). ____________________________________________________ MultiQC report ~~~~~~~~~~~~~~ MultiQC report allows to interactively examine the basic statistics and quality profiles of your input data. **Example plots generated by MultiQC:** |multiQC_stats| |multiQC_plot| More info about `"using MultiQC reports" in MultiQC docs page `_. .. note:: Note that '_fastqc.zip' and '_fastqc.html' are generated for each fastq file in the **'quality_check'** directory. These are summarized in **multiqc_report.html**, so you may **examine or delete** all individual '_fastqc.zip' and '_fastqc.html' files if those are of no interest. .. _remove_low_quality_ends: ____________________________________________________ Remove low-quality ends/starts of reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the quality profile of the sequences is poor at the **starts** or **ends** of reads, then applying sequence truncation/stripping settings in the **quality filtering** step may be helpful. Reads that would otherwise fail ``maxee`` (sum of per-base error probabilities derived from Phred scores) because of a noisy tail can then **pass filtering and be retained** in the dataset. For **paired-end** data, cleaner 3′ and 5′ ends also tend to improve :ref:`read merging `: overlap regions may be easier to align when error-prone tails are trimmed, and base calls in the overlap are more trustworthy for conflict resolution. After all truncation, reads **shorter than** ``minLen`` / ``min length`` are **discarded**, and ``maxEE`` is evaluated on the truncated sequence. In :ref:`vsearch quality filtering `: - ``trunc length`` (vsearch *--fastq_trunclen*) cuts every read to at most specified length (bases counted from the 5′ end). Reads **shorter than the specified length** are **discarded**. Use this when quality or noise rises predictably toward the **3′ end** and you want a **fixed** read length. If the specified length is too small you may reduce overlap for merging paired-end reads. - ``truncqual`` (vsearch *--fastq_truncqual*) scans each read from the **5′ end toward the 3′ end** and **truncates at the first base** whose Phred quality is **less than or equal to** the threshold you set. The kept sequence part is the longest 5′ segment where **every** remaining quality score is **strictly greater** than the threshold. - ``truncee`` (vsearch *--fastq_truncee*) **shortens the read from the 3′ end** until the **total expected errors** (sum of per-base error probabilities from Phred scores over the **kept** read part) is **≤** the value you set. Unlike ``trunc length``, the cut length **differs per read**. It removes noisy tails while keeping as much length as the quality scores allow under specified expected errors threshold. - ``strip_left`` (vsearch *--fastq_stripleft*) removes a **fixed number of bases from the 5′ end** of every read (independent of per-base quality scores). - ``strip_right`` (vsearch *--fastq_stripright*) removes a **fixed number of bases from the 3′ end** of every read (independent of per-base quality scores). .. |trunc_seqs_qFilt| image:: _static/trunc_seqs_qFilt.png :width: 600 |trunc_seqs_qFilt| In :ref:`DADA2 quality filtering ` (*filterAndTrim*): - ``truncQ`` scans each read from the **5′ toward the 3′ end** and **cuts at the first position** whose Phred quality is **less than or equal to** the specified quality score value. - ``truncLen`` — after any ``trimLeft`` / ``trimRight``, **truncate R1** (single-end: the only read) to **exactly** specified length (number of bases) by keeping the **5′ segment**. Reads **shorter than** the specified length after trimming are **removed**. This is a **fixed length** cut (not quality score based); use it when the quality profile shows a clear point to drop the noisy 3′ tail for **R1**. ``maxEE`` is applied **after** this truncation. - ``truncLen_R2`` — **paired-end only.** Same idea as ``truncLen`` but applied to **R2** only. R1 and R2 may need **different** values because R2 quality commonly drops faster than R1. - ``trimLeft`` — remove a **fixed** number of bases from the **5′ end** of each read (**R1** and **R2** in paired-end) **before** ``truncLen`` / ``truncLen_R2`` and **before** ``maxEE``. - ``trimRight`` — remove a **fixed** number of bases from the **3′ end** of each read before length truncation and ``maxEE``. Use when there is a known **uniform** artefact at the read end (e.g. after adapter removal) or a short low-quality tail that is easier to cut with a constant length than with ``truncQ`` / ``truncLen`` alone. .. note :: Truncation/stripping reduces read length, thus the effective amplicon length you keep. That can be desirable for quality, but if trimming is too aggressive you may lose overlap between R1 and R2 (harder or impossible merging) or retain only a truncated fragment of the true amplicon. Balance trimming against insert length and the overlap you need for merging paired-end reads. ____________________________________________________ .. _quality_scores_table: Quality scores table ~~~~~~~~~~~~~~~~~~~~ The table of quality (Phred) scores and corresponding probabilities of base calling errors. The ASCII column denotes the quality score representations in the fastq file *(note that old Illumina fastq files have different Phred score encoding)*. +---------------+-------------------+-------+---------------+-------------------+-------+ | Quality score | Error probability | ASCII | Quality score | Error probability | ASCII | +===============+===================+=======+===============+===================+=======+ | **0** | 1.00000 | ! | **22** | 0.00631 | 7 | +---------------+-------------------+-------+---------------+-------------------+-------+ | **1** | 0.79433 | \" | **23** | 0.00501 | 8 | +---------------+-------------------+-------+---------------+-------------------+-------+ | **2** | 0.63096 | # | **24** | 0.00398 | 9 | +---------------+-------------------+-------+---------------+-------------------+-------+ | **3** | 0.50119 | $ | **25** | 0.00316 | : | +---------------+-------------------+-------+---------------+-------------------+-------+ | **4** | 0.39811 | % | **26** | 0.00251 | ; | +---------------+-------------------+-------+---------------+-------------------+-------+ | **5** | 0.31623 | & | **27** | 0.00200 | \< | +---------------+-------------------+-------+---------------+-------------------+-------+ | **6** | 0.25119 | \' | **28** | 0.00158 | = | +---------------+-------------------+-------+---------------+-------------------+-------+ | **7** | 0.11953 | ( | **29** | 0.00126 | \> | +---------------+-------------------+-------+---------------+-------------------+-------+ | **8** | 0.15849 | ) | **30** | 0.00100 | ? | +---------------+-------------------+-------+---------------+-------------------+-------+ | **9** | 0.12589 | \* | **31** | 0.00079 | @ | +---------------+-------------------+-------+---------------+-------------------+-------+ | **10** | 0.10000 | \+ | **32** | 0.00063 | A | +---------------+-------------------+-------+---------------+-------------------+-------+ | **11** | 0.07943 | , | **33** | 0.00050 | B | +---------------+-------------------+-------+---------------+-------------------+-------+ | **12** | 0.06310 | \- | **34** | 0.00040 | C | +---------------+-------------------+-------+---------------+-------------------+-------+ | **13** | 0.05012 | . | **35** | 0.00032 | D | +---------------+-------------------+-------+---------------+-------------------+-------+ | **14** | 0.03981 | / | **36** | 0.00025 | E | +---------------+-------------------+-------+---------------+-------------------+-------+ | **15** | 0.03162 | 0 | **37** | 0.00020 | F | +---------------+-------------------+-------+---------------+-------------------+-------+ | **16** | 0.02512 | 1 | **38** | 0.00016 | G | +---------------+-------------------+-------+---------------+-------------------+-------+ | **17** | 0.01995 | 2 | **39** | 0.00013 | H | +---------------+-------------------+-------+---------------+-------------------+-------+ | **18** | 0.01585 | 3 | **40** | 0.00010 | I | +---------------+-------------------+-------+---------------+-------------------+-------+ | **19** | 0.01259 | 4 | **41** | 0.00008 | J | +---------------+-------------------+-------+---------------+-------------------+-------+ | **20** | 0.01000 | 5 | | | | +---------------+-------------------+-------+---------------+-------------------+-------+ | **21** | 0.00794 | 6 | | | | +---------------+-------------------+-------+---------------+-------------------+-------+ So, such a sequence with associated encoded quality scores ... .. code-block:: @M02459:45:000000000-ATN9N:1:1101:9884:1029 1:N:0:247 ATGAATCATCGAATCTTTGAACGCA + &8BCCGGGGGAAGGGGGG,CFFGGG | ... would translate into sequence with the following quality scores: .. code-block:: @M02459:45:000000000-ATN9N:1:1101:9884:1029 1:N:0:247 A T G A A T C A T C G A A T C T T T G A A C G C A 5 23 33 34 34 38 38 38 38 38 32 32 38 38 38 38 38 38 11 34 37 37 38 38 38