Walkthrough
Some example analyses pipelines.
Note
When samples of interest are distributed between different sequencing libraries, then first demultiplex (if needed) libraries separately and place samples of interest into separate working directory.
Inspect quality profiles
Examine the quality profiles and basic statistics of the your data set using QualityCheck module. See here.
Paired-end Illumina (or MGI-Tech) data
Example analyses of paired-end data. Starting with raw paired-end fastq files, finishing with ASV/OTU table and taxonomy table.
Demultiplexed paired-end data; ASV workflow with DADA2
Note
This tutorial follows DADA2 Pipeline Tutorial.
Here, we perform example analyses of paired-end data using mothur MiSeq SOP example data set. Download example data set here (35.1 Mb) and unzip it. This data set represents demultiplexed set (per-sample fastq files) of 16S rRNA gene V4 amplicon sequences where sample indexes and primers have already been removed.
If working with multiplexed data, see here.
If you need to reorient reads (based on primer sequences), see here.
If you need to trim the primers/adapters, see here.
Warning
Be sure that all sequences have same orientation (5’-3’ or 3’-5’) in your input data set(s)! If sequences are in mixed orientation
(i.e. some sequences are recorded as 5’-3’ and some as 3’-5’; as usually in the raw data),
then exactly the same ASV may be reported twice, where one is just the reverse complementary ASV: 1) ASV with sequence orientation of 5’-3’; and 2) ASV with sequence orientation of 3’-5’. Reorient sequences based on primer sequene using REORIENT
panel; see here.
Important
When working with your own data, then please check that the paired-end data file names contain “R1” and “R2” strings (to correctly identify the paired-end reads by PipeCraft).
1. Select working directory by pressing the 'SELECT WORKDIR' button.
sequencing data format
as demultiplexed;sequence files extension
as *.fastq;sequencing read types
as paired-end.2. Select 'ASVs workflow' panel (right-ribbon) and check that docker is running (green icon);
Here, working with demultiplexed data, where primers have already been removed; so do not tick
DEMULTIPLEX
,REORIENT
,CUT PRIMERS
(see here to analyse multiplexed data, and here if you need to cut primers/adapters).
3. 'QUALITY FILTERING'
Before adjusting quality filtering settings, let’s have a look on the quality profile of our example data set. Below quality profile plot was generated using QualityCheck
panel (see here).
In this case, all R1 files are represented with green lines, indicating good average quality per file. However, all R2 files are either yellow or red, indicating a drop in quality scores. Lower qualities of R2 reads are characteristic for Illumina sequencing data, and is not too alarming. DADA2 algoritm is robust to lower quality sequences, but removing the low quality read parts will improve the DADA2 sensitivity to rare sequence variants.
Click on
QUALITY FILTERING
to expand the panelspecify identifier strings for
read R1
andread R2
. Here, fastq file names = F3D0_S188_L001_R1_001.fastq, F3D0_S188_L001_R2_001.fastq etc…; _R1 and _R2 are common identifiers for all files.specify
samp ID
(sample identifier). Here _ (underscore), which denotes that sample name is a string before the first _ in the fastq file name.trim reads to specified length to remove low quality ends. Set
truncLen
to 240 for trimming R1 reads andtruncLen R2
to 160 to trim R2 reads. Latter positions represent the approximate positions where sequence quality drps notably (quality profile figure above). Be sure to consider the amplicon length before applyingtruncLen
options, so that R1 and R2 reads would still overlap for theMERGE PAIRS
process.other settings as default.
(click on the image for enlargement)
qualFiltered_out
:4. Here, we use default 'DENOISE' and 'MERGE PAIRS' settings
denoised_assembled.dada2
.5. Default settings for 'CHIMERA FILTERING'
(method = consensus)
chimeraFiltered_out.dada2
:ASVs_out.dada2
:6. 'ASSIGN TAXONOMY'
Click on ‘ASSIGN TAXONOMY’ to expand the panel
press
DOWNLOAD DATABASES
which direct you to the DADA2-formatted reference databases web page.download SILVA (silva_nr99_v138.1_wSpecies_train_set.fa.gz) database for assigning taxonomy to our 16S ASVs. Download link here
specify the location of your downloaded DADA2 database by pressing
SELECT FASTA
since primers were already removed from this data set, we could not reorient all sequences to uniform orientation as based on primers. Therefore, swithc ON
tryRC
to include reverse-complement search.
taxonomy_out.dada2
:6.1. Save the workflow by pressing ``SAVE WORKFLOW`` button on the right-ribbon.
7. Press** 'RUN WORKFLOW' **to start the analyses.
Note
When running the panel for the first time, a docker image will be pulled first to start the process.
When done, 'workflow finished' window will be displayed.
->
Examine the outputs
Several process-specific output folders are generated:
qualFiltered_out
-> quality filtered paired-end fastq files per sampledenoised_assembled.dada2
-> denoised and assembled fasta files per sample (and error rate plots)chimeraFiltered_out.dada2
–> chimera filtered fasta files per sampleASVs_out.dada2
–> ASVs table (ASVs_table.txt), and ASV sequences (ASVs.fasta) filetaxonomy_out.dada2
–> ASVs taxonomy table (taxonomy.txt)Each folder (except ASVs_out.dada2 and taxonomy_out.dada2) contain summary of the sequence counts (seq_count_summary.txt). Examine those to track the read counts throughout the pipeline.
For example, merging the seq_count_summary.txt file in
qualFiltered_out
with the seq_count_summary.txt file fromchimeraFiltered_out.dada2
forms a table for examining sequence counts throughout the pipeline and number of ASVs per sample.
sample |
input |
qualFiltered |
merged |
chimeraFiltered |
no.of ASVs |
---|---|---|---|---|---|
F3D0 |
7793 |
7113 |
6540 |
6528 |
106 |
F3D141 |
5958 |
5463 |
4986 |
4863 |
74 |
F3D142 |
3183 |
2914 |
2595 |
2521 |
48 |
F3D143 |
3178 |
2941 |
2552 |
2518 |
56 |
F3D144 |
4827 |
4312 |
3627 |
3488 |
47 |
F3D145 |
7377 |
6741 |
6079 |
5820 |
72 |
F3D146 |
5021 |
4560 |
3968 |
3879 |
84 |
F3D147 |
17070 |
15637 |
14231 |
13006 |
103 |
F3D148 |
12405 |
11413 |
10529 |
9935 |
97 |
F3D149 |
13083 |
12017 |
11154 |
10653 |
112 |
F3D150 |
5509 |
5032 |
4349 |
4240 |
78 |
F3D1 |
5869 |
5299 |
5028 |
5017 |
100 |
F3D2 |
19620 |
18075 |
17431 |
16835 |
134 |
F3D3 |
6758 |
6250 |
5853 |
5491 |
68 |
F3D5 |
4448 |
4052 |
3716 |
3716 |
86 |
F3D6 |
7989 |
7369 |
6865 |
6679 |
90 |
F3D7 |
5129 |
4765 |
4428 |
4217 |
61 |
F3D8 |
5294 |
4871 |
4576 |
4547 |
99 |
F3D9 |
7070 |
6504 |
6092 |
6015 |
106 |
Mock |
4779 |
4314 |
4269 |
4269 |
20 |
ASVs_out.dada2
directory contains ASVs table (ASVs_table.txt), where the 1st column represents ASV identifiers,
2nd column representative sequences of ASVs,
and all following columns represent samples (number of sequences per ASV in a sample). This is tab delimited text file.
ASVs_table.txt; first 4 samples
ASV |
Sequence |
F3D0 |
F3D141 |
F3D142 |
F3D143 |
---|---|---|---|---|---|
ASV_1 |
TACGGAGGATG… |
579 |
444 |
289 |
228 |
ASV_2 |
TACGGAGGATG… |
345 |
362 |
304 |
176 |
ASV_3 |
TACGGAGGATG… |
449 |
345 |
158 |
204 |
ASV_4 |
TACGGAGGATG… |
430 |
502 |
164 |
231 |
ASV_5 |
TACGGAGGATC… |
154 |
189 |
180 |
130 |
ASV_6 |
TACGGAGGATG… |
470 |
331 |
181 |
244 |
ASV_7 |
TACGGAGGATG… |
282 |
243 |
163 |
152 |
ASV_8 |
TACGGAGGATT… |
184 |
321 |
89 |
83 |
ASV_9 |
TACGGAGGATG… |
45 |
167 |
89 |
109 |
The ASV sequences are representad also in the fasta file (ASVs.fasta) in ASVs_out.dada2
directory.
Result from the taxonomy annotation process - taxonomy table (taxonomy.txt) - is located at the taxonomy_out.dada2
directory.
“NA” denotes that the ASV was not assigned to corresponding taxonomic unit.
Last columns with integers (for ‘Kingdom’ to ‘Species’) represent bootstrap values for the correspoinding taxonomic unit.
taxonomy.txt; first 10 ASVs
ASV |
Sequence |
Kingdom |
Phylum |
Class |
Order |
Family |
Genus |
Species |
Kingdom |
Phylum |
Class |
Order |
Family |
Genus |
Species |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ASV_1 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
ASV_2 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
ASV_3 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
ASV_4 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Rikenellaceae |
Alistipes |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
ASV_5 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
ASV_6 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
95 |
95 |
ASV_7 |
TACGTAG… |
Bacteria |
Firmicutes |
Clostridia |
Lachnospirales |
Lachnospiraceae |
Lachnospiraceae NK4A136 group |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
ASV_8 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
ASV_9 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Bacteroidaceae |
Bacteroides |
caecimuris |
100 |
100 |
100 |
100 |
100 |
100 |
77 |
ASV_10 |
TACGGAG… |
Bacteria |
Bacteroidota |
Bacteroidia |
Bacteroidales |
Muribaculaceae |
NA |
NA |
100 |
100 |
100 |
100 |
100 |
99 |
99 |
Demultiplexed paired-end data; OTU workflow
Note
Built-in panel for OTU workflow with (mostly) vsearch.
Here, we perform example analyses of paired-end data using mothur MiSeq SOP example data set. Download example data set here (35.1 Mb) and unzip it. This data set represents demultiplexed set (per-sample fastq files) of 16S rRNA gene V4 amplicon sequences where sample indexes and primers have already been removed.
Note
When working with your own data, then consider reorienting reads; see here. Although, in the OTU formation step (clustering), both sequence strands will be compared to generate OTUs, the time for BLAST (taxonomy annotation step) can be reduced when there is no need to search reverse complementary matches.
Important
When working with your own data, then please check that the paired-end data file names contain “R1” and “R2” strings (to correctly identify the paired-end reads by PipeCraft).
1. Select working directory by pressing the 'SELECT WORKDIR' button.
sequencing data format
as demultiplexed;sequence files extension
as *.fastq;sequencing read types
as paired-end.2. Select 'OTU workflow' panel (right-ribbon) and check that docker is running (green icon);
Here, working with demultiplexed data, where primers have already been removed; so do not tick
DEMULTIPLEX
,REORIENT
,CUT PRIMERS
(see here to analyse multiplexed data, and here if you need to cut primers/adapters).
Before proceeding, let’s have a look on the quality profile of our example data set. Below quality profile plot was generated using QualityCheck
panel (see here).
In this case, all R1 files are represented with green lines, indicating good average quality per file. However, all R2 files are either yellow or red, indicating a drop in quality scores. Lower qualities of R2 reads are characteristic for Illumina sequencing data, and is not too alarming. Nevertheless, we need to quality filter the data set.
3. 'MERGE PAIRS'
Here, we use default settings.
Note
If include_only_R1
option = TRUE,
then unassembled R1 reads will be included to the set of assembled reads per sample.
This may be useful when working with e.g. ITS2 sequences, because the ITS2 region in some
taxa is too long for paired-end assembly using current short-read sequencing technology.
Therefore longer ITS2 amplicon sequences are discarded completely after the assembly process.
Thus, including also unassembled R1 reads (include_only_R1
= TRUE), partial ITS2 sequences for
these taxa will be represented in the final output. But when using ITSx
, keep only_full
= FALSE and include partial
= 50.
|
If include only R1 option = TRUE, then other specified options (lenght, max error rate etc.) have not been
applied to R1 reads in the ‘assembled’ file. Thus, additional quality filtering (if this was done before assembling)
should be run on the ‘assembled’ data. But in this built-in OTU workflow, the quality filtering step is anyway performed after merge pairs step.
assembled_out
.4. 'QUALITY FILTERING'
Click on
QUALITY FILTERING
to expand the panelspecify
maxee
(maximum number of expected errors per sequence), here we use 1 (see here what is maxee).specify
maxNs
(maximum number of Ns in the sequences). Here, we will discard any sequence that contains N (ambiguously recorded nucleotide) by setting the value to 0.other settings as default.
qualFiltered_out
5. 'CHIMERA FILTERING'
Click on
CHIMERA FILTERING
to expand the panelspecify
pre cluster
threshold as 0.97 (that is 97%; when planning to use 97% sequence similarity threshold also for clustering reads into OTUs).here, we perform only
denovo
chimera filteringother settings as default.
Note
Tick reference based
if there is appropriate database for reference based chimera filtering
(such as e.g. UNITE for ITS reads).
chimeraFiltered_out
6. Consideration when working with ITS data
Identify and extract the ITS regions using ITSx; see here
Note
because ITSx outputs multiple directories for different ITS sub-regions
CLUSTERING
and ASSIGN TAXONOMY
will be disabled after ‘ITS EXTRACTOR’.
Select appropriate ITSx output folder for CLUSTERING after the process is finished
[‘ADD STEP’ -> CLUSTERING
(vsearch)].
7. 'CLUSTERING'
Here, we use default settings by clustering the reads using 97% similarity threshold
clustering_out
8. 'ASSIGN TAXONOMY'
Tick
ASSIGN TAXONOMY
to perform taxonomy assignment with BLASTdownload SILVA 99% database here (SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz)
unzip the downloaded database and place this into separete folder (to automatically make blast database from that fasta file)
specify the location of your downloaded SILVA database by pressing
SELECT FILE
under ‘database file’ optionsince primers were already removed from this data set, we could not reorient all sequences to uniform orientation as based on primers. Therefore, keep ON the
strands
= both to include reverse-complement search.
taxonomy_out
8.1. Save the workflow by pressing ``SAVE WORKFLOW`` button on the right-ribbon.
1. Press** 'RUN WORKFLOW' **to start the analyses.
Note
When running the panel for the first time, a docker image will be pulled first to start the process.
When done, 'workflow finished' window will be displayed.
->
Examine the outputs
Several process-specific output folders are generated:
assembled_out
–> assembled fastq files per samplequalFiltered_out
–> quality filtered fastq files per samplechimeraFiltered_out
–> chimera filtered fasta files per sampleclustering_out
–> OTU table (OTU_table.txt), and OTU sequences (OTUs.fasta) filetaxonomy_out
–> BLAST hits for the OTUs (BLAST_1st_best_hit.txt and BLAST_10_best_hits.txt)Each folder (except clustering_out and taxonomy_out) contain summary of the sequence counts (seq_count_summary.txt). Examine those to track the read counts throughout the pipeline (example here)
clustering_out
directory contains OTUs table (OTUs_table.txt), where the 1st column represents OTU identifiers,
and all following columns represent samples (number of sequences per OTU in a sample).
The OTU sequences are representad in the fasta file (OTUs.fasta) in clustering_out
directory.
OTUs_table.txt; first 4 samples
OTU_id |
F3D0_S188_L001 |
F3D1_S189_L001 |
F3D2_S190_L001 |
F3D3_S191_L001 |
---|---|---|---|---|
00fc1569196587dde0462c7d806cc05774f61bfa |
106 |
271 |
584 |
20 |
02d84ed0175c2c79e8379a99cffb6dbc7f6a6bd9 |
81 |
44 |
88 |
14 |
0407ee3bd15ca7206a75d02bb41732516adaaa88 |
3 |
4 |
3 |
0 |
042e5f0b5e38dff09f7ad58b6849fb17ec5503b9 |
20 |
83 |
131 |
4 |
07411b848fcda497fd29944d351b8a2ec7dc2bd4 |
1 |
0 |
2 |
0 |
07e7806a732c67ef090b6b279b74a87fefad9e8e |
18 |
22 |
83 |
7 |
0836d270877aed22cd247f7e703b9247fb339127 |
1 |
1 |
0 |
0 |
0aa6e7da5819c11973f186cb35b3f4f58275fb04 |
1 |
4 |
5 |
0 |
0c1c219a4756bb729e5f0ceb7d82d932bbfa0c5e |
18 |
17 |
40 |
7 |
Results from the taxonomy annotation process (BLAST) are located at the taxonomy_out
directory (BLAST_1st_best_hit.txt and BLAST_10_best_hits.txt).
Blast values are separated by +
and tab
[be sure to specify the delimiter when aligning columns in e.g. LibreOffice or Excel].
“NO_BLAST_HIT” denotes that the OTU sequence did not get any match againt the selected database.
blast values |
|
---|---|
score |
blast score |
e-value |
blast e-value |
query len |
query (i.e. OTU/ASV) sequence length |
query start |
start position of match in the query seq |
query end |
end position of match in the query seq |
target len |
target seq length in the database |
target start |
start position of match in the target seq |
target end |
end position of match in the target seq |
align len |
alignment length of query and target |
identities |
number of identical matches |
gaps |
number of gaps in the alignment |
coverage |
query coverage percentage against the target sequence
(100 percent is full-length match;
low coverage may indicate presence of chimeric sequence/OTU/ASV)
|
id |
identity percentage against the target sequence |
Multiplexed library
Working with paired-end raw multiplexed data.
1. Select working directory by pressing the 'SELECT WORKDIR' button.
sequencing data format
as multiplexed;sequence files extension
as may be fastq or fasta formatted files;sequencing read types
as paired-end.2. 'DEMULTIPLEX'
2.1 Press ADD STEP
-> DEMULTIPLEX
or
2.2. Select ASVs workflow
or OTUs workflow
panel
tick
DEMULTIPLEX
,REORIENT
andCUT PRIMERS
;check that the docker is running (green icon [red = not running])
(click on the image for enlargement)
3. Click on 'DEMULTIPLEX' to expand the panel
select your FASTA formatted index_file.fasta (general index file guide here)
adjust
overlap
setting to fully match the length (in base pairs) of the indexes in the index_file.fasta.
(click on the image for enlargement)
demultiplex_out
:1. 'REORIENT'
specify allowed
mismatches
during the primer search; >2 not recommended.specify
forward primer
: 5’-GTGYCAGCMGCCGCGGTAA-3’ (example)specify
reverse primer
: 3’-GGCCGYCAATTYMTTTRAGTTT-5’ (example)
(click on the image for enlargement)
reorient_out
5. Click on 'CUT PRIMERS' to expand the panel
specify
forward primer
: 5’-GTGYCAGCMGCCGCGGTAA-3’ (example)specify
reverse primer
: 3’-GGCCGYCAATTYMTTTRAGTTT-5’ (example)specify allowed
mismatches
during the primer search; >2 not recommendedfor paired-end reads keep
seqs to keep
andpair filter
as default (keep_all and both, respectively)
(click on the image for enlargement)
primersCut_out
6. Follow the rest of the ASV workflow or OTU workflow
Single-end (PacBio or assembled paired-end) data
coming soon …