This is the documentation for the Nextflow 2018 Hackaton for beginners. This project will cover the implementation of a Variant Calling analisys pipeline for RNAseq data based on GATK best practices and using Nextflow as the pipeline framework.
It is supposed to be a tutorial project for newbie Nextflow users who are interested in learning how to assemble a real world genomic pipeline with Nextflow, manage dependencies by using container technologies (Docker & Singularity) and deploy the pipeline execution in a reproducible manner to the AWS cloud.
1. Data Description
The input data used to test the pipeline implementation is described below. For the purpose of this project, only a subset of the original data is used for most of the data types.
- Genome assembly
-
genome.fa
The human genome assembly hg19 (GRCh37) from GenBank, chromosome 22 only.
- RNA-seq reads
-
ENCSR000COQ[12]_[12].fastq.gz
The RNA-seq data comes from the human GM12878 cell line from whole cell, cytosol and nucleous extraction (see table below).
The libraries are stranded PE76 Illumina GAIIx RNA-Seq from rRNA-depleted Poly-A+ long RNA (
> 200
nucleotides in size).Only reads mapped to the 22q11 locus of the human genome (
chr22:16000000-18000000
) are used.ENCODE ID Cellular fraction replicate ID file names Whole Cell
1
ENCSR000COQ1_1.fastq.gz ENCSR000COQ1_2.fastq.gz
2
ENCSR000COQ2_1.fastq.gz ENCSR000COQ2_2.fastq.gz
Nuclear
1
ENCSR000CPO1_1.fastq.gz ENCSR000CPO1_2.fastq.gz
2
ENCSR000CPO2_1.fastq.gz ENCSR000CPO2_2.fastq.gz
Cytosolic
1
ENCSR000COR1_1.fastq.gz ENCSR000COR1_1.fastq.gz
2
ENCSR000COR2_1.fastq.gz ENCSR000COR2_1.fastq.gz
- "Known" variants
-
known_variants.vcf.gz
Known variants come from high confident variant calls for GM12878 from the Illumina Platinum Genomes project. These variant calls were obtained by taking into account pedigree information and the concordance of calls across different methods.
We’re using the subset from chromosome 22 only.
- Blacklisted regions
-
blacklist.bed
Blacklisted regions are regions of the genomes with anomalous coverage. We use regions for the hg19 assembly, taken from the ENCODE project portal. These regions were identified with DNAse and ChiP-seq samples over ~60 human tissues/cell types, and had a very high ratio of multi-mapping to unique-mapping reads and high variance in mappability. = Worflow Description
The aim of the pipeline is to process raw RNA-seq data (in FASTQ format) and obtain the list of small variants, SNVs (SNPs and INDELs) for the downstream analysis. The pipeline is based on the GATK best practices for variant calling with RNAseq data and includes all major steps. In addition the pipeline includes SNVs postprocessing and quantification for allele specific expression.
Samples processing is done indipendently for each replicate. This includes mapping of the reads, splitting at the CIGAR, reassigning mapping qualities and recalibrating base qualities.
Variant calling is done simultaniously on bam files from all replicates. This allows to improve coverage of genomic regions and obtain more reliable results.
1.1. Sofware manuals
Documentation for all software used in the workflow can be found at the following links:
1.2. Pipeline steps
In order to get a general idea of the workflow, all the composing steps, together with the corresponding commands, are explained in the next sections.
1.2.1. Preparing data
This step prepares input files for the analysis. Genome indexes are created and variants overlapping blacklisted regions are filtered out.
Genome indices with samtools
and picard
are produced first. They will be needed for GATK commands such as Split’N’Trim
:
samtools faidx genome.fa java -jar picard.jar CreateSequenceDictionary R= genome.fa O= genome.dict
Genome index for STAR
, needed for RNA-seq reads mappings, is created next. Index files are written to the folder genome_dir
:
STAR --runMode genomeGenerate \ --genomeDir genome_dir \ --genomeFastaFiles genome.fa \ --runThreadN 4
Variants overlapping blacklisted regions are then filtered in order to reduce false positive calls [optional]:
vcftools --gzvcf known_variants.vcf.gz -c \ --exclude-bed blacklist.bed \ --recode | bgzip -c \ > known_variants.filtered.recode.vcf.gz
1.2.2. Mapping RNA-seq reads to the reference
To align RNA-seq reads to the genome we’re using STAR 2-pass approach. The first alignment creates a table with splice-junctions that is used to guide final alignments. The alignments at both steps are done with default parameters.
Additional fields with the read groups, libraries and sample information are added into the final bam file at the second mapping step. As a result we do not need to run Picard processing step from GATK best practices.
STAR 1-pass:
STAR --genomeDir genome_dir \ --readFilesIn ENCSR000COQ1_1.fastq.gz ENCSR000COQ1_2.fastq.gz \ --runThreadN 4 \ --readFilesCommand zcat \ --outFilterType BySJout \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999
Create new genome index using splice-junction table:
STAR --runMode genomeGenerate \ --genomeDir genome_dir \ --genomeFastaFiles genome.fa \ --sjdbFileChrStartEnd SJ.out.tab \ --sjdbOverhang 75 \ --runThreadN 4
STAR 2-pass, final alignments:
STAR --genomeDir genome_dir \ --readFilesIn ENCSR000COQ1_1.fastq.gz ENCSR000COQ1_2.fastq.gz \ --runThreadN 4 \ --readFilesCommand zcat \ --outFilterType BySJout \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999 \ --outSAMtype BAM SortedByCoordinate \ --outSAMattrRGline ID:ENCSR000COQ1 LB:library PL:illumina PU:machine SM:GM12878
Index the resulting bam file:
samtools index final_alignments.bam
1.2.3. Split’N’Trim and reassign mapping qualities
The RNA-seq reads overlapping exon-intron junctions can produce false positive variants due to inaccurate splicing. To solve this problem the GATK team recommend to hard-clip any sequence that overlap intronic regions and developed a speciall tool for this purpose: SplitNCigarReads
. The tool identifies Ns in the CIGAR string of the alignment and split reads at this position so that few new reads are created.
At this step we also reassign mapping qualities to the alignments. This is important because STAR assign the value 255
(high quality) to “unknown” mappings that are meaningless to GATK and to variant calling in general.
This step is done with recommended parameters from the GATK best practices.
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads \ -R genome.fa -I final_alignments.bam \ -o split.bam \ -rf ReassignOneMappingQuality \ -RMQF 255 -RMQT 60 \ -U ALLOW_N_CIGAR_READS \ --fix_misencoded_quality_scores
1.2.4. Base Recalibration
The proposed worflow does not include an indel realigment step, which is an optional step in the GATK best practices. We excluded that since it is quite time-intensive and does not really improve variant calling.
We instead include a base recalibration step. This step allows to remove possible systematic errors introduced by the sequencing machine during the assignment of read qualities. To do this, the list of known variants is used as a training set to the machine learning algorithm that models possible errors. Base quality scores are then adjusted based on the obtained results.
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator \ --default_platform illumina \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -knownSites known_variants.filtered.recode.vcf.gz\ -cov ContextCovariate \ -R genome.fa -I split.bam \ --downsampling_type NONE \ -nct 4 \ -o final.rnaseq.grp java -jar GenomeAnalysisTK.jar -T PrintReads \ -R genome.fa -I split.bam \ -BQSR final.rnaseq.grp \ -nct 4 \ -o final.bam
1.2.5. Variant Calling and Variant filtering
The variant calling is done on the uniquely aligned reads only in order to reduce the number of false positive variants called:
(samtools view -H final.bam; samtools view final.bam| grep -w 'NH:i:1') \ | samtools view -Sb - > final.uniq.bam samtools index final.uniq.bam
For variant calling we’re using the GATK tool HaplotypeCaller
with default parameters:
ls final.uniq.bam > bam.list java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \ -R genome.fa -I bam.list \ -dontUseSoftClippedBases \ -stand_call_conf 20.0 \ -o output.gatk.vcf.gz
Variant filtering is done as recommended in the GATK best practices:
-
keep clusters of at least 3 SNPs that are within a window of 35 bases between them
-
estimate strand bias using Fisher’s Exact Test with values > 30.0 (Phred-scaled p-value)
-
use variant call confidence score
QualByDepth
(QD) with values < 2.0. The QD is the QUAL score normalized by allele depth (AD) for a variant.
java -jar GenomeAnalysisTK.jar -T VariantFiltration \ -R genome.fa -V output.gatk.vcf.gz \ -window 35 -cluster 3 \ -filterName FS -filter "FS > 30.0" \ -filterName QD -filter "QD < 2.0" \ -o final.vcf
1.2.6. Variant Post-processing
For downstream analysis we’re considering only sites that pass all filters and are covered with at least 8 reads:
grep -v '#' final.vcf | awk '$7~/PASS/' \ | perl -ne 'chomp($_); ($dp)=$_=~/DP\\=(\\d+)\\;/; if($dp>=8){print $_."\\n"};' > result.DP8.vcf
Filtered RNA-seq variants are compared with those obtained from DNA sequencing (from Illumina platinum genome project). Variants that are common to these two datasets are "known" SNVs. The ones present only in the RNA-seq cohort only are "novel".
Known SNVs will be used for allele specific expression analysis. Novel variants will be used to detect RNA-editing events. |
We compare two variants files to detect common and different sites:
vcftools --vcf result.DP8.vcf --gzdiff known_SNVs.filtered.recode.vcf.gz --diff-site --out commonSNPs
Here we select sites present in both files ("known" SNVs only):
awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' commonSNPs.diff.sites_in_files > test.bed vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all --stdout > known_snps.vcf
Plot a histogram with allele frequency distribution for "known" SNVs:
grep -v '#' known_snps.vcf | awk -F '\\t' '{print $10}' \ |awk -F ':' '{print $2}'|perl -ne 'chomp($_); \ @v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\ {print $v[1]/($v[1]+$v[0])."\\n"; }' |awk '$1!=1' \ >AF.4R gghist.R -i AF.4R -o AF.histogram.pdf
Calculate read counts for each "known" SNVs per allele for allele specific expression analysis:
java -jar GenomeAnalysisTK.jar -R genome.fa \ -T ASEReadCounter \ -o ASE.tsv \ -I bam.list \ -sites known_snps.vcf
2. Environment Setup
2.1. Pipeline data
All the files needed for the hands-on activity are stored in this GitHub repository.
Check if you have already the following directory:
tree $HOME/nf-hack18/hands-on
$HOME/nf-hack18/hands-on ├── bin │ └── gghist.R ├── data │ ├── blacklist.bed │ ├── genome.fa │ ├── known_variants.vcf.gz │ └── reads │ ├── ENCSR000COQ1_1.fastq.gz │ ├── ENCSR000COQ1_2.fastq.gz │ ├── ENCSR000COQ2_1.fastq.gz │ ├── ENCSR000COQ2_2.fastq.gz │ ├── ENCSR000COR1_1.fastq.gz │ ├── ENCSR000COR1_2.fastq.gz │ ├── ENCSR000COR2_1.fastq.gz │ ├── ENCSR000COR2_2.fastq.gz │ ├── ENCSR000CPO1_1.fastq.gz │ ├── ENCSR000CPO1_2.fastq.gz │ ├── ENCSR000CPO2_1.fastq.gz │ └── ENCSR000CPO2_2.fastq.gz ├── nextflow.config └── README.md 3 directories, 19 files
If you don’t have the above directory, clone it with the following command:
cd $HOME
git clone https://github.com/nextflow-io/nf-hack18/
Cloning into 'nf-hack18'... remote: Counting objects: 71, done. remote: Total 71 (delta 0), reused 0 (delta 0), pack-reused 71 Unpacking objects: 100% (71/71), done. Checking connectivity... done.
2.2. Pulling the Docker image
Nextflow can pull Docker images at runtime, but let’s just download it manually to see how Docker works:
docker pull cbcrg/callings-nf@sha256:b65a7d721b9dd2da07d6bdd7f868b04039860f14fa514add975c59e68614c310
You should see the progress of the download:
sha256:b65a7d721b9dd2da07d6bdd7f868b04039860f14fa514add975c59e68614c310: Pulling from cbcrg/callings-nf 915665fee719: Downloading [=============================================> ] 47.08 MB/51.36 MB f332de2321e6: Downloading [===========> ] 41.96 MB/187.8 MB 1577a6dd9e43: Downloading [===============================> ] 46.72 MB/73.45 MB 7059d9bb5245: Waiting 71863f70269f: Waiting ce2a2879246d: Waiting e38ba5d5f9fb: Waiting 90158da87bb2: Waiting
and the following message when the pull is completed:
Digest: sha256:b65a7d721b9dd2da07d6bdd7f868b04039860f14fa514add975c59e68614c310 Status: Downloaded newer image for cbcrg/callings-nf@sha256:b65a7d721b9dd2da07d6bdd7f868b04039860f14fa514add975c59e68614c310
3. Pipeline Implementaton
3.1. Data preparation
A first step in any pipeline is to prepare the input data. You will find
all the data required to run the pipeline in the folder data
within the $HOME/nf-hack18/hands-on
repository directory.
There are four data inputs that we will use in this tutorial:
-
Genome File (
data/genome.fa
)-
Human chromosome 22 in FASTA file format
-
-
Read Files (
data/reads/
)-
Sample ENCSR000COQ1: 76bp paired-end reads (
ENCSR000COQ1_1.fq.gz
andENCSR000COQ1_2.fq.gz
).
-
-
Variants File (
data/known_variants.vcf.gz
)-
Known variants, gzipped as a Variant Calling File (VCF) format.
-
-
Blacklist File (
data/blacklist.bed
)-
Genomic locations which are known to produce artifacts and spurious variants in Browser Extensible Data (BED) format.
-
3.2. Input parameters
We can begin writing the pipeline by creating and editing a text file called main.nf
from the $HOME/nf-hack18/hands-on
repository directory with your favourite text editor. In this example we are using nano
:
cd $HOME/nf-hack18/hands-on
nano main.nf
Edit this file to specify the input files as script parameters. Using this notation allows you to override them by specifying different values when launching the pipeline execution.
1
2
3
4
5
6
7
8
9
10
/*
* Define the default parameters (1)
*/
params.genome = "data/genome.fa"
params.variants = "data/known_variants.vcf.gz"
params.blacklist = "data/blacklist.bed"
params.reads = "data/reads/ENCSR000COQ1_{1,2}.fastq.gz" (2)
params.results = "results" (3)
params.gatk = "/opt/broad/GenomeAnalysisTK.jar" (4)
You can copy the above text by using the Cmd+C keys, then move in the terminal window,
open nano and paste the above text by using the Cmd+V keys shortcut.
|
1 | The /* , * and */ specify comment lines which are ignored by Nextflow. |
2 | The reads parameter uses a glob pattern to specify the forward (ENCSR000COQ1_1.fq.gz ) and reverse (ENCSR000COQ1_2.fq.gz ) reads are pairs of the same sample. |
3 | The results parameter is used to specify a directory called results . |
4 | The gatk parameter specifies the location of the GATK jar file. |
Once you have the default parameters in the main.nf
file, you can save and run the main script for the first time.
With nano you can save and close the file with Ctrl+O, then Enter, followed by Ctrl+X.
|
To run the main script use the following command:
nextflow run main.nf
You should see the script execute, print Nextflow version and pipeline revision and then exit.
N E X T F L O W ~ version 18.10.1 Launching `main.nf` [romantic_snyder] - revision: 83a0a5415a
3.2.1. Problem #1
Great, now we need to transform the parameters into proper file handlers and channel variables.
To do that open the main.nf
file and copy the lines below at the end of the file.
In nano you can move to the end of the file using Ctrl+W and then Ctrl+V.
|
This time you must fill the BLANK
spaces with the correct function and parameter.
1
2
3
4
5
6
7
8
9
/*
* Parse the input parameters
*/
genome_file = file(params.genome)
variants_file = BLANK
blacklist_file = BLANK
reads_ch = BLANK
GATK = params.gatk
The first three should specify file objects containing a single file as shown in
this basic example.
For the reads channel, use the fromFilePairs
channel factory method. The final one, GATK is simply creating a Nextflow variable
specifying the path of the GATK application file and does not need any special function.
|
Once you think you have data organised, you can again run the pipeline.
However this time, we can use the the -resume
flag.
nextflow run main.nf -resume
See here for more
details about using the resume option.
|
3.3. Process 1A
Create a FASTA genome index
Now we have our inputs set up we can move onto the processes. In our first process we will create a genome index using samtools.
You should implement a process having the following structure:
- Name
-
1A_prepare_genome_samtools
- Command
-
create a genome index for the genome fasta with samtools
- Input
-
the genome fasta file
- Output
-
the samtools genome index file
3.3.1. Problem #2
Copy the code below and paste it at the end of main.nf
.
Your aim is to replace BLANK
placeholder with the the correct
variable name of the genome file that you have defined in previous problem.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
* Process 1A: Create a FASTA genome index with samtools
*/
process '1A_prepare_genome_samtools' { (1)
input:
file genome from BLANK (2)
output:
file "${genome}.fai" into genome_index_ch (3)
script:
"""
samtools faidx ${genome} (4)
"""
}
In plain english, the process could be written as:
1 | A process called 1A_prepare_genome_samtools |
2 | takes as input the genome file from BLANK |
3 | and creates as output a genome index file which goes into channel genome_index_ch |
4 | script: using samtools create the genome index from the genome file |
Now when we run the pipeline, we see that the process 1A is submitted:
nextflow run main.nf -resume
N E X T F L O W ~ version 0.24.1 Launching `main.nf` [adoring_wilson] - revision: 89dbc97b8e [warm up] executor > local [17/b0eae4] Submitted process > 1A_prepare_genome_samtools
3.4. Process 1B
Create a FASTA genome sequence dictionary with Picard for GATK
Our first process created the genome index for GATK using samtools. For the next process we must do something very similar, this time creating a genome sequence dictionary using Picard.
You should implement a process having the following structure:
- Name
-
1B_prepare_genome_picard
- Command
-
create a genome dictionary for the genome fasta with Picard tools
- Input
-
the genome fasta file
- Output
-
the genome dictionary file
3.4.1. Problem #3
Fill in the BLANK
words for both the input and output sections.
Copy the code below and paste it at the end of main.nf
.
Your aim is to insert the correct input name from into
the input step (written as BLANK
) of the process and run the pipeline.
You can choose any channel output name that makes sense to you. |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/*
* Process 1B: Create a FASTA genome sequence dictionary with Picard for GATK
*/
process '1B_prepare_genome_picard' {
input:
file genome BLANK BLANK
output:
file "${genome.baseName}.dict" BLANK BLANK
script:
"""
PICARD=`which picard.jar`
java -jar \$PICARD CreateSequenceDictionary R= $genome O= ${genome.baseName}.dict
"""
}
.baseName returns the filename without the file suffix. If "${genome}" is human.fa , then "${genome.baseName}.dict" would be human.dict .
|
3.5. Process 1C
Create STAR genome index file
Next we must create a genome index for the STAR mapping software.
You should implement a process having the following structure:
- Name
-
1C_prepare_star_genome_index
- Command
-
create a STAR genome index for the genome fasta
- Input
-
the genome fasta file
- Output
-
a directory containing the STAR genome index
3.5.1. Problem #4
This is a similar exercise as problem 3, except this time both input
and output
lines have been left BLANK
and must be completed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/*
* Process 1C: Create the genome index file for STAR
*/
process '1C_prepare_star_genome_index' {
input:
BLANK_LINE
output:
BLANK_LINE
script:
"""
mkdir genome_dir
STAR --runMode genomeGenerate \
--genomeDir genome_dir \
--genomeFastaFiles ${genome} \
--runThreadN ${task.cpus}
"""
}
The output of the STAR genomeGenerate command is specified here as genome_dir .
|
3.6. Process 1D
Filtered and recoded set of variants
Next on to something a little more tricky. The next process takes two inputs: the variants file and the blacklist file.
It should output a channel named prepared_vcf_ch
which emitting a tuple of two files.
In Nextflow, tuples can be defined in the input or output using the set qualifier.
|
You should implement a process having the following structure:
- Name
-
1D_prepare_vcf_file
- Command
-
create a filtered and recoded set of variants
- Input
-
the variants file
the blacklisted regions file - Output
-
a set containing the filtered/recoded VCF file and the tab index (TBI) file.
3.6.1. Problem #5
You must fill in the two BLANK_LINES
in the input and the two BLANK
output files.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/*
* Process 1D: Create a file containing the filtered and recoded set of variants
*/
process '1D_prepare_vcf_file' {
input:
BLANK_LINE
BLANK_LINE
output:
set BLANK, BLANK into prepared_vcf_ch
script:
"""
vcftools --gzvcf $variantsFile -c \(1)
--exclude-bed ${blacklisted} \(2)
--recode | bgzip -c \
> ${variantsFile.baseName}.filtered.recode.vcf.gz (3)
tabix ${variantsFile.baseName}.filtered.recode.vcf.gz (4)
"""
}
1 | The input variable for the variants file |
2 | The input variable for the blacklist file |
3 | The first of the two output files |
4 | Generates the second output file named "${variantsFile.baseName}.filtered.recode.vcf.gz.tbi" |
Try run the pipeline from the project directory with:
nextflow run main.nf -resume
Congratulations! Part 1 is now complete.
We have all the data prepared and into channels ready for the more serious steps
3.7. Process 2
STAR Mapping
In this process, for each sample, we align the reads to our genome using the STAR index we created previously.
You should implement a process having the following structure:
- Name
-
2_rnaseq_mapping_star
- Command
-
mapping of the RNA-Seq reads using STAR
- Input
-
the genome fasta file
the STAR genome index
a set containing the replicate id and paired read files - Output
-
a set containg replicate id, aligned bam file & aligned bam file index
3.7.1. Problem #6
Copy the code below and paste it at the end of main.nf
.
You must fill in the three BLANK_LINE
lines in the input and the one BLANK_LINE
line in the output.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/*
* Process 2: Align RNA-Seq reads to the genome with STAR
*/
process '2_rnaseq_mapping_star' {
input:
BLANK_LINE
BLANK_LINE
BLANK_LINE
output:
BLANK_LINE
script:
"""
# ngs-nf-dev Align reads to genome
STAR --genomeDir $genomeDir \
--readFilesIn $reads \
--runThreadN ${task.cpus} \
--readFilesCommand zcat \
--outFilterType BySJout \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999
# 2nd pass (improve alignmets using table of splice junctions and create a new index)
mkdir genomeDir
STAR --runMode genomeGenerate \
--genomeDir genomeDir \
--genomeFastaFiles $genome \
--sjdbFileChrStartEnd SJ.out.tab \
--sjdbOverhang 75 \
--runThreadN ${task.cpus}
# Final read alignments
STAR --genomeDir genomeDir \
--readFilesIn $reads \
--runThreadN ${task.cpus} \
--readFilesCommand zcat \
--outFilterType BySJout \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:$replicateId LB:library PL:illumina PU:machine SM:GM12878
# Index the BAM file
samtools index Aligned.sortedByCoord.out.bam
"""
}
The final command produces an bam index which is the full filename with an additional .bai suffix.
|
The next step is a filtering step using GATK. For each sample, we split all the reads that contain N characters in their CIGAR string.
3.8. Process 3
GATK Split on N
The process creates k+1 new reads (where k is the number of N cigar elements) that correspond to the segments of the original read beside/between the splicing events represented by the Ns in the original CIGAR.
You should implement a process having the following structure:
- Name
-
3_rnaseq_gatk_splitNcigar
- Command
-
split reads on Ns in CIGAR string using GATK
- Input
-
the genome fasta file
the genome index made with samtools
the genome dictionary made with picard
a set containg replicate id, aligned bam file and aligned bam file index from the STAR mapping - Output
-
a set containing the replicate id, the split bam file and the split bam index file
3.8.1. Problem #7
Copy the code below and paste it at the end of main.nf
.
You must fill in the four BLANK_LINE
lines in the input and the one BLANK_LINE
line in the output.
There is an optional tag line added
to the start of this process. The tag line
allows you to assign a name to a specific task (single execution of a process).
This is particularly useful when there are many samples/replicates which pass through the same process.
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
process '3_rnaseq_gatk_splitNcigar' {
tag OPTIONAL_BLANK
input:
BLANK_LINE
BLANK_LINE
BLANK_LINE
BLANK_LINE
output:
BLANK_LINE
script:
"""
# SplitNCigarReads and reassign mapping qualities
java -jar $GATK -T SplitNCigarReads \
-R $genome -I $bam \
-o split.bam \
-rf ReassignOneMappingQuality \
-RMQF 255 -RMQT 60 \
-U ALLOW_N_CIGAR_READS \
--fix_misencoded_quality_scores
"""
}
The GATK command above automatically creates a bam index (.bai ) of the split.bam output file
|
A tag line would also be useful in Process 2
|
Next we perform a Base Quality Score Recalibration step using GATK.
3.9. Process 4
GATK Recalibrate
This step uses GATK to detect systematic errors in the base quality scores, select unique alignments and then index the resulting bam file with samtools. You can find details of the specific GATK BaseRecalibrator parameters here.
You should implement a process having the following structure:
- Name
-
4_rnaseq_gatk_recalibrate
- Command
-
recalibrate reads from each replicate using GATK
- Input
-
the genome fasta file
the genome index made with samtools
the genome dictionary made with picard
a set containg replicate id, aligned bam file and aligned bam file index from process 3
a set containing the filtered/recoded VCF file and the tab index (TBI) file from process 1D - Output
-
a set containing the sample id, the unique bam file and the unique bam index file
3.9.1. Problem #8
Copy the code below and paste it at the end of main.nf
.
You must fill in the five BLANK_LINE
lines in the input and the one BLANK_LINE
line in the output.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
process '4_rnaseq_gatk_recalibrate' {
tag "$replicateId"
input:
BLANK_LINE
BLANK_LINE
BLANK_LINE
BLANK_LINE
BLANK_LINE
output:
BLANK into (final_output_ch, bam_for_ASE_ch) (1)
script:
sampleId = replicateId.replaceAll(/[12]$/,'')
"""
# Indel Realignment and Base Recalibration
java -jar $GATK -T BaseRecalibrator \
--default_platform illumina \
-cov ReadGroupCovariate \
-cov QualityScoreCovariate \
-cov CycleCovariate \
-knownSites ${variants_file} \
-cov ContextCovariate \
-R ${genome} -I ${bam} \
--downsampling_type NONE \
-nct ${task.cpus} \
-o final.rnaseq.grp
java -jar $GATK -T PrintReads \
-R ${genome} -I ${bam} \
-BQSR final.rnaseq.grp \
-nct ${task.cpus} \
-o final.bam
# Select only unique alignments, no multimaps
(samtools view -H final.bam; samtools view final.bam| grep -w 'NH:i:1') \
|samtools view -Sb - > ${replicateId}.final.uniq.bam (2)
# Index BAM files
samtools index ${replicateId}.final.uniq.bam (3)
"""
}
1 | The files resulting from this process will be used in two downstream processes. If a process is executed more than once, and the downstream channel is used by more than one process, we must duplicate the channel. We can do this using the into operator with parenthesis in the output section. See here for more information on using into . |
2 | The unique bam file |
3 | The index of the unique bam file (bam file name + .bai ) |
Now we are ready to perform the variant calling with GATK.
3.10. Process 5
GATK Variant Calling
This steps call variants with GATK HaplotypeCaller. You can find details of the specific GATK HaplotypeCaller parameters here.
You should implement a process having the following structure:
- Name
-
5_rnaseq_call_variants
- Command
-
variant calling of each sample using GATK
- Input
-
the genome fasta file
the genome index made with samtools
the genome dictionary made with picard
a set containg replicate id, aligned bam file and aligned bam file index from process 4 - Output
-
a set containing the sample id the resulting variant calling file (vcf)
3.10.1. Problem #9
In this problem we will introduce the use of a channel operator in the input section. The groupTuple operator groups together the tuples emitted by a channel which share a common key.
Note that in process 4, we used the sampleID (not replicateID) as the first element of the set in the output. Now we combine the replicates by grouping them on the sample ID. It follows from this that process 4 is run one time per replicate and process 5 is run one time per sample. |
Fill in the BLANKS
as before.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
process '5_rnaseq_call_variants' {
tag BLANK
input:
BLANK_LINE
BLANK_LINE
BLANK_LINE
BLANK from BLANK.groupTuple()
output:
BLANK_LINE
script:
"""
echo "${bam.join('\n')}" > bam.list
# Variant calling
java -jar $GATK -T HaplotypeCaller \
-R $genome -I bam.list \
-dontUseSoftClippedBases \
-stand_call_conf 20.0 \
-o output.gatk.vcf.gz
# Variant filtering
java -jar $GATK -T VariantFiltration \
-R $genome -V output.gatk.vcf.gz \
-window 35 -cluster 3 \
-filterName FS -filter "FS > 30.0" \
-filterName QD -filter "QD < 2.0" \
-o final.vcf
"""
}
3.11. Processes 6A and 6B
ASE & RNA Editing
In the final steps we will create processes for Allele-Specific Expression and RNA Editing Analysis.
We must process the VCF result to prepare variants file for allele specific expression (ASE) analysis. We will implement both processes togther.
You should implement two processes having the following structure:
- Name
-
6A_post_process_vcf
- Command
-
post-process the variant calling file (vcf) of each sample
- Input
-
set containing the sample ID and vcf file
a set containing the filtered/recoded VCF file and the tab index (TBI) file from process 1D - Output
-
a set containing the sample id, the variant calling file (vcf) and a file containing common SNPs
- Name
-
6B_prepare_vcf_for_ase
- Command
-
prepare the VCF for allele specific expression (ASE) and generate a figure in R.
- Input
-
a set containing the sample id, the variant calling file (vcf) and a file containing common SNPs
- Output
-
a set containing the sample ID and known SNPs in the sample for ASE
a figure of the SNPs generated in R as a PDF file
3.11.1. Problem #10
Here we introduce the publishDir
directive. This allows us to specifiy a location for the outputs of the process. See here for more details.
You must have the output of process 6A become the input of process 6B.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
process '6A_post_process_vcf' {
tag BLANK
publishDir "$params.results/$sampleId" (1)
input:
BLANK_LINE
BLANK_LINE
output:
BLANK_LINE
script:
'''
grep -v '#' final.vcf | awk '$7~/PASS/' |perl -ne 'chomp($_); ($dp)=$_=~/DP\\=(\\d+)\\;/; if($dp>=8){print $_."\\n"};' > result.DP8.vcf
vcftools --vcf result.DP8.vcf --gzdiff filtered.recode.vcf.gz --diff-site --out commonSNPs
'''
}
process '6B_prepare_vcf_for_ase' {
tag BLANK
publishDir BLANK
input:
BLANK_LINE
output:
BLANK_LINE
BLANK_LINE
script:
'''
awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' commonSNPs.diff.sites_in_files > test.bed
vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all --stdout > known_snps.vcf
grep -v '#' known_snps.vcf | awk -F '\\t' '{print $10}' \
|awk -F ':' '{print $2}'|perl -ne 'chomp($_); \
@v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\
{print $v[1]/($v[1]+$v[0])."\\n"; }' |awk '$1!=1' \
>AF.4R
gghist.R -i AF.4R -o AF.histogram.pdf
'''
}
The final step is the GATK ASEReadCounter.
3.11.2. Problem #11
We have seen the basics of using processes in Nextflow. Yet one of the standout features of Nextflow is the operations that can be performed on channels outside of processes. See here for details on the specific operators.
Before we perform the GATK ASEReadCounter process, we must group the data for allele-specific expression. To do this we must combine channels.
The bam_for_ASE_ch
channel emites tuples having the following structure, holding the final BAM/BAI files:
( sample_id, file_bam, file_bai )
The vcf_for_ASE
channel emits tuples having the following structure:
( sample_id, output.vcf )
In the first operation, the BAMs are grouped together by sample id.
Next, this resulting channel is merged with the VCFs (vcf_for_ASE) having the same sample id.
We must take the merged channel and creates a channel named grouped_vcf_bam_bai_ch
emitting the following tuples:
( sample_id, file_vcf, List[file_bam], List[file_bai] )
Your aim is to fill in the BLANKS
below.
1
2
3
4
5
6
7
8
9
10
bam_for_ASE_ch
.BLANK (1)
.phase(vcf_for_ASE) (2)
.map{ left, right -> (3)
def sampleId = left[0] (4)
def bam = left[1] (5)
def bai = left[2] (6)
def vcf = right [1] (7)
tuple(BLANK, vcf, BLANK, BLANK) (8)
.set { grouped_vcf_bam_bai_ch } (9)
1 | an operator that groups sets that contain a common first element. |
2 | the phase operator synchronizes the values emitted by two other channels. See here for more details |
3 | the map operator can apply any function to every item on a channel. In this case we take our tuple from the phase operation, define the seperate elements and create a new tuple. |
4 | define repID to be the first element of left. |
5 | define bam to be the second element of left. |
6 | define bai to be the third element of left. |
7 | define vcf to be the first element of right. |
8 | create a new tuple made of four elements |
9 | rename the resulting as grouped_vcf_bam_bai_ch |
left and right above are arbitary names. From the phase operator documentation, we see that phase returns pairs of items. So here left originates from contents of the bam_for_ASE_ch channel and right originates from the contents of vcf_for_ASE channel.
|
3.12. Process 6C
Allele-Specific Expression analysis with GATK ASEReadCounter
Now we are ready for the final process.
You should implement a process having the following structure:
- Name
-
6C_ASE_knownSNPs
- Command
-
calculate allele counts at a set of positions with GATK tools
- Input
-
genome fasta file
genome index file from samtools
genome dictionary file
the `grouped_vcf_bam_bai_ch`channel - Output
-
the allele specific expression file (
ASE.tsv
)
3.12.1. Problem #12
You should construct the process and run the pipeline in its entirety.
1
2
3
4
5
6
7
echo "${bam.join('\n')}" > bam.list
java -jar $GATK -R ${genome} \
-T ASEReadCounter \
-o ASE.tsv \
-I bam.list \
-sites ${vcf}
Congratulations! If you made it this far you now have all the basics to create your own Nextflow workflows.
3.13. Results overview
For each processed sample the pipeline stores results into a folder named after the sample identifier. These folders are created in the directory specified as a parameter in params.results
.
Result files for this workshop can be found in the folder results
within the current folder. There you should see a directory called ENCSR000COQ/
containing the following files:
- Variant calls
-
final.vcf
This file contains all somatic variants (SNVs) called from RNAseq data. You will see variants that pass all filters, with the
PASS
keyword in the 7th field of the vcf file (filter status
), and also those that did not pass one or more filters.commonSNPs.diff.sites_in_files
Tab-separated file with comparison between variants obtained from RNAseq and "known" variants from DNA.
The file is sorted by genomic position and contains 8 fields:
1
CHROM
chromosome name;
2
POS1
position of the SNV in file #1 (RNAseq data);
3
POS2
position of SNV in file #2 (DNA "known" variants);
4
IN_FILE
flag whether SNV is present in the file #1 '1', in the file #2 '2', or in both files 'B';
5
REF1
reference sequence in the file 1;
6
REF2
reference sequence in the file 2;
7
ALT1
alternative sequence in the file 1;
8
ALT2
alternative sequence in the file 2
known_snps.vcf
Variants that are common to RNAseq and "known" variants from DNA.
- Allele specific expression quantification
-
ASE.tsv
Tab-separated file with allele counts at common SNVs positions (only SNVs from the file
known_snps.vcf
)The file is sorted by coordinates and contains 13 fields:
1
contig
contig, scaffold or chromosome name of the variant
2
position
position of the variant
3
variant ID
variant ID in the dbSNP
4
refAllele
reference allele sequence
5
altAllele
alternate allele sequence
6
refCount
number of reads that support the reference allele
7
altCount
number of reads that support the alternate allele
8
totalCount
total number of reads at the site that support both reference and alternate allele and any other alleles present at the site
9
lowMAPQDepth
number of reads that have low mapping quality
10
lowBaseQDepth
number of reads that have low base quality
11
rawDepth
total number of reads at the site that support both reference and alternate allele and any other alleles present at the site
12
otherBases
number of reads that support bases other than reference and alternate bases
13
improperPairs
number of reads that have malformed pairs
- Allele frequency histogram
-
AF.histogram.pdf
This file contains a histogram plot of allele frequency for SNVs common to RNA-seq and "known" variants from DNA.
3.14. Bonus step
Until now the pipeline has been executed using just a single sample (ENCSR000COQ1
).
Now we can re-execute the pipeline specifying a large set of samples by using the command shown below:
nextflow run main.nf -resume --reads 'data/reads/ENCSR000C*_{1,2}.fastq.gz'
It will print an output similar to the one below:
N E X T F L O W ~ version 0.24.1 Launching `main.nf` [backstabbing_nightingale] - revision: 1187e44c7a [warm up] executor > local [c6/75e3f4] Submitted process > 1A_prepare_genome_samtools (genome) [7b/44e5d6] Submitted process > 1C_prepare_star_genome_index (genome) [da/e19bcf] Submitted process > 1B_prepare_genome_picard (genome) [95/1ad13d] Submitted process > 1D_prepare_vcf_file (known_variants.vcf) [72/702900] Submitted process > 2_rnaseq_mapping_star (ENCSR000COR1) [9a/5ca042] Submitted process > 2_rnaseq_mapping_star (ENCSR000CPO1) [77/03ef01] Submitted process > 2_rnaseq_mapping_star (ENCSR000COR2) [04/262db9] Submitted process > 2_rnaseq_mapping_star (ENCSR000COQ2) [a4/64c69e] Submitted process > 2_rnaseq_mapping_star (ENCSR000CPO2) [9e/ad3621] Submitted process > 2_rnaseq_mapping_star (ENCSR000COQ1) [a5/cda1b0] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COQ2) [42/0565d7] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COQ1) [0c/68ce48] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COR1) [6b/3843e1] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COR2) [1c/8c474b] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000CPO1) [98/f17992] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000CPO2) [c2/8cdfca] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COQ1) [d1/1a6935] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COR1) [9f/b4c61d] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COR2) [aa/b43a43] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COQ2) [46/2d96f0] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000CPO1) [85/6b9527] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000CPO2) [79/a7fb48] Submitted process > 5_rnaseq_call_variants (ENCSR000CPO) [a5/29c017] Submitted process > 5_rnaseq_call_variants (ENCSR000COQ) [22/1fdea2] Submitted process > 5_rnaseq_call_variants (ENCSR000COR) [7d/e1adfb] Submitted process > 6A_post_process_vcf (ENCSR000CPO) [0a/4d43fc] Submitted process > 6A_post_process_vcf (ENCSR000COQ) [18/8d486b] Submitted process > 6A_post_process_vcf (ENCSR000COR) [60/427153] Submitted process > 6B_prepare_vcf_for_ase (ENCSR000CPO) [32/64eff0] Submitted process > 6B_prepare_vcf_for_ase (ENCSR000COQ) [31/32ad40] Submitted process > 6B_prepare_vcf_for_ase (ENCSR000COR) [6f/a5e211] Submitted process > 6C_ASE_knownSNPs (ENCSR000COR) [ff/989dc1] Submitted process > 6C_ASE_knownSNPs (ENCSR000CPO) [25/92875a] Submitted process > 6C_ASE_knownSNPs (ENCSR000COQ)
You can notice that this time the pipeline spawns the execution of more tasks because three samples have been provided instead of one.
This shows the ability of Nextflow to implicitly handle multiple parallel task executions depending on the specified pipeline input dataset.
A fully functional version of this pipeline is available at the following GitHub repository: CalliNGS-NF.