Launching a pipeline
You can launch a Nextflow pipeline by simply executing this code:
nextflow run main.nf --help
We should see something like:
nextflow run main.nf --help
Picked up JAVA_TOOL_OPTIONS: -Xmx2576m
N E X T F L O W ~ version 21.10.3
Launching `main.nf` [distracted_avogadro] - revision: e434616b3d
BIOCORE@CRG - N F TESTPIPE
=============================================
reads : /workspace/elixir-workshop-21/data/*.fastq.gz
reference : /workspace/elixir-workshop-21/data/chr19.fasta.gz
output : ./output
This is the Biocore's Nextflow test pipeline
Please define reads, reference and output!
Enjoy!
But is really convenient to store your code in one of the sharing platform that are supported, so let’s try a different thing.
Linux containers
This pipeline needs tools that are stored within linux containers. In particular we will use images for Bowtie, FastQC and multiQC that are retrieved from Biocontainers.
For this we need to tell Nextflow which is the right container engine. We can do this by using the Nextflow parameter -with-docker
.
Nextflow supports many more container engines like Singularity, Shifter, Podman and Charliecloud.
Docker |
Singularity |
Shifter |
Podman |
Charliecloud |
---|---|---|---|---|
In this example we will use Docker: the default image is stored in DockerHub and is retrieved on the fly.
nextflow run nextflow-io/elixir-workshop-21 -r master -with-docker
N E X T F L O W ~ version 21.04.3
Pulling nextflow-io/elixir-workshop-21 ...
Launching `nextflow-io/elixir-workshop-21` [determined_minsky] - revision: 711f9f806d [master]
BIOCORE@CRG - N F TESTPIPE ~ version 1.0
=============================================
reads : /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/data/*.fastq.gz
reference : /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/data/chr19.fasta.gz
output : /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/output
executor > local (3)
[2f/6cd1ca] process > fastqc (B7_H3K4me1_s_chr19.fastq.gz) [100%] 2 of 2 ✔
[a2/305aae] process > BOWTIE:Index (chr19.fasta.gz) [ 0%] 0 of 1
[79/f30f70] process > BOWTIE:Align (B7_H3K4me1_s_chr19.fastq.gz) [100%] 2 of 2 ✔
[50/088302] process > multiqc [100%] 1 of 1 ✔
/Users/lcozzuto/work/27/f1cc39c1e01c9ee55684b347c492f5/B7_input_s_chr19.fastq.gz.sam
/Users/lcozzuto/work/27/f1cc39c1e01c9ee55684b347c492f5/B7_input_s_chr19.fastq.gz.log
/Users/lcozzuto/work/79/f30f7041abf5baf7c496a0982906c1/B7_H3K4me1_s_chr19.fastq.gz.sam
/Users/lcozzuto/work/79/f30f7041abf5baf7c496a0982906c1/B7_H3K4me1_s_chr19.fastq.gz.log
Done! Open the following report in your browser --> /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/output/ouptut_multiQC/multiqc_report.html
Completed at: 24-Nov-2021 15:48:35
Duration : 3m 25s
CPU hours : 0.1
Succeeded : 6
This pipeline can be launched also with Singularity just using the Nextflow parameter -with-singularity
. Nextflow will retrieve and convert the image(s) for you. The image(s) will be then stored so that next time you don’t need to download anything again.
We can inspect the output in the new output
folder generated.
ls -alht output
ls -alht
total 0
drwxr-xr-x 17 lcozzuto staff 544B Nov 24 16:18 ..
drwxr-xr-x 3 lcozzuto staff 96B Nov 24 16:13 ouptut_multiQC
drwxr-xr-x 5 lcozzuto staff 160B Nov 24 16:13 .
drwxr-xr-x 4 lcozzuto staff 128B Nov 24 16:13 ouptut_aln
drwxr-xr-x 6 lcozzuto staff 192B Nov 24 16:11 ouptut_fastqc
Here you can see the report produced by multiQC. You can download the one you just generated by clicking with the righ button on that file.
Work folder structure and process isolation
Once executed, we can see that a folder named work is generated. Nextflow stores in this folder the intermediate files generated by each processes. In case you resume a process that folder is “reused” as cache.
At the start of each row, there is an alphanumeric code:
[a2/305aae] process > BOWTIE:Index (chr19.fasta.gz) [ 0%] 0 of 1
This code indicates the path in which the process is “isolated” and where the corresponding temporary files are kept in the work directory.
Note
Nextflow will randomly generate temporary folders so they will be named differently in your execution.
Let’s have a look inside that folder:
cd work/a2/305aaee297250b0c7a455cab35707c/
ls -alht
-rw-r--r-- 1 lcozzuto staff 20M Nov 24 16:12 chr19.fasta.gz.rev.1.ebwt
-rw-r--r-- 1 lcozzuto staff 6.9M Nov 24 16:12 chr19.fasta.gz.rev.2.ebwt
-rw-r--r-- 1 lcozzuto staff 20M Nov 24 16:11 chr19.fasta.gz.1.ebwt
-rw-r--r-- 1 lcozzuto staff 6.9M Nov 24 16:11 chr19.fasta.gz.2.ebwt
-rw-r--r-- 1 lcozzuto staff 53B Nov 24 16:10 chr19.fasta.gz.3.ebwt
-rw-r--r-- 1 lcozzuto staff 14M Nov 24 16:10 chr19.fasta.gz.4.ebwt
lrwxr-xr-x 1 lcozzuto staff 74B Nov 24 16:10 chr19.fasta.gz -> /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/data/chr19.fasta.gz
You can see the input files staged as links, the output files and some “hidden” files in which we have different information:
.exitcode, contains 0 if everything is ok, another value if there was a problem.
.command.log, contains the log of the command execution. It is often identical to .command.out
.command.out, contains the standard output of the command execution
.command.err, contains the standard error of the command execution
.command.begin, contains what has to be executed before .command.sh
.command.sh, contains the block of code indicated in the process
.command.run, contains the code made by nextflow for the execution of .command.sh, and contains environmental variables, eventual invocations of linux containers etc.
Resuming and changing parameters
We can copy a fastq files in another place and change the file name:
cp $PATH/.nextflow/assets/nextflow-io/elixir-workshop-21/data/*.gz .
mv B7_H3K4me1_s_chr19.fastq.gz test2.fastq.gz
mv B7_input_s_chr19.fastq.gz test1.fastq.gz
Then we can execute again the pipeline feeding the new input files by using the pipeline parameter --reads ""
Note
Nextflow parameters are indicated by one dash (-). Pipeline parameters by two dahses (--)
You can execute again the pipeline by using the Nextflow parameter -resume
and send it to background with -bg
.
nextflow run nextflow-io/elixir-workshop-21 -with-docker -r master -bg --reads "*.fastq.gz" -resume > log
cat log
N E X T F L O W ~ version 21.10.3
Launching `nextflow-io/elixir-workshop-21` [jolly_visvesvaraya] - revision: 040cd63a79 [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [cf2612db62]
BIOCORE@CRG - N F TESTPIPE ~ version 1.0
=============================================
reads : *.fastq.gz
reference : /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/data/chr19.fasta.gz
output : ./output
[8b/cfcc4f] Submitted process > fastqc (test1.fastq.gz)
[5b/71ae88] Submitted process > fastqc (test2.fastq.gz)
[6e/1cc3be] Cached process > BOWTIE:Index (chr19.fasta.gz)
[97/2a6a72] Submitted process > BOWTIE:Align (test2.fastq.gz)
[0a/951748] Submitted process > BOWTIE:Align (test1.fastq.gz)
/Users/lcozzuto/work/97/2a6a7245675d7913019aa8983c5e55/test2.fastq.gz.log
/Users/lcozzuto/work/97/2a6a7245675d7913019aa8983c5e55/test2.fastq.gz.sam
/Users/lcozzuto/work/0a/9517481ef43b0e88163ec5f8b4d71f/test1.fastq.gz.log
/Users/lcozzuto/work/0a/9517481ef43b0e88163ec5f8b4d71f/test1.fastq.gz.sam
[f7/1b5746] Submitted process > multiqc
Done! Open the following report in your browser --> ./output/ouptut_multiQC/multiqc_report.html
You can see that the indexing of the genome is cached while the processes that are influenced by the new files are triggered.
Reporting and monitoring
Before going to the code we can have a look to two important features of Nextflow: the ability to produce a comprehensive report and the live monitoring offered by tower.nf web application.
We can go to the tower.nf website
and click on the GitHub authentication.
You can generate your token at https://tower.nf/tokens exporting those environmental variables:
export TOWER_ACCESS_TOKEN=*******YOUR***TOKEN*****HERE*******
Note
You can also store them indefinitely in your .bashrc or .bash_profile file.
We can then launch again the pipeline forcing this time without -resume
and check the live reporting on the tower website adding the parameter -with-tower
.
nextflow run nextflow-io/elixir-workshop-21 -with-docker -r master -bg --reads "*.fastq.gz" -with-tower > log
tail -f log
N E X T F L O W ~ version 21.10.3
Launching `nextflow-io/elixir-workshop-21` [evil_ekeblad] - revision: 040cd63a79 [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [fb23636633]
Downloading plugin nf-tower@1.3.0
BIOCORE@CRG - N F TESTPIPE ~ version 1.0
=============================================
reads : *.fastq.gz
reference : /Users/lcozzuto/.nextflow/assets/nextflow-io/elixir-workshop-21/data/chr19.fasta.gz
output : ./output
Monitor the execution with Nextflow Tower using this url https://tower.nf/user/lucacozzuto/watch/54kIaLzfwIfiLx
[23/b06dda] Submitted process > fastqc (test1.fastq.gz)
[ee/82bce0] Submitted process > fastqc (test2.fastq.gz)
[27/82af32] Submitted process > BOWTIE:Index (chr19.fasta.gz)
[...]
We can check the appearance of a new pipeline and the content
Here you have a real case of a two days run with almost 6 thousands jobs.
When the pipeline is finished you also get a mail. Adding the parameter -with-report
will produce a final html report with all the information that was in the tower.nf website.