Introduction to Nextflow

Enabling scalable and reproducible scientific workflows Nextflow and software containers.

Tip

We can open our test repository in GitPod by clicking this link. This will take some minutes to set up so we can introduce a bit Nextflow

What is Nextflow?

Nextflow is a reactive workflow system and a programming DSL that eases the writing of data-intensive computational pipelines.

It enables scalable and reproducible scientific workflows using software containers that can be deployed in a portable manner across clusters and clouds.

Project documentation is available at this link. Help and support is available through the community Gitter channel and GitHub discussion group

Nextflow project was setup at the Centre for Genomics Regulation (CRG), currently is supported by Seqera Labs.

What is Nextflow for?

It is for making pipelines without caring about parallelization, dependencies, intermediate file names, data structures, handling exceptions, resuming executions, etc.

Nextflow has been published on Nature Biotechnology. If you use it in your researcher, please cite it.

../_images/NF_pub.png

Number of PubMed publications with the word Nextflow, while the original paper has been cited 257 times.

../_images/NF_mentioning.png

A curated list of Nextflow pipelines and a collection of pipelines written collaboratively by the nf-core community.

Main advantages

Fast prototyping

You can quickly write a small pipeline that can be expanded incrementally. Each task is independent and can be easily added to other. You can reuse scripts without re-writing or adapting them.

Reproducibility

Nextflow supports Docker and Singularity containers technology. Their use will make the pipelines reproducible in any Unix environment. Nextflow is integrated with GitHub code sharing platform, so you can call directly a specific version of a pipeline from a repository, download and use it on-the-fly.

Portability

Nextflow can be executed on multiple platforms without modifiying the code. It supports several schedulers such as SGE, LSF, SLURM, PBS, HTCondor and cloud platforms like Kubernetes, Amazon AWS, Google Cloud.

../_images/executors.png

Scalability

Nextflow is based on the dataflow programming model which simplifies writing complex pipelines. The tool takes care of parallelizing the processes without additionally written code. The resulting applications are inherently parallel and can scale-up or scale-out transparently; there is no need to adapt them to a specific platform architecture.

Resumable, thanks to continuous checkpoints

All the intermediate results produced during the pipeline execution are automatically tracked. For each process a temporary folder is created and is cached (or not) once resuming an execution.

Workflow structure

The workflows can be represented as graphs where the nodes are the processes and the edges are the channels. The processes are blocks of code that can be executed - such as scripts or programs - while the channels are asynchronous queues able to connect processes among them via input / output. Some methods, called operators are provided for reshaping and combining the channels.

../_images/wf_example.png

Processes are independent from each another and can be run in parallel, depending on the number of elements in a channel. In the previous example, processes A, B and C can be run in parallel and only when they ALL end the process D is triggered. An operator is used for gathering together all the elements generated by the channels 2, 3 and 4.

Practical part

Once the GitPod page is loaded, we can then open a terminal as indicated in the picture:

../_images/gitpod1.png

Installation of Nextflow

Note

You need at least the Java version 8 for the Nextflow installation.

Tip

You can check the version fo java by typing:

java -version

And we can install Nextflow using this command:

curl -s https://get.nextflow.io | bash

This will create the nextflow executable that can be moved, for example, to /usr/local/bin.

sudo mv nextflow /usr/local/bin