Introduction to Nextflow
Enabling scalable and reproducible scientific workflows Nextflow and software containers.
Tip
We can open our test repository in GitPod by clicking this link. This will take some minutes to set up so we can introduce a bit Nextflow
What is Nextflow?
Nextflow is a reactive workflow system and a programming DSL that eases the writing of data-intensive computational pipelines.
It enables scalable and reproducible scientific workflows using software containers that can be deployed in a portable manner across clusters and clouds.
Project documentation is available at this link. Help and support is available through the community Gitter channel and GitHub discussion group
Nextflow project was setup at the Centre for Genomics Regulation (CRG), currently is supported by Seqera Labs.
What is Nextflow for?
It is for making pipelines without caring about parallelization, dependencies, intermediate file names, data structures, handling exceptions, resuming executions, etc.
Nextflow has been published on Nature Biotechnology. If you use it in your researcher, please cite it.
Number of PubMed publications with the word Nextflow, while the original paper has been cited 257 times.
A curated list of Nextflow pipelines and a collection of pipelines written collaboratively by the nf-core community.
Main advantages
Fast prototyping
You can quickly write a small pipeline that can be expanded incrementally. Each task is independent and can be easily added to other. You can reuse scripts without re-writing or adapting them.
Reproducibility
Nextflow supports Docker and Singularity containers technology. Their use will make the pipelines reproducible in any Unix environment. Nextflow is integrated with GitHub code sharing platform, so you can call directly a specific version of a pipeline from a repository, download and use it on-the-fly.
Portability
Nextflow can be executed on multiple platforms without modifiying the code. It supports several schedulers such as SGE, LSF, SLURM, PBS, HTCondor and cloud platforms like Kubernetes, Amazon AWS, Google Cloud.
Scalability
Nextflow is based on the dataflow programming model which simplifies writing complex pipelines. The tool takes care of parallelizing the processes without additionally written code. The resulting applications are inherently parallel and can scale-up or scale-out transparently; there is no need to adapt them to a specific platform architecture.
Resumable, thanks to continuous checkpoints
All the intermediate results produced during the pipeline execution are automatically tracked. For each process a temporary folder is created and is cached (or not) once resuming an execution.
Workflow structure
The workflows can be represented as graphs where the nodes are the processes and the edges are the channels. The processes are blocks of code that can be executed - such as scripts or programs - while the channels are asynchronous queues able to connect processes among them via input / output. Some methods, called operators are provided for reshaping and combining the channels.
Processes are independent from each another and can be run in parallel, depending on the number of elements in a channel. In the previous example, processes A, B and C can be run in parallel and only when they ALL end the process D is triggered. An operator is used for gathering together all the elements generated by the channels 2, 3 and 4.
Practical part
Once the GitPod page is loaded, we can then open a terminal as indicated in the picture:
Installation of Nextflow
Note
You need at least the Java version 8 for the Nextflow installation.
Tip
You can check the version fo java by typing:
java -version
And we can install Nextflow using this command:
curl -s https://get.nextflow.io | bash
This will create the nextflow
executable that can be moved, for example, to /usr/local/bin
.
sudo mv nextflow /usr/local/bin