Nextflow Workshop 2018

1. Installation

1.1. Requirements

Nextflow can be used on any POSIX compatible system (Linux, OS X, etc). It requires Bash and Java 8 (or later) to be installed.

Optional requirements:

Docker engine 1.10.x (or later)
Singularity 2.5.x (or later, optional)
Conda 4.5 (or later, optional)
Graphviz (optional)
AWS Batch computing environment properly configured (optional)

1.2. Nextflow Installation

Nextflow is distributed as a self-contained executable package, which means that it does not require any special installation procedure.

It only needs two easy steps:

Download the executable package by using the command shown below. It will create the nextflow main executable file in the current directory.
```
curl https://get.nextflow.io | bash
```
Finally, move the nextflow file to a directory accessible by your $PATH variable (this is only required to avoid remembering and typing the full path to nextflow each time you need to run it).
```
mv nextflow ~/bin
```
Check everything is fine running the command:
```
nextflow info
```

Alternatively Nextflow can also be installed using Conda package managed using the command:

conda install nextflow

2. Simple Rna-Seq pipeline

During this tutorial you will implement a proof of concept of a RNA-Seq pipeline which:

Indexes a trascriptome file.
Performs quality controls
Performs quantification.
Create a MultiqQC report.

Change in the course directory with the following command:

cd ~/nf-hack18/course

2.1. Define the pipeline parameters

The script script1.nf defines the pipeline input parameters.


  1
2
3
4
5

  params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"

println "reads: $params.reads"

Run it by using the following command:

nextflow run script1.nf

Try to specify a different input parameter, for example:

nextflow run script1.nf --reads this/and/that

2.1.1. Exercise

Modify the script1.nf adding a fourth parameter named outdir and set it to a default path that will be used as the pipeline output directory.

2.1.2. Exercise

Modify the script1.nf to print all the pipeline parameters by using a single println command and a multiline string statement.

see an example here.

2.1.3. Recap

In this step you have learned:

How to define parameters in your pipeline script
How to pass parameters by using the command line
The use of $var and ${var} variable placeholders
How to use multiline strings

2.2. Create transcriptome index file

Nextflow allows the execution of any command or user script by using a process definition.

A process is defined by providing three main declarations: the process inputs, the process outputs and finally the command script.

The second example adds the index process.


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

  /*
 * pipeline input parameters
 */
params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "results"

println """\
         R N A S E Q - N F   P I P E L I N E
         ===================================
         transcriptome: ${params.transcriptome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

/*
 * create a transcriptome file object given then transcriptome string parameter
 */
transcriptome_file = file(params.transcriptome)

/*
 * define the `index` process that create a binary index
 * given the transcriptome file
 */
process index {

    input:
    file transcriptome from transcriptome_file

    output:
    file 'index' into index_ch

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}

It takes the transcriptome file as input and creates the transcriptome index by using the salmon tool.

Note how the input declaration defines a transcriptome variable in the process context that it is used in the command script to reference that file in the Salmon command line.

Try to run it by using the command:

nextflow run script2.nf

The execution will fail because Salmon is not installed in your environment.

Add the command line option -with-docker to launch the execution through a Docker container as shown below:

nextflow run script2.nf -with-docker

This time it works because it uses the Docker container nextflow/rnaseq-nf defined in the nextflow.config file.

In order to avoid to add the option -with-docker add the following line in the nextflow.config file:

docker.enabled = true

2.2.1. Exercise

Enable the Docker execution by default adding the above setting in the nextflow.config file.

2.2.2. Exercise

Print the output of the index_ch channel by using the println operator (do not confuse it with the println statement seen previously).

2.2.3. Exercise

Use the command tree work to see how Nextflow organises the process work directory.

2.2.4. Recap

In this step you have learned:

How to define a process executing a custom command
How process inputs are declared
How process outputs are declared
How to access the number of available CPUs
How to print the content of a channel

2.3. Collect read files by pairs

This step shows how to match read files into pairs, so they can be mapped by Salmon.

Edit the script script3.nf and add the following statement as the last line:

read_pairs_ch.println()

Save it and execute it with the following command:

nextflow run script3.nf

It will print an output similar to the one shown below:

[ggal_gut, [/.../data/ggal/gut_1.fq, /.../data/ggal/gut_2.fq]]

The above example shows how the read_pairs_ch channel emits tuples composed by two elements, where the first is the read pair prefix and the second is a list representing the actual files.

Try it again specifying different read files by using a glob pattern:

nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'

File paths including one or more wildcards ie. *, ?, etc. MUST be wrapped in single-quoted characters to avoid Bash expands the glob.

2.3.1. Exercise

Use the set operator in place of = assignment to define the read_pairs_ch channel.

2.3.2. Exercise

Use the ifEmpty operator to check if the read_pairs_ch contains at least an item.

2.3.3. Recap

In this step you have learned:

How to use fromFilePairs to handle read pair files
How to use the set operator to define a new channel variable
How to use the ifEmpty operator to check if a channel is empty

2.4. Perform expression quantification

The script script4.nf adds the quantification process.

In this script note as the index_ch channel, declared as output in the index process, is now used as a channel in the input section.

Also note as the second input is declared as a set composed by two elements: the pair_id and the reads in order to match the structure of the items emitted by the read_pairs_ch channel.

Execute it by using the following command:

nextflow run script4.nf -resume

You will see the execution of the quantification process.

The -resume option cause the execution of any step that has been already processed to be skipped.

Try to execute it with more read files as shown below:

nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'

You will notice that the quantification process is executed more than one time.

Nextflow parallelizes the execution of your pipeline simply by providing multiple input data to your script.

2.4.1. Exercise

Add a tag directive to the quantification process to provide a more readable execution log.

2.4.2. Exercise

Add a publishDir directive to the quantification process to store the process results into a directory of your choice.

2.4.3. Recap

In this step you have learned:

How to connect two processes by using the channel declarations
How to resume the script execution skipping already already computed steps
How to use the tag directive to provide a more readable execution output
How to use the publishDir to store a process results in a path of your choice

2.5. Quality control

This step implements a quality control of your input reads. The inputs are the same read pairs which are provided to the quantification steps

You can run it by using the following command:

nextflow run script5.nf -resume

The script will report the following error message:

Channel `read_pairs_ch` has been used twice as an input by process `fastqc` and process `quantification`

2.5.1. Exercise

Modify the creation of the read_pairs_ch channel by using a into operator in place of a set.

see an example here.

2.5.2. Recap

In this step you have learned:

How to use the into operator to create multiple copies of the same channel

2.6. MultiQC report

This step collect the outputs from the quantification and fastqc steps to create a final report by using the MultiQC tool.

Execute the script with the following command:

nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'

It creates the final report in the results folder in the current work directory.

In this script note the use of the mix and collect operators chained together to get all the outputs of the quantification and fastqc process as a single input.

2.6.1. Recap

In this step you have learned:

How to collect many outputs to a single input with the collect operator
How to mix two channels in a single channel
How to chain two or more operators togethers

2.7. Handle completion event

This step shows how to execute an action when the pipeline completes the execution.

Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another as they are written in the pipeline script as it would happen in a common imperative programming language.

The script uses the workflow.onComplete event handler to print a confirmation message when the script completes.


  1
2
3

  workflow.onComplete {
    println ( workflow.success ? "\nDone! Open the following report in your browser -->  $params.outdir/multiqc_report.html\n" : "Oops .. something went wrong" )
}

Try to run it by using the following command:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

2.8. Custom scripts

Real world pipelines use a lot of custom user scripts (BASH, R, Python, etc). Nextflow allows you to use and manage all these scripts in consistent manner. Simply put them in a directory named bin in the pipeline project root. They will be automatically added to the pipeline execution PATH.

For example, create a file named fastqc.sh with the following content:


  1
2
3
4
5
6
7
8
9

  #!/bin/bash
set -e
set -u

sample_id=${1}
reads=${2}

mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}

Save it, give execute permission and move it in the bin directory as shown below:


  1
2
3

  chmod +x fastqc.sh
mkdir -p bin
mv fastqc.sh bin

Then, open the script7.nf file and replace the fastqc process' script with the following code:


  1
2
3
4

  script:
"""
fastqc.sh "$sample_id" "$reads"
"""

Run it as before:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

2.8.1. Recap

In this step you have learned:

How to write or use existing custom script in your Nextflow pipeline.
How to avoid the use of absolute paths having your scripts in the bin/ project folder.

2.9. Mail notification

Send a notification email when the workflow execution complete using the -N <email address> command line option. Execute again the previous example specifying your email address:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq' -N <your email>

Your computer must have installed a pre-configured mail tool, such as mail or sendmail.

Alternatively you can provide the settings of the STMP server needed to send the mail notification in the Nextflow config file. See mail documentation for details.

2.10. More resources

Nextflow documentation - The Nextflow docs home.
Nextflow patterns - A collection of Nextflow implementation patterns.
CalliNGS-NF - An Variant calling pipeline implementing GATK best practices.
nf-core - A community collection of production ready genomic pipelines.

3. Bonus

3.1. Run a pipeline from a GitHub repository

Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services eg. BitBucket and GitLab).

This simplifies the sharing and the deployment of complex projects and tracking changes in a consistent manner.

The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:

github.com/nextflow-io/rnaseq-nf

You can run it by specifying the project name as shown below:

nextflow run nextflow-io/rnaseq-nf -with-docker

It automatically downloads it and store in the $HOME/.nextflow folder.

Use the command info to show the project information, e.g.:

nextflow info nextflow-io/rnaseq-nf

Nextflow allows the execution of a specific revision of your project by using the -r command line option. For Example:

nextflow run nextflow-io/rnaseq-nf -r dev

Revision are defined by using Git tags or branches defined in the project repository.

This allows a precise control of the changes in your project files and dependencies over time.

3.2. Metrics and reports

Nextflow is able to produce multiple reports and charts providing several runtime metrics and execution information.

Run the rnaseq-nf pipeline previously introduced as shown below:

nextflow run -r master rnaseq-nf -with-docker -with-report -with-trace -with-timeline -with-dag dag.png

The -with-report option enables the creation of the workflow execution report. Open the file report.html with a browser to see the report created with the above command.

The -with-trace option enables the create of a tab separated file containing runtime information for each executed task. Check the content of the file trace.txt for an example.

The -with-timeline option enables the creation of the workflow timeline report showing how processes where executed along time. This may be useful to identify most time consuming tasks and bottlenecks. See an example at this link.

Finally the -with-dag option enables to rendering of the workflow execution direct acyclic graph representation. Note: this feature requires the installation of Graphviz in your computer. See here for details.

The public IP address of your computer can be obtained with the command myip.

Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.

4. Basic language structures and commands

Nextflow is a DSL implemented on top of the Groovy programming lang, which in turns is a super-set of the Java programming language. This means that Nextflow can run any Groovy and Java code.

4.1. Printing values

To print something is as easy as using one of the print or println methods.


  1

  println("Hello, World!")

The only difference between the two is that the println method implicitly appends a new line character to the printed string.

parenthesis for function invocations are optional. Therefore also the following is a valid syntax.


  1

  println "Hello, World!"

4.2. Comments

Comments use the same syntax as in the C-family programming languages:


  1
2
3
4
5
6

  // comment a single config file

/*
   a comment spanning
   multiple lines
 */

4.3. Variables

To define a variable, simply assign a value to it:


  1
2
3
4
5
6
7
8
9
10
11
12
13
14

  x = 1
println x

x = new java.util.Date()
println x

x = -3.1499392
println x

x = false
println x

x = "Hi"
println x

Local variables are defined using the def keyword:


  1

  def x = 'foo'

It should be always used when defining variables local to a function or a closure.

4.4. Lists

A List object can be defined by placing the list items in square brackets:


  1

  list = [10,20,30,40]

You can access a given item in the list with square-bracket notation (indexes start at 0) or using the get method:


  1
2

  assert list[0] == 10
assert list[0] == list.get(0)

In order to get the length of the list use the size method:


  1

  assert list.size() == 4

Lists can also be indexed with negative indexes and reversed ranges.


  1
2
3

  list = [0,1,2]
assert list[-1] == 2
assert list[-1..0] == list.reverse()

List objects implements all methods provided by the Java java.util.List interface plus the extension methods provided by Groovy API.


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

  assert [1,2,3] << 1 == [1,2,3,1]
assert [1,2,3] + [1] == [1,2,3,1]
assert [1,2,3,1] - [1] == [2,3]
assert [1,2,3] * 2 == [1,2,3,1,2,3]
assert [1,[2,3]].flatten() == [1,2,3]
assert [1,2,3].reverse() == [3,2,1]
assert [1,2,3].collect{ it+3 } == [4,5,6]
assert [1,2,3,1].unique().size() == 3
assert [1,2,3,1].count(1) == 2
assert [1,2,3,4].min() == 1
assert [1,2,3,4].max() == 4
assert [1,2,3,4].sum() == 10
assert [4,2,1,3].sort() == [1,2,3,4]
assert [4,2,1,3].find{it%2 == 0} == 4
assert [4,2,1,3].findAll{it%2 == 0} == [4,2]

4.5. Maps

Maps are like lists that have an arbitrary type of key instead of integer. Therefore, the syntax is very much aligned.


  1

  map = [a:0, b:1, c:2]

Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.


  1
2
3

  assert map['a'] == 0        (1)
assert map.b == 1           (2)
assert map.get('c') == 2    (3)

1	Use of the square brackets.
2	Use a dot notation.
3	Use of get method.

To add data or to modify a map, the syntax is similar to adding values to list:


  1
2
3
4

  map['a'] = 'x'           (1)
map.b = 'y'              (2)
map.put('c', 'z')        (3)
assert map == [a:'x', b:'y', c:'z']

1	Use of the square brackets.
2	Use a dot notation.
3	Use of get method.

Map objects implements all methods provided by the Java java.util.Map interface plus the extension methods provided by Groovy API.

4.6. String interpolation

String literals can be defined enclosing them either with single-quoted or double-quotes characters.

Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:


  1
2
3
4
5
6

  foxtype = 'quick'
foxcolor = ['b', 'r', 'o', 'w', 'n']
println "The $foxtype ${foxcolor.join()} fox"

x = 'Hello'
println '$x + $y'

This code prints:


  1
2

  The quick brown fox
$x + $y

Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.

Finally string literals can also be defined using the / character as delimiter. They are known as slashy strings and are useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double quote strings they allow to interpolate variables prefixed with a $ character.

Try the following to see the difference:


  1
2
3
4
5

  x = /tic\tac\toe/
y = 'tic\tac\toe'

println x
println y

it prints:

tic\tac\toe
tic    ac    oe

4.7. Multi-line strings

A block of text that span multiple lines can be defined by delimiting it with triple single or double quotes:


  1
2
3
4

  text = """
    Hello there James
    how are you today?
    """

Finally multi-line strings can also be defined with slashy string. For example:


  1
2
3
4
5

  text = /
    This is a multi-line
    slashy string!
    It's cool, isn't it?!
    /

Like before, multi-line strings inside double quotes and slash characters support variable interpolation, while single-quoted multi-line strings do not.

4.8. If statement

The if statement uses the same syntax common other programming lang such Java, C, JavaScript, etc.


  1
2
3
4
5
6

  if( < boolean expression > ) {
    // true branch
}
else {
    // false branch
}

The else branch is optional. Also curly brackets are optional when the branch define just a single statement.


  1
2
3

  x = 1
if( x > 10 )
    println 'Hello'

null, empty strings and empty collections are evaluated to false.

Therefore a statement like:


  1
2
3
4
5
6
7

  list = [1,2,3]
if( list != null && list.size() > 0 ) {
  println list
}
else {
  println 'The list is empty'
}

Can be written as:


  1
2
3
4

  if( list )
    println list
else
    println 'The list is empty'

See the Groovy-Truth for details.

In some cases can be useful to replace if statement with a ternary expression aka conditional expression. For example:


  1

  println list ? list : 'The list is empty'

The previous statement can be further simplified using the Elvis operator as shown below:


  1

  println list ?: 'The list is empty'

4.9. For statement

The classical for loop syntax is supported as shown here:


  1
2
3

  for (int i = 0; i <3; i++) {
   println("Hello World $i")
}

Iteration over list objects is also possible using the syntax below:


  1
2
3
4
5

  list = ['a','b','c']

for( String elem : list ) {
  println elem
}

4.10. Functions

It is possible to define a custom function into a script, as shown here:


  1
2
3
4
5

  int fib(int n) {
    return n < 2 ? 1 : fib(n-1) + fib(n-2)
}

assert fib(10)==89

A function can take multiple arguments separating them with a comma. The return keyword can be omitted and the function implicitly returns the value of the last evaluated expression. Also explicit types can be omitted (thought not recommended):


  1
2
3
4
5

  def fact( n ) {
  n > 1 ? n * fact(n-1) : 1
}

assert fact(5) == 120

4.11. Closures

Closures are the swiss army knife of Nextflow/Groovy programming. In a nutshell a closure is is a block of code that can be passed as an argument to a function, it could also be defined an anonymous function.

More formally, a closure allows the definition of functions as first class objects.


  1

  square = { it * it }

The curly brackets around the expression it * it tells the script interpreter to treat this expression as code. The it identifier is an implicit variable that represents the value that is passed to the function when it is invoked.

Once compiled the function object is assigned to the variable square as any other variable assignments shown previously. To invoke the closure execution use the special method call or just use the round parentheses to specify the closure parameter(s). For example:


  1
2

  assert square.call(5) == 25
assert square(9) == 81

This is not very interesting until we find that we can pass the function square as an argument to other functions or methods. Some built-in functions take a function like this as an argument. One example is the collect method on lists:


  1
2

  x = [ 1, 2, 3, 4 ].collect(square)
println x

It prints:

[ 1, 4, 9, 16 ]

By default, closures take a single parameter called it, to give it a different name use the -> syntax. For example:


  1

  square = { num -> num * num }

It’s also possible to define closures with multiple, custom-named parameters.

For example, the method each() when applied to a map can take a closure with two arguments, to which it passes the key-value pair for each entry in the map object. For example:


  1
2
3
4
5

  printMap = { a, b ->
    println "$a with value $b"
}

[ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ].each(printMap)

It prints:

Yue with value Wu
Mark with value Williams
Sudha with value Kumari

A closure has two other important features. First, it can access and modify variables in the scope where it is defined.

Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the place where it needs to be used.

As an example showing both these features, see the following code fragment:


  1
2
3
4

  result = 0  (1)
map = ["China": 1 , "India" : 2, "USA" : 3]   (2)
map.keySet().each { result+= myMap[it] }      (3)
println result

1	Define a global variable.
2	Define a map object.
3	Invoke the `each` method passing closure object which modifies the `result` variable.

Learn more about closures in the Groovy documentation.

4.12. More resources

The complete Groovy language documentation is available at this link.

A great resource to master Apache Groovy syntax is Groovy in Action.

5. Channels

Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.

They are used to logically connect tasks each other or to implement functional style data transformations.

5.1. Channel types

Nextflow distinguish two different kinds of channels: queue channels and value channels.

5.1.1. Queue channel

A queue channel is a asynchronous unidirectional FIFO queue which connects two processes or operators.

What asynchronous means? That operations are non-blocking.
What unidirectional means? That data flow from a producer to a consumer.
What FIFO means? That the data is guaranteed to be delivered in the same order as it is produced.

A queue channel is implicitly created by process output definitions or using channel factories methods such as Channel.from or Channel.fromPath.

Try the following snippets:


  1
2
3

  p = Channel.from(1,2,3)
println(p)      (1)
p.println()     (2)

Note the different use of println:

1	Use the built-in `pritnln` function to print the `p` variable.
2	Apply the `println` method to the `p` channel, therefore prints each item emitted by the channels.

Exercise

Try to execute this snippet, it will produce an error message.


  1
2
3

  p = Channel.from(1,2,3)
p.println()
p.println()

A queue channel can have one and exactly one producer and one and exactly one consumer.

5.1.2. Value channels

A value channel a.k.a. singleton channel by definition is bound to a single value and it can be read unlimited times without consuming its content.


  1
2
3
4

  p = Channel.value('Hello')
p.println()
p.println()
p.println()

It prints:

Hello
Hello
Hello

5.2. Channel factories

5.2.1. value

The value factory method is used to create a value channel. An optional not null argument can be specified to bind the channel to a specific value. For example:


  1
2
3

  ch1 = Channel.value()                 (1)
ch2 = Channel.value( 'Hello there' )  (2)
ch2 = Channel.value( [1,2,3,4,5] )    (3)

1	Creates an empty value channel.
2	Creates a value channel and binds a string to it.
3	Creates a value channel and binds a list object to it that will be emitted as a sole emission.

5.2.2. from

The factory Channel.from allows the creation of a queue channel with the values specified as argument.


  1
2

  ch = Channel.from( 1, 3, 5, 7 )
ch.println{ "value: $it" }

The first line in this example creates a variable ch which holds a channel object. This channel emits the values specified as a parameter in the from method. Thus the second line will print the following:

value: 1
value: 3
value: 5
value: 7

5.2.3. fromPath

The fromPath factory method create a queue channel emitting one or more files matching the specified glob pattern.


  1

  Channel.fromPath( '/data/big/*.txt' )

This example creates a channel and emits as many items as there are files with txt extension in the /data/big folder. Each element is a file object implementing the Path interface.

Two asterisks, i.e. **, works like * but crosses directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.

Table 1. Available options
Name	Description
glob	When `true` interprets characters `*`, `?`, `[]` and `{}` as glob wildcards, otherwise handles them as normal characters (default: `true`)
type	Type of paths returned, either `file`, `dir` or `any` (default: `file`)
hidden	When `true` includes hidden files in the resulting paths (default: `false`)
maxDepth	Maximum number of directory levels to visit (default: `no limit`)
followLinks	When `true` it follows symbolic links during directories tree traversal, otherwise they are managed as files (default: `true`)
relative	When `true` returned paths are relative to the top-most common directory (default: `false`)
checkIfExists	When `true` throws an exception of the specified path do not exist in the file system (default: `false`)

Learn more about the blog pattern syntax at this link.

Exercise

Use the Channel.fromPath method to create a channel emitting all files with the suffix .fq in the data/ggal/ and any subdirectory, then print the file name.

5.2.4. fromFilePairs

The fromFilePairs method creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).


  1
2
3

  Channel
    .fromFilePairs('/my/data/SRR*_{1,2}.fastq')
    .println()

It will produce an output similar to the following:

[SRR493366, [/my/data/SRR493366_1.fastq, /my/data/SRR493366_2.fastq]]
[SRR493367, [/my/data/SRR493367_1.fastq, /my/data/SRR493367_2.fastq]]
[SRR493368, [/my/data/SRR493368_1.fastq, /my/data/SRR493368_2.fastq]]
[SRR493369, [/my/data/SRR493369_1.fastq, /my/data/SRR493369_2.fastq]]
[SRR493370, [/my/data/SRR493370_1.fastq, /my/data/SRR493370_2.fastq]]
[SRR493371, [/my/data/SRR493371_1.fastq, /my/data/SRR493371_2.fastq]]

The glob pattern must contain at least a star wildcard character.

Table 2. Available options
Name	Description
type	Type of paths returned, either `file`, `dir` or `any` (default: `file`)
hidden	When `true` includes hidden files in the resulting paths (default: `false`)
maxDepth	Maximum number of directory levels to visit (default: `no limit`)
followLinks	When `true` it follows symbolic links during directories tree traversal, otherwise they are managed as files (default: `true`)
size	Defines the number of files each emitted item is expected to hold (default: 2). Set to `-1` for any.
flat	When `true` the matching files are produced as sole elements in the emitted tuples (default: `false`).
checkIfExists	When `true` throws an exception of the specified path do not exist in the file system (default: `false`)

Exercise

Use the fromFilePairs method to create a channel emitting all pairs of fastq read in the data/ggal/ directory and print them.

Then use the flat:true option and compare the output with the previous execution.

6. Operators

Built-in functions applied to channels
Transform channels content
Can be used also to filter, fork and combine channels

6.1. Basic example


  1
2
3

  nums = Channel.from(1,2,3,4)         (1)
square = nums.map { it -> it * it }  (2)
square.println()                     (3)

1	Create a queue channel emitting four values.
2	Create a new channels transforming each number in it’s square.
3	Print the channel content.

Operators can be chained to implement custom behaviors:


  1
2
3

  Channel.from(1,2,3,4)
        .map { it -> it * it }
        .println()

Operators can be separated in to five groups:

Filtering operators
Transforming operators
Splitting operators
Combining operators
Forking operators
Maths operators

6.2. Basic operators

6.2.1. println

The println operator prints the items emitted by a channel to the console standard output appending a new line character to each of them. For example:


  1
2
3

  Channel
      .from('foo', 'bar', 'baz', 'qux')
      .println()

It prints:

foo
bar
baz
qux

An optional closure parameter can be specified to customise how items are printed. For example:


  1
2
3

  Channel
      .from('foo', 'bar', 'baz', 'qux')
      .println { "- $it" }

It prints:

- foo
- bar
- baz
- qux

println is a terminal operator, i.e. it does not allow the chaining with other operators.

6.2.2. map

The map operator applies a function of your choosing to every item emitted by a channel, and returns the items so obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as shown in the example below:


  1
2
3
4

  Channel
    .from( 'hello', 'world' )
    .map { it -> it.reverse() }
    .println()

A map can associate to each element a generic tuple containing any data as needed.


  1
2
3
4

  Channel
    .from( 'hello', 'world' )
    .map { word -> [word, word.size()] }
    .println { word, len -> "$word contains $len letters" }

Exercise

Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq, then chain with a map to return a pair containing the file name and the path itself. Finally print the resulting channel.


  1
2
3

  Channel.fromPath('data/ggal/*.fq')
        .map { file -> [ file.name, file ] }
        .println { name, file -> "> file: $name" }

6.2.3. into

The into operator connects a source channel to two or more target channels in such a way the values emitted by the source channel are copied to the target channels. For example:


  1
2
3
4
5
6

  Channel
     .from( 'a', 'b', 'c' )
     .into{ foo; bar }

foo.println{ "Foo emits: " + it }
bar.println{ "Bar emits: " + it }

Note the use in this example of curly brackets and the ; as channel names separator. This is needed because the actual parameter of into is a closure which defines the target channels to which the source one is connected.

6.2.4. mix

The mix operator combines the items emitted by two (or more) channels into a single channel.


  1
2
3
4
5

  c1 = Channel.from( 1,2,3 )
c2 = Channel.from( 'a','b' )
c3 = Channel.from( 'z' )

c1 .mix(c2,c3).println()

1
2
a
3
b
z

The items in the resulting channel have the same order as in respective original channel, however there’s no guarantee that the element of the second channel are append after the elements of the first. Indeed in the above example the element a has been printed before 3.

6.2.5. flatten

The flatten operator transforms a channel in such a way that every tuple is flattened so that each single entry is emitted as a sole element by the resulting channel.


  1
2
3
4
5
6
7

  foo = [1,2,3]
bar = [4, 5, 6]

Channel
    .from(foo, bar)
    .flatten()
    .println()

The above snippet prints:

6.2.6. collect

The collect operator collects all the items emitted by a channel to a list and return the resulting object as a sole emission.


  1
2
3
4

  Channel
    .from( 1, 2, 3, 4 )
    .collect()
    .println()

It prints a single value:

[1,2,3,4]

The result of the collect operator is a value channel.

6.2.7. groupTuple

The groupTuple operator collects tuples (or lists) of values emitted by the source channel grouping together the elements that share the same key. Finally it emits a new tuple object for each distinct key collected.

Try the following example:


  1
2
3
4

  Channel
     .from( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
     .groupTuple()
     .println()

It shows:

[1, [A, B, C]]
[2, [C, A]]
[3, [B, D]]

This operator is useful to process altogether all elements for which there’s a common property or a grouping key.

Exercise

Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq, then use a map to associate to each file the name prefix. Finally group together all files having the same common prefix.

6.2.8. join

The join operator creates a channel that joins together the items emitted by two channels for which exits a matching key. The key is defined, by default, as the first element in each item emitted.


  1
2
3

  left = Channel.from(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
right= Channel.from(['Z', 6], ['Y', 5], ['X', 4])
left.join(right).println()

The resulting channel emits:

[Z, 3, 6]
[Y, 2, 5]
[X, 1, 4]

6.2.9. subscribe

The subscribe operator permits to execute a user define function each time a new value is emitted by the source channel.

The emitted value is passed implicitly to the specified function. For example:


  1
2
3
4
5

  // define a channel emitting three values
source = Channel.from ( 'alpha', 'beta', 'delta' )

// subscribe a function to the channel printing the emitted values
source.subscribe {  println "Got: $it"  }

It prints:

Got: alpha
Got: beta
Got: delta

subscribe is a terminal operator, i.e. it does not allow the chaining with other operators.

6.3. More resources

Check the operators documentation on Nextflow web site.

7. Processes

A process is the basic Nextflow computing primitive to execute foreign function i.e. custom scripts or tools.

The process definition starts with keyword the process, followed by process name and finally the process body delimited by curly brackets. The process body must contain a string which represents the command or, more generally, a script that is executed by it.

A basic process looks like the following example:


  1
2
3
4
5

  process sayHello {
  """
  echo 'Hello world!'
  """
}

A process may contain five definition blocks, respectively: directives, inputs, outputs, when clause and finally the process script. The syntax is defined as follows:

process < name > {
  [ directives ]        (1)
  input:                (2)
  < process inputs >
  output:               (3)
  < process outputs >
  when:                 (4)
  < condition >
  [script|shell|exec]:  (5)
  < user script to be executed >
}

1	Zero, one or more process directives
2	Zero, one or more process inputs
3	Zero, one or more process outputs
4	An optional boolean conditional to trigger the process execution
5	The command to be executed

7.1. Script

The script block is a string statement that defines the command that is executed by the process to carry out its task.

A process contains one and only one script block, and it must be the last statement when the process contains input and output declarations.

The script block can be a simple string or multi-line string. The latter simplifies the writing of non trivial scripts composed by multiple commands spanning over multiple lines. For example::


  1
2
3
4
5
6
7
8

  process example {
    script:
    """
    blastp -db /data/blast -query query.fa -outfmt 6 > blast_result
    cat blast_result | head -n 10 | cut -f 2 > top_hits
    blastdbcmd -db /data/blast -entry_batch top_hits > sequences
    """
}

By default the process command is interpreted as a Bash script. However any other scripting language can be used just simply starting the script with the corresponding Shebang declaration. For example:


  1
2
3
4
5
6
7
8
9
10

  process pyStuff {
  script:
  """
  #!/usr/bin/env python

  x = 'Hello'
  y = 'world!'
  print "%s - %s" % (x,y)
  """
}

This allows the compositing in the same workflow script of tasks using different programming languages which may better fit a particular job. However for large chunks of code is suggested to save them into separate files and invoke them from the process script.

7.1.1. Script variables

Process script can be defined dynamically using variable values like in other string.


  1
2
3
4
5
6
7
8

  params.data = 'World'

process foo {
  script:
  """
  echo Hello $params.data
  """
}

Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using \ character.


  1
2
3
4
5
6

  process foo {
  script:
  """
  echo \$PATH | tr : '\\n'
  """
}

This can be tricky when the script uses many Bash variables. A possible alternative is to use a script string delimited by single-quote characters


  1
2
3
4
5
6

  process bar {
  script:
  '''
  echo $PATH | tr : '\\n'
  '''
}

However this won’t allow any more the usage of Nextflow variables in the command script.

Another alternative is to use a shell statement instead of script which uses a different syntax for Nextflow variable: !{..}. This allow to use both Nextflow and Bash variables in the same script.


  1
2
3
4
5
6
7
8
9

  params.data = 'le monde'

process baz {
  shell:
  '''
  X='Bonjour'
  echo $X !{params.data}
  '''
}

Exercise

Write a Nextflow script, putting together the above example, printing either a Nextflow variables and a Bash variable both the script and shell syntax.

7.1.2. Conditional script

The process script can also be defined in a complete dynamic manner using a if statement or any other expression evaluating to string value. For example:


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

  params.aligner = 'kallisto'

process foo {
  script:
  if( params.aligner == 'kallisto' )
    """
    kallisto --reads /some/data.fastq
    """
  else if( params.aligner == 'salmon' )
    """
    salmons --reads /some/data.fastq
    """
  else
    throw new IllegalArgumentException("Unknown aligner $params.aligner")
}

Exercise

Write a custom function that given the aligner name as parameter returns the command string to be executed. Then use this function as the process script body.

7.2. Inputs

Nextflow processes are isolated from each other but can communicate between themselves sending values through channels.

The input block defines which channels the process is expecting to receive inputs data from. You can only define one input block at a time and it must contain one or more inputs declarations.

The input block follows the syntax shown below:

input:
  <input qualifier> <input name> [from <source channel>]

7.2.1. Input values

The val qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name, as shown in the following example:


  1
2
3
4
5
6
7
8
9
10

  num = Channel.from( 1, 2, 3 )

process basicExample {
  input:
  val x from num

  """
  echo process job $x
  """
}

In the above example the process is executed three times, each time a value is received from the channel num and used to process the script. Thus, it results in an output similar to the one shown below:

process job 3
process job 1
process job 2

The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received.

7.2.2. Input files

The file qualifier allows the handling of file values in the process execution context. This means that Nextflow will stage it in the process execution directory, and it can be access in the script by using the name specified in the input declaration.


  1
2
3
4
5
6
7
8
9
10

  reads = Channel.fromPath( 'data/ggal/*.fq' )

process foo {
    input:
    file 'sample.fastq' from reads
    script:
    """
    your_command --reads sample.fastq
    """
}

The input file name can also be defined using a variable reference as shown below:


  1
2
3
4
5
6
7
8
9
10

  reads = Channel.fromPath( 'data/ggal/*.fq' )

process foo {
    input:
    file sample from reads
    script:
    """
    your_command --reads $sample
    """
}

The same syntax it’s also able to handle more than one input file in the same execution. Only change the channel composition.


  1
2
3
4
5
6
7
8
9
10

  reads = Channel.fromPath( 'data/ggal/*.fq' )

process foo {
    input:
    file sample from reads.collect()
    script:
    """
    your_command --reads $sample
    """
}

Exercise

Write a script that merge all read files matching the pattern data/ggal/*_1.fq into a single file. Then print the first 20 lines.

7.2.3. Combine input channels

A key feature of processes is the ability to handle inputs from multiple channels. However it’s important to understands how the content of channel and their semantic affect the execution of a process.

Consider the following example:


  1
2
3
4
5
6
7
8
9
10

  process foo {
  echo true
  input:
  val x from Channel.from(1,2,3)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

Both channels emit three value, therefore the process is executed three times, each time with a different pair:

(1, a)
(2, b)
(3, c)

What is happening is that the process waits until there’s a complete input configuration i.e. it receives an input value from all the channels declared as input.

When this condition is verified, it consumes the input values coming from the respective channels, and spawns a task execution, then repeat the same logic until one or more channels have no more content.

This means channel values are consumed serially one after another and the first empty channel cause the process execution to stop even if there are other values in other channels.

What does it happen when not all channels have the same cardinality (i.e. they emit a different number of elements)?

For example:


  1
2
3
4
5
6
7
8
9
10

  process foo {
  echo true
  input:
  val x from Channel.from(1,2)
  val y from Channel.from('a','b','c','d')
  script:
   """
   echo $x and $y
   """
}

In the above example the process is executed only two time, because when a channel has no more data to be processed it stops the process execution.

Note however that value channel do not affect the process termination.

To better understand this behavior compare the previous example with the following one:


  1
2
3
4
5
6
7
8
9
10

  process bar {
  echo true
  input:
  val x from Channel.value(1)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

Exercise

Write a process that is executed for each read file matching the pattern data/ggal/*_1.fa and and use the same data/ggal/transcriptome.fa in each execution.

7.2.4. Input repeaters

The each qualifier allows you to repeat the execution of a process for each item in a collection, every time a new data is received. For example:


  1
2
3
4
5
6
7
8
9
10
11
12

  sequences = Channel.fromPath('*.fa')
methods = ['regular', 'expresso', 'psicoffee']

process alignSequences {
  input:
  file seq from sequences
  each mode from methods

  """
  t_coffee -in $seq -mode $mode
  """
}

In the above example every time a file of sequences is received as input by the process, it executes three tasks running a T-coffee alignment with a different value for the mode parameter. This is useful when you need to repeat the same task for a given set of parameters.

Exercise

Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fa and repeat the same task both with salmon and kallisto.

7.3. Outputs

The output declaration block allows to define the channels used by the process to send out the results produced.

It can be defined at most one output block and it can contain one or more outputs declarations. The output block follows the syntax shown below:

output:
  <output qualifier> <output name> [into <target channel>[,channel,..]]

7.3.1. Output values

The val qualifier allows to output a value defined in the script context. In a common usage scenario, this is a value which has been defined in the input declaration block, as shown in the following example::


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

  methods = ['prot','dna', 'rna']

process foo {
  input:
  val x from methods

  output:
  val x into receiver

  """
  echo $x > file
  """
}

receiver.println { "Received: $it" }

7.3.2. Output files

The file qualifier allows to output one or more files, produced by the process, over the specified channel.


  1
2
3
4
5
6
7
8
9
10
11

  process randomNum {

    output:
    file 'result.txt' into numbers

    '''
    echo $RANDOM > result.txt
    '''
}

numbers.println { "Received: " + it.text }

In the above example the process randomNum creates a file named result.txt containing a random number.

Since a file parameter using the same name is declared in the output block, when the task is completed that file is sent over the numbers channel. A downstream process declaring the same channel as input will be able to receive it.

7.3.3. Multiple output files

When an output file name contains a * or ? wildcard character it is interpreted as a glob path matcher. This allows to capture multiple files into a list object and output them as a sole emission. For example:


  1
2
3
4
5
6
7
8
9
10
11
12
13

  process splitLetters {

    output:
    file 'chunk_*' into letters

    '''
    printf 'Hola' | split -b 1 - chunk_
    '''
}

letters
    .flatMap()
    .println { "File: ${it.name} => ${it.text}" }

it prints:

File: chunk_aa => H
File: chunk_ab => o
File: chunk_ac => l
File: chunk_ad => a

Some caveats on glob pattern behavior:

Input files are not included in the list of possible matches.
Glob pattern matches against both files and directories path.
When a two stars pattern ** is used to recourse across directories, only file paths are matched i.e. directories are not included in the result list.

Exercise

Remove the flatMap operator and see out the output change. The documentation for the flatMap operator is available at this link.

7.3.4. Dynamic output file names

When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic evaluated string which references values defined in the input declaration block or in the script global context. For example::


  1
2
3
4
5
6
7
8
9
10
11
12

  process align {
  input:
  val x from species
  file seq from sequences

  output:
  file "${x}.aln" into genomes

  """
  t_coffee -in $seq > ${x}.aln
  """
}

In the above example, each time the process is executed an alignment file is produced whose name depends on the actual value of the x input.

7.3.5. Composite inputs and outputs

So far we have seen how to declare multiple input and output channels, but each channel was handling only one value at time. However Nextflow can handle tuple of values.

When using channel emitting tuple of values the corresponding input declaration must be declared with a set qualifier followed by definition of each single element in the tuple.

In the same manner output channel emitting tuple of values can be declared using the set qualifier following by the definition of each tuple element in the tuple.


  1
2
3
4
5
6
7
8
9
10
11
12
13
14

  reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')

process foo {
  input:
    set val(sample_id), file(sample_files) from reads_ch
  output:
    set val(sample_id), file('sample.bam') into bam_ch
  script:
  """
    your_command_here --reads $sample_id > sample.bam
  """
}

bam_ch.println()

Exercise

Modify the script of the previous exercise so that the bam file is named as the given sample_id.

7.4. When

The when declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.

It is useful to enable/disable the process execution depending the state of various inputs and parameters. For example:


  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

  params.dbtype = 'nr'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)

process find {
  input:
  file fasta from proteins
  val type from params.dbtype

  when:
  fasta.name =~ /^BB11.*/ && type == 'nr'

  script:
  """
  blastp -query $fasta -db nr
  """
}

7.5. Directives

Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.

They must be entered at the top of the process body, before any other declaration blocks (i.e. input, output, etc).

Directives are commonly used to define the amount of computing resources to be used or other meta directives like that allows the definition of extra information for configuration or logging purpose. For example:


  1
2
3
4
5
6
7
8
9
10

  process foo {
  cpus 2
  memory 8.GB
  container 'image/name'

  script:
  """
  your_command --this --that
  """
}

The complete list of directives is available at this link.

Exercise

Modify the script of the previous exercise adding a tag directive logging the sample_id in the execution output.