Sample sheet channel manipulation examples
Introduction
Understanding channel structure and manipulation is critical for getting the most out of Nextflow. nf-validation helps initialise your channels from the text inputs to get you started, but further work might be required to fit your exact use case. In this page we run through some common cases for transforming the output of .fromSamplesheet
.
Glossary
- A channel is the Nextflow object, referenced in the code
- An item is each thing passing through the channel, equivalent to one row in the samplesheet
- An element is each thing in the item, e.g., the meta value, fastq_1 etc. It may be a file or value
Default mode
Each item in the channel emitted by .fromSamplesheet()
is a flat tuple, corresponding with each row of the samplesheet. Each item will be composed of a meta value (if present) and any additional elements from columns in the samplesheet, e.g.:
sample,fastq_1,fastq_2,bed
sample1,fastq1.R1.fq.gz,fastq1.R2.fq.gz,sample1.bed
sample2,fastq2.R1.fq.gz,fastq2.R2.fq.gz,
Might create a channel where each element consists of 4 items, a map value followed by three files:
// Columns:
[ val([ sample: sample ]), file(fastq1), file(fastq2), file(bed) ]
// Resulting in:
[ [ id: "sample" ], fastq1.R1.fq.gz, fastq1.R2.fq.gz, sample1.bed]
[ [ id: "sample2" ], fastq2.R1.fq.gz, fastq2.R2.fq.gz, [] ] // A missing value from the samplesheet is an empty list
This channel can be used as input of a process where the input declaration is:
It may be necessary to manipulate this channel to fit your process inputs. For more documentation, check out the Nextflow operator docs, however here are some common use cases with .fromSamplesheet()
.
Using a samplesheet with no headers
Sometimes you only have one possible input in the pipeline samplesheet. In this case it doesn't make sense to have a header in the samplesheet. This can be done by creating a samplesheet with an empty string as input key:
{
"$schema": "http://json-schema.org/draft-07/schema",
"description": "Schema for the file provided with params.input",
"type": "array",
"items": {
"type": "object",
"properties": {
"": {
"type": "string"
}
}
}
}
When using samplesheets like this CSV file:
or this YAML file:
The output of .fromSamplesheet()
will look like this:
Changing the structure of channel items
Each item in the channel will be a flat tuple, but some processes will use multiple files as a list in their input channel, this is common in nf-core modules. For example, consider the following input declaration in a process, where FASTQ could be > 1 file:
The output of .fromSamplesheet()
can be used by default with a process with the following input declaration:
To manipulate each item within a channel, you should use the Nextflow .map()
operator. This will apply a function to each element of the channel in turn. Here, we convert the flat tuple into a tuple composed of a meta and a list of FASTQ files:
Channel.fromSamplesheet("input")
.map { meta, fastq_1, fastq_2 -> tuple(meta, [ fastq_1, fastq_2 ]) }
.set { input }
input.view() // Channel has 2 elements: meta, fastqs
This is now compatible with the process defined above and will not raise a warning about input cardinality:
Removing elements in channel items
For example, to remove the BED file from the channel created above, we could not return it from the map. Note the absence of the bed
item in the return of the closure below:
Channel.fromSamplesheet("input")
.map { meta, fastq_1, fastq_2, bed -> tuple(meta, fastq_1, fastq_2) }
.set { input }
input.view() // Channel has 3 elements: meta, fastq_1, fastq_2
In this way you can drop items from a channel.
Separating channel items
We could perform this twice to create one channel containing the FASTQs and one containing the BED files, however Nextflow has a native operator to separate channels called .multiMap()
. Here, we separate the FASTQs and BEDs into two separate channels using multiMap
. Note, the channels are both contained in input
and accessed as an attribute using dot notation:
Channel.fromSamplesheet("input")
.multiMap { meta, fastq_1, fastq_2, bed ->
fastq: tuple(meta, fastq_1, fastq_2)
bed: tuple(meta, bed)
}
.set { input }
The channel has two attributes, fastq
and bed
, which can be accessed separately.
input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.bed.view() // Channel has 2 elements: meta, bed
Importantly, multiMap
applies to every item in the channel and returns an item to both channels for every input, i.e. input
, input.fastq
and input.bed
all contain the same number of items, however each item will be different.
Separate items based on a condition
You can use the .branch()
operator to separate the channel entries based on a condition. This is especially useful when you can get multiple types of input data.
This example shows a channel which can have entries for WES or WGS data. WES data includes a BED file denoting the target regions, but WGS data does not. These analysis are different so we want to separate the WES and WGS entries from each other. We can separate the two using .branch
based on the presence of the BED file:
// Channel with four elements - see docs for examples
params.input = "samplesheet.csv"
Channel.fromSamplesheet("input")
.branch { meta, fastq_1, fastq_2, bed ->
// If BED does not exist
WGS: !bed
return [meta, fastq_1, fastq_2]
// If BED exists
WES: bed
// The original channel structure will be used when no return statement is used.
}
.set { input }
input.WGS.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.WES.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed
Unlike multiMap
, the outputs of .branch()
, the resulting channels will contain a different number of items.
Combining a channel
After splitting the channel, it may be necessary to rejoin the channel. There are many ways to join a channel, but here we will demonstrate the simplest which uses the Nextflow join operator to rejoin any of the channels from above based on the first element in each item, the meta
value.
input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.bed.view() // Channel has 2 elements: meta, bed
input.fastq
.join( input.bed )
.set { input_joined }
input_joined.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed
Count items with a common value
This example is based on this code from Marcel Ribeiro-Dantas.
It's useful to determine the count of channel entries with similar values when you want to merge them later on (to prevent pipeline bottlenecks with .groupTuple()
).
This example contains a channel where multiple samples can be in the same family. Later on in the pipeline we want to merge the analyzed files so one file gets created for each family. The result will be a channel with an extra meta field containing the count of channel entries with the same family name.
// channel created by fromSamplesheet() previous to modification:
// [[id:example1, family:family1], example1.txt]
// [[id:example2, family:family1], example2.txt]
// [[id:example3, family:family2], example3.txt]
params.input = "samplesheet.csv"
Channel.fromSamplesheet("input")
.tap { ch_raw } // Create a copy of the original channel
.map { meta, txt -> [ meta.family ] } // Isolate the value to count on
.reduce([:]) { counts, family -> // Creates a map like this: [family1:2, family2:1]
counts[family] = (counts[family] ?: 0) + 1
counts
}
.combine(ch_raw) // Add the count map to the original channel
.map { counts, meta, txt -> // Add the counts of the current family to the meta
new_meta = meta + [count:counts[meta.family]]
[ new_meta, txt ]
}
.set { input }
input.view()
// [[id:example1, family:family1, count:2], example1.txt]
// [[id:example2, family:family1, count:2], example2.txt]
// [[id:example3, family:family2, count:1], example3.txt]