Skip to content

Sample sheet channel manipulation examples

Introduction

Understanding channel structure and manipulation is critical for getting the most out of Nextflow. nf-validation helps initialise your channels from the text inputs to get you started, but further work might be required to fit your exact use case. In this page we run through some common cases for transforming the output of .fromSamplesheet.

Glossary

  • A channel is the Nextflow object, referenced in the code
  • An item is each thing passing through the channel, equivalent to one row in the samplesheet
  • An element is each thing in the item, e.g., the meta value, fastq_1 etc. It may be a file or value

Default mode

Each item in the channel emitted by .fromSamplesheet() is a flat tuple, corresponding with each row of the samplesheet. Each item will be composed of a meta value (if present) and any additional elements from columns in the samplesheet, e.g.:

sample,fastq_1,fastq_2,bed
sample1,fastq1.R1.fq.gz,fastq1.R2.fq.gz,sample1.bed
sample2,fastq2.R1.fq.gz,fastq2.R2.fq.gz,

Might create a channel where each element consists of 4 items, a map value followed by three files:

// Columns:
[ val([ sample: sample ]), file(fastq1), file(fastq2), file(bed) ]

// Resulting in:
[ [ id: "sample" ], fastq1.R1.fq.gz, fastq1.R2.fq.gz, sample1.bed]
[ [ id: "sample2" ], fastq2.R1.fq.gz, fastq2.R2.fq.gz, [] ] // A missing value from the samplesheet is an empty list

This channel can be used as input of a process where the input declaration is:

tuple val(meta), path(fastq_1), path(fastq_2), path(bed)

It may be necessary to manipulate this channel to fit your process inputs. For more documentation, check out the Nextflow operator docs, however here are some common use cases with .fromSamplesheet().

Using a samplesheet with no headers

Sometimes you only have one possible input in the pipeline samplesheet. In this case it doesn't make sense to have a header in the samplesheet. This can be done by creating a samplesheet with an empty string as input key:

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "description": "Schema for the file provided with params.input",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "": {
        "type": "string"
      }
    }
  }
}

When using samplesheets like this CSV file:

test_1
test_2

or this YAML file:

- test_1
- test_2

The output of .fromSamplesheet() will look like this:

[test_1]
[test_2]

Changing the structure of channel items

Each item in the channel will be a flat tuple, but some processes will use multiple files as a list in their input channel, this is common in nf-core modules. For example, consider the following input declaration in a process, where FASTQ could be > 1 file:

process ZCAT_FASTQS {
    input:
        tuple val(meta), path(fastq)

    """
    zcat $fastq
    """
}

The output of .fromSamplesheet() can be used by default with a process with the following input declaration:

val(meta), path(fastq_1), path(fastq_2)

To manipulate each item within a channel, you should use the Nextflow .map() operator. This will apply a function to each element of the channel in turn. Here, we convert the flat tuple into a tuple composed of a meta and a list of FASTQ files:

Channel.fromSamplesheet("input")
    .map { meta, fastq_1, fastq_2 -> tuple(meta, [ fastq_1, fastq_2 ]) }
    .set { input }

input.view() // Channel has 2 elements: meta, fastqs

This is now compatible with the process defined above and will not raise a warning about input cardinality:

ZCAT_FASTQS(input)

Removing elements in channel items

For example, to remove the BED file from the channel created above, we could not return it from the map. Note the absence of the bed item in the return of the closure below:

Channel.fromSamplesheet("input")
    .map { meta, fastq_1, fastq_2, bed -> tuple(meta, fastq_1, fastq_2) }
    .set { input }

input.view() // Channel has 3 elements: meta, fastq_1, fastq_2

In this way you can drop items from a channel.

Separating channel items

We could perform this twice to create one channel containing the FASTQs and one containing the BED files, however Nextflow has a native operator to separate channels called .multiMap(). Here, we separate the FASTQs and BEDs into two separate channels using multiMap. Note, the channels are both contained in input and accessed as an attribute using dot notation:

Channel.fromSamplesheet("input")
    .multiMap { meta, fastq_1, fastq_2, bed ->
        fastq: tuple(meta, fastq_1, fastq_2)
        bed:   tuple(meta, bed)
    }
    .set { input }

The channel has two attributes, fastq and bed, which can be accessed separately.

input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.bed.view()   // Channel has 2 elements: meta, bed

Importantly, multiMap applies to every item in the channel and returns an item to both channels for every input, i.e. input, input.fastq and input.bed all contain the same number of items, however each item will be different.

Separate items based on a condition

You can use the .branch() operator to separate the channel entries based on a condition. This is especially useful when you can get multiple types of input data.

This example shows a channel which can have entries for WES or WGS data. WES data includes a BED file denoting the target regions, but WGS data does not. These analysis are different so we want to separate the WES and WGS entries from each other. We can separate the two using .branch based on the presence of the BED file:

// Channel with four elements - see docs for examples
params.input = "samplesheet.csv"

Channel.fromSamplesheet("input")
    .branch { meta, fastq_1, fastq_2, bed ->
        // If BED does not exist
        WGS: !bed
            return [meta, fastq_1, fastq_2]
        // If BED exists
        WES: bed
            // The original channel structure will be used when no return statement is used.
    }
    .set { input }

input.WGS.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.WES.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed

Unlike multiMap, the outputs of .branch(), the resulting channels will contain a different number of items.

Combining a channel

After splitting the channel, it may be necessary to rejoin the channel. There are many ways to join a channel, but here we will demonstrate the simplest which uses the Nextflow join operator to rejoin any of the channels from above based on the first element in each item, the meta value.

input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2
input.bed.view()   // Channel has 2 elements: meta, bed

input.fastq
    .join( input.bed )
    .set { input_joined }

input_joined.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed

Count items with a common value

This example is based on this code from Marcel Ribeiro-Dantas.

It's useful to determine the count of channel entries with similar values when you want to merge them later on (to prevent pipeline bottlenecks with .groupTuple()).

This example contains a channel where multiple samples can be in the same family. Later on in the pipeline we want to merge the analyzed files so one file gets created for each family. The result will be a channel with an extra meta field containing the count of channel entries with the same family name.

// channel created by fromSamplesheet() previous to modification:
// [[id:example1, family:family1], example1.txt]
// [[id:example2, family:family1], example2.txt]
// [[id:example3, family:family2], example3.txt]

params.input = "samplesheet.csv"

Channel.fromSamplesheet("input")
    .tap { ch_raw }                       // Create a copy of the original channel
    .map { meta, txt -> [ meta.family ] } // Isolate the value to count on
    .reduce([:]) { counts, family ->      // Creates a map like this: [family1:2, family2:1]
        counts[family] = (counts[family] ?: 0) + 1
        counts
    }
    .combine(ch_raw)                     // Add the count map to the original channel
    .map { counts, meta, txt ->          // Add the counts of the current family to the meta
        new_meta = meta + [count:counts[meta.family]]
        [ new_meta, txt ]
    }
    .set { input }

input.view()
// [[id:example1, family:family1, count:2], example1.txt]
// [[id:example2, family:family1, count:2], example2.txt]
// [[id:example3, family:family2, count:1], example3.txt]