Nextflow

Table of contents

What is Nextflow?

Nextflow is an open-source workflow management system that enables scalable and reproducible scientific workflows. It was created by the Centre for Genomic Regulation (CRG) and is particularly popular in bioinformatics, genomics, and data science. Nextflow uses a domain-specific language (DSL) based on Groovy to define workflows as code.

Nextflow is particularly well-suited for:

  • Bioinformatics: Genomics pipelines, sequence analysis, variant calling
  • Data Science: Large-scale data processing, ETL workflows
  • Scientific Computing: Reproducible research workflows, computational experiments
  • HPC Workflows: Workflows that need to run on high-performance computing clusters

Key features of Nextflow include:

  • Reproducibility: Built-in support for containers (Docker, Apptainer/Singularity) and Conda environments
  • Portability: Workflows can run on multiple execution platforms (local, HPC, cloud)
  • Scalability: Automatic parallelization and distributed execution
  • Resume capability: Can resume failed workflows from the last successful step
  • Rich ecosystem: Large collection of pre-built workflows and modules
  • DSL-based: Workflows defined in a readable, maintainable language

Setup

Installation

Install Nextflow using one of the following methods:

Using conda:

conda install -c bioconda nextflow

Using Homebrew (macOS):

brew install nextflow

Manual installation:

curl -s https://get.nextflow.io | bash

Or download directly:

wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
sudo mv nextflow /usr/local/bin/

Using Docker:

docker pull nextflow/nextflow

Verify Installation

Verify that Nextflow is installed correctly:

nextflow -version

Requirements

Nextflow requires:

  • Java 8 or later (Java 11+ recommended)
  • Bash shell
  • Common Unix utilities (curl, wget)

Check Java version:

java -version

Configuration

Nextflow configuration is typically stored in ~/.nextflow/config or nextflow.config in your project directory. The configuration file allows you to:

  • Define execution profiles (local, HPC, cloud)
  • Configure compute resources
  • Set up container engines (Docker, Apptainer/Singularity)
  • Configure data storage and caching

Once Nextflow is installed, you can:

  • Create your first workflow (see “Getting Started” section below)
  • Run example workflows from the Nextflow community
  • Check out the example scripts in the examples folder

Getting Started

Understanding Nextflow Workflows

Nextflow workflows are defined using a domain-specific language (DSL) based on Groovy. Key concepts include:

  • Process: A process is a task that executes a command or script. Processes are the building blocks of workflows.
  • Channel: Channels are asynchronous queues that connect processes and pass data between them.
  • Workflow: A workflow defines the execution flow by connecting processes through channels.
  • Operators: Operators transform and manipulate channels (e.g., map, filter, collect).

Processes are defined with:

  • Input channels
  • Output channels
  • Script/command to execute
  • Resource requirements (CPU, memory, etc.)

A simple Workflow

Assuming that you’re in the top level directory of the cloned GitHub repo, change to the examples folder with this command:

cd examples/nextflow

Here’s a basic Nextflow workflow that processes data:

#!/usr/bin/env nextflow

// Define workflow parameters
params.input = "data.txt"
params.output = "results"

// Print workflow information
workflow {
    println "Running simple data processing workflow"
    println "Input: ${params.input}"
    println "Output: ${params.output}"
}

// Process 1: Extract data
process extract {
    input:
    path input_file
    
    output:
    path "extracted.txt"
    
    script:
    """
    echo "Extracting data from ${input_file}"
    head -n 10 ${input_file} > extracted.txt
    """
}

// Process 2: Transform data
process transform {
    input:
    path input_file
    
    output:
    path "transformed.txt"
    
    script:
    """
    echo "Transforming data"
    awk "{print \$1 * 2}" ${input_file} > transformed.txt
    """
}

// Process 3: Load data
process load {
    input:
    path input_file
    
    output:
    path "loaded.txt"
    
    script:
    """
    echo "Loading data"
    wc -l ${input_file} > loaded.txt
    """
}

// Define the workflow
workflow {
    // Create input channel
    input_ch = Channel.fromPath(params.input)
    
    // Chain processes
    extract(input_ch)
    transform(extract.out)
    load(transform.out)
    
    // Print results
    load.out.view()
}

Save this file as simple_pipeline.nf. Run it with:

nextflow run simple_pipeline.nf --input data.txt

Key points:

  • Processes are defined with process keyword
  • Inputs and outputs are declared with input: and output: blocks
  • The script: block contains the command to execute
  • Workflow connects processes using channels
  • Parameters can be passed via command line with --param value

Channels and Operators

Channels are fundamental to Nextflow workflows:

// Create channels
ch1 = Channel.from(1, 2, 3, 4, 5)
ch2 = Channel.fromPath("*.txt")
ch3 = Channel.fromFilePairs("data_{sample}.txt")

// Transform channels
ch1.map { it * 2 }  // Multiply each value by 2
ch2.filter { it.size() > 1000 }  // Filter large files
ch3.collect()  // Collect all items into a single channel

All Example Scripts

You can find the example scripts and notebooks in the examples folder in the Git repository.

In addition, take a look at the examples in the Additional Resources

Advanced Topics

Execution Profiles

Nextflow supports execution profiles for different environments:

profiles {
    local {
        executor = "local"
        process.cpus = 4
        process.memory = "8 GB"
    }
    
    slurm {
        executor = "slurm"
        process.queue = "compute"
        process.cpus = 8
        process.memory = "16 GB"
    }
    
    aws {
        executor = "awsbatch"
        process.queue = "my-queue"
    }
}

Run with a specific profile:

nextflow run workflow.nf -profile slurm

Containers

Nextflow supports Docker, Apptainer, and Singularity containers:

process my_process {
    container "biocontainers/fastqc:0.11.9"
    
    script:
    """
    fastqc input.fastq
    """
}

Enable containers:

nextflow run workflow.nf -with-docker
nextflow run workflow.nf -with-apptainer  # Recommended (Apptainer is the new name for Singularity)
nextflow run workflow.nf -with-singularity  # Legacy option, still supported

Resume Execution

Nextflow can resume failed workflows:

nextflow run workflow.nf -resume

This will skip successfully completed processes and continue from failures.

Configuration Files

Create a nextflow.config file in your project:

// Global configuration
workDir = "/path/to/work"

// Process defaults
process {
    cpus = 4
    memory = "8 GB"
    time = "1h"
}

// Executor configuration
executor {
    name = "slurm"
    queue = "compute"
}

// Container configuration
docker {
    enabled = true
}

Error Handling

Handle errors in processes:

process my_process {
    errorStrategy "retry"
    maxRetries 3
    
    script:
    """
    your_command_here
    """
}

Publishing Results

Publish output files:

process my_process {
    publishDir "${params.output}", mode: "copy"
    
    output:
    path "result.txt"
    
    script:
    """
    echo "result" > result.txt
    """
}

Additional Resources

Official Documentation

Learning Resources

Community

  • nf-core - Community-curated Nextflow workflows for genomics
  • Tower - Cloud-based workflow management for Nextflow
  • Seqera Platform - Enterprise workflow management platform