Sequencer integration

bcbio-nextgen supports processing of samples arriving from sequencing machines. It automates the generation of de-multiplexed fastq files and sample configurations that feed directly into standard bcbio-nextgen pipelines.

This requires an Illumina sequencer, and samples annotated using the basic Galaxy nglims. The approach is general and we’d be happy to collaborate or accept contributions for supporting other sequencers or LIMS systems.

Overview

A fully automated setup consists of three components:

  • A front end Galaxy nglims system. Users provide details on samples, including organism information and choice of post-processing pipeline. Sequencing operators annotate multiplexed flowcells with the locations of samples. The automated processing scripts use this information to prepare sample configurations for downstream processing.
  • A sequencer output machine, where the sequencer dumps reads. An hourly scheduled job detects newly finished flowcells and initiates processing of output into demultiplexed fastq.
  • An analysis machine where bcbio-nextgen analysis occurs. The cronjob on the sequencer output machines transfers fastq files and initiates multi-core processing. On completion, the analysis machine uploads results to an attached Galaxy instance. The analysis machine can be the same as the sequencer output machine for systems with shared filesystems.

Sequencer output machine

The sequencer output machine is a Linux-based machine where Illumina writes output directories containing bcl files. Our current experience is on HiSeq machines output and we welcome contributions from users working with different machines or output setups. Post-sequencing processing, including demultiplexing, initiate via a cronjob run on the Illumina output machine:

PATH=/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin
@hourly bcbio_nextgen.py sequencer /opt/bcbio/transfer_info.yaml

transfer_info.yaml is a configuration file specifying locations of output directories where sequencer results appear. It also contains the location of the Galaxy nglims server to retrieve sample details, and the downstream analysis server to transfer files and start bcbio-nextgen pipelines.

Illumina machines produce run directories that include the date, machine identifier, and flowcell ID:

110216_HWI-EAS264_00035_FC638NPAAXX

the cronjob script identifies directories with newly finished output. Unprocessed directories, identified by the date and flowcell ID, continue on for demultiplexing, transfer and analysis.

Illumina’s bcl2fastq performs demultiplexing and conversion to fastq. The configureBclToFastq.pl script from bcl2fastq needs to be available on the PATH specified within your cronjob.

Following demultiplexing, the script combines separate fastq files into single or paired fastq files per sample. This includes combining samples multiplexed across multiple lanes in a flowcell; any samples with the same project and sample name get combined.

Finally the script prepares a sample configuration file defining processing steps to perform, transfers the configuration and fastq files to the analysis machine, and initiates processing. Transfer between the sequencer output and analysis machines occurs using secure rsync, which requires the ability to securely login between machines without passwords using ssh public key authentication. For shared filesystem setups, rsync will transfer the files to a local directory for processing.

Analysis machine

We support two approaches for automated analysis processing:

  • A remote analysis server. This is a long running server on a remote machine without a shared filesystem with the dumping machine.
  • A cluster-based analysis server. This server mounts the output directories via a shared filesystem and can submit to an attached cluster for processing.

On analysis completion the pipeline transfers processed files to a Galaxy server based on a pre-specified Upload configuration. This includes the alignment BAM files, quality control, and other pipeline specific files like variant calls or RNA-seq counts. It organizes files in project and sample specific folders within Galaxy’s data libraries, making them available to researchers for additional analysis.

Remote analysis server

The first approach is to initial processing on a remote server. The process section of your transfer_info.yaml file should look like:

process:
  host: workserver.you.org
  username: bcbio_user
  dir: /array1/bcbio
  storedir: /galaxydata/upload/storage
  server: http://workserver.you.org/bcbio

The analysis server runs bcbio-nextgen pipelines and uploads results to a local Galaxy server. A bcbio-nextgen server receives processing commands and start them. The following command starts a server on port 8080 using 16 cores for analysis:

bcbio_nextgen.py server p 8080 -n 16

To make this available outside of the current machine use a proxy server like nginx with the following configuration:

upstream bcbio {
  server localhost:8080;
}

server {
    listen       80 default_server;
    location /bcbio/ {
       proxy_pass http://bcbio/;
       proxy_set_header   X-Forwarded-Host $host;
       proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
    }
}

Specify this URL as server: http://analysis.you.org/bcbio in your transfer_info.yaml file to enable the sequencing output machine to communicate with the analysis server.

The remote analysis server currently handles multicore processing. We’d be happy to collaborate on approaches to allow it to automatically start bcbio-nextgen jobs on HPC clusters or other types of distributed environments.

Cluster-based analysis server

The alternative approach for post-processing submits directly to a cluster attached to the output filesystem. This requires a process configuration containing information about the batch scripts for submit for bcl2fastq processing and bcbio-nextgen analysis:

process:
  dir: /array1/bcbio
  storedir: /galaxydata/upload/storage
  submit_cmd: "qsub {batch_script}"
  bcl2fastq_batch: |
    batch script template for bcl2fastq
  bcbio_batch: |
    batch script template for bcbio-nextgen processing

The example transfer_info.yaml has batch script templates you can customize for your specific system. This provides a way to automatically prepare batch scripts for kicking off analyses, and supplements the IPython cluster integration that bcbio-nextgen provides.

Debugging

This section contains tips and tricks on restarting processing in case of problems. Flowcell processing occurs under the directory specified by process -> dir in your transfer_info.yaml file. Each flowcell directory contains the sample YAML configuration file, an analysis directory, and demultiplexed fastqs:

140313_SN728_0206_AC3KL2ACXX
├── 140313_SN728_0206_AC3KL2ACXX.csv
├── 140313_SN728_0206_AC3KL2ACXX.yaml
├── analysis
├── Data
├── fastq
├── RunInfo.xml
└── runParameters.xml

The full log file for a processing run is in analysis/log/bcbio-nextgen-debug.log and will contain useful details about why a run failed. You can manually restart processing of a run using the standard bcbio-nextgen command line:

cd FLOWCELL/analysis
bcbio_nextgen.py ../../FLOWCELL ../FLOWCELL.yaml -n 16