Pipelines

Germline variant calling

bcbio implements configurable SNP, indel and structural variant calling for germline populations. We include whole genome and exome evaluations against reference calls from the Genome in a Bottle consortium and Illumina Platinum Genomes project, enabling continuous assessment of new alignment and variant calling algorithms. We regularly report on these comparisons and continue to improve approaches as the community makes new tools available. Here is some of the research that contributes to the current implementation:

  • An introduction to the variant evaluation framework. This includes a comparison of the bwa mem and novoalign aligners. We also compared the FreeBayes, GATK HaplotypeCaller and GATK UnifiedGenotyper variant callers.
  • An in-depth evaluation of FreeBayes and BAM post-alignment processing. We found that FreeBayes quality was equal to GATK HaplotypeCaller. Additionally, a lightweight post-alignment preparation method using only de-duplication was equivalent to GATK’s recommended Base Quality Score Recalibration (BQSR) and realignment around indels, when using good quality input datasets and callers that do local realignment.
  • Additional work to improve variant filtering, providing methods to remove low complexity regions (LCRs) that can bias indel results. We also tuned GATK’s Variant Quality Score Recalibrator (VQSR) and compared it with cutoff-based soft filtering. VQSR requires a large number of variants and we use it in bcbio with GATK HaplotypeCaller when your Algorithm parameters contains high depth samples (coverage_depth is not low) and you are calling on the whole genome (coverage_interval is genome) or have more than 50 regional or exome samples called concurrently.
  • An evaluation of joint calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools. This validates the joint calling implementation, allowing scaling of large population germline experiments. It also demonstrates improved performance of new callers: samtools 1.0 and Platypus.
  • Support for build 38 of the human genome, improving precision of detection thanks to the improved genome representation.

bcbio automates post-variant calling annotation to make the outputs easier to feed directly into your biological analysis. We annotate variant effects using snpEff or Variant Effect Predictor (VEP), and prepare a GEMINI database that associates variants with multiple external annotations in a SQL-based query interface. GEMINI databases have the most associated external information for human samples (GRCh37/hg19 and hg38) but are available for any organism with the database populated using the VCF INFO column and predicted effects.

Basic germline calling

The best approach to build a bcbio Configuration for germline calling is to use the Automated sample configuration with one of the default templates:

  • FreeBayes template – Call variants using FreeBayes with a minimal preparation pipeline. This is a freely available unrestricted pipeline fully included in the bcbio installation.
  • GATK HaplotypeCaller template – Run GATK best practices, including Base Quality Score Recalibration, realignment and HaplotypeCaller variant calling. This requires a license from Broad for commercial use. You need to manually install GATK along with bcbio using downloads from the GATK Broad site or Appistry (see Extra software).

You may also want to enable Structural variant calling for detection of larger events, which work with either caller. Another good source of inspiration are the configuration files from the Example pipelines, which may help identify other configuration variables of interest. A more complex setup with multiple callers and resolution of ensemble calls is generally only useful with a small population where you are especially concerned about sensitivity. Single caller detection with FreeBayes or GATK HaplotypeCaller provide good resolution of events.

Population calling

When calling multiple samples, we recommend calling together to provide improved sensitivity and a fully squared off final callset. To associate samples together in a population add a metadata batch to the Sample information:

- description: Sample1
  metadata:
    batch: Batch1
- description: Sample2
  metadata:
    batch: Batch1

Batching samples results in output VCFs and GEMINI databases containing all merged sample calls. bcbio has two methods to call samples together:

  • Batch or pooled calling – This calls all samples simultaneously by feeding them to the variant caller. This works for smaller batch sizes (< 100 samples) as memory requirements become limiting in larger pools. This is the default approach taken when you specify a variantcaller in the Variant calling configuration.

  • Joint calling – This calls samples independently, then combines them together into a single callset by integrating the individual calls. This scales to larger population sizes by avoiding the computational bottlenecks of pooled calling. We recommend joint calling with HaplotypeCaller if you have a license for GATK usage, but also support joint calling with FreeBayes using a custom implementation. Specifying a jointcaller along with the appropriate variantcaller in the Variant calling configuration enables this:

    - description: Sample1
      algorithm:
        variantcaller: gatk-haplotype
        jointcaller: gatk-haplotype-joint
      metadata:
        batch: Batch1
    - description: Sample2
      algorithm:
        variantcaller: gatk-haplotype
        jointcaller: gatk-haplotype-joint
      metadata:
        batch: Batch1
    

Cancer variant calling

bcbio supports somatic cancer calling with tumor and optionally matched normal pairs using multiple SNP, indel and structural variant callers. A full evaluation of cancer calling validates callers against synthetic dataset 3 from the ICGC-TCGA DREAM challenge. bcbio uses a majority voting ensemble approach to combining calls from multiple SNP and indel callers, and also flattens structural variant calls into a combined representation.

The example configuration for the Cancer tumor normal validation is a good starting point for setting up a tumor/normal run on your own dataset. The configuration works similarly to population based calling. Supply a consistent batch for tumor/normal pairs and mark them with the phenotype:

- description: your-tumor
  metadata:
    batch: batch1
    phenotype: tumor
- description: your-normal
  metadata:
    batch: batch1
    phenotype: normal

Other Somatic variant calling configuration options allow tweaking of the processing parameters.

Cancer calling handles both tumor-normal paired calls and tumor-only calling. To specify a tumor-only sample, provide a single sample labeled with phenotype: tumor. Otherwise the configuration and setup is the same as with paired analyses. For tumor-only samples, bcbio will try to remove likely germline variants present in the public databases like 1000 genomes and ExAC, and not in COSMIC. This runs as long as you have a local GEMINI data installation (--datatarget gemini) and marks likely germline variants with a LowPriority filter. This post has more details on the approach and validation.

The standard variant outputs (sample-caller.vcf.gz) for tumor calling emphasize somatic differences, those likely variants unique to the cancer. In addition to this file, we also produce a sample-caller-germline.vcf.gz file containing likely germline mutations. These are useful for identifying pre-existing genomic changes that can contribute to cancer development, or in paired cases like pre and post treatment where you may want to identify maintained mutations after treatment.

We’re actively working on improving calling to better account for the heterogeneity and structural variability that define cancer genomes.

Somatic with germline variants

For tumor/normal somatic samples, bcbio can call both somatic (tumor-specific) and germline (pre-existing) variants. The typical outputs of Cancer variant calling are likely somatic variants acquired by the cancer, but pre-existing germline risk variants are often also diagnostic.

bcbio enables calling both somatic and germline variants within a single pipeline run. Since the algorithms for calling both are different, you specify which callers to use for each step in the Variant calling configuration:

description: your-normal
variantcaller:
   somatic: vardict
   germline: freebayes

bcbio does a single alignment for the normal sample, then splits at the variant calling steps using this normal sample to do germline calling. In this example, you’d get FreeBayes germline calls labeled as your-normal-germline and VarDict somatic calls for the tumor sample linked to this normal. This generates a single set of somatic and germline calls for the tumor and normal pair.

Germline calling supports multiple callers, and other configuration options like ensemble and structural variant calling inherit from the remainder configuration. For example, to use 3 callers for somatic and germline calling, create ensemble calls for both and include germline and somatic events from two structural variant callers:

variantcaller:
   somatic: [vardict, varscan, mutect2]
   germline: [freebayes, gatk-haplotype, platypue]
ensemble:
   numpass: 2
svcaller: [manta, cnvkit]

Tumor-only inputs mix somatic and germline variants, making it difficult to separate events. For tumor-only cases we suggest running standard Cancer variant calling. bcbio will attempt to distinguish somatic and germline mutations using the presence of variants in population databases.

Structural variant calling

bcbio can detect larger structural variants like deletions, insertions, inversions and copy number changes for both germline population and cancer variant calling, based on validation against existing truth sets:

  • Validation of germline structural variant detection using multiple calling methods to validate against deletions in NA12878. This implements a pipeline that works in tandem with SNP and indel calling to detect larger structural variations like deletions, duplications, inversions and copy number variants (CNVs).
  • Validation of tumor/normal calling using the synthetic DREAM validation set. This includes validation of additional callers against duplications, insertions and inversions.

To enable structural variant calling, specify svcaller options in the algorithm section of your configuration:

- description: Sample
  algorithm:
    svcaller: [lumpy, manta, cnvkit]

The best supported callers are Lumpy and Manta, for paired end and split read calling, CNVkit for read-depth based CNV calling, and WHAM for association testing. We also support DELLY, another excellent paired end and split read calling, although it is slow on large whole genome datasets.

In addition to results from individual callers, bcbio can create a summarized ensemble callset using MetaSV. We’re actively working on improved structural variant reporting to highlight potential variants of interest.

RNA-seq

bcbio can also be use to analyze RNA-seq data. It includes steps for quality control, adapter trimming, alignment, variant calling, transcriptome reconstruction and post-alignment quantitation at the level of the gene and isoform.

We recommend using the STAR aligner for all genomes where there are no alt alleles. For genomes such as hg38 that have alt alleles, hisat2 should be used as it handles the alts correctly and STAR does not yet. Use Tophat2 only if you do not have enough RAM available to run STAR (about 30 GB).

Our current recommendation is to run adapter trimming only if using the Tophat2 aligner. Adapter trimming is very slow, and aligners that soft clip the ends of reads such as STAR and hisat2, or algorithms using pseudoalignments like Sailfish handle contaminant sequences at the ends properly. This makes trimming unnecessary. Tophat2 does not perform soft clipping so if using Tophat2, trimming must still be done.

Salmon, which is an extremely fast alignment-free method of quantitation, is run for all experiments. Salmon can accurately quantitate the expression of genes, even ones which are hard to quantitate with other methods (see this paper for example for Sailfish, which performs similarly to Salmon.). Salmon can also quantitate at the transcript level which can help gene-level analyses (see this paper for example). We recommend using the Salmon quantitation rather than the counts from featureCounts to perform downstream quantification.

Although we do not recommend using the featureCount based counts, the alignments are still useful because they give you many more quality metrics than the quasi-alignments from Salmon.

After a bcbio RNA-seq run there will be in the upload directory a directory for each sample which contains a BAM file of the aligned and unaligned reads, a sailfish directory with the output of Salmon, including TPM values, and a qc directory with plots from FastQC and qualimap.

In addition to directories for each sample, in the upload directory there is a project directory which contains a YAML file describing some summary statistics for each sample and some provenance data about the bcbio run. In that directory is also a combined.counts file with the featureCounts derived counts per cell.

fast RNA-seq

This mode of bcbio-nextgen quantitates transcript expression using Salmon and does nothing else. It is an order of magnitude faster or more than running the full RNA-seq analysis. The cost of the increased speed is that you will have much less information about your samples at the end of the run, which can make troubleshooting trickier. Invoke with analysis: fastrna-seq.

single-cell RNA-seq

bcbio-nextgen supports universal molecular identifiers (UMI) based single-cell RNA-seq analyses. If your single-cell prep does not use universal molecular identifiers (UMI), you can most likely just run the standard RNA-seq pipeline and use the results from that. The UMI are used to discard reads which are possibly PCR duplicates and is very helpful for removing some of the PCR duplicate noise that can dominate single-cell experiments.

Unlike the standard RNA-seq pipeline, the single-cell pipeline expects the FASTQ input files to not be separated by cellular barcode, so each file is a mix of cells identified by a cellular barcode (CB), and unique reads from a transcript are identified with a UMI. bcbio-nextgen inspects each read, identifies the cellular barcode and UMI and puts them in the read name. Then the reads are aligned to the transcriptome with RapMap and the number of reads aligning to each transcript is counted for each cellular barcode. The output is a table of counts with transcripts as the rows and columns as the cellular barcodes for each input FASTQ file.

Optionally the reads can be quantitated with kallisto to output transcript compatibility counts rather than counts per gene (TCC paper) kallisto is free for academic use, but if you are a commerical entity, you need a license from UC Berkeley.

To extract the UMI and cellular barcodes from the read, bcbio-nextgen needs to know where the UMI and the cellular barcode are expected to be in the read. Currently there is support for two schemes, the inDrop system from the Harvard single-cell core facility and CEL-seq. If bcbio-nextgen does not support your UMI and barcoding scheme, please open up an issue and we will help implement support for it.

Most of the heavy lifting for this part of bcbio-nextgen is implemented in the umis repository.

smallRNA-seq

bcbio-nextgen also implements a configurable best-practices pipeline for smallRNA-seq quality controls, adapter trimming, miRNA/isomiR quantification and other small RNA detection.

  • Adapter trimming:
  • Sequence alignment:
    • STAR for genome annotation
    • bowtie, bowtie2 and hisat2 for genome annotation as an option
  • Known small RNAs quantification:
  • Quality control:
  • Other small RNAs quantification:

The pipeline generates a _RMD_ template file inside report folder that can be rendered with knitr. An example of the report is at here. Count table (counts_mirna.tst) from mirbase miRNAs will be inside mirbase or final project folder. Input files for isomiRs package for isomiRs analysis will be inside each sample in mirbase folder.. If mirdeep2 can run, count table (counts_mirna_novel.tsv) for novel miRNAs will be inside mirdeep2 or final project folder. tdrmapper results will be inside each sample inside tdrmapper or final project folder.

ChIP-seq

bcbio-nextgen implements the first steps of a ChIP-seq analysis up to aligning with bowtie2. It does alignment and peak calling with MACS2.

Standard

This pipeline implements alignment and qc tools. Furthermore, it will run qsignature to detect possible duplicated samples, or miss-labeling. It uses SNPs signature to create a distance matrix that helps easily to create groups. The project yaml file will show the number of total samples analyzed, the number of very similar samples, and samples that could be duplicated.

Configuration

We will assume that you installed bcbio-nextgen with the automated installer, and so your default bcbio_system.yaml file is configured correctly with all of the tools pointing to the right places. If that is the case, to run bcbio-nextgen on a set of samples you just need to set up a YAML file that describes your samples and what you would like to do to them. Let’s say that you have a single paired-end control lane, prepared with the Illumina TruSeq Kit from a human. Here is what a well-formed sample YAML file for that RNA-seq experiment would look like:

fc_date: '070113'
fc_name: control_experiment
upload:
  dir: final
details:
  - files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq]
    description: 'Control_rep1'
    genome_build: GRCh37
    analysis: RNA-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [truseq, polya]
         strandedness: unstranded

fc_date and fc_name will be combined to form a prefix to name intermediate files, you can set them to whatever you like. upload is explained pretty well in the configuration documentation and the above will direct bcbio-nextgen to put the output files from the pipeine into the final directory. Under details is a list of sections each describing a sample to process. You can set many parameters under each section but most of the time just setting a few like the above is all that is necessary. analysis tells bcbio-nextgen to run the best-practice RNA-seq pipeline on this sample.

In the above, since there are two files, control_1.fastq and control_2.fastq will be automatically run as paired-end data. If you have single end data you can just supply one file and it will run as single-end. The description field will be used to eventually rename the files, so make it very evocative since you will be looking at it a lot later. genome_build is self-explanatory.

Sometimes you need a little bit more flexibility than the standard pipeline, and the algorithm section has many options to fine-tune the behavior of the algorithm. quality_format tells bcbio-nextgen what quality format your FASTQ inputs are using, if your samples were sequenced any time past 2009 or so, you probably want to set it to Standard. Adapter read-through is a problem in RNA-seq libraries, so we want to trim off possible adapter sequences on the ends of reads, so trim_reads is set to read_through, which will also trim off poor quality ends. Since your library is a RNA-seq library prepared with the TruSeq kit, the set of adapters to trim off are the TruSeq adapters and possible polyA tails, so adapters is set to the both of those. strandedness can be set if your library was prepared in a strand-specific manner and can be set to firststrand, secondstrand or unstranded (the default).

Multiple samples

Lets say you have a set of mouse samples to analyze and each sample is a single lane of single-end RNA-seq reads prepared using the NextEra kit. There are two case and two control samples. Here is a sample configuration file for that analysis:

fc_date: '070113'
fc_name: mouse_analysis
upload:
  dir: final
details:
  - files: [/full/path/to/control_rep1.fastq]
    description: 'Control_rep1'
    genome_build: mm10
    analysis: RNA-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
  - files: [/full/path/to/control_rep2.fastq]
    description: 'Control_rep2'
    genome_build: mm10
    analysis: RNA-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
  - files: [/full/path/to/case_rep1.fastq]
    description: 'Case_rep1'
    genome_build: mm10
    analysis: RNA-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
  - files: [/full/path/to/case_rep2.fastq]
    description: 'Case_rep2'
    genome_build: mm10
    analysis: RNA-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]

More samples are added just by adding more entries under the details section. This is tedious and error prone to do by hand, so there is an automated template_ system for common experiments. You could set up the previous experiment by making a mouse version of the illumina-rnaseq template file and saving it to a local file such as illumina-mouse-rnaseq.yaml. Then you can set up the sample file using the templating system:

bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis
/full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq
/full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq

If you had paired-end samples instead of single-end samples, you can still use the template system as long as the forward and reverse read filenames are the same, barring a _1 and _2. For example: control_1.fastq and control_2.fastq will be detected as paired and combined in the YAML file output by the templating system.