Structural variant calling

Overview

bcbio can detect larger (>50bp) structural variants like deletions, insertions, inversions and copy number changes for both germline population and cancer variant calling

To enable structural variant calling, specify svcaller options in the algorithm section of your configuration

- description: Sample
  algorithm:
    svcaller: [lumpy, manta, cnvkit]

Split read callers (primary use case - germline WGS sequencing):

Read-depth based CNV callers (primary use case - T/N cancer CNV calling)

  • gatkcnv
  • CNVkit. 2020-05-13: temporarily off until new release 0.9.7.

Workflow

This example runs structural variant calling with multiple callers (Lumpy, Manta and CNVkit), providing a combined output summary file and validation metrics against NA12878 deletions. It uses the same NA12878 input as the whole genome trio example.

To run the analysis do:

mkdir -p NA12878-sv-eval
cd NA12878-sv-eval
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-sv-getdata.sh
bash NA12878-sv-getdata.sh
cd work
bcbio_nextgen.py ../config/NA12878-sv.yaml -n 16

This is large whole genome analysis and the timing and disk space requirements for the NA12878 trio analysis above apply here as well.

Paramaters

  • svcaller – List of structural variant callers to use. [lumpy, manta, cnvkit, gatk-cnv, seq2c, purecn, titancna, delly, battenberg]. LUMPY and Manta require paired end reads. cnvkit and gatk-cnv should not be used on the same sample due to incompatible normalization approaches, please pick one or the other for CNV calling.
  • svprioritize – Produce a tab separated summary file of structural variants in regions of interest. This complements the full VCF files of structural variant calls to highlight changes in known genes. See the paper on cancer genome prioritization for the full details. This can be either the path to a BED file (with chrom start end gene_name, see Input file preparation) or the name of one of the pre-installed prioritization files:
    • cancer/civic (hg19, GRCh37, hg38) – Known cancer associated genes from CIViC.
    • cancer/az300 (hg19, GRCh37, hg38) – 300 cancer associated genes contributed by AstraZeneca oncology.
    • cancer/az-cancer-panel (hg19, GRCh37, hg38) – A text file of genes in the AstraZeneca cancer panel. This is only usable for svprioritize which can take a list of gene names instead of a BED file.
    • actionable/ACMG56 – Medically actionable genes from the The American College of Medical Genetics and Genomics
    • coding/ccds (hg38) – Consensus CDS (CCDS) regions with 2bps added to internal introns to capture canonical splice acceptor/donor sites, and multiple transcripts from a single gene merged into a single all inclusive gene entry.
  • fusion_mode Enable fusion detection in RNA-seq when using STAR (recommended) or Tophat (not recommended) as the aligner. OncoFuse is used to summarise the fusions but currently only supports hg19 and GRCh37. For explant samples disambiguate enables disambiguation of STAR output [false, true]. This option is deprecated in favor of fusion_caller.
  • fusion_caller Specify a standalone fusion caller for fusion mode. Supports oncofuse for STAR/tophat runs, pizzly and ericscript for all runs. If a standalone caller is specified (i.e. pizzly or ericscript ), fusion detection will not be performed with aligner. oncofuse only supports human genome builds GRCh37 and hg19. ericscript supports human genome builds GRCh37, hg19 and hg38 after installing the associated fusion databases (Customizing data installation).
  • known_fusions A TAB-delimited file of the format gene1<tab>gene2, where gene1 and gene2 are identifiers of genes specified under gene_name in the attributes part of the GTF file.

Validation