Somatic (cancer) variants

For small variants (SNV and indels), bcbio supports the following workflows:

  • tumor-normal calling;

  • tumor only calling;

  • UMIs, including duplex UMIs, which improve precision in many applications including cfDNA analysis; We recommend starting with vardict and mutect2 as variant callers. bcbio also supports a majority voting ensemble approach to combine calls from multiple callers.

For copy number (CNV) detection, bcbio supports T/N and T only calling with a panel of normals with purecn, cnvkit, gatk-cnv and seq2c.

Workflow1 - T/N

This example runs a tumor-normal calling with mutect2 and vardict. Supply a consistent batch for tumor/normal pairs and mark them with the phenotype:

- description: sample_tumor
    variantcaller: [vardict, mutect2]
    batch: batch1
    phenotype: tumor
- description: sample_normal
    variantcaller: [vardict, mutect2]
    batch: batch1
    phenotype: normal

This example calls and validates variants using multiple approaches in a paired tumor/normal cancer sample from the ICGC-TCGA DREAM challenge. It uses synthetic dataset 3 which has multiple subclones, enabling detection of lower frequency variants.

The configuration and data file has downloads for exome only and whole genome analyses. It enables exome by default, but you can use the larger whole genome evaluation by uncommenting the relevant parts of the configuration and retrieval script.

1. Get the data:

mkdir -p cancer-dream-syn3/config cancer-dream-syn3/input cancer-dream-syn3/work
cd cancer-dream-syn3/config
cd ../input

2. Review parameters in the yaml file:

Modify variant_regions for both samples to point to your bcbio installation or copy NGv3.bed from there to project/input and reference it as ../input/NGv3.bed in the yaml.

# Cancer tumor/normal calling evaluation using synthetic dataset 3
# from the ICGC-TCGA DREAM challenge:
- algorithm:
    aligner: bwa
    mark_duplicates: true
    remove_lcr: true
    variantcaller: [mutect2, vardict]
    variant_regions: /path/to/bcbio/genomes/Hsapiens/hg38/coverage/capture_regions/NGv3.bed
  analysis: variant2
  description: syn3-normal
    - ../input/synthetic_challenge_set3_normal_NGv3_1.fq.gz
    - ../input/synthetic_challenge_set3_normal_NGv3_2.fq.gz
  genome_build: hg38
    batch: syn3
    phenotype: normal
- algorithm:
    aligner: bwa
    mark_duplicates: true
    remove_lcr: true
    variantcaller: [mutect2, vardict]
    variant_regions: /path/to/bcbio/genomes/Hsapiens/hg38/coverage/capture_regions/NGv3.bed 
  analysis: variant2
  description: syn3-tumor
    - ../input/synthetic_challenge_set3_tumor_NGv3_1.fq.gz
    - ../input/synthetic_challenge_set3_tumor_NGv3_2.fq.gz
  genome_build: hg38
    batch: syn3
    phenotype: tumor
  dir: ../final

Set remove_lcr parameter to true to remove low complexity regions from variant calling, both germline and somatic (additional information:

3. Run bcbio project

cd ../work ../config/cancer-dream-syn3.yaml -n 8

Cancer calling handles both tumor-normal paired calls and tumor-only calling. To specify a tumor-only sample, provide a single sample labeled with phenotype: tumor. Otherwise the configuration and setup is the same as with paired analyses. For tumor-only samples, bcbio will try to remove likely germline variants present in the public databases like 1000 genomes and ExAC, and not in COSMIC. This runs as long as you have a local GEMINI data installation (--datatarget gemini) and marks likely germline variants with a LowPriority filter. This post has more details on the approach and validation.

The standard variant outputs (sample-caller.vcf.gz) for tumor calling emphasize somatic differences, those likely variants unique to the cancer. If you have a tumor-only sample and GEMINI data installed, it will also output sample-caller-germline.vcf.gz, which tries to identify germline background mutations based on presence in public databases.

We’re actively working on improving calling to better account for the heterogeneity and structural variability that define cancer genomes.

Workflow2: Somatic and germline variants

For tumor/normal somatic samples, bcbio can call both somatic (tumor-specific) and germline (pre-existing) variants.

To option somatic and germline calls for your tumor/normal inputs, specify which callers to use for each step in the configuration:

description: your-normal
   somatic: vardict
   germline: gatk-haplotype

bcbio does a single alignment for the normal sample, then splits at the variant calling steps using this normal sample to do germline calling. In this example, the output files are:

  • your-tumor/your-tumor-vardict.vcf.gz – Somatic calls from the tumor samples using the normal as background to subtract existing calls.

  • your-normal/your-normal-freebayes.vcf.gz – Germline calls on the normal sample.

Germline calling supports multiple callers, and other configuration options like ensemble and structural variant calling inherit from the remainder configuration. For example, to use 3 callers for somatic and germline calling, create ensemble calls for both and include germline and somatic events from two structural variant callers:

   somatic: [vardict, strelka2, mutect2]
   germline: [freebayes, gatk-haplotype, strelka2]
   numpass: 2
svcaller: [manta, cnvkit]

In addition to the somatic and germline outputs attached to the tumor and normal sample outputs as described above, you’ll get:

  • your-tumor/your-tumor-manta.vcf.gz – Somatic structural variant calls for each specified svcaller. These will have genotypes for both the tumor and normal samples, with somatic calls labeled as PASS variants.

  • your-normal/your-normal-manta.vcf.gz – Germline structural variant calls for each specified svcaller. We expect these to be noisier than the somatic calls due to the lack of a reference sample to help remove technical noise.

Workflow3: copy number variants

See also PureCN user story which is a recommended option.

The first bcbio run creates a panel of normals (PON), the second bcbio run uses the PON file to call copy number variants (CNV) in tumor/normal (gatk-cnv) or tumor only (cnvkit) samples. PON samples should use the same gene panel/sequencing technology as tumor only samples. While it is technically possible to call CNVs in T/N pairs without PON, PON approach is preferable. 3 tools support PON in bcbio: gatk-cnv,CNVkit,seq2c. To call CNVs with a PON, this PON file should be created by the same method (not possible to create PON with CNVkit and use it for gatk-cnv calling. It is possible to calculate two PON files simultaneously (for gatk-cnv and seq2c or CNVkit and seqc2). CNVkit and gatk-cnv cannot be run together, because they require different, incompatible normalization schemes.

1. Collect PON samples and create a project structure

Put coverage.bed in pon/config/ and PON input files (bam, fq.gz) to pon/input. One test tumor sample is required to create a PON project, this tumor sample is not included in the PON.

$ mkdir pon
$ cd pon
$ mkdir input config
$ ls config
$ ls -1 input

coverage.bed contains regions from WES or gene panel capture kit provider.

2. Create pon.csv

Mark samples to build the PON with svclass=control


To use fastq input: S_2_N_R1.fq.gz;S_2_N_R2.fq.gz,S_2_N,control,pon_build

3. Create pon_template.yaml:

  - analysis: variant2
    genome_build: hg38
      svcaller: [gatk-cnv, seq2c]
      variant_regions: /path/pon/config/coverage.bed

4. Configure bcbio PON project

Put pon.csv, pon_template.yaml in the same dir as pon. Run automatic sample configuration:

$ ls -1
$ -w template pon_template.yaml pon.csv pon/input/*.bam

5. Run bcbio PON project

$ cd pon/work
$ ../config/pon.yaml -n 15

6. Collect PON file:

  • gatk-cnv: final/project/gatkcnv-pon.hdf5

  • seq2c doesn’t have a default PON file format so we create a bcbio specific one as a concatenation of the read mapping file final/date_project/seq2c-read_mapping.txt and coverage file final/date_project/seq2c-coverage.tsv outputs for the background samples. When fed to future bcbio runs, it will correctly extract and re-use this file as background.

cat seq2c-read_mapping.txt seq2c-coverage.tsv > seqc.pon.txt
  • CNVkit: final/testsample/testsample-cnvkit-background.cnn

Then use PON in a T/N project

7. Create pon_tn project structure

Put coverage.bed, gatk-cnv.pon.hdf5,seq2c.pon.txt in pon_tn/config/, copy or symlink all input files (bam, fq.gz) to pon/input.

mkdir pon_tn
cd pon_tn
mkdir config input
cp /path/coverage.bed /path/gatk-cnv.pon.hdf5 /path/seq2c.pon.txt config
# cp or symlink input files
cp /path/*.bam input
cd ..

8. Prepare a sample sheet pon_tn.csv:


9. Premate a yaml template: pon_tn_template.yaml

  - analysis: variant2
    genome_build: hg38
      svcaller: [gatk-cnv, seq2c]
      variant_regions: /path/pon_tn/config/coverage.bed
      coverage_interval: regional
          gatk-cnv: /path/pon_tn/config/gatk-cnv.pon.hdf5
          seq2c: /path/pon_tn/config/seqc.pon.txt

10. configure pon_tn project:

$ ls -1
$ -w template pon_tn_template.yaml pon_tn.csv pon_tn/input/*.bam

11. Run bcbio pon_tn project

$ cd pon_tn/work
$ ../config/pon_tn.yaml -n 15

Workflow4: Cancer-like mixture with Genome in a Bottle samples

This example simulates somatic cancer calling using a mixture of two Genome in a Bottle samples, NA12878 as the “tumor” mixed with NA24385 as the background. The Hartwig Medical Foundation and Utrecht Medical Center generated this “tumor/normal” pair by physical mixing of samples prior to sequencing. The GiaB FTP directory has more details on the design and truth sets. The sample has variants at 15% and 30%, providing the ability to look at lower frequency mutations.

To get the data:


Then run the analysis with:

cd work ../config/cancer-giab-na12878-na24385.yaml -n 16


  • variantcaller: should be consistent for T/N pairs

  • min_allele_fraction Minimum allele fraction to detect variants in heterogeneous tumor samples, set as the float or integer percentage to resolve (i.e. 10 = alleles in 10% of the sample). Defaults to 10. Specify this in the tumor sample of a tumor/normal pair. It is percentage, not ratio, it is divided /100.0 when calling vardict!

  • use_lowfreq_filter: false. When set, forces vardict to report variants with low allelec frequency, useful to call variants in panels with high coverage (>1000x). The default (option is not set to false in the config) is to use low frequency filter, i.e. variants could be underreported (variant VAF is above min_allele_fraction but rejected by the filter).

  • to use panel of normals (PON) for small variants with mutect2, specify a vcf file as a background for T sample.

    - algorithm:
        background: /path/to/input/1000g_pon.hg38.vcf.gz
        variantcaller: mutect2
        batch: batch1
        phenotype: tumor

    bcbio does not support PON generation at the moment. You may create a PON outside of bcbio or use PONs from Broad Institute

  • also see more parameters in germline_variants user story.

Ensemble variant calling

A simple majority rule ensemble classifier builds a final callset based on the intersection of calls. It selects variants represented in at least a specified number of callers:

variantcaller: [mutect2, varscan, freebayes, vardict]
  numpass: 2
  use_filtered: false

This example selects variants present in 2 out of the 4 callers and does not use filtered calls (the default behavior). Because of the difficulties of producing a unified FORMAT/genotype field across callers, the ensemble outputs contains a mix of outputs from the different callers. It picks a representative sample in the order of specified caller, so in the example above would have a MuTect2 call if present, otherwise a VarScan call if present, otherwise a FreeBayes call. This may require custom normalization scripts during post-processing when using these calls. bcbio.variation.recall implements this approach, which handles speed and file sorting limitations in the bcbio.variation approach.

This older approach uses the bcbio.variation toolkit to perform the consolidation. An example configuration in the algorithm section is:

variantcaller: [gatk, freebayes, samtools, gatk-haplotype, varscan]
  format-filters: [DP < 4]
    type: svm
    balance: [AD, FS, Entropy]
    calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
  trusted-pct: 0.65

The ensemble set of parameters configure how to combine calls from the multiple methods:

  • format-filters A set of filters to apply to variants before combining. The example removes all calls with a depth of less than 4.

  • classifier-params Parameters to configure the machine learning approaches used to consolidate calls. The example defines an SVM classifier.

  • classifiers Groups of classifiers to use for training and evaluating during machine learning. The example defines two set of criteria for distinguishing reads with allele balance issues and those with low calling support.

  • trusted-pct Define threshold of variants to include in final callset. In the example, variants called by more than 65% of the approaches (4 or more callers) pass without being requiring SVM filtering.


Unique molecular identifiers (UMIs) are short random barcodes used to tag individual molecules and avoid amplification biased. Both single cell RNA-seq and variant calling support UMIs. For variant calling, fgbio collapses sequencing duplicates for each UMI into a single consensus read prior to running re-alignment and variant calling. This requires mark_duplicates: true (the default) since it uses position based duplicates and UMI tags for collapsing duplicate reads into consensus sequences.

To help with preparing fastq files with UMIs bcbio provides a script This handles two kinds of UMI barcodes:

  • Separate UMIs: it converts reads output by an Illumina as 3 files (read 1, read 2, and UMIs).

  • Duplex barcodes with tags incorporated at the 5’ end of read 1 and read 2

In both cases, these get converted into paired reads with UMIs in the fastq names, allowing specification of umi_type: fastq_name in your bcbio YAML configuration. The script runs on a single set of files or autopairs an entire directory of fastq files. To convert a directory with separate UMI files: autopair -c <cores_to_use> <list> <of> <fastq> <files>

To convert duplex barcodes present on the ends of read 1 and read 2: autopair -c <cores_to_use> --tag1 5 --tag2 5 <list> <of> <fastq> <files>

If you want to prepare your FASTQ files for use with the umi_type: fastq_name option and they don’t follow the two barcode schemes listed you can pre-transform the FASTQ files yourself by putting :UMI_yourumisequence in the read name. Here is an example:


For more complex UMI schemes please open up an issue and we can help you write a transformation to prepare your files. We use to do these more complex transformations.

Configuration options for UMIs:

  • umi_type The UMI/cellular barcode scheme used for your data. For variant analysis with UMI based consensus calling, supports fastq_name with UMIs in read names, the path to a fastq file with UMIs for each aligned read or dragen which accepts the pre-consensus BAM file from DRAGEN.

  • correct_umis: [path/to/whitelist_umi.txt]. For a restricted set of UMIs specify a text file (one UMI per line). UMIs will be corrected with

You can adjust the fgbio default options by adjusting resources. The most common change is modifying the minimum number of reads as input to consensus sequences. This default to 1 to avoid losing reads, 3 is recommended for high depth panels:

    options: [--min-reads, 3]

Duplex UMI sequencing allows to reduce false positive rate compared to single UMIs, see implementations from TwinStrand or IDT. Duplex UMI data processing example in bcbio


Project directory:

  • grading-summary.csv – Grading details comparing each sample to a reference set of calls. This will only have information when providing a validation callset.

  • BATCH-caller.vcf – Variants called for a population/batch of samples by a particular caller.

  • BATCH-caller.db – A GEMINI database associating variant calls with a wide variety of third party annotations. This provides a queryable framework for assessing variant quality statistics.

Sample directories:

  • SAMPLE-caller.vcf – Variants calls for an individual sample.

  • SAMPLE-gdc-viral-completeness.txt – Optional viral contamination estimates. File is of the format depth, 1x, 5x, 25x. depth is the number of reads aligning to the virus. 1x, 5x, 25x are percentage of the viral sequence covered by reads of 1x, 5x, 25x depth. Real viral contamination will have broad coverage across the entire genome, so high numbers for these values, depending on sequencing depth. High depth and low viral sequence coverage means a likely false positive.


bcbio pre-installs standard truth sets for performing validation, and also allows use of custom local files for assessing reliability of your runs:

  • validate A VCF file of expected variant calls to perform validation and grading of small variants (SNPs and indels) from the pipeline. This provides a mechanism to ensure consistency of calls against a known set of variants, supporting comparisons to genotyping array data or reference materials.

  • validate_regions A BED file of regions to evaluate small variant calls in. This defines specific regions covered by the validate VCF file.

  • svvalidate – Dictionary of call types and pointer to BED file of known regions. For example: DEL: known_deletions.bed does deletion based validation of outputs against the BED file.

Each option can be either the path to a local file, or a partial path to a file in the pre-installed truth sets. For instance, to validate an NA12878 run against the Genome in a Bottle truth set:

validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: giab-NA12878/truth_regions.bed
  DEL: giab-NA12878/truth_DEL.bed

For cancer validations:

A full evaluation of cancer calling validates callers against synthetic dataset 3 from the ICGC-TCGA DREAM challenge.