Germline variant calling¶
bcbio implements configurable SNP, indel and structural variant calling for germline populations. We include whole genome and exome evaluations against reference calls from the Genome in a Bottle consortium and Illumina Platinum Genomes project, enabling continuous assessment of new alignment and variant calling algorithms. We regularly report on these comparisons and continue to improve approaches as the community makes new tools available. Here is some of the research that contributes to the current implementation:
- An introduction to the variant evaluation framework. This includes a comparison of the bwa mem and novoalign aligners. We also compared the FreeBayes, GATK HaplotypeCaller and GATK UnifiedGenotyper variant callers.
- An in-depth evaluation of FreeBayes and BAM post-alignment processing. We found that FreeBayes quality was equal to GATK HaplotypeCaller. Additionally, a lightweight post-alignment preparation method using only de-duplication was equivalent to GATK’s recommended Base Quality Score Recalibration (BQSR) and realignment around indels, when using good quality input datasets and callers that do local realignment.
- Additional work to improve variant filtering, providing methods to
remove low complexity regions (LCRs) that can bias indel results. We also
tuned GATK’s Variant Quality Score Recalibrator (VQSR) and compared it with
cutoff-based soft filtering. VQSR requires a large number of variants and we use
it in bcbio with GATK HaplotypeCaller when your Algorithm parameters
contains high depth samples (
coverage_depthis not low) and you are calling on the whole genome (
coverage_intervalis genome) or have more than 50 regional or exome samples called concurrently.
- An evaluation of joint calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools. This validates the joint calling implementation, allowing scaling of large population germline experiments. It also demonstrates improved performance of new callers: samtools 1.0 and Platypus.
- Support for build 38 of the human genome, improving precision of detection thanks to the improved genome representation.
bcbio automates post-variant calling annotation to make the outputs easier to feed directly into your biological analysis. We annotate variant effects using snpEff or Variant Effect Predictor (VEP), and prepare a GEMINI database that associates variants with multiple external annotations in a SQL-based query interface. GEMINI databases have the most associated external information for human samples (GRCh37/hg19 and hg38) but are available for any organism with the database populated using the VCF INFO column and predicted effects.
Basic germline calling¶
- FreeBayes template – Call variants using FreeBayes with a minimal preparation pipeline. This is a freely available unrestricted pipeline fully included in the bcbio installation.
- GATK HaplotypeCaller template – Run GATK best practices, including Base Quality Score Recalibration, realignment and HaplotypeCaller variant calling. This requires a license from Broad for commercial use. You need to manually install GATK along with bcbio using downloads from the GATK Broad site or Appistry (see Extra software).
You may also want to enable Structural variant calling for detection of larger events, which work with either caller. Another good source of inspiration are the configuration files from the Example pipelines, which may help identify other configuration variables of interest. A more complex setup with multiple callers and resolution of ensemble calls is generally only useful with a small population where you are especially concerned about sensitivity. Single caller detection with FreeBayes or GATK HaplotypeCaller provide good resolution of events.
When calling multiple samples, we recommend calling together to provide improved
sensitivity and a fully squared off final callset. To associate samples together
in a population add a
batch to the Samples from GEO or SRA:
- description: Sample1 metadata: batch: Batch1 - description: Sample2 metadata: batch: Batch1
Batching samples results in output VCFs and GEMINI databases containing all merged sample calls. bcbio has two methods to call samples together:
Batch or pooled calling – This calls all samples simultaneously by feeding them to the variant caller. This works for smaller batch sizes (< 100 samples) as memory requirements become limiting in larger pools. This is the default approach taken when you specify a
variantcallerin the Variant calling configuration.
Joint calling – This calls samples independently, then combines them together into a single callset by integrating the individual calls. This scales to larger population sizes by avoiding the computational bottlenecks of pooled calling. We recommend joint calling with HaplotypeCaller if you have a license for GATK usage, but also support joint calling with FreeBayes using a custom implementation. Specifying a
jointcalleralong with the appropriate
variantcallerin the Variant calling configuration enables this:
- description: Sample1 algorithm: variantcaller: gatk-haplotype jointcaller: gatk-haplotype-joint metadata: batch: Batch1 - description: Sample2 algorithm: variantcaller: gatk-haplotype jointcaller: gatk-haplotype-joint metadata: batch: Batch1
Cancer variant calling¶
bcbio supports somatic cancer calling with tumor and optionally matched normal pairs using multiple SNP, indel and structural variant callers. A full evaluation of cancer calling validates callers against synthetic dataset 3 from the ICGC-TCGA DREAM challenge. bcbio uses a majority voting ensemble approach to combining calls from multiple SNP and indel callers, and also flattens structural variant calls into a combined representation.
The example configuration for the Whole genome trio (50x) - hg38 validation is a good starting point for setting up a tumor/normal run on your own dataset. The configuration works similarly to population based calling. Supply a consistent batch for tumor/normal pairs and mark them with the phenotype:
- description: your-tumor algorithm: variantcaller: [vardict, strelka2, mutect2] metadata: batch: batch1 phenotype: tumor - description: your-normal algorithm: variantcaller: [vardict, strelka2, mutect2] metadata: batch: batch1 phenotype: normal
Other Somatic variant calling configuration options allow tweaking of the
processing parameters. For pairs you want to analyze together, specify a
consistent set of
variantcaller options for both samples.
Cancer calling handles both tumor-normal paired calls and tumor-only calling. To
specify a tumor-only sample, provide a single sample labeled with
tumor. Otherwise the configuration and setup is the same as with paired
analyses. For tumor-only samples, bcbio will try to remove likely germline
variants present in the public databases like 1000 genomes and ExAC, and not in
COSMIC. This runs as long as you have a local GEMINI data installation
--datatarget gemini) and marks likely germline variants with a
LowPriority filter. This post has more details on the approach and validation.
The standard variant outputs (
sample-caller.vcf.gz) for tumor calling
emphasize somatic differences, those likely variants unique to the cancer. If
you have a tumor-only sample and GEMINI data installed, it will also output
sample-caller-germline.vcf.gz, which tries to identify germline background
mutations based on presence in public databases. If you have tumor/normal data
and would like to also call likely germline mutations see the documentation on
specifying a germline caller: Somatic with germline variants.
We’re actively working on improving calling to better account for the heterogeneity and structural variability that define cancer genomes.
Somatic with germline variants¶
For tumor/normal somatic samples, bcbio can call both somatic (tumor-specific) and germline (pre-existing) variants. The typical outputs of Cancer variant calling are likely somatic variants acquired by the cancer, but pre-existing germline risk variants are often also diagnostic.
For tumor-only cases we suggest running standard Cancer variant calling. Tumor-only inputs mix somatic and germline variants, making it difficult to separate events. For small variants (SNPs and indels) bcbio will attempt to distinguish somatic and germline mutations using the presence of variants in population databases.
To option somatic and germline calls for your tumor/normal inputs, specify which callers to use for each step in the Variant calling configuration:
description: your-normal variantcaller: somatic: vardict germline: freebayes
bcbio does a single alignment for the normal sample, then splits at the variant calling steps using this normal sample to do germline calling. In this example, the output files are:
your-tumor/your-tumor-vardict.vcf.gz– Somatic calls from the tumor samples using the normal as background to subtract existing calls.
your-normal/your-normal-freebayes.vcf.gz– Germline calls on the normal sample.
Germline calling supports multiple callers, and other configuration options like ensemble and structural variant calling inherit from the remainder configuration. For example, to use 3 callers for somatic and germline calling, create ensemble calls for both and include germline and somatic events from two structural variant callers:
variantcaller: somatic: [vardict, strelka2, mutect2] germline: [freebayes, gatk-haplotype, strelka2] ensemble: numpass: 2 svcaller: [manta, cnvkit]
In addition to the somatic and germline outputs attached to the tumor and normal sample outputs as described above, you’ll get:
your-tumor/your-tumor-manta.vcf.gz– Somatic structural variant calls for each specified
svcaller. These will have genotypes for both the tumor and normal samples, with somatic calls labeled as PASS variants.
your-normal/your-normal-manta.vcf.gz– Germline structural variant calls for each specified
svcaller. We expect these to be noisier than the somatic calls due to the lack of a reference sample to help remove technical noise.
Somatic tumor only CNVs¶
Copy number variation (CNVs) detection in tumor only samples requires accurately representing the non-somatic capture and sequencing background in the absence of a matched sample. Capture or sequencing specific coverage differences can trigger false positives or negatives. Without a matched normal to remove these artifacts, you can use a process matched set of unrelated samples to build a Panel of Normals (PoN) for the background correction.
To create these, collect all the samples you plan to use for the panel of normals and run through a Automated sample configuration as a single batch with the background samples set as control and any nonbackground as the non-background. An example sample CSV:
samplename,description,svclass,batch background1.bam,background1,control,pon_build background2_R1.fq.gz;background2_R2.fq.gz,background2,control,pon_build testsample.bam,testsample,pon_build
and template YAML:
details: - analysis: variant2 genome_build: hg38 algorithm: svcaller: [gatk-cnv, seq2c] variant_regions: your_regions.bed
After running, collect the panel of normal files from each calling method:
- gatk-cnv: work/structural/testsample/bins/background1-pon_build-pon.hdf5
- seq2c: This doesn’t have a default panel of normals file format so we create a bcbio specific one as a concatenation of the read mapping file (final/date_project/seq2c-read_mapping.txt) and coverage file (final/date_project/seq2c-coverage.tsv) outputs for the background samples. When fed to future bcbio runs, it will correctly extract and re-use this file as background.
- CNVkit: final/testsample/testsample-cnvkit-background.cnn
CNVkit and gatk-cnv cannot be run together, because they require different, incompatible normalization schemes.
Once you have the panel of normals, use them as background in any tumor only project with the same sequencing and capture process in your :ref: variant-config configuration:
svcaller: [gatk-cnv, seq2c] variant_regions: your_regions.bed background: cnv_reference: cnvkit: ../pon/your_regions-cnvkit-pon.cnn gatk-cnv: ../pon/your_regions-gatk-cnv-pon.hdf5 seq2c: ../pon/your_region-seq2c-pon.txt
Structural variant calling¶
bcbio can detect larger structural variants like deletions, insertions, inversions and copy number changes for both germline population and cancer variant calling, based on validation against existing truth sets:
- Validation of germline structural variant detection using multiple calling methods to validate against deletions in NA12878. This implements a pipeline that works in tandem with SNP and indel calling to detect larger structural variations like deletions, duplications, inversions and copy number variants (CNVs).
- Validation of tumor/normal calling using the synthetic DREAM validation set. This includes validation of additional callers against duplications, insertions and inversions.
To enable structural variant calling, specify
svcaller options in the
algorithm section of your configuration:
- description: Sample algorithm: svcaller: [lumpy, manta, cnvkit]
The best supported callers are Lumpy and Manta, for paired end and split read calling, CNVkit for read-depth based CNV calling, and WHAM for association testing. We also support DELLY, another excellent paired end and split read caller, although it is slow on large whole genome datasets.
bcbio can also be used to analyze RNA-seq data. It includes steps for quality control, adapter trimming, alignment, variant calling, transcriptome reconstruction and post-alignment quantitation at the level of the gene and isoform.
We recommend using the STAR aligner for all genomes where there are no alt alleles. For genomes such as hg38 that have alt alleles, hisat2 should be used as it handles the alts correctly and STAR does not yet. Use Tophat2 only if you do not have enough RAM available to run STAR (about 30 GB).
Our current recommendation is to run adapter trimming only if using the Tophat2 aligner. Adapter trimming is very slow, and aligners that soft clip the ends of reads such as STAR and hisat2, or algorithms using pseudoalignments like Sailfish handle contaminant sequences at the ends properly. This makes trimming unnecessary. Tophat2 does not perform soft clipping so if using Tophat2, trimming must still be done.
Salmon, which is an extremely fast alignment-free method of quantitation, is run for all experiments. Salmon can accurately quantitate the expression of genes, even ones which are hard to quantitate with other methods (see this paper for example for Sailfish, which performs similarly to Salmon.). Salmon can also quantitate at the transcript level which can help gene-level analyses (see this paper for example). We recommend using the Salmon quantitation rather than the counts from featureCounts to perform downstream quantification.
Although we do not recommend using the featureCounts based counts, the alignments are still useful because they give you many more quality metrics than the quasi-alignments from Salmon.
After a bcbio RNA-seq run there will be in the
upload directory a directory
for each sample which contains a BAM file of the aligned and unaligned reads, a
sailfish directory with the output of Salmon, including TPM values, and a
qc directory with plots from FastQC and qualimap.
In addition to directories for each sample, in the
upload directory there is
a project directory which contains a YAML file describing some summary
statistics for each sample and some provenance data about the bcbio run. In that
directory is also a
combined.counts file with the featureCounts derived
counts per cell.
This mode of
bcbio-nextgen quantitates transcript expression using Salmon and does nothing else. It is an
order of magnitude faster or more than running the full RNA-seq analysis. The
cost of the increased speed is that you will have much less information about
your samples at the end of the run, which can make troubleshooting trickier.
bcbio-nextgen supports universal molecular identifiers (UMI) based single-cell RNA-seq analyses. If your single-cell prep does not use universal molecular identifiers (UMI), you can most likely just run the standard RNA-seq pipeline and use the results from that. The UMI are used to discard reads which are possibly PCR duplicates and is very helpful for removing some of the PCR duplicate noise that can dominate single-cell experiments.
Unlike the standard RNA-seq pipeline, the single-cell pipeline expects the FASTQ input files to not be separated by cellular barcode, so each file is a mix of cells identified by a cellular barcode (CB), and unique reads from a transcript are identified with a UMI. bcbio-nextgen inspects each read, identifies the cellular barcode and UMI and puts them in the read name. Then the reads are aligned to the transcriptome with RapMap and the number of reads aligning to each transcript is counted for each cellular barcode. The output is a table of counts with transcripts as the rows and columns as the cellular barcodes for each input FASTQ file.
Optionally the reads can be quantitated with
kallisto to output transcript
compatibility counts rather than counts per gene
To extract the UMI and cellular barcodes from the read, bcbio-nextgen needs to know where the UMI and the cellular barcode are expected to be in the read. Currently there is support for two schemes, the inDrop system from the Harvard single-cell core facility and CEL-seq. If bcbio-nextgen does not support your UMI and barcoding scheme, please open up an issue and we will help implement support for it.
Most of the heavy lifting for this part of bcbio-nextgen is implemented in the umis repository.
bcbio-nextgen also implements a configurable best-practices pipeline for smallRNA-seq quality controls, adapter trimming, miRNA/isomiR quantification and other small RNA detection.
- Adapter trimming:
- Sequence alignment:
- Specific small RNAs quantification (miRNA/tRNAs…):
- seqbuster for miRNA annotation
- MINTmap for tRNA fragments annotation
- miRge2 for alternative small RNA quantification. To setup this tool, you need
- install manually miRge2.0, and download the library data for your species.
Read how to install and download the data here.
If you have
/mnt/data/humanthe option to pass to resources will be
/mnt/data. Then setup
- options: [“-lib $PATH_TO_PARENT_SPECIES_LIB”]
- Quality control:
- Other small RNAs quantification:
The pipeline generates a _RMD_ template file inside
that can be rendered with knitr. An example of the report is at here.
Count table (
counts_mirna.tst) from mirbase miRNAs will be
mirbase or final project folder.
Input files for isomiRs package for isomiRs analysis will be
inside each sample in
If mirdeep2 can run, count table (
for novel miRNAs will be inside
mirdeep2 or final project folder.
tdrmapper results will be inside each sample
tdrmapper or final project folder.
The bcbio-nextgen implementation of ChIP-seq aligns, removes multimapping reads, calls peaks with a paired input file using MACS2 and outputs a set of greylist regions for filtering possible false peaks in regions of high depth in the input file.
Whole genome bisulfite sequencing is supported using the bismark2 pipeline. It can be turned on by setting analysis to wgbs-seq.
This pipeline implements
qc tools. Furthermore, it will
run qsignature to detect possible duplicated samples, or mislabeling. It
uses SNPs signature to create a distance matrix that helps easily to create
groups. The project yaml file will show the number of total samples analyzed,
the number of very similar samples, and samples that could be duplicated.
We will assume that you installed bcbio-nextgen with the automated installer, and so your default bcbio_system.yaml file is configured correctly with all of the tools pointing to the right places. If that is the case, to run bcbio-nextgen on a set of samples you just need to set up a YAML file that describes your samples and what you would like to do to them. Let’s say that you have a single paired-end control lane, prepared with the Illumina TruSeq Kit from a human. Here is what a well-formed sample YAML file for that RNA-seq experiment would look like:
fc_date: '070113' fc_name: control_experiment upload: dir: final details: - files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq] description: 'Control_rep1' genome_build: GRCh37 analysis: RNA-seq algorithm: aligner: tophat2 quality_format: Standard trim_reads: read_through adapters: [truseq, polya] strandedness: unstranded
fc_name will be combined to form a prefix to name
intermediate files, and can be set to whatever you like.
explained pretty well in the configuration documentation and the above will
direct bcbio-nextgen to put the output files from the pipeine into the
details is a list of sections each describing a sample to
process. You can set many parameters under each section but most of
the time just setting a few like the above is all that is necessary.
analysis tells bcbio-nextgen to run the best-practice RNA-seq pipeline on
In the above, since there are two files,
control_2.fastq will be automatically run as paired-end data. If you have
single end data you can just supply one file and it will run as single-end. The
description field will be used to eventually rename the files, so make it
very evocative since you will be looking at it a lot later.
Sometimes you need a little bit more flexibility than the standard pipeline, and
algorithm section has many options to fine-tune the behavior of the
quality_format tells bcbio-nextgen what quality format your FASTQ
inputs are using, if your samples were sequenced any time past 2009 or so, you
probably want to set it to
Standard. Adapter read-through is a problem in
RNA-seq libraries, so we want to trim off possible adapter sequences on the ends
of reads, so
trim_reads is set to
read_through, which will also trim off
poor quality ends. Since your library is a RNA-seq library prepared with the
TruSeq kit, the set of adapters to trim off are the TruSeq adapters and possible
polyA tails, so
adapters is set to both of those.
can be set if your library was prepared in a strand-specific manner and can
be set to firststrand, secondstrand or unstranded (the default).
Lets say you have a set of mouse samples to analyze and each sample is a single lane of single-end RNA-seq reads prepared using the NextEra kit. There are two case and two control samples. Here is a sample configuration file for that analysis:
fc_date: '070113' fc_name: mouse_analysis upload: dir: final details: - files: [/full/path/to/control_rep1.fastq] description: 'Control_rep1' genome_build: mm10 analysis: RNA-seq algorithm: aligner: tophat2 quality_format: Standard trim_reads: read_through adapters: [nextera, polya] - files: [/full/path/to/control_rep2.fastq] description: 'Control_rep2' genome_build: mm10 analysis: RNA-seq algorithm: aligner: tophat2 quality_format: Standard trim_reads: read_through adapters: [nextera, polya] - files: [/full/path/to/case_rep1.fastq] description: 'Case_rep1' genome_build: mm10 analysis: RNA-seq algorithm: aligner: tophat2 quality_format: Standard trim_reads: read_through adapters: [nextera, polya] - files: [/full/path/to/case_rep2.fastq] description: 'Case_rep2' genome_build: mm10 analysis: RNA-seq algorithm: aligner: tophat2 quality_format: Standard trim_reads: read_through adapters: [nextera, polya]
More samples are added just by adding more entries under the details section.
This is tedious and error prone to do by hand, so there is an automated
template system for common experiments. You could set up the previous
experiment by making a mouse version of the illumina-rnaseq template
file and saving it to a local file such as
you can set up the sample file using the templating system:
bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis /full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq /full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq
If you had paired-end samples instead of single-end samples, you can still use the template system as long as the forward and reverse read filenames are the same, barring a _1 and _2. For example: control_1.fastq and control_2.fastq will be detected as paired and combined in the YAML file output by the templating system.