Small germline variants¶
Workflow1: validate hg38 calls¶
This workflow validates variant calls using WES data for NA12878 sample.
1. Create project structure¶
mkdir validate_giab cd validate_giab mkdir config input final work
2. Prepare input data¶
cd input wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R2_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L002_R1_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L002_R2_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L001_R1_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L001_R2_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L002_R1_001.fastq.gz wget -c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7086_CGTACTAG_L002_R2_001.fastq.gz # hg19 wget -c ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz gunzip nexterarapidcapture_expandedexome_targetedregions.bed.gz cat *R1* > NA12878_1.fq.gz cat *R2* > NA12878_2.fq.gz rm Garvan_NA12878*
3. Convert capture regions file to hg38 coordinates with UCSC liftover.¶
Successfully converted 200993 records: View Conversions Conversion failed on 78 records.
Save file to
4. Set analysis parameters in
details: - files: - /full/path/validate_giab/input/NA12878_1.fq.gz - /full/path/validate_giab/input/NA12878_2.fq.gz description: NA12878 metadata: sex: female analysis: variant2 genome_build: hg38 algorithm: aligner: bwa variantcaller: gatk-haplotype validate: giab-NA12878/truth_small_variants.vcf.gz validate_regions: giab-NA12878/truth_regions.bed variant_regions: /full/path/validate_giab/input/nexterarapidcapture_expandedexome_targetedregions.hg38.bed resources: default: cores: 7 jvm_opts: - -Xms750m - -Xmx7000m memory: 8G upload: dir: ../final
5. Run the project (8cores/64G RAM)¶
cd validate_giab/work bcbio_nextgen.py ../config/NA12878.yaml -n 8
Running time ~ 2.4h
6. Review results:¶
sample,caller,vtype,metric,value NA12878,gatk-haplotype,SNPs,tp,36710 NA12878,gatk-haplotype,Indels,tp,3588 NA12878,gatk-haplotype,SNPs,fp,285 NA12878,gatk-haplotype,Indels,fp,981 NA12878,gatk-haplotype,SNPs,fn,323 NA12878,gatk-haplotype,Indels,fn,611
Here FDR_SNPS = 285 / (285 + 36710) = 0.77%, so precision is 99.23%
FNR_SNPS = 323 / (323 + 36710) = 0.87%, so sensitivity is 99.13%
more improvement is possible with higher coverage depth and custom filters, but overall 99.23% precision / 99.13% sensitivity is a good result for out of the box pipeline.
Workflow2: Basic germline calling¶
The best approach to build a bcbio configuration for germline calling is to use the automated sample configuration with one of the default templates:
- FreeBayes template –Call variants using FreeBayes with a minimal preparation pipeline. This is a freely available unrestricted pipeline fully included in the bcbio installation.
- GATK HaplotypeCaller template –Run GATK best practices, including Base Quality Score Recalibration, realignment and HaplotypeCaller variant calling. This requires a license from Broad for commercial use. You need to manually install GATK along with bcbio using downloads from the GATK Broad site or Appistry.
Workflow3: Population calling¶
When calling multiple samples, we recommend calling together to provide improved
sensitivity and a fully squared off final callset. To associate samples together
in a population add a
batch to the sample configuration:
- description: Sample1 metadata: batch: Batch1 - description: Sample2 metadata: batch: Batch1
Batching samples results in output VCFs and GEMINI databases containing all merged sample calls. bcbio has two methods to call samples together:
- Batch or pooled calling – This calls all samples simultaneously by feeding
them to the variant caller. This works for smaller batch sizes (< 100 samples)
as memory requirements become limiting in larger pools. This is the default approach
taken when you specify a
variantcallerin the variant calling configuration.
- Joint calling – This calls samples independently, then combines them together
into a single callset by integrating the individual calls.
This scales to larger population sizes by avoiding the computational bottlenecks
of pooled calling. We recommend joint calling with HaplotypeCaller
but also support joint calling with FreeBayes using a custom implementation.
jointcalleralong with the appropriate
variantcallerin the variant calling configuration enables this
- description: Sample1 algorithm: variantcaller: gatk-haplotype jointcaller: gatk-haplotype-joint metadata: batch: Batch1 - description: Sample2 algorithm: variantcaller: gatk-haplotype jointcaller: gatk-haplotype-joint metadata: batch: Batch1
Workflow4: Whole genome trio (50x) - hg38¶
This input configuration runs whole genome bwa alignment and GATK variant calling. It uses a father/mother/child trio from the CEPH NA12878 family: NA12891, NA12892, NA12878. Illumina’s Platinum genomes project has 50X whole genome sequencing of the three members. The analysis compares results against a reference NA12878 callset from NIST’s Genome in a Bottle initiative.
To run the analysis do:
mkdir NA12878-trio-eval cd NA12878-trio-eval mkdir config input work cd config wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate.yaml cd ../input wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh bash NA12878-trio-wgs-validate-getdata.sh cd ../work bcbio_nextgen.py ../config/NA12878-trio-wgs-validate.yaml -n 16
This is a large whole genome analysis and meant to test both pipeline scaling and validation across the entire genome. It can take multiple days to run depending on available cores. It requires 300GB for the input files and 1.3TB for the work directory. Smaller examples below exercise the pipeline with less disk and computational requirements.
Workflow5: Whole genome (10x)¶
An input configuration for running whole gnome variant calling with bwa and GATK, using Illumina’s Platinum genomes project NA12878-illumina.yaml. See this blog post on whole genome scaling for expected run times and more information about the pipeline. To run the analysis:
Create an input directory structure like:
├── config │ └── NA12878-illumina.yaml ├── input └── work
Retrieve inputs and comparison calls:
cd input wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_2.fastq.gz
Retrieve configuration input file:
cd config wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-illumina.yaml
Run analysis on 16 core machine:
cd work bcbio_nextgen.py ../config/NA12878-illumina.yaml -n 16
Examine summary of concordance and discordance to comparison calls from the
grading-summary.csvfile in the work directory.
variantcallerVariant calling algorithm. Can be a list of multiple options or false to skip [false, freebayes, gatk-haplotype, haplotyper, platypus, mutect, mutect2, scalpel, tnhaplotyper, tnscope, vardict, varscan, samtools, gatk]
- Paired (typically somatic, tumor-normal) variant calling is currently supported by vardict, freebayes, mutect2, mutect (see disclaimer below), scalpel (indels only), tnhaplotyper (Sentieon), tnscope (Sentieon) and varscan. See somatic variant calling documentation for details on pairing tumor and normal samples.
- You can generate both somatic and germline calls for paired tumor-normal samples using different sets of callers.
The pipeline documentation on calling
Somatic with germline variantsdetails how to do this.
- mutect, a SNP-only caller, can be combined with indels from scalpel or sid. Mutect operates in both tumor-normal and tumor-only modes. In tumor-only mode the indels from scalpel will reflect all indels in the sample, as there is currently no way of separating the germline from somatic indels in tumor-only mode.
indelcallerFor the MuTect SNP only variant caller it is possible to add calls from an indelcaller such as scalpel, pindel and somatic indel detector (for Appistry MuTect users only). Currently an experimental option that adds these indel calls to MuTect’s SNP-only output. Only one caller supported. Omit to ignore. [scalpel, pindel, sid, false]
jointcallerJoint calling algorithm, combining variants called with the specified
variantcaller. Can be a list of multiple options but needs to match with appropriate
variantcaller. Joint calling is only needed for larger input sample sizes (>100 samples), otherwise use standard pooled
gatk-haplotype-jointGATK incremental joint discovery with HaplotypeCaller. Takes individual gVCFs called by
gatk-haploypeand perform combined genotyping.
freebayes-jointCombine freebayes calls using bcbio.variation.recall with recalling at all positions found in each individual sample. Requires
platypus-jointCombine platypus calls using bcbio.variation.recall with squaring off at all positions found in each individual sample. Requires
samtools-jointCombine samtools calls using bcbio.variation.recall with squaring off at all positions found in each individual sample. Requires
joint_group_sizeSpecify the maximum number of gVCF samples to feed into joint calling. Currently applies to GATK HaplotypeCaller joint calling and defaults to the GATK recommendation of 200. Larger numbers of samples will first get combined prior to genotyping.
ploidyPloidy of called reads. Defaults to 2 (diploid). You can also tweak specialty ploidy like mitochondrial calling by setting ploidy as a dictionary. The defaults are:
ploidy: default: 2 mitochondrial: 1 female: 2 male: 1
backgroundProvide pre-calculated files to use as backgrounds for different processes. Organized as a dictionary with individual keys for different components of the pipeline. You can enter as many or few as needed:
variantA VCF file with variants to use as a background reference during variant calling. For tumor/normal paired calling use this to supply a panel of normal individuals.
cnv_referenceBackground reference file for copy number calling. This can be either a single file for one CNV method or a dictionary for multiple methods. Supports CNVkit cnn inputs, GATK4 HDF5 panel of normals and seq2c combined mapping plus coverage files:
background: cnv_reference: cnvkit: /path/to/background.cnn gatk-cnv: /path/to/background_pon.hdf5 seq2c: /path/to/background.tsv
effectsMethod used to calculate expected variant effects; defaults to snpEff. Ensembl variant effect predictor (VEP) is also available when downloaded using
Customizing data installation. [snpeff, vep, false]
effects_transcriptsDefine the transcripts to use for effect prediction annotation. Options
all: Standard Ensembl transcript list (the default);
canonical: Report single canonical transcripts (
canonical_cancerCanonical transcripts with hand curated changes for more common cancer transcripts (effects snpEff only).
vcfannoConfiguration files for vcfanno, allowing the application of additional annotations to variant calls. By default, bcbio will try and apply:
gemini– External population level annotations from GEMINI. This is only run for human samples with gemini data installed.
somatic– Somatic annotations from COSMIC, ClinVar and friends. COSMIC need a custom installation within bcbio. Only added for tumor or tumor/normal somatic calling.
rnaedit– RNA editing sites for RNA-seq variant calling runs. bcbio installs pre-prepared configuration files in
genomes/build/config/vcfannoor you can specify the full path to a
/path/your/anns.confand optionally an equivalently named
/path/your/anns.luafile. This value can be a list for multiple inputs.
See description in somatic_variants.
bcbio pre-installs standard truth sets for performing validation, and also allows use of custom local files for assessing reliability of your runs:
validateA VCF file of expected variant calls to perform validation and grading of small variants (SNPs and indels) from the pipeline. This provides a mechanism to ensure consistency of calls against a known set of variants, supporting comparisons to genotyping array data or reference materials.
validate_regionsA BED file of regions to evaluate small variant calls in. This defines specific regions covered by the
svvalidate– Dictionary of call types and pointer to BED file of known regions. For example:
DEL: known_deletions.beddoes deletion based validation of outputs against the BED file.
Each option can be either the path to a local file, or a partial path to a file in the pre-installed truth sets. For instance, to validate an NA12878 run against the Genome in a Bottle truth set:
validate: giab-NA12878/truth_small_variants.vcf.gz validate_regions: giab-NA12878/truth_regions.bed svvalidate: DEL: giab-NA12878/truth_DEL.bed
follow the same naming schemes for small variants, regions and different structural variant types. bcbio has the following validation materials for germline validations:
giab-NA12878– Genome in a Bottle for NA12878, a Caucasian sample. Truth sets: small_variants, regions, DEL; Builds: GRCh37, hg19, hg38
giab-NA24385– Genome in a Bottle for NA24385, an Ashkenazic Jewish sample. Truth sets: small_variants, regions; Builds: GRCh37, hg19, hg38
giab-NA24631– Genome in a Bottle for NA24631, a Chinese sample. Truth sets: small_variants, regions; Builds: GRCh37, hg19, hg38
giab-NA12878-crossmap– Genome in a Bottle for NA12878 converted to hg38 with CrossMap. Truth sets: small_variants, regions, DEL; Builds: hg38
giab-NA12878-remap– Genome in a Bottle for NA12878 converted to hg38 with Remap. Truth sets: small_variants, regions, DEL; Builds: hg38
platinum-genome-NA12878– Illumina Platinum Genome for NA12878. Truth sets: small_variants, regions; Builds: hg19, hg38
For more information on the hg38 truth set preparation see the work on validation on build 38 and conversion of human build 37 truth sets to build 38. The installation recipes contain provenance details about the origins of the installed files.
- NIST genome in a bottle
- GIAB data index
- Illumina Platinum Genomes
- An introduction to the variant evaluation framework. This includes a comparison of the bwa mem and novoalign aligners. We also compared the FreeBayes, GATK HaplotypeCaller and GATK UnifiedGenotyper variant callers.
- An in-depth evaluation of FreeBayes and BAM post-alignment processing. We found that FreeBayes quality was equal to GATK HaplotypeCaller. Additionally, a lightweight post-alignment preparation method using only de-duplication was equivalent to GATK’s recommended Base Quality Score Recalibration (BQSR) and realignment around indels, when using good quality input datasets and callers that do local realignment.
- Additional work to improve variant filtering,
providing methods to remove low complexity regions (LCRs) that can bias indel results.
We also tuned GATK’s Variant Quality Score Recalibrator
(VQSR) and compared it with cutoff-based soft filtering. VQSR requires
a large number of variants and we use it in bcbio with GATK HaplotypeCaller
algorithm parameterscontain high depth samples (
coverage_depthis not low) and you are calling on the whole genome (
coverage_intervalis genome) or have more than 50 regional or exome samples called concurrently.
- An evaluation of joint calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools. This validates the joint calling implementation, allowing scaling of large population germline experiments. It also demonstrates improved performance of new callers: samtools 1.0 and Platypus.
- Support for build 38 of the human genome, improving precision of detection thanks to the improved genome representation.
- bcbio automates post-variant calling annotation to make the outputs easier to feed directly into your biological analysis. We annotate variant effects using snpEff or Variant Effect Predictor (VEP), and prepare a GEMINI database that associates variants with multiple external annotations in a SQL-based query interface. GEMINI databases have the most associated external information for human samples (GRCh37/hg19 and hg38) but are available for any organism with the database populated using the VCF INFO column and predicted effects.