Getting started¶
Workflow¶
This example calls variants using NA12878 exome data from EdgeBio’s clinical sequencing pipeline, and compares them against reference materials from NIST’s Genome in a Bottle initiative.
1. Install bcbio python package and tools¶
# make sure that bcbio installed python will be ahead of the system pythons/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin
export PATH=/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin:$PATH
wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python3 bcbio_nextgen_install.py [bcbio_path] --tooldir=[bcbio_path]/tools --nodata --mamba
2. Install hg38 reference genome and bwa indices¶
bcbio_nextgen.py upgrade -u skip --genomes hg38 --aligners bwa
See more detailed instructions in the installation user story.
3. Get the input configuration file, fastq reads, reference materials and analysis regions:¶
mkdir -p NA12878-exome-eval
cd NA12878-exome-eval
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-exome-methodcmp-getdata.sh
bash NA12878-exome-methodcmp-getdata.sh
4. Run the analysis, distributed on 8 local cores, with:¶
Make sure that PATH variable contains paths to bcbio scripts and tools:
$ which bcbio_nextgen.py
/bcbio_installation/anaconda/bin/bcbio_nextgen.py
$ which mosdepth
/bcbio_installation/tools/bin/mosdepth
$ echo $PATH
/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin:[other-system-bin-dirs]
Run the project:
cd work
bcbio_nextgen.py ../config/NA12878-exome-methodcmp.yaml -n 8
Parameters of the analysis are specified in the yaml configuration file:
upload:
dir: ../final
details:
- files: [../input/NA12878-NGv3-LAB1360-A_1.fastq.gz, ../input/NA12878-NGv3-LAB1360-A_2.fastq.gz]
description: NA12878
metadata:
sex: female
analysis: variant2
genome_build: hg38
algorithm:
aligner: bwa
variantcaller: gatk-haplotype
validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: giab-NA12878/truth_regions.bed
variant_regions: capture_regions/Exome-NGv3
Running time is ~2h.
5. Explore results in NA12878-exome-eval/final
:¶
date_project/multiqc
- quality contoldate_project/NA12878-gatk-haplotype-annotated.vcf.gz
- annotated variantsNA12878/NA12878-callable.bed
- callable regionsfinal/NA12878/NA12878-ready.bam
- bam filedate_project/bcbio-nextgen-commands.log
- commands ran to produce resultsdate_project/grading-summary-NA12878.csv
- validation results. False Discovery Rate (FDR) for SNPs here is 3% (i.e. 97% precision for SNPs), so the precision is quite low. One reason of low precision could be that NA12878-NGv3-LAB1360 WES dataset was sequenced in 2013 or earlier, so it could be of somewhat lower quality. We left it here for educational purpose. With a modern NA12878 dataset you can achieve >99% precision and >99% sensitivity using bcbio/gatk, see germline variants user story. Comparing QC and validations in the two NA12878 WES datasets illustrates how sequencing quality affects variant calling precision and sensitivity. Another point one could make when comparing the two validations is that NA12878-NGv3-LAB1360 has a larger target (133,288 SNPs vs 37,033), so the choice ofvariant_regions
directly influences validation results. Including only regions with high coverage, excluding low complexity regions leads to increased precision. A larger bed file with more regions included is a more stressful test for combination of capture kit/sequencing instrument/aligner/variant caller/filters.
What is next?¶
Bcbio documentation is organized by user stories. We support 22 user stories (extended use cases):
14 data processing user stories corresponding to different types of NGS data and biological questions
8 infrastructural stories.
Data processing stories¶
Somatic variants
Bulk RNA-seq expression
Single cell RNA-seq
HLA typing
Germline small variants
3’prime digital gene expression
Structural variants
ChIP/ATAC-seq
Methylation
Bulk RNA-seq variants
Bulk RNA-seq fusion
Fast RNA-seq
Disambiguation
Small RNA-seq
Infrastructural stories¶
Getting started
Installation
Configuration
Parallel execution
Outputs
Development
Cloud
CWL
A typical user story contains:¶
workflow
- step-by step description of how to run the pipeline with example dataparameters
- describes yaml config file parameters relevant for the user storyoutput
- describes output filessteps
- outlines low level step the pipeline performs to process datavalidation
- validation results when availabledescription
references
Try running bcbio with your own data, report issues, contribute to the codebase and documentation on github.