Getting started¶

Workflow¶

This example calls variants using NA12878 exome data from EdgeBio’s clinical sequencing pipeline, and compares them against reference materials from NIST’s Genome in a Bottle initiative.

1. Install bcbio python package and tools¶

# make sure that bcbio installed python will be ahead of the system pythons/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin
export PATH=/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin:$PATH
wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python3 bcbio_nextgen_install.py [bcbio_path] --tooldir=[bcbio_path]/tools --nodata --mamba

2. Install hg38 reference genome and bwa indices¶

bcbio_nextgen.py upgrade -u skip --genomes hg38 --aligners bwa

See more detailed instructions in the installation user story.

3. Get the input configuration file, fastq reads, reference materials and analysis regions:¶

mkdir -p NA12878-exome-eval
cd NA12878-exome-eval
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-exome-methodcmp-getdata.sh
bash NA12878-exome-methodcmp-getdata.sh

4. Run the analysis, distributed on 8 local cores, with:¶

Make sure that PATH variable contains paths to bcbio scripts and tools:

$ which bcbio_nextgen.py
/bcbio_installation/anaconda/bin/bcbio_nextgen.py
$ which mosdepth
/bcbio_installation/tools/bin/mosdepth
$ echo $PATH
/bcbio_installation/anaconda/bin:/bcbio_installation/tools/bin:[other-system-bin-dirs]

Run the project:

cd work
bcbio_nextgen.py ../config/NA12878-exome-methodcmp.yaml -n 8

Parameters of the analysis are specified in the yaml configuration file:

upload:
  dir: ../final
details:
  - files: [../input/NA12878-NGv3-LAB1360-A_1.fastq.gz, ../input/NA12878-NGv3-LAB1360-A_2.fastq.gz]
    description: NA12878
    metadata:
      sex: female
    analysis: variant2
    genome_build: hg38
    algorithm:
      aligner: bwa
      variantcaller: gatk-haplotype
      validate: giab-NA12878/truth_small_variants.vcf.gz
      validate_regions: giab-NA12878/truth_regions.bed
      variant_regions: capture_regions/Exome-NGv3

Running time is ~2h.

5. Explore results in `NA12878-exome-eval/final`:¶

date_project/multiqc - quality contol
date_project/NA12878-gatk-haplotype-annotated.vcf.gz - annotated variants
NA12878/NA12878-callable.bed - callable regions
final/NA12878/NA12878-ready.bam - bam file
date_project/bcbio-nextgen-commands.log - commands ran to produce results
date_project/grading-summary-NA12878.csv - validation results. False Discovery Rate (FDR) for SNPs here is 3% (i.e. 97% precision for SNPs), so the precision is quite low. One reason of low precision could be that NA12878-NGv3-LAB1360 WES dataset was sequenced in 2013 or earlier, so it could be of somewhat lower quality. We left it here for educational purpose. With a modern NA12878 dataset you can achieve >99% precision and >99% sensitivity using bcbio/gatk, see germline variants user story. Comparing QC and validations in the two NA12878 WES datasets illustrates how sequencing quality affects variant calling precision and sensitivity. Another point one could make when comparing the two validations is that NA12878-NGv3-LAB1360 has a larger target (133,288 SNPs vs 37,033), so the choice of variant_regions directly influences validation results. Including only regions with high coverage, excluding low complexity regions leads to increased precision. A larger bed file with more regions included is a more stressful test for combination of capture kit/sequencing instrument/aligner/variant caller/filters.

What is next?¶

Bcbio documentation is organized by user stories. We support 22 user stories (extended use cases):

14 data processing user stories corresponding to different types of NGS data and biological questions
8 infrastructural stories.

Data processing stories¶

Somatic variants
Bulk RNA-seq expression
Single cell RNA-seq
HLA typing
Germline small variants
3’prime digital gene expression
Structural variants
ChIP/ATAC-seq
Methylation
Bulk RNA-seq variants
Bulk RNA-seq fusion
Fast RNA-seq
Disambiguation
Small RNA-seq

Infrastructural stories¶

Getting started
Installation
Configuration
Parallel execution
Outputs
Development
Cloud
CWL

A typical user story contains:¶

workflow - step-by step description of how to run the pipeline with example data
parameters - describes yaml config file parameters relevant for the user story
output - describes output files
steps - outlines low level step the pipeline performs to process data
validation - validation results when available
description
references

Try running bcbio with your own data, report issues, contribute to the codebase and documentation on github.