Presentations

Feel free to reuse any images or text from these talks. The slides are on GitHub.

Abstract

Community Development of Validated Variant Calling Pipelines

Brad Chapman, Rory Kirchner, Oliver Hofmann and Winston Hide Harvard School of Public Health, Bioinformatics Core, Boston, MA, 02115

Translational research relies on accurate identification of genomic variants. However, rapidly changing best practice approaches in alignment and variant calling, coupled with large data sizes, make it a challenge to create reliable and reproducible variant calls. Coordinated community development can help overcome these challenges by sharing testing and updates across multiple groups. We describe bcbio-nextgen, a distributed multi-architecture pipeline that automates variant calling, validation and organization of results for query and visualization. It creates an easily installable, reliable infrastructure from best-practice open source tools with the following goals:

  • Quantifiable: Validates variant calls against known reference materials developed by the Genome in a Bottle consortium. The bcbio.variation toolkit automates scoring and assessment of calls to identify regressions in variant identification as calling pipelines evolve. Incorporation of multiple variant calling approaches from Broad’s GATK best practices and the Marth lab’s gkno software enables informed comparisons between current and future algorithms.

  • Scalable: bcbio-nextgen handles large population studies with hundreds of whole genome samples by parallelizing on a wide variety of schedulers and multicore machines, setting up different ad hoc cluster configurations for each workflow step. Work in progress includes integration with virtual environments, including Amazon Web Services and OpenStack.

  • Accessible: Results automatically feed into tools for query and investigation of variants. The GEMINI framework provides a queryable database associating variants with a wide variety of genome annotations. The o8 web-based tool visualizes the work of variant prioritization and assessment.

  • Community developed: bcbio-nextgen is widely used in multiple sequencing centers and research laboratories. We actively encourage contributors to the code base and make it easy to get started with a fully automated installer and updater that prepares all third party software and reference genomes.