Configuration

Two configuration files, in easy to write YAML format, specify details about your system and samples to run:

  • bcbio_system.yaml High level information about the system, including locations of installed programs like GATK and cores and memory usage (see Tuning core and memory usage). These apply across multiple runs. The automated installer creates a ready to go system configuration file that can be manually edited to match the system. Find the file in the galaxy sub-directory within your installation data location (ie. /usr/local/share/bcbio-nextgen/galaxy). By default, the pipeline uses the standard pre-created configuration file but multiple system configurations can be independently maintained and passed as the first argument to bcbio_nextgen.py commands.
  • bcbio_sample.yaml Details about a set of samples to process, including input files and analysis options. You configure these for each set of samples to process.

Commented system and sample example files are available in the config directory. The Example pipelines section contains additional examples of ready to run sample files.

Automated sample configuration

bcbio-nextgen provides a utility to create configuration files for multiple sample inputs using a base template. Start with one of the best-practice templates, or define your own, then apply to multiple samples using the template workflow command:

bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq
  • freebayes-variant is the name of the standard freebayes-variant.yaml input, which the script fetches from GitHub. This argument can also be a path to a locally customized YAML configuration. In both cases, the script replicates the single sample template configuration to all input samples.

  • project1.csv is a comma separated value file containing sample metadata, descriptions and algorithm tweaks:

    samplename,description,batch,phenotype,sex,variant_regions
    sample1,ERR256785,batch1,normal,female,/path/to/regions.bed
    sample2,ERR256786,batch1,tumor,,/path/to/regions.bed
    

    The first column links the metadata to a specific input file. The template command tries to identify the samplename from read group information in a BAM file, or uses the base filename if no read group information is present. For BAM files, this would be the filename without the extension and path (/path/to/yourfile.bam => yourfile). For FASTQ files, the template functionality will identify pairs using standard conventions (_1 and _2, including Illumina extensions like _R1), so use the base filename without these (/path/to/yourfile_R1.fastq => yourfile). Note that paired-end samples sequentially numbered without leading zeros (e.g., sample_1_1.fastq, sample_1_2.fastq, sample_2_1.fastq, sample_2_2.fastq, etc., will likely not be parsed correctly; see #1919 for more info). As well, . characters could be problematics, it’s better to avoid this character and use it only as separation for the extension file.

    The remaining columns can contain:

    • description Changes the sample description, originally supplied by the file name or BAM read group, to this value. You can also set the lane, although this is less often done as the default sequential numbering works here. See the documentation for Sample information on how these map to BAM read groups.
    • Algorithm parameters specific for this sample. If the column name matches an available Algorithm parameters, then this value substitutes into the sample algorithm, replacing the defaults from the template. You can also change other information in the BAM read group through the algorithm parameters. See Alignment configuration documentation for details on how these map to read group information.
    • Sample information metadata key/value pairs. Any columns not falling into the above cases will go into the metadata section. A ped specification will allow bcbio to read family, gender and phenotype information from a PED input file and use it for batch, sex and phenotype, respectively. The PED inputs supplement information from the standard template file, so if you specify a value in the template CSV the PED information will no overwrite it.

    Individual column items can contain booleans (true or false), integers, or lists (separated by semi-colons). These get converted into the expected time in the output YAML file. For instance, to specify a sample that should go into multiple batches:

    samplename,description,phenotype,batch
    normal.bam,two_normal,normal,Batch1;Batch2
    

    For dictionary inputs like Somatic with germline variants setups, you can separate items in a dictionary with colons and double colons, and also use semicolons for lists:

    samplename,description,phenotype,variantcaller
    tumor.bam,sample1,tumor,germline:freebayes;gatk-haplotype::somatic:vardict;freebayes
    

    The name of the metadata file, minus the .csv extension, is a short name identifying the current project. The script creates a project1 directory containing the sample configuration in project1/config/project1.yaml.

  • The remaining arguments are input BAM or FASTQ files. The script pairs FASTQ files (identified by _1 and _2) and extracts sample names from input BAMs, populating the files and description field in the final configuration file. Specify the full path to sample files on your current machine.

To make it easier to define your own project specific template, an optional first step is to download and edit a local template. First retrieve a standard template:

bcbio_nextgen -w template freebayes-variant project1

This pulls the current GATK best practice variant calling template into your project directory in project1/config/project1-template.yaml. Manually edit this file to define your options, then run the full template creation for your samples, pointing to this custom configuration file:

bcbio_nextgen -w template project1/config/project1-template.yaml project1.csv folder/*

If your sample folder contains additional BAM or FASTQ files you do not wish to include in the sample YAML configuration, you can restrict the output to only include samples in the metadata CSV with --only-metadata. The output will print warnings about samples not present in the metadata file, then leave these out of the final output YAML:

bcbio_nextgen -w template --only-metadata project1/config/project1-template.yaml project1.csv folder/*

Multiple files per sample

In case you have multiple FASTQ or BAM files for each sample you can use bcbio_prepare_samples.py. The main parameters are:

  • --out: the folder where the merged files will be

  • --csv: the CSV file that is exactly the same than described previously, but having as many duplicate lines for each samples as files to be merged:

    samplename,description,batch,phenotype,sex,variant_regions
    file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
    file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
    file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed
    

An example of usage is:

bcbio_prepare_samples.py --out merged --csv project1.csv

The script will create the sample1.fastq,sample2.fastq in the merged folder, and a new CSV file in the same folder than the input CSV :project1-merged.csv. Later, it can be used for bcbio:

bcbio_nextgen -w template project1/config/project1-template.yaml project1-merged.csv merged/*fastq

The new CSV file will look like:

samplename,description,batch,phenotype,sex,variant_regions
sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed

It supports parallelization the same way bcbio_nextgen.py does:

python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py --out merged --csv project1.csv -t ipython -q queue_name -s lsf -n 1

See more examples at parallelize pipeline.

In case of paired reads, the CSV file should contain all files:

samplename,description,batch,phenotype,sex,variant_regions
file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed
file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed

The script will try to guess the paired files the same way than bcbio_nextgen.py -w template does. It would detect paired files if the difference among two files is only _R1/_R2 or -1/-2 or _1/_2 or .1/.2

The output CSV will look like and is compatible with bcbio:

samplename,description,batch,phenotype,sex,variant_regions
sample1,sample1,batch1,normal,female,/path/to/regions.bed

Sample information

The sample configuration file defines details of each sample to process:

details:
  - analysis: variant2
    lane: 1
    description: Example1
    files: [in_pair_1.fq, in_pair_2.fq]
    genome_build: hg19
    algorithm:
      platform: illumina
    metadata:
      batch: Batch1
      sex: female
      platform_unit: flowcell-barcode.lane
      library: library_type
  • analysis Analysis method to use [variant2, RNA-seq, smallRNA-seq]

  • lane A unique number within the project. Corresponds to the ID parameter in the BAM read group.

  • description Unique name for this sample, corresponding to the SM parameter in the BAM read group. Required.

  • files A list of files to process. This currently supports either a single end or two paired-end FASTQ files, or a single BAM file. It does not yet handle merging BAM files or more complicated inputs.

  • genome_build Genome build to align to, which references a genome keyword in Galaxy to find location build files.

  • algorithm Parameters to configure algorithm inputs. Options described in more detail below:

    • platform Sequencing platform used. Corresponds to the PL parameter in BAM read groups. Optional, defaults to illumina.
  • metadata Additional descriptive metadata about the sample:

    • batch defines a group that the sample falls in. We perform multi-sample variant calling on all samples with the same batch name. This can also be a list, allowing specification of a single normal sample to pair with multiple tumor samples in paired cancer variant calling (batch: [MatchWithTumor1, MatchWithTumor2]).
    • sex specifies the sample gender used to correctly prepare X/Y chromosomes. Use male and female or PED style inputs (1=male, 2=female).
    • phenotype stratifies cancer samples into tumor and normal or case/controls into affected and unaffected. Also accepts PED style specifications (1=unaffected, 2=affected). CNVkit uses case/control status to determine how to set background samples for CNV calling.
    • svclass defines a classification for a sample for use in SV case/control setups. When set as control will put samples into the background samples used for normalization.
    • ped provides a PED phenotype file containing sample phenotype and family information. Template creation uses this to supplement batch, sex and phenotype information provided in the template CSV. GEMINI database creation uses the PED file as input.
    • platform_unit – Unique identifier for sample. Optional, defaults to lane if not specified.
    • library – Name of library preparation used. Optional, empty if not present.
    • validate_batch – Specify a batch name to group samples together for preparing validation plots. This is useful if you want to process samples in specific batches, but include multiple batches into the same validation plot.

Upload

The upload section of the sample configuration file describes where to put the final output files of the pipeline. At its simplest, you can configure bcbio-nextgen to upload results to a local directory, for example a folder shared amongst collaborators or a Dropbox account. You can also configure it to upload results automatically to a Galaxy instance, to Amazon S3 or to iRODS. Here is the simplest configuration, uploading to a local directory:

upload:
  dir: /local/filesystem/directory

General parameters, always required:

  • method Upload method to employ. Defaults to local filesystem. [filesystem, galaxy, s3, irods]
  • dir Local filesystem directory to copy to.

Galaxy parameters:

  • galaxy_url URL of the Galaxy instance to upload to. Upload assumes you are able to access a shared directory also present on the Galaxy machine.
  • galaxy_api_key User API key to access Galaxy: see the Galaxy API documentation.
  • galaxy_library Name of the Galaxy Data Library to upload to. You can specify this globally for a project in upload or for individual samples in the sample details section.
  • galaxy_role Specific Galaxy access roles to assign to the uploaded datasets. This is optional and will default to the access of the parent data library if not supplied. You can specify this globally for a project in upload or for individual samples in the sample details section. The Galaxy Admin documentation has more details about roles.

Here is an example configuration for uploading to a Galaxy instance. This assumes you have a shared mounted filesystem that your Galaxy instance can also access:

upload:
  method: galaxy
  dir: /path/to/shared/galaxy/filesystem/folder
  galaxy_url: http://url-to-galaxy-instance
  galaxy_api_key: YOURAPIKEY
  galaxy_library: data_library_to_upload_to

Your Galaxy universe_wsgi.ini configuration needs to have allow_library_path_paste = True set to enable uploads.

S3 parameters:

  • bucket AWS bucket to direct output.
  • folder A folder path within the AWS bucket to prefix the output.
  • region AWS region name to use. Defaults to us-east-1
  • reduced_redundancy Flag to determine if we should store S3 data with reduced redundancy: cheaper but less reliable [false, true]

For S3 access credentials, set the standard environmental variables, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION or use IAM access roles with an instance profile on EC2 to give your instances permission to create temporary S3 access.

iRODS parameters:

  • folder Full directory name within iRODS to prefix the output.
  • resource (optional) iRODS resource name, if other than default.

example configuration

upload:
method: irods dir: ../final folder: /irodsZone/your/path/ resource: yourResourceName

Uploads to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured connection through the iinit command.

Globals

You can define files used multiple times in the algorithm section of your configuration in a top level globals dictionary. This saves copying and pasting across the configuration and makes it easier to manually adjust the configuration if inputs change:

globals:
  my_custom_locations: /path/to/file.bed
details:
  - description: sample1
    algorithm:
      variant_regions: my_custom_locations
  - description: sample2
    algorithm:
      variant_regions: my_custom_locations

Algorithm parameters

The YAML configuration file provides a number of hooks to customize analysis in the sample configuration file. Place these under the algorithm keyword.

Alignment

  • platform Sequencing platform used. Corresponds to the PL parameter in BAM read groups. Default ‘Illumina’.

  • aligner Aligner to use: [bwa, bowtie, bowtie2, hisat2, novoalign, snap, star, tophat2, false] To use pre-aligned BAM files as inputs to the pipeline, set to false. Using pre-aligned inputs requires proper assignment of BAM read groups and sorting. The bam_clean argument can often resolve issues with problematic input BAMs.

  • bam_clean Clean an input BAM when skipping alignment step. This handles adding read groups, sorting to a reference genome and filtering problem records that cause problems with GATK. Options:

    • fixrg – only adjust read groups, assuming everything else in BAM file is compatible.
    • picard – Picard/GATK based cleaning. Includes read group changes, fixing of problematic reads and re-ordering chromosome order to match the reference genome. To fix misencoded input BAMs with non-standard scores, set quality_format to illumina.
  • bam_sort Allow sorting of input BAMs when skipping alignment step (aligner set to false). Options are coordinate or queryname. For additional processing through standard pipelines requires coordinate sorted inputs. The default is to not do additional sorting and assume pre-sorted BAMs.

  • disambiguate For mixed or explant samples, provide a list of genome_build identifiers to check and remove from alignment. Currently supports cleaning a single organism. For example, with genome_build: hg19 and disambiguate: [mm10], it will align to hg19 and mm10, run disambiguation and continue with reads confidently aligned to hg19. Affects fusion detection when star is chosen as the aligner. Aligner must be set to a non false value for this to run.

  • trim_reads Can be set to trim low quality ends or to also trim off, in conjunction with the adapters field a set of adapter sequences or poly-A tails that could appear on the ends of reads. Only used in RNA-seq pipelines, not variant calling. [False, read_through]. Default to False, recommended to leave as False unless running Tophat2.

  • min_read_length Minimum read length to maintain when read_through trimming set in trim_reads. Defaults to 20.

  • adapters If trimming adapter read through, trim a set of stock adapter sequences. Allows specification of multiple items in a list, for example [truseq, polya] will trim both TruSeq adapter sequences and polyA tails. Valid items are [truseq, illumina, nextera, polya]. In small RNA pipeline, bcbio’ll try to detect the adapter using DNApi. If you set up this parameter, then bcbio’ll use this value instead.

  • custom_trim A list of sequences to trim from the end of reads, for example: [AAAATTTT, GGGGCCCC]

  • align_split_size: Increase parallelization of alignment. As of 0.9.8, bcbio will try to determine a useful parameter and you don’t need to set this. If you manually set it, bcbio will respect your specification. Set to false to avoid splitting entirely. If set, this defines the number of records to feed into each independent parallel step (for example, 5000000 = 5 million reads per chunk). It converts the original inputs into bgzip grabix indexed FASTQ files, and then retrieves chunks for parallel alignment. Following alignment, it combines all chunks back into the final merged alignment file. This allows parallelization at the cost of additional work of preparing inputs and combining split outputs. The tradeoff makes sense when you have large files and lots of distributed compute. When you have fewer large multicore machines this parameter may not help speed up processing.

  • quality_format Quality format of FASTQ or BAM inputs [standard, illumina]

  • strandedness For RNA-seq libraries, if your library is strand specific, set the appropriate flag from [unstranded, firststrand, secondstrand]. Defaults to unstranded. For dUTP marked libraries, firststrand is correct; for Scriptseq prepared libraries, secondstrand is correct.

  • save_diskspace Remove align prepped bgzip and split BAM files after merging into final BAMs. Helps reduce space on limited filesystems during a run. tools_off: [upload_alignment] may also be useful in conjunction with this. [false, true]

Alignment postprocessing

  • mark_duplicates Mark duplicated reads [true, false]. If true, will perform streaming duplicate marking with biobambam’s bammarkduplicates or bamsormadup. Uses samblaster as an alternative if you have paired reads and specifying lumpy as an svcaller.
  • recalibrate Perform GATK’s base quality score recalibration on the aligned BAM file. Defaults to false, no recalibration. [false, gatk]
  • realign Perform GATK’s realignment around indels on the aligned BAM file. Defaults to no realignment since realigning callers like FreeBayes and GATK HaplotypeCaller handle this as part of the calling process. [false, gatk]

Coverage information

  • coverage_interval Regions covered by sequencing. bcbio calculates this automatically from alignment coverage information, so you only need to specify it in the input configuration if you have specific needs or bcbio does not determine coverage correctly. genome specifies full genome sequencing, regional identifies partial-genome pull down sequencing like exome analyses, and amplicon is partial-genome sequencing from PCR amplicon sequencing. This influences GATK options for filtering: we use Variant Quality Score Recalibration when set to genome, otherwise we apply cutoff-based soft filters. Also affects copy number calling with CNVkit, structural variant calling and deep panel calling in cancer samples, where we tune regional/amplicon analyses to maximize sensitivity. [genome, regional, amplicon]
  • coverage_depth_min Minimum depth of coverage. When calculating regions to call in, bcbio may exclude regions with less than this many reads. It is not a hard filter for variant calling, but rather a guideline for determining callable regions. It’s primarily useful when trying to call on very low depth samples. Defaults to 4. Setting lower than 4 will trigger low-depth calling options for GATK.

Analysis regions

These BED files define the regions of the genome to analyze and report on. variant_regions adjusts regions for small variant (SNP and indel) calling. sv_regions defines regions for structural variant calling if different than variant_regions. For coverage-based quality control metrics, we first use coverage if specified, then sv_regions if specified, then variant_regions. See the section on Input file preparation for tips on ensuring chromosome naming in these files match your reference genome. bcbio pre-installs some standard BED files for human analyses. Reference these using the naming schemes described in the reference data repository.

  • variant_regions BED file of regions to call variants in.
  • sv_regions – A specification of regions to target during structural variant calling. By default, bcbio uses regions specified in variant_regions but this allows custom specification for structural variant calling. This can be a pointer to a BED file or special inputs: exons for only exon regions, transcripts for transcript regions (the min start and max end of exons) or transcriptsXXXX for transcripts plus a window of XXXX size around it. The size can be an integer (transcripts1000) or exponential (transcripts1e5). This applies to CNVkit and heterogeneity analysis.
  • coverage A BED file of regions to check for coverage and completeness in QC reporting. This can also be a shorthand for a BED file installed by bcbio (see Structural variant calling for options).

Variant calling

  • variantcaller Variant calling algorithm. Can be a list of multiple options or false to skip [false, freebayes, gatk-haplotype, platypus, mutect, mutect2, scalpel, vardict, varscan, samtools, gatk]

    • Paired (typically somatic, tumor-normal) variant calling is currently supported by vardict, freebayes, mutect2, mutect (see disclaimer below), scalpel (indels only) and varscan. See the pipeline documentation on Cancer variant calling for details on pairing tumor and normal samples.
    • You can generate both somatic and germline calls for paired tumor-normal samples using different sets of callers. The pipeline documentation on calling Somatic with germline variants details how to do this.
    • mutect, a SNP-only caller, can be combined with indels from scalpel or sid. Mutect operates in both tumor-normal and tumor-only modes. In tumor-only mode the indels from scalpel will reflect all indels in the sample, as there is currently no way of separating the germline from somatic indels in tumor-only mode.
  • indelcaller For the MuTect SNP only variant caller it is possible to add

    calls from an indelcaller such as scalpel, pindel and somatic indel detector (for Appistry MuTect users only). Currently an experimental option that adds these indel calls to MuTect’s SNP-only output. Only one caller supported. Omit to ignore. [scalpel, pindel, sid, false]

  • remove_lcr Remove variants in low complexity regions (LCRs) for human variant calling. Heng Li’s variant artifacts paper provides these regions, which cover ~2% of the genome but contribute to a large fraction of problematic calls due to the difficulty of resolving variants in repetitive regions. Removal can help facilitate comparisons between methods and reduce false positives if you don’t need calls in LCRs for your biological analysis. [false, true]

  • jointcaller Joint calling algorithm, combining variants called with the specified variantcaller. Can be a list of multiple options but needs to match with appropriate variantcaller. Joint calling is only needed for larger input sample sizes (>100 samples), otherwise use standard pooled Population calling:

    • gatk-haplotype-joint GATK incremental joint discovery with HaplotypeCaller. Takes individual gVCFs called by gatk-haploype and perform combined genotyping.
    • freebayes-joint Combine freebayes calls using bcbio.variation.recall with recalling at all positions found in each individual sample. Requires freebayes variant calling.
    • platypus-joint Combine platypus calls using bcbio.variation.recall with squaring off at all positions found in each individual sample. Requires platypus variant calling.
    • samtools-joint Combine samtools calls using bcbio.variation.recall with squaring off at all positions found in each individual sample. Requires samtools variant calling.
  • joint_group_size Specify the maximum number of gVCF samples to feed into joint calling. Currently applies to GATK HaplotypeCaller joint calling and defaults to the GATK recommendation of 200. Larger numbers of samples will first get combined prior to genotyping.

  • ploidy Ploidy of called reads. Defaults to 2 (diploid). You can also tweak specialty ploidy like mitochondrial calling by setting ploidy as a dictionary. The defaults are:

    ploidy:
      default: 2
      mitochondrial: 1
      female: 2
      male: 1
    
  • phasing Do post-call haplotype phasing of variants. Defaults to no phasing [false, gatk]

  • background Provide a VCF file with variants to use as a background reference during variant calling. For tumor/normal paired calling use this to supply a panel of normal individuals.

Somatic variant calling

  • min_allele_fraction Minimum allele fraction to detect variants in heterogeneous tumor samples, set as the float or integer percentage to resolve (i.e. 10 = alleles in 10% of the sample). Defaults to 10. Specify this in the tumor sample of a tumor/normal pair.

Variant annotation

  • effects Method used to calculate expected variant effects; defaults to snpEff. Ensembl variant effect predictor (VEP) is also available with support for dbNSFP and `dbscSNV`_ annotation, when downloaded using Customizing data installation. [snpeff, vep, false]
  • effects_transcripts Define the transcripts to use for effect prediction annotation. Options all: Standard Ensembl transcript list (the default); canonical: Report single canonical transcripts (-canon in snpEff, -pick in VEP); canonical_cancer Canonical transcripts with hand curated changes for more common cancer transcripts (effects snpEff only).
  • vcfanno Configuration files for vcfanno, allowing use of the new vcfanno/vcf2db approach for creating GEMINI databases. The default is [gemini] for all organisms except GRCh37/hg19, which defaults to the older GEMINI loading approach. bcbio installs pre-prepared configuration files in genomes/build/config/vcfanno or you can specify the full path to a /path/your/anns.conf and optionally an equivalently named /path/your/anns.lua file. This value can be a list so you can supplement the existing annotation file with: [gemini, /path/your/anns.conf]. or replace it by only specifying your file. You can run only vcfanno without GEMINI database creation by setting tools_off: [gemini] and explicitly setting vcfanno: [gemini] (or any other configurations you want).

Structural variant calling

  • svcaller – List of structural variant callers to use. [lumpy, manta, cnvkit, seq2c, delly, battenberg]. LUMPY and Manta require paired end reads.

  • svprioritize – Produce a tab separated summary file of structural variants in regions of interest. This complements the full VCF files of structural variant calls to highlight changes in known genes. This can be either the path to a BED file (with chrom start end gene_name, see Input file preparation) or the name of one of the pre-installed prioritization files:

  • fusion_mode Enable fusion detection in RNA-seq when using STAR (recommended) or Tophat (not recommended) as the aligner. OncoFuse is used to summarise the fusions but currently only supports hg19 and GRCh37. For explant samples disambiguate enables disambiguation of STAR output [false, true].

HLA typing

  • hlacaller – Perform identification of highly polymorphic HLAs with human build 38 (hg38). The recommended option is optitype, using the OptiType caller. Also supports using the bwa HLA typing implementation with bwakit

Validation

bcbio pre-installs standard truth sets for performing validation, and also allows use of custom local files for assessing reliability of your runs:

  • validate A VCF file of expected variant calls to perform validation and grading of small variants (SNPs and indels) from the pipeline. This provides a mechanism to ensure consistency of calls against a known set of variants, supporting comparisons to genotyping array data or reference materials.
  • validate_regions A BED file of regions to evaluate small variant calls in. This defines specific regions covered by the validate VCF file.
  • svvalidate – Dictionary of call types and pointer to BED file of known regions. For example: DEL: known_deletions.bed does deletion based validation of outputs against the BED file.

Each option can be either the path to a local file, or a partial path to a file in the pre-installed truth sets. For instance, to validate an NA12878 run against the Genome in a Bottle truth set:

validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: giab-NA12878/truth_regions.bed
svvalidate:
  DEL: giab-NA12878/truth_DEL.bed

follow the same naming schemes for small variants, regions and different structural variant types. bcbio has the following validation materials for germline validations:

  • giab-NA12878Genome in a Bottle for NA12878. Truth sets: small_variants, regions, DEL; Builds: GRCh37, hg19, hg38
  • giab-NA12878-crossmapGenome in a Bottle for NA12878 converted to hg38 with CrossMap. Truth sets: small_variants, regions, DEL; Builds: hg38
  • giab-NA12878-remapGenome in a Bottle for NA12878 converted to hg38 with Remap. Truth sets: small_variants, regions, DEL; Builds: hg38
  • platinum-genome-NA12878Illumina Platinum Genome for NA12878. Truth sets: small_variants, regions; Builds: hg19, hg38

and for cancer validations:

For more information on the hg38 truth set preparation see the work on validation on build 38 and conversion of human build 37 truth sets to build 38. The installation recipes contain provenance details about the origins of the installed files.

UMIs

Unique molecular identifiers (UMIs) are short random barcodes used to tag individual molecules and avoid amplification biased. Both single cell RNA-seq and variant calling support UMIs. For variant calling, fgbio collapses sequencing duplicates for each UMI into a single consensus read prior to running re-alignment and variant calling. This requires mark_duplicates: true (the default) since it uses position based duplicates and UMI tags for collapsing duplicate reads into consensus sequences.

To help with preparing fastq files with UMIs bcbio provides a script bcbio_fastq_umi_prep.py which converts reads output by an Illumina as 3 files (read 1, read 2, and UMIs) into paired reads with UMIs in the fastq names. This can run on a single set of files or autopair an entire directory of fastq files:

bcbio_fastq_umi_prep.py autopair -c <cores_to_use> <list> <of> <fastq> <files>

Configuration options for UMIs:

  • umi_type The UMI/cellular barcode scheme used for your data. For single cell RNA sequencing, supports [harvard-indrop, harvard-indrop-v2, cel-seq]. For variant analysis with UMI based consensus calling, supports either fastq_name with UMIs in read names or the path to a fastq file with UMIs for each aligned read.

RNA sequencing

  • transcript_assembler If set, will assemble novel genes and transcripts and merge the results into the known annotation. Can have multiple values set in a list. Supports [‘cufflinks’, ‘sailfish’].
  • transcriptome_align If set to True, will also align reads to just the transcriptome, for use with EBSeq and others.
  • expression_caller A list of optional expression callers to turn on. Supports [‘cufflinks’, ‘express’, ‘stringtie’, ‘sailfish’, ‘dexseq’]. Salmon and count based expression estimation are run by default.
  • variantcaller Variant calling algorithm to call variants on RNA-seq data. Supports [gatk] or [vardict].
  • spikein_fasta A FASTA file of spike in sequences to quantitate.

Fast RNA-seq

  • transcriptome_fasta An optional FASTA file of transcriptome sequences to quantitate rather than using bcbio installed transcriptome sequences.

Single-cell RNA sequencing

  • minimum_barcode_depth Cellular barcodes with less than this many reads assigned to them are discarded (default 100,000).
  • cellular_barcodes A single file or a list of one or two files which have valid cellular barcodes. Provide one file if there is only one barcode and two files if the barcodes are split. If no file is provided, all cellular barcodes passing the minimum_barcode_depth filter are kept.
  • transcriptome_fasta An optional FASTA file of transcriptome sequences to quantitate rather than the bcbio installed version.
  • singlecell_quantifier Quantifier to use for single-cell RNA-sequencing. Non-academic users without a kallisto license should choose rapmap. Supports rapmap or kallisto.
  • cellular_barcode_correction Number of errors to correct in identified cellular barcodes. Requires a set of known barcodes to be passed with the cellular_barcodes option. Defaults to 1. Set to 0 to turn off error correction.
  • sample_barcodes A text file with one barcode per line of expected sample barcodes.

smallRNA sequencing

  • adapter The 3’ end adapter that needs to be remove. For NextFlex protocol you can add resources: cutadapt:options:["-u 4", "-u -4"].
  • species 3 letters code to indicate the species in mirbase classification (i.e. hsa for human).
  • aligner Currently STAR is the only one tested although bowtie can be used as well.
  • expression_caller A list of expression callers to turn on: trna, seqcluster, mirdeep2
  • spikein_fasta A FASTA file of spike in sequences to quantitate.

ChIP sequencing

  • peakcaller bcbio only accepts [macs2]
  • aligner Currently bowtie2 is the only one tested
  • The ```` and batch tags need to be set under metadata in the config YAML file. The phenotype tag will specify the chip (phenotype: chip) and input samples (phenotype: input). The batch tag will specify the input-chip pairs of samples for example, batch: pair1. Same input can be used for different chip samples giving a list of distinct values: batch: [sample1, sample2].
  • chip_method: currently supporting standard CHIP-seq (TF or broad regions using chip) or ATAC-seq (atac). Paramters will change depending on the option to get the best possible results. Only macs2 supported for now.

You can pass different parameters for macs2 adding to Resources:

resources:
  macs2:
    options: ["--broad"]

Quality control

  • mixup_check Detect potential sample mixups. Currently supports qSignature. qsignature_full runs a larger analysis while qsignature runs a smaller subset on chromosome 22. [False, qsignature, qsignature_full]
  • kraken Turn on kraken algorithm to detect possible contamination. You can add kraken: minikraken and it will use a minimal database to detect possible contaminants. As well, you can point to a custom database directory and kraken will use it. You will find the results in the qc directory. This tool only run during rnaseq pipeline.

Post-processing

  • archive Specify targets for long term archival. cram does 8-bin compression of BAM files into CRAM format. Default: [] – no archiving.

Tweaking defaults

bcbio provides some hints to change default behavior be either turning specific defaults on or off, with tools_on and tools_off. Both can be lists with multiple options:

  • tools_off Specify third party tools to skip as part of analysis pipeline. Enables turning off specific components of pipelines if not needed. gemini avoids creation of a GEMINI database of variants for downstream query during variant calling pipelines. Also skips vcfanno annotation unless turned on explicitly with vcfanno in Variant annotation. vardict_somatic_filter disables running a post calling filter for VarDict to remove variants found in normal samples. Without vardict_somatic_filter in paired analyses no soft filtering of germline variants is performed but all high quality variants pass. bwa-mem forces use of original bwa aln alignment. Without this, we use bwa mem with 70bp or longer reads. fastqc turns off quality control FastQC usage. pbgzip turns off use of bgzip with multiple threads. seqcluster turns off use of seqcluster tool in srnaseq pipeline. tumoronly-prioritization turns off attempted removal of germline variants from tumor only calls using external population data sources like ExAC and 1000 genomes. vqsr turns off variant quality score recalibration for all samples. upload_alignment turns off final upload of large alignment files.
  • tools_on Specify functionality to enable that is off by default. svplots adds additional coverage and summary plots for CNVkit and detected ensemble variants. qualimap runs Qualimap (qualimap uses downsampled files and numbers here are an estimation of 1e7 reads.). qualimap_full uses the full bam files but it may be slow. bwa-mem forces use of bwa mem even for samples with less than 70bp reads. bnd-genotype enables genotyping of breakends in Lumpy calls, which improves accuracy but can be slow. gvcf forces gVCF output for callers that support it (GATK HaplotypeCaller, FreeBayes, Platypus). vqsr makes GATK try quality score recalibration for variant filtration, even for smaller sample sizes. vep_splicesite_annotations enables the use of the MaxEntScan and GeneSplicer plugin for VEP. Both optional plugins add extra splice site annotations. gemini_allvariants enables all variants to go into GEMINI, not only those that pass filters. vcf2db_expand decompresses and expands the genotype columns in the vcfanno prepared GEMINI databases, enabling standard SQL queries on genotypes and depths. damage_filter annotates low frequency somatic calls in INFO/DKFZBias for DNA damage artifacts using DKFZBiasFilter. lumpy_usecnv uses input calls from CNVkit as prior evidence to Lumpy calling.

parallelization

  • nomap_split_size Unmapped base pair regions required to split analysis into blocks. Creates islands of mapped reads surrounded by unmapped (or N) regions, allowing each mapped region to run in parallel. (default: 250)
  • nomap_split_targets Number of target intervals to attempt to split processing into. This picks unmapped regions evenly spaced across the genome to process concurrently. Limiting targets prevents a large number of small targets. (default: 200)

Ensemble variant calling

In addition to single method variant calling, we support calling with multiple calling methods and consolidating into a final Ensemble callset.

The recommended method to do this uses a simple majority rule ensemble classifier that builds a final callset based on the intersection of calls. It selects variants represented in at least a specified number of callers:

variantcaller: [mutect2, varscan, freebayes, vardict]
ensemble:
  numpass: 2
  use_filtered: false

This example selects variants present in 2 out of the 4 callers and does not use filtered calls (the default behavior). bcbio.variation.recall implements this approach, which handles speed and file sorting limitations in the bcbio.variation approach.

This older approach uses the bcbio.variation toolkit to perform the consolidation. An example configuration in the algorithm section is:

variantcaller: [gatk, freebayes, samtools, gatk-haplotype, varscan]
ensemble:
  format-filters: [DP < 4]
  classifier-params:
    type: svm
  classifiers:
    balance: [AD, FS, Entropy]
    calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
  trusted-pct: 0.65

The ensemble set of parameters configure how to combine calls from the multiple methods:

  • format-filters A set of filters to apply to variants before combining. The example removes all calls with a depth of less than 4.
  • classifier-params Parameters to configure the machine learning approaches used to consolidate calls. The example defines an SVM classifier.
  • classifiers Groups of classifiers to use for training and evaluating during machine learning. The example defines two set of criteria for distinguishing reads with allele balance issues and those with low calling support.
  • trusted-pct Define threshold of variants to include in final callset. In the example, variants called by more than 65% of the approaches (4 or more callers) pass without being requiring SVM filtering.

Resources

The resources section allows customization of locations of programs and memory and compute resources to devote to them:

resources:
  bwa:
    cores: 12
    cmd: /an/alternative/path/to/bwa
  samtools:
    cores: 16
    memory: 2G
  gatk:
    jvm_opts: ["-Xms2g", "-Xmx4g"]
    dir: /usr/share/java/gatk
  • cmd Location of an executable. By default, we assume executables are on the path.

  • dir For software not distributed as a single executable, like files of Java jars, the location of the base directory.

  • cores Cores to use for multi-proccessor enabled software. This is how many cores will be allocated per job. For example if you are running 10 samples and passed -n 40 to bcbio-nextgen and the step you are running has cores: 8 set, a maximum of five samples will run in parallel, each using 8 cores.

  • jvm_opts Specific memory usage options for Java software. For memory usage on programs like GATK, specify the maximum usage per core. On multicore machines, that’s machine-memory divided by cores. This avoids memory errors when running multiple jobs simultaneously, while the framework will adjust memory up when running multicore jobs.

  • memory Specify the memory per core used by a process. For programs where memory control is available, like samtools sort, this limits memory usage. For other programs this is an estimate of usage, used by Memory management to avoid over-scheduling memory. Always specify this as the memory usage for a single core, and the pipeline handles scaling this when a process uses multiple cores.

  • keyfile Specify the location of a program specific key file or license server, obtained from a third party software tool. Supports licenses for novoalign and Sentieon. For more complex Sentieon setups this can also be a dictionary of environmental variables:

    resources:
      sentieon:
        keyfile:
          SENTIEON_LICENSE_SERVER: 100.100.100.100:8888
          SENTIEON_AUTH_MECH: XXX
          SENTIEON_AUTH_DATA: signature
    

For GATK you can individually control memory for variant calling (which uses the gatk memory target) and for framework usage like merging and variant file preparation (which can optionally use the the gatk-framework target). If you only set gatk, that specification gets used for framework calls as well.

Temporary directory

You also use the resource section to specify system specific parameters like global temporary directories:

resources:
  tmp:
    dir: /scratch

This is useful on cluster systems with large attached local storage, where you can avoid some shared filesystem IO by writing temporary files to the local disk. When setting this keep in mind that the global temporary disk must have enough space to handle intermediates. The space differs between steps but generally you’d need to have 2 times the largest input file per sample and account for samples running simultaneously on multiple core machines.

To handle clusters that specify local scratch space with an environmental variable, bcbio will resolve environmental variables like:

resources:
  tmp:
    dir: $YOUR_SCRATCH_LOCATION

Sample or run specific resources

To override any of the global resource settings in a sample specific manner, you write a resource section within your sample YAML configuration. For example, to create a sample specific temporary directory and pass a command line option to novoalign, write a sample resource specification like:

- description: Example
  analysis: variant2
  resources:
    novoalign:
      options: ["-o", "FullNW", "--rOQ"]
    tmp:
      dir: tmp/sampletmpdir

To adjust resources for an entire run, you can add this resources specification at the top level of your sample YAML:

details:
  - description: Example
resources:
  default:
    cores: 16

Logging directory

By default, bcbio writes the Logging directory to log in the main directory of the run. You can configure this to a different location in your bcbio-system.yaml with:

log_dir: /path/to/logs

Input file preparation

Input files for supplementing analysis, like variant_regions need to match the specified reference genome. A common cause of confusion is the two chromosome naming schemes for human genome build 37: UCSC-style in hg19 (chr1, chr2) and Ensembl/NCBI style in GRCh37 (1, 2). To help avoid some of this confusion, in build 38 we only support the commonly agreed on chr1, chr2 style.

It’s important to ensure that the chromosome naming in your input files match those in the reference genome selected. bcbio will try to detect this and provide helpful errors if you miss it.

To convert chromosome names, you can use Devon Ryan’s collection of chromosome mappings as an input to sed. For instance, to convert hg19 chr-style coordinates to GRCh37:

wget --no-check-certificate -qO- http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt \
   | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
sed -f remap.sed original.bed > final.bed

Genome configuration files

Each genome build has an associated buildname-resources.yaml configuration file which contains organism specific naming and resource files. bcbio-nextgen expects a resource file present next to the genome FASTA file. Example genome configuration files are available, and automatically installed for natively supported genomes. Create these by hand to support additional organisms or builds.

The major sections of the file are:

  • aliases – Names for third-party programs used as part of the analysis, since naming expectations can differ between software programs.
  • variation – Supporting data files for variant analysis. For human analyses, the dbSNP and training files are from the GATK resource bundle. These are inputs into the training models for recalibration. The automated CloudBioLinux data scripts will download and install these in the variation subdirectory relative to the genome files.
  • rnaseq – Supporting data files for RNA-seq analysis. The automated installer and updater handles retrieval and installation of these resources for supported genome builds.
  • srnaseq – Supporting data files for smallRNA-seq analysis. Same as in rnaseq, the automated installer and updater handle this for supported genome builds.

By default, we place the buildname-resources.yaml files next to the genome FASTA files in the reference directory. For custom setups, you specify an alternative directory in the ref:config-resources section of your bcbio_system.yaml file:

resources:
  genome:
    dir: /path/to/resources/files

Reference genome files

The pipeline requires access to reference genomes, including the raw FASTA sequence and pre-built indexes for aligners. The automated installer will install reference files and indexes for commonly used genomes (see the Upgrade documentation for command line options). For human, GRCh37 and hg19, we use the 1000 genome references provided in the GATK resource bundle.

You can use pre-existing data and reference indexes by pointing bcbio-nextgen at these resources. We use the Galaxy .loc files approach to describing the location of the sequence and index data, as described in Data requirements. This does not require a Galaxy installation since the installer sets up a minimal set of .loc files. It finds these by identifying the root galaxy directory, in which it expects a tool-data sub-directory with the .loc files. It can do this in two ways:

  • Using the directory of your bcbio-system.yaml. This is the default mechanism setup by the automated installer and requires no additional work.
  • From the path specified by the galaxy_config option in your bcbio-system.yaml. If you’d like to move your system YAML file, add the full path to your galaxy directory here. This is useful if you have a pre-existing Galaxy installation with reference data.

To manually make genomes available to bcbio-nextgen, edit the individual .loc files with locations to your reference and index genomes. You need to edit sam_fa_indices.loc to point at the FASTA files and then any genome indexes corresponding to aligners you’d like to use (for example: bwa_index.loc for bwa and bowtie2_indices.loc for bowtie2). The database key names used (like GRCh37 and mm10) should match those used in the genome_build of your sample input configuration file.

Adding custom genomes

bcbio_setup_genome.py will help you to install a custom genome and apply all changes needed to the configuration files. It needs the genome in FASTA format, and the annotation file in GTF or GFF3 format. It can create index for all aligners used by bcbio. Moreover, it will create the folder rnaseq to allow you run the RNAseq pipeline without further configuration.

bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135

If you want to add smallRNA-seq data files, you will need to add the 3 letters code of mirbase for your genome (i.e hsa for human) and the GTF file for the annotation of smallRNA data. Here you can use the same file than the transcriptome if no other available.

bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135 --species cel --srna_gtf another_annotation.gtf

To use that genome just need to configure your YAML files as:

genome_build: WBcel135