Configuration¶
Project structure¶
bcbio encourages a project structure:
project/
├── config
├── final
├── input
└── work
with the project.yaml
configuration in the config
directory,
the input files (fastq, bam, bed) in the input
directory,
the outputs of the pipeline in the final
directory,
and the actual processing done in the work
directory.
Typical bcbio run:
copy or link input files in the
input
directoryset pipeline parameters in
config/project.yaml
run the
bcbio_nextgen.py
script from inside thework
review the results in
final
delete
work
with intermediate files.
System and sample configuration files¶
Two configuration files, in easy to write YAML format, specify details about your system and samples to run:
bcbio_sample.yaml
Details about a set of samples to process, including input files and analysis options. You configure these for each set of samples to process. This will be the main file prepared for each sample run and the documentation below details techniques to help prepare them.bcbio_system.yaml
High level information about the system, including locations of installed programs like GATK and cores and memory usage (see Tuning core and memory usage). These apply across multiple runs. The automated installer creates a ready to go system configuration file that can be manually edited to match the system. Find the file in the galaxy sub-directory within your installation data location (ie./usr/local/share/bcbio-nextgen/galaxy
). To modify system parameters for a specific run, supply sample or run specific resources in yourbcbio_sample.yaml
file.
Commented system
and sample
example files are available in the config
directory.
Sample configuration¶
Sample information¶
The sample configuration file defines details
of each sample to process:
details:
- analysis: variant2
lane: 1
description: Example1
files: [in_pair_1.fq, in_pair_2.fq]
genome_build: hg19
algorithm:
platform: illumina
metadata:
batch: Batch1
sex: female
platform_unit: flowcell-barcode.lane
library: library_type
analysis
Analysis method to use [variant2, RNA-seq, smallRNA-seq]lane
A unique number within the project. Corresponds to theID
parameter in the BAM read group.description
Unique name for this sample, corresponding to theSM
parameter in the BAM read group. Required.files
A list of files to process. This currently supports either a single end or two paired-end FASTQ files, or a single BAM file. It does not yet handle merging BAM files or more complicated inputs.genome_build
Genome build to align to, which references a genome keyword in Galaxy to find location build files.algorithm
Parameters to configure algorithm inputs. Options described in more detail below:platform
Sequencing platform used. Corresponds to thePL
parameter in BAM read groups. Optional, defaults toillumina
.
metadata
Additional descriptive metadata about the sample:batch
defines a group that the sample falls in. We perform multi-sample variant calling on all samples with the same batch name. This can also be a list, allowing specification of a single normal sample to pair with multiple tumor samples in paired cancer variant calling (batch: [MatchWithTumor1, MatchWithTumor2]
).sex
specifies the sample gender used to correctly prepare X/Y chromosomes. Usemale
andfemale
or PED style inputs (1=male, 2=female).phenotype
stratifies cancer samples intotumor
andnormal
or case/controls intoaffected
andunaffected
. Also accepts PED style specifications (1=unaffected, 2=affected). CNVkit uses case/control status to determine how to set background samples for CNV calling.disease
identifies a specific disease name for the sample. Used along withsvprioritize
to help identify gene regions for reporting during analysis with heterogeneity callers like PureCN and TitanCNA. This is primarily for cancer studies and you can narrow genes by disease using inputs like lung, breast or pancreatic for different cancer types.prep_method
A free text description of the method used in sample prep. Used to group together samples during CNV calling for background. This is not required and when not present bcbio assumes all samples in an analysis use the same method.svclass
defines a classification for a sample for use in SV case/control setups. When set ascontrol
will put samples into the background samples used for normalization.ped
provides a PED phenotype file containing sample phenotype and family information. Template creation uses this to supplementbatch
,sex
andphenotype
information provided in the template CSV. GEMINI database creation uses the PED file as input.platform_unit
– Unique identifier for sample. Optional, defaults tolane
if not specified.library
– Name of library preparation used. Optional, empty if not present.validate_batch
– Specify a batch name to group samples together for preparing validation plots. This is useful if you want to process samples in specific batches, but include multiple batches into the same validation plot.validate_combine
– Specify a batch name to combine multiple samples into an additional validation summary. Useful for larger numbers of small samples to evaluate together.
Automated sample configuration¶
bcbio-nextgen provides a utility to create configuration files for multiple sample inputs using a base template. Start with one of the best-practice templates, or define your own, then apply to multiple samples using the template workflow command:
bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq
freebayes-variant
is the name of the standardfreebayes-variant.yaml
input, which the script fetches from GitHub. This argument can also be a path to a locally customized YAML configuration. In both cases, the script replicates the single sample template configuration to all input samples.project1.csv
is a comma separated value file containing sample metadata, descriptions and algorithm tweaks:samplename,description,batch,phenotype,sex,variant_regions sample1,ERR256785,batch1,normal,female,/path/to/regions.bed sample2,ERR256786,batch1,tumor,,/path/to/regions.bed
The first column links the metadata to a specific input file. The template command tries to identify the
samplename
from read group information in a BAM file, or uses the base filename if no read group information is present. For BAM files, this would be the filename without the extension and path (/path/to/yourfile.bam => yourfile
). For FASTQ files, the template functionality will identify pairs using standard conventions (_1
and_2
, including Illumina extensions like_R1
), so use the base filename without these (/path/to/yourfile_R1.fastq => yourfile
). Note that paired-end samples sequentially numbered without leading zeros (e.g.,sample_1_1.fastq
,sample_1_2.fastq
,sample_2_1.fastq
,sample_2_2.fastq
, etc., will likely not be parsed correctly; see issue #1919 for more info). In addition,.
characters could be problematic, so it’s better to avoid this character and use it only as separation for the file extension.For Common Workflow Language (CWL) inputs, the first
samplename
column should contain the base filename. For BAM files, this isyour_file.bam
. For fastqs this isyour_file_R1.fastq.gz;your_file_R2.fastq.gz
, separating individual files with a semicolon. By putting paths to the actual locations of the inputs in yourbcbio_system.yaml
input when generating CWL, you can easily move projects between different filesystems.The remaining columns can contain:
description
Changes the sample description, originally supplied by the file name or BAM read group, to this value. You can also set thelane
, although this is less often done as the default sequential numbering works here.Algorithm parameters specific for this sample. If the column name matches an available Algorithm parameters, then this value substitutes into the sample
algorithm
, replacing the defaults from the template. You can also change other information in the BAM read group through thealgorithm
parameters. See alignment configuration documentation for details on how these map to read group information.metadata key/value pairs. Any columns not falling into the above cases will go into the metadata section. A
ped
specification will allow bcbio to read family, gender and phenotype information from a PED input file and use it for batch, sex and phenotype, respectively. The PED inputs supplement information from the standard template file, so if you specify a value in the template CSV the PED information will no overwrite it. Alternatively,ped
fields can be specified directly in the metadata as columns. Iffamily_id
is specified it will be used as thefamily_id
for that sample, otherwisebatch
will be used. Thedescription
column is used as theindividual_id
column and thephenotype
column will be used for as theaffected
column in the PED format:samplename,description,phenotype,batch,sex,ethnicity,maternal_id,paternal_id,family_id NA12878.bam,NA12878,-9,CEPH,female,-9,NA12892,NA12891,NA12878FAM
Individual column items can contain booleans (true or false), integers, or lists (separated by semi-colons). These get converted into the expected time in the output YAML file. For instance, to specify a sample that should go into multiple batches:
samplename,description,phenotype,batch normal.bam,two_normal,normal,Batch1;Batch2
For dictionary inputs like
somatic with germline variants
setups, you can separate items in a dictionary with colons and double colons, and also use semicolons for lists:samplename,description,phenotype,variantcaller tumor.bam,sample1,tumor,germline:freebayes;gatk-haplotype::somatic:vardict;freebayes
The name of the metadata file, minus the
.csv
extension, is a short name identifying the current project. The script creates aproject1
directory containing the sample configuration inproject1/config/project1.yaml
.The remaining arguments are input BAM or FASTQ files. The script pairs FASTQ files (identified by
_1
and_2
) and extracts sample names from input BAMs, populating thefiles
anddescription
field in the final configuration file. Specify the full path to sample files on your current machine.
To make it easier to define your own project specific template, an optional first step is to download and edit a local template. First retrieve a standard template:
bcbio_nextgen.py -w template freebayes-variant project1
This pulls the current GATK best practice variant calling template into your project directory in project1/config/project1-template.yaml
. Manually edit this file to define your options, then run the full template creation for your samples, pointing to this custom configuration file:
bcbio_nextgen.py -w template project1/config/project1-template.yaml project1.csv folder/*
If your sample folder contains additional BAM or FASTQ files you do not wish to include in the sample YAML configuration, you can restrict the output to only include samples in the metadata CSV with --only-metadata
. The output will print warnings about samples not present in the metadata file, then leave these out of the final output YAML:
bcbio_nextgen.py -w template --only-metadata project1/config/project1-template.yaml project1.csv folder/*
Multiple files per sample¶
In case you have multiple FASTQ or BAM files for each sample you can use bcbio_prepare_samples.py
. The main parameters are:
--out
: the folder where the merged files will be--csv
: the CSV file that is exactly the same as described previously, but having as many duplicate lines for each sample as files to be merged:
samplename,description,batch,phenotype,sex,variant_regions
file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed
An example of usage is:
bcbio_prepare_samples.py --out merged --csv project1.csv
The script will create the sample1.fastq,sample2.fastq
in the merged
folder, and a new CSV file in the same folder than the input CSV:project1-merged.csv
. Later, it can be used for bcbio:
bcbio_nextgen.py -w template project1/config/project1-template.yaml project1-merged.csv merged/*fastq
The new CSV file will look like:
samplename,description,batch,phenotype,sex,variant_regions
sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed
It supports parallelization the same way bcbio_nextgen.py
does:
python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py --out merged --csv project1.csv -t ipython -q queue_name -s lsf -n 1
See more examples at parallelize pipeline.
In case of paired reads, the CSV file should contain all files:
samplename,description,batch,phenotype,sex,variant_regions
file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed
file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
The script will try to guess the paired files the same way that bcbio_nextgen.py -w template
does. It would detect paired files if the difference among two files is only _R1/_R2
or -1/-2
or _1/_2
or .1/.2
The output CSV will look like and is compatible with bcbio:
samplename,description,batch,phenotype,sex,variant_regions
sample1,sample1,batch1,normal,female,/path/to/regions.bed
Algorithm parameters¶
Alignment¶
platform
Sequencing platform used. Corresponds to thePL
parameter in BAM read groups. Default ‘Illumina’.aligner
Aligner to use: [bwa, bowtie, bowtie2, hisat2, minimap2, novoalign, snap, star, tophat2, false] To use pre-aligned BAM files as inputs to the pipeline, set tofalse
, which will also skip duplicate marking by default. Using pre-aligned inputs requires proper assignment of BAM read groups and sorting. Thebam_clean
argument can often resolve issues with problematic input BAMs.bam_clean
Clean an input BAM when skipping alignment step. This handles adding read groups, sorting to a reference genome and filtering problem records that cause problems with GATK. Options:remove_extracontigs
– Remove non-standard chromosomes (for human, anything that is not chr1-22,X,Y) from the BAM file. This allows compatibility when the BAM reference genome has different contigs from the reference file but consistent ordering for standard chromosomes. Also fixes the read groups in the BAM file as infixrg
. This is faster than the fullpicard
cleaning option.fixrg
– only adjust read groups, assuming everything else in BAM file is compatible.picard
– Picard/GATK based cleaning. Includes read group changes, fixing of problematic reads and re-ordering chromosome order to match the reference genome. To fix misencoded input BAMs with non-standard scores, setquality_format
toillumina
.
bam_sort
Allow sorting of input BAMs when skipping alignment step (aligner
set to false). Options are coordinate or queryname. For additional processing through standard pipelines requires coordinate sorted inputs. The default is to not do additional sorting and assume pre-sorted BAMs.align_split_size
: Increase parallelization of alignment. As of 0.9.8, bcbio will try to determine a useful parameter and you don’t need to set this. If you manually set it, bcbio will respect your specification. Set to false to avoid splitting entirely. If set, this defines the number of records to feed into each independent parallel step (for example, 5000000 = 5 million reads per chunk). It converts the original inputs into bgzip grabix indexed FASTQ files, and then retrieves chunks for parallel alignment. Following alignment, it combines all chunks back into the final merged alignment file. This allows parallelization at the cost of additional work of preparing inputs and combining split outputs. The tradeoff makes sense when you have large files and lots of distributed compute. When you have fewer large multicore machines this parameter may not help speed up processing.quality_format
Quality format of FASTQ or BAM inputs [standard, illumina].standard
means Sanger,illumina
means Illumina [1.3, 1.8).strandedness
For RNA-seq libraries, if your library is strand specific, set the appropriate flag from [unstranded, firststrand, secondstrand]. Defaults to unstranded. For dUTP marked libraries, firststrand is correct; for Scriptseq prepared libraries, secondstrand is correct.save_diskspace
Remove align prepped bgzip and split BAM files after merging into final BAMs. Helps reduce space on limited filesystems during a run.tools_off: [upload_alignment]
may also be useful in conjunction with this.[false, true]
Read trimming¶
trim_reads
Trims low quality or adapter sequences or at the ends of reads using atropos.adapters
andcustom_trim
specify the sequences to trim. For RNA-seq, it’s recommended to leave as False unless running Tophat2. For variant calling, we recommend trimming only in special cases where standard soft-clipping does not resolve false positive problems. Supports trimming with atropos or fastp.fastp
is currently not compatible with alignment splitting in variant calling and requiresalign_split_size: false
. The old parameterread_through
defaults to using atropos trimming.[False, atropos, fastp]
. Default to False.adapters
If trimming adapter read through, trim a set of stock adapter sequences. Allows specification of multiple items in a list, for example [truseq, polya] will trim both TruSeq adapter sequences and polyA tails. polyg trimming removes high quality G stretches present in NovaSeq and NextSeq data. In the small RNA pipeline, bcbio will try to detect the adapter using DNApi. If you set up this parameter, then bcbio will use this value instead. Choices:[truseq, illumina, nextera, polya, polyx, polyg, nextera2, truseq2]
.nextera2: Illumina NEXTera DNA prep kit from NEB
truseq2: SMARTer Universal Low Input RNA Kit
custom_trim
A list of sequences to trim from the end of reads, for example:[AAAATTTT, GGGGCCCC]
min_read_length
Minimum read length to maintain whenread_through
trimming set intrim_reads
. Defaults to 25.trim_ends
Specify values for trimming at ends of reads, using a fast approach built into fastq preparation. This does not do quality or adapter trimming but will quickly cut off a defined set of values from either the 5’ or 3’ end of the first and second reads. Expects a list of 4 values:[5' trim read1, 3' trim read1, 5' trim read2, 3' trim read2]
. Set values to 0 if you don’t need trimming (ie.[6, 0, 12, 0]
will trim 6bp from the start of read 1 and 12bp from the start of read 2. Only implemented for variant calling pipelines.
Alignment postprocessing¶
mark_duplicates
Mark duplicated reads [true, false]. If true, will perform streaming duplicate marking with biobambam’s bammarkduplicates or bamsormadup. Uses samblaster as an alternative if you have paired reads and specifyinglumpy
as ansvcaller
. Defaults to true for variant calling and false for RNA-seq and small RNA analyses. Also defaults to false if you’re not doing alignment (aligner: false
).recalibrate
Perform base quality score recalibration on the aligned BAM file, adjusting quality scores to reflect alignments and known variants. Supports both GATK and Sentieon recalibration. Defaults to false, no recalibration. [false, gatk, sentieon]realign
Perform GATK’s realignment around indels on the aligned BAM file. Defaults to no realignment since realigning callers like FreeBayes and GATK HaplotypeCaller handle this as part of the calling process. [false, gatk]
Coverage information¶
coverage_interval
Regions covered by sequencing. bcbio calculates this automatically from alignment coverage information, so you only need to specify it in the input configuration if you have specific needs or bcbio does not determine coverage correctly.genome
specifies full genome sequencing,regional
identifies partial-genome pull down sequencing like exome analyses, andamplicon
is partial-genome sequencing from PCR amplicon sequencing. This influences GATK options for filtering: we use Variant Quality Score Recalibration when set togenome
, otherwise we apply cutoff-based soft filters. Also affects copy number calling with CNVkit, structural variant calling and deep panel calling in cancer samples, where we tune regional/amplicon analyses to maximize sensitivity. [genome, regional, amplicon]maxcov_downsample
bcbio downsamples whole genome runs with >10x average coverage to a maximum coverage, avoiding slow runtimes in collapsed repeats and poly-A/T/G/C regions. This parameter specified the multiplier of average coverage to downsample at. For example, 200 downsamples at 6000x coverage for a 30x whole genome. Set to false or 0 to disable downsampling. Current defaults to false pending runtime improvements.coverage_depth_min
Minimum depth of coverage. When calculating regions to call in, bcbio may exclude regions with less than this many reads. It is not a hard filter for variant calling, but rather a guideline for determining callable regions. It’s primarily useful when trying to call on very low depth samples. Defaults to 4. Setting lower than 4 will trigger low-depth calling options for GATK.
Analysis regions¶
These BED files define the regions of the genome to analyze and report on. variant_regions
adjusts regions for small variant (SNP and indel) calling. sv_regions
defines regions for structural variant calling if different than variant_regions
. For coverage-based quality control metrics, we first use coverage
if specified, then sv_regions
if specified, then variant_regions
. See the section on input file preparation for tips on ensuring chromosome naming in these files match your reference genome. bcbio pre-installs some standard BED files for human analyses. Reference these using the naming schemes described in the reference data repository.
variant_regions
BED file of regions to call variants in.sv_regions
– A specification of regions to target during structural variant calling. By default, bcbio uses regions specified invariant_regions
but this allows custom specification for structural variant calling. This can be a pointer to a BED file or special inputs:exons
for only exon regions,transcripts
for transcript regions (the min start and max end of exons) ortranscriptsXXXX
for transcripts plus a window of XXXX size around it. The size can be an integer (transcripts1000
) or exponential (transcripts1e5
). This applies to CNVkit and heterogeneity analysis.coverage
A BED file of regions to check for coverage and completeness in QC reporting. This can also be a shorthand for a BED file installed by bcbio (see Structural variant calling for options).exclude_regions
List of regions to remove as part of analysis. This allows avoidance of slow and potentially misleading regions. This is a list of the following options:polyx
Avoid calling variants in regions of single nucleotide stretches greater than 50. These can contribute to long variant calling runtimes when errors in polyX stretches align in high depth to these regions and take a lot of work to resolve. Since we don’t expect decent resolution through these types of repeats, this helps avoid extra calculations for assessing the noise. This is an alternative to trimming polyX from the 3’ ends for reads withtrim_reads
andadapters
. Requires an organism with a definedpolyx
file in genome resources. For structural variant calling, addingpolyx
avoids calling small indels for Manta, where these can contribute to long runtimes.lcr
Avoid calling variants in low complexity regions (LCRs) in variant2 pipeline (somatic and germline variant calling). Heng Li’s variant artifacts paper provides these regions, which cover ~2% of the genome but contribute to a large fraction of problematic calls due to the difficulty of resolving variants in repetitive regions. Removal can help facilitate comparisons between methods and reduce false positives if you don’t need calls in LCRs for your biological analysis. Requires an organism with a definedlcr
file in genome resources. remove_lcr_regions function is looking for a resource in genome resources, for examplegenomes/Hsapiens/hg38/seq/hg38-resources.yaml
:lcr: ../coverage/problem_regions/repeats/LCR.bed.gz
, and fetching a bed file:genomes/Hsapiens/hg38/coverage/problem_regions/repeats/LCR.bed.gz
. The file is installed by the cloudbiolinux recipe from https://github.com/lh3/varcmp. Motivation in Li2014. See also boundary effects discussion.highdepth
Remove high depth regions during variant calling, identified by collapsed repeats around centromeres in hg19 and GRCh37 as characterized in the ENCODE blacklist. This is on by default for VarDict and FreeBayes whole genome calling to help with slow runtimes in these regions, and also on for whole genome structural variant calling to avoid false positives from high depth repeats.altcontigs
Skip calling in unplaced contigs (Un), limit analysis to standard chromosomes – chr1-22,X,Y,MT for human – to avoid slowdowns on the additional contigs. By default bcbio calls variants in unplaced but not in alternative contigs (alleles). Alt contig calling is currently not supported.
remove_lcr: true
- deprecated, useexclude_regions: [lcr]
instead.
Standard¶
This pipeline implements alignment
and qc
tools. Furthermore, it will run qsignature to detect possible duplicated samples, or mislabeling. It uses SNPs signature to create a distance matrix that helps easily to create groups. The project yaml file will show the number of total samples analyzed, the number of very similar samples, and samples that could be duplicated.
We will assume that you installed bcbio-nextgen with the automated installer, and so your default bcbio_system.yaml file is configured correctly with all of the tools pointing to the right places. If that is the case, to run bcbio-nextgen on a set of samples you just need to set up a YAML file that describes your samples and what you would like to do to them.
Let’s say that you have a single paired-end control lane, prepared with the Illumina TruSeq Kit from a human. Here is what a well-formed sample YAML file for that RNA-seq experiment would look like:
fc_date: '070113'
fc_name: control_experiment
upload:
dir: final
details:
- files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq]
description: 'Control_rep1'
genome_build: GRCh37
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [truseq, polya]
strandedness: unstranded
fc_date
and fc_name
will be combined to form a prefix to name intermediate files, and can be set to whatever you like. upload
is explained pretty well in the configuration
documentation and the above will direct bcbio-nextgen to put the output files from the pipeine into the final
directory. Under details
is a list of sections each describing a sample to process. You can set many
parameters under each section but most of the time just setting a few like the above is all that is necessary. analysis
tells bcbio-nextgen to run the best-practice RNA-seq pipeline on this sample.
In the above, since there are two files, control_1.fastq
and control_2.fastq
will be automatically run as paired-end data. If you have single end data you can just supply one file and it will run as single-end. The description
field will be used to eventually rename the files, so make it very evocative since you will be looking at it a lot later. genome_build
is self-explanatory.
Sometimes you need a little bit more flexibility than the standard pipeline, and the algorithm
section has many options to fine-tune the behavior of the algorithm. quality_format
tells bcbio-nextgen what quality format your FASTQ inputs are using, if your samples were sequenced any time past 2009 or so, you probably want to set it to
Standard
. Adapter read-through is a problem in RNA-seq libraries, so we want to trim off possible adapter sequences on the ends of reads, so trim_reads
is set to read_through
, which will also trim off poor quality ends. Since your library is a RNA-seq library prepared with the TruSeq kit, the set of adapters to trim off are the TruSeq adapters and possible polyA tails, so adapters
is set to both of those.
strandedness
can be set if your library was prepared in a strand-specific manner and can be set to firststrand, secondstrand or unstranded (the default).
Parallelization¶
nomap_split_size
Unmapped base pair regions required to split analysis into blocks. Creates islands of mapped reads surrounded by unmapped (or N) regions, allowing each mapped region to run in parallel. (default: 250)nomap_split_targets
Number of target intervals to attempt to split processing into. This picks unmapped regions evenly spaced across the genome to process concurrently. Limiting targets prevents a large number of small targets which can blow up the memory for runs with many samples. (default: 200 for standard runs, 20 for CWL runs)
Multiple samples¶
Lets say you have a set of mouse samples to analyze and each sample is a single lane of single-end RNA-seq reads prepared using the NextEra kit. There are two case and two control samples. Here is a sample configuration file for that analysis:
fc_date: '070113'
fc_name: mouse_analysis
upload:
dir: final
details:
- files: [/full/path/to/control_rep1.fastq]
description: 'Control_rep1'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
- files: [/full/path/to/control_rep2.fastq]
description: 'Control_rep2'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
- files: [/full/path/to/case_rep1.fastq]
description: 'Case_rep1'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
- files: [/full/path/to/case_rep2.fastq]
description: 'Case_rep2'
genome_build: mm10
analysis: RNA-seq
algorithm:
aligner: tophat2
quality_format: Standard
trim_reads: read_through
adapters: [nextera, polya]
More samples are added just by adding more entries under the details section.
This is tedious and error prone to do by hand, so there is an automated
template system for common experiments.
You could set up the previous experiment by making a mouse version of the illumina-rnaseq template file and saving it to a local file such as illumina-mouse-rnaseq.yaml
. Then you can set up the sample file using the templating system:
bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis
/full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq
/full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq
If you had paired-end samples instead of single-end samples, you can still use
the template system as long as the forward and reverse read filenames are the same,
barring a _1
and _2
. For example: control\_1.fastq
and control\_2.fastq
will
be detected as paired and combined in the YAML file output by the templating system.
Quality control¶
qc
Allows you to specifically assign quality control modules to run. Generally you want to leave this unset and allow bcbio to run the correct QC metrics for your experiment, or remove specific QC steps you don’t want usingtools_off
(Changing bcbio defaults). However, this can allow turning off most of the QC by specifying a single quick running step likepicard
. Available tools arefastqc
,samtools
,coverage
,picard
,contamination
(VerifyBamID),peddy
,viral
,damage
,umi
,small-rna
,atropos
,chipqc
.mixup_check
Detect potential sample mixups. Currently supports qSignature.qsignature_full
runs a larger analysis whileqsignature
runs a smaller subset on chromosome 22. [False, qsignature, qsignature_full]Detect bacterial, viral, and archaeal contamination with kraken. First, install minikraken database:
bcbio_nextgen.py upgrade -u skip --datatarget kraken
. It gets installed inbcbio/genomes/kraken
and takes 4G. Second, turn on kraken algorithm withkraken: minikraken
. Optionally, create a custom kraken database outside of bcbio and use it with:kraken: /path/to/custom/kraken
. Kraken reports are inproject/final/sample/qc/kraken
. Kraken option works forvariant2
andRNA-seq
pipelines.preseq
Acceptslc_extrap
orc_curve
, and runs Preseq, a tool that predicts the yield for future experiments. By default, it runs 300 steps of estimation using the segment length of 100000. The default extrapolation limit forlc_extrap
is 3x of the reads number. You can override the parametersseg_len
,steps
,extrap_fraction
using the resources section:resources: preseq: extrap_fraction: 5 steps: 500 seg_len: 5000
And you can also set
extrap
andstep
parameters directly, as well as provide any other command line option viaoptions
:resources: preseq: extrap: 10000000 step: 30000 options: ["-D"]
bcbio uses MultiQC to combine QC output for all samples into a single report file. If you need to tweak configuration settings from bcbio defaults, you can use resources. For instance to display read counts with full numbers instead of the default millions:
resources: multiqc: options: ["--cl_config", "'read_count_multiplier: 1'"]
or as thousands:
resources: multiqc: options: ["--cl_config", "'{read_count_multiplier: 0.001, read_count_prefix: K}'"]
Post-processing¶
archive
Specify targets for long term archival. cram
removes fastq names and does 8-bin compression of BAM files into CRAM format. cram-lossless
generates CRAM files without changes to quality scores or fastq name. Default: [] – no archiving. Lossy cram has some issues, lossless cram provides pretty good compression relative to BAM, and many machines output binned values now, so cram-lossless
is what we recommend you use.
Changing bcbio defaults¶
bcbio provides some hints to change default behavior be either turning specific defaults on or off, with tools_on
and tools_off
. Both can be lists with multiple options:
tools_off
Specify third party tools to skip as part of analysis pipeline. Enables turning off specific components of pipelines if not needed:gatk4
Use older GATK versions (3.x) for GATK commands like BQSR, HaplotypeCaller and VQSR. By default bcbio includes GATK4 and uses it.vqsr
turns off variant quality score recalibration for all samples.bwa-mem
forces use of originalbwa aln
alignment. Without this, we use bwa mem with 70bp or longer reads.lumpy-genotype
skip genotyping for Lumpy samples, which can be slow in the case of many structural variants.seqcluster
turns off use of seqcluster tool in srnaseq pipeline.tumoronly-prioritization
turns off attempted removal of germline variants from tumor only calls using external population data sources like ExAC and 1000 genomes.vardict_somatic_filter
disables running a post calling filter for VarDict to remove variants found in normal samples. Withoutvardict_somatic_filter
in paired analyses no soft filtering of germline variants is performed but all high quality variants pass.upload_alignment
turns off final upload of large alignment files.pbgzip
turns off use of bgzip with multiple threads.For quality control, you can turn off any specific tool by adding to
tools_off
. For example,fastqc
turns off quality control FastQC usage. andcoverage_qc
turns off calculation of coverage statistics with samtools-stats and picard. See the Methylation docs for details on tools.
tools_on
Specify functionality to enable that is off by default:bcbiornaseq
loads a bcbioRNASeq object for use with bcbioRNASeq.bnd-genotype
enables genotyping of breakends in Lumpy calls, which improves accuracy but can be slow.bwa-mem
forces use of bwa mem even for samples with less than 70bp reads.coverage_perbase
calculates per-base coverage depth for analyzed variant regions.damage_filter
annotates low frequency somatic calls in INFO/DKFZBias for DNA damage artifacts using DKFZBiasFilter.gemini
Create a GEMINI database of variants for downstream query using the new vcfanno and vcf2db approach.gemini_allvariants
enables all variants to go into GEMINI, not only those that pass filters.gemini_orig
Create a GEMINI database of variants using the older GEMINI loader. Only works for GRCh37 and hg19.gvcf
forces gVCF output for callers that support it (GATK HaplotypeCaller, FreeBayes, Platypus). For joint calling using a population of samples, please use jointcaller.lumpy_usecnv
uses input calls from CNVkit as prior evidence to Lumpy calling.noalt_calling
call variants only for chr1,,22,X,Y,MT.qualimap
runs Qualimap (qualimap uses downsampled files and numbers here are an estimation of 1e7 reads).qualimap_full
runs Qualimap with full bam files but it may be slow.svplots
adds additional coverage and summary plots for CNVkit and detected ensemble variants.tumoronly_germline_filter
applies aLowPriority
filter to tumor-only calls that match population germline databases. The default is to just apply a tagEPR
(external prioritization) that flags variants present in external databases. Anything missing apass
here is a likely germline.vcf2db_expand
decompresses and expands the genotype columns in the vcfanno prepared GEMINI databases, enabling standard SQL queries on genotypes and depths.vqsr
makes GATK try quality score recalibration for variant filtration, even for smaller sample sizes.vep_splicesite_annotations
enables the use of the MaxEntScan and SpliceRegion plugin for VEP. Both optional plugins add extra splice site annotations.vardict_sv
remove--nosv
option from the vardict call (default is--nosv
ON). This option is experimental. Breakstools_on: damage_filter
and alters vcf output.
Resources¶
The resources
section allows customization of locations of programs and memory and compute resources to devote to them:
resources:
bwa:
cores: 12
cmd: /an/alternative/path/to/bwa
samtools:
cores: 16
memory: 2G
gatk:
jvm_opts: ["-Xms2g", "-Xmx4g"]
mutect2_filter:
options: ["--max-events-in-region", "2"]
cmd
Location of an executable. By default, we assume executables are on the path.cores
Cores to use for multi-proccessor enabled software. This is how many cores will be allocated per job. For example if you are running 10 samples and passed -n 40 to bcbio-nextgen and the step you are running has cores: 8 set, a maximum of five samples will run in parallel, each using 8 cores.jvm_opts
Specific memory usage options for Java software. For memory usage on programs like GATK, specify the maximum usage per core. On multicore machines, that’s machine-memory divided by cores. This avoids memory errors when running multiple jobs simultaneously, while the framework will adjust memory up when running multicore jobs.memory
Specify the memory per core used by a process. For programs where memory control is available, likesamtools sort
, this limits memory usage. For other programs this is an estimate of usage, used by Memory management to avoid over-scheduling memory. Always specify this as the memory usage for a single core, and the pipeline handles scaling this when a process uses multiple cores.keyfile
Specify the location of a program specific key file or license server, obtained from a third party software tool. Supports licenses for novoalign and Sentieon. For more complex Sentieon setups this can also be a dictionary of environmental variables:resources: sentieon: keyfile: SENTIEON_LICENSE_SERVER: 100.100.100.100:8888 SENTIEON_AUTH_MECH: XXX SENTIEON_AUTH_DATA: signature
options
Adjust specific command line options for a program. This can be hard to support for many tools due to conflicts with other existing options but is available for some tools: -mutect2
,mutect2_filter
: Adjust for Mutect2 calls and filtering.
Temporary directory¶
You also use the resource section to specify system specific parameters like global temporary directories:
resources:
tmp:
dir: /scratch
This is useful on cluster systems with large attached local storage, where you can avoid some shared filesystem IO by writing temporary files to the local disk. When setting this keep in mind that the global temporary disk must have enough space to handle intermediates. The space differs between steps but generally you’d need to have 2x the largest input file per sample and account for samples running simultaneously on multiple core machines.
To handle clusters that specify local scratch space with an environmental variable, bcbio will resolve environmental variables like:
resources:
tmp:
dir: $YOUR_SCRATCH_LOCATION
Sample or run specific resources¶
To override any of the global resource settings in a sample specific manner, you write a resource section within your sample YAML configuration. For example, to create a sample specific temporary directory and pass a command line option to novoalign, write a sample resource specification like:
- description: Example
analysis: variant2
resources:
novoalign:
options: ["-o", "FullNW", "--rOQ"]
tmp:
dir: tmp/sampletmpdir
To adjust resources for an entire run, you can add this resources specification at the top level of your sample YAML:
details:
- description: Example
resources:
default:
cores: 16
Input file preparation¶
Input files for supplementing analysis, like variant_regions
need to match the specified reference genome. A common cause of confusion is the two chromosome naming schemes for human genome build 37: UCSC-style in hg19 (chr1, chr2) and Ensembl/NCBI style in GRCh37 (1, 2). To help avoid some of this confusion, in build 38 we only support the commonly agreed on chr1, chr2 style.
It’s important to ensure that the chromosome naming in your input files match those in the reference genome selected. bcbio will try to detect this and provide helpful errors if you miss it.
To convert chromosome names, you can use Devon Ryan’s collection of chromosome mappings as an input to sed. For instance, to convert hg19 chr-style coordinates to GRCh37:
wget --no-check-certificate -qO- https://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt \
| awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
sed -f remap.sed original.bed > final.bed
Genome configuration files¶
Each genome build has an associated buildname-resources.yaml
configuration file which contains organism specific naming and resource files. bcbio-nextgen expects a resource file present next to the genome FASTA file. Example genome configuration files are available, and automatically installed for natively supported genomes. Create these by hand to support additional organisms or builds.
The major sections of the file are:
aliases
– Names for third-party programs used as part of the analysis, since naming expectations can differ between software programs.variation
– Supporting data files for variant analysis. For human analyses, the dbSNP and training files are from the GATK resource bundle. These are inputs into the training models for recalibration. The automated CloudBioLinux data scripts will download and install these in the variation subdirectory relative to the genome files.rnaseq
– Supporting data files for RNA-seq analysis. The automated installer and updater handles retrieval and installation of these resources for supported genome builds.srnaseq
– Supporting data files for smallRNA-seq analysis. Same as in rnaseq, the automated installer and updater handle this for supported genome builds.
By default, we place the buildname-resources.yaml
files next to the genome FASTA files in the reference directory. For custom setups, you specify an alternative directory in the resources section of your bcbio_system.yaml
file:
resources:
genome:
dir: /path/to/resources/files
Reference genome files¶
For human genomes, we recommend using build 38 (hg38). This is fully supported and validated in bcbio, and corrects a lot of issues in the previous build 37. We use the 1000 genomes distribution which includes HLAs and decoy sequences. For human build 37, GRCh37 and hg19, we use the 1000 genome references provided in the GATK resource bundle. These differ in chromosome naming: hg19 uses chr1, chr2, chr3 style contigs while GRCh37 uses 1, 2, 3. They also differ slightly in content: GRCh37 has masked Pseudoautosomal regions on chromosome Y allowing alignment to these regions on chromosome X.
You can use pre-existing data and reference indexes by pointing bcbio-nextgen at these resources. We use the Galaxy .loc files approach to describing the location of the sequence and index data, as described in Data requirements. This does not require a Galaxy installation since the installer sets up a minimal set of .loc
files. It finds these by identifying the root galaxy
directory, in which it expects a tool-data
sub-directory with the .loc
files. It can do this in two ways:
Using the directory of your
bcbio-system.yaml
. This is the default mechanism setup by the automated installer and requires no additional work.From the path specified by the
galaxy_config
option in yourbcbio-system.yaml
. If you’d like to move your system YAML file, add the full path to yourgalaxy
directory here. This is useful if you have a pre-existing Galaxy installation with reference data.
To manually make genomes available to bcbio-nextgen, edit the individual .loc
files with locations to your reference and index genomes. You need to edit sam_fa_indices.loc
to point at the FASTA files and then any genome indexes corresponding to aligners you’d like to use (for example: bwa_index.loc
for bwa and bowtie2_indices.loc
for bowtie2). The database key names used (like GRCh37
and mm10
) should match those used in the genome_build
of your sample input configuration file.
To remove a reference genome, delete its directory bcbio/genomes/species/reference
and remove all the records corresponding to that genome from bcbio/galaxy/tool-data/*.loc
files.
genomes/Hsapiens/hg38/seq/hg38-resources.yaml
specifies relative locations of the resources. To determine the absolute path, bcbio fetches a value
from bcbio/galaxy/tool-data/sam_fa_indices.loc
and uses it is a basedir for all resources. If there are several installations of bcbio data
,
it is important to have separate tool-data
as well.
Adding custom genomes¶
bcbio_setup_genome.py installs a custom genome for variant and bulk-RNA-seq analyses and updates the configuration files.
bcbio_setup_genome.py \
-f genome.fa \
-g annotation.gtf \
-i bwa star seq \
-n Celegans -b WBcel135 --buildversion WormBase_34
Arguments:
run
bcbio_setup_genome.py --help
to see all available arguments-f genome.fasta
- genome in FASTA format-g annotation.gtf
- annotation file in GTF or GFF3 format-i seq bwa
- list of aligner indices to create.seq
is a sequence dictionary, always created.-n Name
- name of the species, for exampleCelegans
.-b Build
- genome build, for exampleWBcel135
.--buildversion annotation_build
- annotation build, for exampleWormBase_34
or ensembl build. It is saved inbcbio/genomes/Name/Build/rnaseq/version.txt
.
References for many species are available from Ensembl:
If you want to add smallRNA-seq data files, you will need to add the 3 letters code of mirbase for your genome (i.e hsa for human) and the GTF file for the annotation of smallRNA data. Here you can use the same file than the transcriptome if no other available.
bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135 --species cel --srna_gtf another_annotation.gtf --buildversion WormBase_34
To use that genome just need to configure your YAML files as:
genome_build: WBcel135
The GTF file you provide for the bcbio_setup_genome.py
script must have the following features:
each entry must have a
transcript_id
and agene_id
for each transcript there must be entries where the feature field (field 3) is
exon
with the coordinates describing the stop and end of the exon
for example, this is a snippet from a valid GTF file:
1 pseudogene gene 11869 14412 . + . gene_source "ensembl_havana"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1";
1 processed_transcript transcript 11869 14409 . + . transcript_source "havana"; gene_id "ENSG00000223972"; gene_source "ensembl_havana"; trans
cript_name "DDX11L1-002"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; gene_name "DDX11L1";
1 processed_transcript exon 11869 12227 . + . exon_number "1"; transcript_source "havana"; gene_id "ENSG00000223972"; exon_id "ENSE00002234944";
gene_source "ensembl_havana"; transcript_id "ENST00000456328"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1";
1 processed_transcript exon 12613 12721 . + . exon_number "2"; transcript_source "havana"; gene_id "ENSG00000223972"; exon_id "ENSE00003582793";
gene_source "ensembl_havana"; transcript_id "ENST00000456328"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1";
1 processed_transcript exon 13221 14409 . + . exon_number "3"; transcript_source "havana"; gene_id "ENSG00000223972"; exon_id "ENSE00002312635";
gene_source "ensembl_havana"; transcript_id "ENST00000456328"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1";
Effects prediction¶
To perform variant calling and predict effects in a custom genome you’d have to manually download and link this into your installation. First find the snpEff genome build:
$ snpEff databases | grep Lactobacillus | grep pentosus
Lactobacillus_pentosus_dsm_20314 Lactobacillus_pentosus_dsm_20314 ENSEMBL_BFMPP_32_179 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
Lactobacillus_pentosus_kca1 Lactobacillus_pentosus_kca1 ENSEMBL_BFMPP_32_179 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
then download to the appropriate location:
$ cd /path/to/bcbio/genomes/Lacto/Lactobacillus_pentosus
$ mkdir snpEff
$ cd snpEff
$ wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
$ unzip snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
$ find . -name "Lactobacillus_pentosus_dsm_20314"
./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314
$ mv ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 .
finally add to your genome configuration file (seq/Lactobacillus_pentosus-resources.yaml
):
aliases:
snpeff: Lactobacillus_pentosus_dsm_20314
For adding an organism not present in snpEff, please see this mailing list discussion.
Download data from SRA¶
Use SRA toolkit prefetch/fastq-dump
Upload¶
The upload
section of the sample configuration file describes where to put the final output files of the pipeline. At its simplest, you can configure bcbio-nextgen to upload results to a local directory, for example a folder shared amongst collaborators or a Dropbox account. You can also configure it to upload results automatically to a Galaxy instance, to Amazon S3 or to iRODS. Here is the simplest configuration, uploading to a local directory:
upload:
dir: /local/filesystem/directory
General parameters, always required:
method
Upload method to employ:[filesystem, galaxy, s3, irods]
. Defaults to local filesystem.dir
Local filesystem directory to copy to.
Galaxy parameters:
galaxy_url
URL of the Galaxy instance to upload to. Upload assumes you are able to access a shared directory also present on the Galaxy machine.galaxy_api_key
User API key to access Galaxy: see the Galaxy API documentation.galaxy_library
Name of the Galaxy Data Library to upload to. You can specify this globally for a project inupload
or for individual samples in the sample details section.galaxy_role
Specific Galaxy access roles to assign to the uploaded datasets. This is optional and will default to the access of the parent data library if not supplied. You can specify this globally for a project inupload
or for individual samples in the sample details section. The Galaxy Admin documentation has more details about roles.
Here is an example configuration for uploading to a Galaxy instance. This assumes you have a shared mounted filesystem that your Galaxy instance can also access:
upload:
method: galaxy
dir: /path/to/shared/galaxy/filesystem/folder
galaxy_url: http://url-to-galaxy-instance
galaxy_api_key: YOURAPIKEY
galaxy_library: data_library_to_upload_to
Your Galaxy universe_wsgi.ini
configuration needs to have allow_library_path_paste = True
set to enable uploads.
S3 parameters:
bucket
AWS bucket to direct output.folder
A folder path within the AWS bucket to prefix the output.region
AWS region name to use. Defaults to us-east-1reduced_redundancy
Flag to determine if we should store S3 data with reduced redundancy: cheaper but less reliable[false, true]
For S3 access credentials, set the standard environmental variables, AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_DEFAULT_REGION
or use IAM access roles with an instance profile on EC2 to give your instances permission to create temporary S3 access.
iRODS parameters:
folder
Full directory name within iRODS to prefix the output.resource
(optional) iRODS resource name, if other than default.
Example configuration:
upload:
method: irods
dir: ../final
folder: /irodsZone/your/path/
resource: yourResourceName
Uploads to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured connection through the iinit command.
Globals¶
You can define files used multiple times in the algorithm
section of your configuration in a top level globals
dictionary. This saves copying and pasting across the configuration and makes it easier to manually adjust the configuration if inputs change:
globals:
my_custom_locations: /path/to/file.bed
details:
- description: sample1
algorithm:
variant_regions: my_custom_locations
- description: sample2
algorithm:
variant_regions: my_custom_locations
Logging¶
Bcbio creates 3 log files:
bcbio-nextgen.log
High level logging information about the analysis. This provides an overview of major processing steps and useful checkpoints for assessing run times.bcbio-nextgen-debug.log
Detailed information about processes including stdout/stderr from third party software and error traces for failures. Look here to identify the status of running pipelines or to debug errors. It labels each line with the hostname of the machine it ran on to ease debugging in distributed cluster environments.bcbio-nextgen-commands.log
Full command lines for all third party software tools run.Default location for log files is
work/log
directory. Also 2 logs are saved infinal/project
log_dir: /path/to/logs
in/bcbio/galaxy/bcbio-system.yaml
sets logging destination for all projects.
Persistence¶
Every pipeline has multiple steps. Bcbio saves intermediate results in the work directory. If a step has been successfully finished (alignment bam file is generated, variants vcf is calculated, purecn normal db is generated), and the pipeline failed one of the subsequent steps, then upon re-running the pipeline, the finished steps would not be re-calculated. If you’d like to re-generate data for a particular step, simply remove the corresponding work/step
folder,
for example, remove work/gemini
if you’d like to re-generate a gemini database or purecn normaldb.