Internals

Overview

Variant calling overview

Parallel

bcbio calculates callable regions following alignment using goleft depth. These regions determine breakpoints for analysis, allowing parallelization by genomic regions during variant calling. Defining non-callable regions allows bcbio to select breakpoints for parallelization within chromosomes where we won’t accidentally impact small variant detection. The callable regions also supplement the variant calls to define positions where not called bases are homozygous reference, as opposed to true no-calls with no evidence. The callable regions plus variant calls is an alternative to gVCF output which explicitly enumerates reference calls in the output variant file.

Parallel approaches

Overview of cluster types during parallel execution

Somatic tumor only variant calling pipeline with UMIs step by step

Minor steps (like tabix’ing of vcfs, indexing of bams) and details (full paths) are omitted.

bcbio.yaml config:

details:
- algorithm:
    aligner: bwa
    trim_ends: [2,0,2,0]
    min_allele_fraction: 0.01
    correct_umis: /path/umi.whitelist.txt
    tools_off:
    - gemini
    umi_type: fastq_name
    variantcaller:
    - vardict
    coverage: target.bed
    variant_regions: target.bed
  analysis: variant2
  description: samplex
  files:
  - /path/samplex_1.fq.gz
  - /path/samplex_2.fq.gz
  genome_build: hg38
  metadata:
    phenotype: tumor
resources:
  fgbio:
    options: [--min-reads, 3]
  • step 1: trimming 2p from reads: trim_ends: [2,0,2,0], indexing:

    bgzip --threads 8 -c <(seqtk trimfq -b 2 -e 0 samplex_1.fq.gz) > samplex_1.fq.gz
    bgzip --threads 8 -c <(seqtk trimfq -b 2 -e 0 samplex_2.fq.gz) > samplex_2.fq.gz
    grabix index samplex_2.fq.gz
    grabix index samplex_1.fq.gz
    
  • step 2: alignment with bwa mem, sort and mark duplicates, assign BAM tags (XS=sample, XC=cell, RX=UMI), input: fastq files, output: samplex-sort.bam:

    unset JAVA_HOME && \
    bwa mem -c 250 -M -t 16  -R '@RG\tID:samplex\tPL:illumina\tPU:samplex\tSM:samplex' \
    -v 1 /path/reference/genomes/Hsapiens/hg38/bwa/hg38.fa samplex_1.fq.gz samplex_2.fq.gz  | \
    bamsormadup tmpfile=samplex-sort-sorttmp-markdup inputformat=sam threads=16 \
    outputformat=bam level=0 SO=coordinate | \
    /path/bcbio/anaconda/envs/python2/bin/python /path/bin/umis bamtag - | \
    samtools view -b > samplex-sort.bam
    
  • step 3: correct UMIs with fgbio using the whitelist, input: samplex-sort.bam, output: samplex-sort-umis_corrected.bam:

    unset JAVA_HOME && \
    fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \
    CorrectUmis \
    -t XC -m 3 -d 1 -x -U /path/umi.whitelist.txt -i samplex-sort.bam \
    -o samplex-sort-umis_corrected.bam
    
  • step 4: calculate coverage with mosdepth:

    mosdepth -t 16 -F 1804 --no-per-base --by target.bed samplex-rawumi \
    samplex-sort-umis_corrected.bam
    
  • step 5: fgbio GroupReadsByUmi, CallDuplexConsensusReads, FilterConsensusReads with min-reads=3, bam2fastq:

    unset JAVA_HOME && \
    fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \
    GroupReadsByUmi \
    --edits=1 --min-map-q=1 -t XC -s paired -i samplex-sort-umis_corrected.bam | \
    fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \
    CallDuplexConsensusReads \
    --min-input-base-quality=2 --sort-order=:none: -i /dev/stdin -o /dev/stdout | \
    fgbio -Xms750m -Xmx30g -XX:+UseSerialGC --tmp-dir . --async-io=true --compression=0 \
    FilterConsensusReads --min-reads=3 --min-base-quality=13 --max-base-error-rate=0.1 \
    -r /path/reference/genomes/Hsapiens/hg38/seq/hg38.fa -i /dev/stdin -o /dev/stdout | \
    bamtofastq collate=1 T=samplex-sort-umis_corrected-cumi-1-bamtofastq-tmp \
    F=samplex-sort-umis_corrected-cumi-1.fq.gz F2=samplex-sort-umis_corrected-cumi-2.fq.gz tags=cD,cM,cE gz=1
    
  • step 6: align consensus reads:

    unset JAVA_HOME && bwa mem  -C -c 250 -M -t 16  -R '@RG\tID:samplex\tPL:illumina\tPU:samplex\tSM:samplex' \
    -v 1 /projects/ngs/reference/genomes/Hsapiens/hg38/bwa/hg38.fa \
    samplex-sort-umis_corrected-cumi-1.fq.gz samplex-sort-umis_corrected-cumi-2.fq.gz | \
    samtools sort -@ 16 -m 1G -T samplex-sort-cumi-sorttmp -o samplex-sort-cumi.bam /dev/stdin
    samtools index -@ 16 samplex-sort-cumi.bam samplex-sort-cumi.bam.bai
    
  • step 7: clean variant_regions bed file:

    cat target.bed | grep -v ^track | grep -v ^browser | grep -v ^@ | grep -v ^# | \
    bcbio_python -c 'from bcbio.variation import bedutils; bedutils.remove_bad()' | \
    sort -V -T . -k1,1 -k2,2n > cleaned-target.bed
    cat cleaned-target.bed | bgzip --threads 16 -c > cleaned-target.bed.gz
    tabix -f -p bed cleaned-target.bed.gz
    bedtools merge  -i cleaned-target.bed> cleaned-target-merged.bed
    cat cleaned-target-merged.bed  | bgzip --threads 16 -c > cleaned-target-merged.bed.gz
    
  • step 8: clean coverage bed file (the same in our example):

    cat target.bed | grep -v ^track | grep -v ^browser | grep -v ^@ | grep -v ^# | \
    iconv -c -f utf-8 -t ascii | sed 's/ //g' | \
    bcbio_python -c 'from bcbio.variation import bedutils; bedutils.remove_bad()' | \
    sort -V -T . -k1,1 -k2,2n > cov-target.bed
    cat cov-target.bed  | bgzip --threads 16 -c > cov-target.bed.gz
    tabix -f -p bed cov-target.bed.gz
    bedtools merge -i cov-target.bed > cov-target-merged.bed
    cat cov-target-merged.bed | bgzip --threads 16 -c > cov-target-merged.bed.gz
    
  • step 9: clean sv regions bed file:

    cat cleaned-target.bed | grep -v ^track | grep -v ^browser | grep -v ^@ | grep -v ^# |  \
    /home/kmhr378/local/bin/bcbio_python -c 'from bcbio.variation import bedutils; bedutils.remove_bad()' | \
    sort -V -T . -k1,1 -k2,2n > \
    svregions-cleaned-target.bed
    cat svregions-cleaned-target.bed | bgzip --threads 16 -c > svregions-cleaned-target.bed.gz
    
  • step10: calculate coverage for 3 bed files with MOSDEPTH:

    export MOSDEPTH_Q0=NO_COVERAGE && export MOSDEPTH_Q1=LOW_COVERAGE && \
    export MOSDEPTH_Q2=CALLABLE && \
    mosdepth -t 16 -F 1804 -Q 1 --no-per-base --by cleaned-target.bed \
    --quantize 0:1:4: samplex-variant_regions samplex-sort-cumi.bam
    mosdepth -t 16 -F 1804  --no-per-base --by svregions-cleaned-target.bed  \
    samplex-sv_regions samplex-sort-cumi.bam
    mosdepth -t 16 -F 1804  --no-per-base --by cov-target.bed samplex-coverage \
    samplex-sort-cumi.bam \
    -T 1,5,10,20,50,100,200,500,1000,2000,5000,10000,20000,50000,100000,200000,500000
    
  • step11: hts_nim counts:

    hts_nim_tools count-reads -t 16 -F 1804 /path/samplex/counts/fullgenome.bed samplex-sort-cumi.bam > fullgenome-1804-counts.txt
    hts_nim_tools count-reads -t 16 -F 1804 cleaned-target.bed samplex-sort-cumi.bam > cleaned-target-merged-1804-counts.txt
    
  • step12: samtools read statistics::

    samtools stats -@ 16 samplex-sort-cumi.bam > samplex.txt samtools idxstats samplex-sort-cumi.bam > samplex-idxstats.txt

  • step13: variant calling with vardict (repeated for each alignment chunk):

    unset R_HOME && unset R_LIBS && unset JAVA_HOME && \
    export VAR_DICT_OPTS='-Xms750m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=.' && \
    vardict-java -G /path/reference/genomes/Hsapiens/hg38/seq/hg38.fa \
    -N samplex -b samplex-sort-cumi.bam -c 1 -S 2 -E 3 -g 4 --nosv --deldupvar -Q 10 -F 0x700 \
    samplex-chr5_0_x-unmerged-regions-regionlimit.bed -f 0.0025 | \
    teststrandbias.R | \
    var2vcf_valid.pl -A -N samplex -E -f 0.0025 | grep -v ^##contig | \
    bcftools annotate -h samplex-chr5_0_x-contig_header.txt | \
    bcftools filter -i 'QUAL >= 0' | \
    bcftools filter --soft-filter 'LowFreqBias' --mode '+' -e  'FORMAT/AF[0:0] < 0.02 && \
    FORMAT/VD[0] < 30 && INFO/SBF < 0.1 && INFO/NM >= 2.0' | \
    awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDXkmryswbvhdx]/, "N", $4) } {print}' | \
    awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDXkmryswbvhdx]/, "N", $5) } {print}' |  \
    awk -F$'\t' -v OFS='\t' '$1!~/^#/ && $4 == $5 {next} {print}' | \
    vcfstreamsort | bgzip -c > samplex-chr5_0_x.vcf.gz
    zgrep ^# samplex-chr5_0_x.vcf.gz > samplex-chr5_0_x-fixheader-header.vcf
    unset JAVA_HOME && \
    picard FixVcfHeader HEADER=samplex-chr5_0_x-fixheader-header.vcf \
    INPUT=samplex-chr5_0_x.vcf.gz \
    OUTPUT=samplex-chr5_0_x-fixheader.vcf.gz
    
  • step14: gather vcfs:

    unset JAVA_HOME && \
    gatk --java-options '-Xms681m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=.\
    GatherVcfs -I samplex-files.list -O samplex.vcf.gz
    
  • step15: annotate with snpEff:

    unset JAVA_HOME && \
    snpEff -Xms750m -Xmx29g -Djava.io.tmpdir=. eff \
    -dataDir /path/reference/genomes/Hsapiens/hg38/snpeff \
    -hgvs -cancer -noLog -i vcf -o vcf -csvStats samplex-effects-stats.csv \
    -s samplex-effects-stats.html GRCh38.86 samplex.vcf.gz | \
    bgzip --threads 16 -c > samplex-effects.vcf.gz
    
  • step16: annotate with vcfanno:

    vcfanno -p 16 dbsnp.conf samplex-effects.vcf.gz | \
    bcftools reheader -h samplex-effects-annotated-sample_header.txt | \
    bcftools view | bgzip -c > samplex-effects-annotated.vcf.gz
    tabix -f -p vcf samplex-effects-annotated.vcf.gz
    vcfanno -p 16 samplex-effects-annotated-annotated-somatic-combine.conf \
    samplex-effects-annotated.vcf.gz | bgzip -c > samplex-effects-annotated-annotated-somatic.vcf.gz
    tabix -f -p vcf samplex-effects-annotated-annotated-somatic.vcf.gz
    cat samplex-effects-annotated-annotated-somatic-priority.tsv  | bgzip --threads 16 -c > \
    samplex-effects-annotated-annotated-somatic-priority.tsv.gz
    tabix -f -0 -c '#' -s 1 -b 2 -e 3 samplex-effects-annotated-annotated-somatic-priority.tsv.gz
    

scRNA-seq: Single cell RNA-seq barcode counting pipeline step by step

Minor steps and exact file locations are omitted. * = repeated for every sample in sample_barcodes.csv

bcbio.yaml config:

details:
- algorithm:
    cellular_barcode_correction: 1
    minimum_barcode_depth: 1000
    sample_barcodes: /path/project/config/barcodes.csv
    transcriptome_fasta: /path/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
    transcriptome_gtf: /path/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
    umi_type: harvard-indrop-v3
  analysis: scRNA-seq
  description: project
  files:
  - /path/project_1.fq.gz
  - /path/project_2.fq.gz
  - /path/project_3.fq.gz
  - /path/project_4.fq.gz
  genome_build: mm10
fc_name: sc-mouse
upload:
  dir: /path/project/final
  • step1: parse barcode information from reads 2,3,4 to the fastq read name, CELL_[value]:UMI_[value]:sample_[value]. umis supports many protocols, however, the downside is speed - this step can take up to 3-4 days:

    python umis fastqtransform \
    --separate_cb umis/harvard-indrop-v3-transform.json --cores 16  \
    project_1.fq.gz project_2.fq.gz project_3.fq.gz project_4.fq.gz | \
    seqtk seq -L 20 - | gzip > project.umitransformed.fq.gz
    
  • step2: create a fastq file for each sample:

    python umis demultiplex_samples \
    --nedit 1 --barcodes sample_barcodes.csv \
    --out_dir demultiplexed project.umitransformed.fq.gz
    
  • step3*:

    python umis cb_filter \
    --cores 16 --bc1 harvard-indrop-v3-cb1.txt.gz --nedit 1 \
    --bc2 harvard-indrop-v3-cb2.txt.gz demultiplexed/[sample-barcodeAATTTTT].fq  | \
    gzip -c > project-sample_barcode.filtered.fq.gz
    
  • step 4*: create cellular barcode histogram, also creates cb-histogram-filtered.txt for cells with nreads > minimum_barcode_depth:

    python umis cb_histogram project-sample_barcode.filtered.fq.gz > cb-histogram.txt
    
  • step 5: create index genome for rapmap:

    rapmap quasiindex -k 31 -i mm10 -t mm10/ref-transcripts.fa
    
  • step 6*. align reads with rapmap:

    rapmap quasimap -t 16 -i mm10 \
    -r <(gzip -cd project-sample_barcode.filtered.fq.gz) | \
    samtools sort -@ 16 -m 1G  -T project-sample_barcode-sorttmp \
    -o project-sample_barcode.bam /dev/stdin
    samtools index -@ 16 project-sample_barcode.bam project-sample_barcode.bam.bai
    
  • step 7*: count transcripts:

    python umis fasttagcount --cb_cutoff 1000 \
    --genemap ref-transcripts-tx2gene.tsv
    --cb_histogram project-sample_barcode/cb-histogram.txt \
    --umi_matrix project-sample_barcode-dupes.mtx.full \
    project-sample_barcode.bam project-sample_barcode.mtx.full
    
    python umis sparse project-sample_barcode.mtx.full \
    project-sample_barcode.mtx
    
    python umis sparse project-sample_barcode-dupes.mtx.full \
    project-sample_barcode-dupes.mtx
    
  • step 8. Concatenate all cb-histogram-filtered.txt files:

    cat project-[all-barcodes]/cb-histogram-filtered.txt > cb-histogram.txt