Variant calling using bulk RNA-seq data

2020-05-12: works for validation samples, fails for some random samples due to pybedtools issue. To avoid HaplotypeCallerSpark issue, update to the latest development version.

Workflow

This workflow demonstrates how to call variants with GATK3.8 using bulk RNA-seq data of GM12878 cell line (blood).

At least 3 RNA-seq datasets exist for NA12878:

  • SRR307897 is of bad quality - don’t use it;

  • SRR307898 is a bit old (Illumina GAII) but it was used in Piskol2013 article, which is reliable work on RNA-seq variant calling validation;

  • SRR5665260, NextSeq-500.

GATK3.8 requires additional installation step.

1. Project structure

mkdir NA12878_RNA-seq_validation
cd NA12878_RNA-seq_validation
mkdir config input final work

2. Download input data (~6G total)

cd input
wget -c -O NA12878_1.fq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR307/SRR307898/SRR307898_1.fastq.gz
wget -c -O NA12878_2.fq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR307/SRR307898/SRR307898_2.fastq.gz

or (if ebi is not online)

wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR307898/SRR307898.3
fastq-dump --gzip --split-files SRR307898.3

3. Config file input/NA12878.yaml

Note the use of batch even for a single sample.

details:
- algorithm:
    aligner: star
    strandedness: unstranded
    variantcaller: gatk-haplotype
    tools_off:
    - gatk4
  analysis: RNA-seq
  description: NA12878_SRR307898
  files:
  - /path/NA12878/input/NA12878_SRR307898_1.fq.gz
  - /path/NA12878/input/NA12878_SRR307898_2.fq.gz
  genome_build: hg38
  metadata:
    batch: NA12878
fc_name: NA12878
resources:
  default:
    cores: 4
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 15G
upload:
  dir: ../final

4. Run bcbio - assuming 60G/4cores machine or cluster node.

cd work
bcbio_nextgen.py ../config/NA12878.yaml -n 4

Parameters

  • variantcaller: gatk-haplotype or vardict. You can use just one variant caller for RNA-seq data in a bcbio project. If you want calls from two callers, run a separate project or edit variantcaller parameter and re-run.

  • tools_off: [gatk4]: when set, runs gatk3.8, which gives better precision.

  • batch is required:

    metadata:
      batch: NA12878
    

Validation

How to validate calls from bcbio

Use high quality variants (PASS filters, depth>=10 reads), remove potential RNA editing events.

bcftools view -f PASS -e "INFO/DP<10 | INFO/possible_rnaedit==1" NA12878-gatk-haplotype-annotated.vcf.gz |  bgzip -c > NA12878.pass.vcf.gz
tabix NA12878.pass.vcf.gz
wget https://raw.githubusercontent.com/naumenko-sa/cre/master/data/intersect.hg38.bed
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878.validate.sh
NA12878.validate.sh \
NA12878.pass.vcf.gz \
/path/bcbio/genomes/Hsapiens/hg38/validation/giab-NA12878/truth_small_variants.vcf.gz \
intersect.hg38.bed \
/path/bcbio/genomes/Hsapiens/hg38/rtg/hg38.sdf

Results will be in NA12878.pass.vcf.gz.stat:

snp tp-baseline 5720
indels tp-baseline 184
snp fp 1205
indels fp 2141
snp fn 29323
indels fn 2455

Validation results, 2016-2019, GRCh37, SRR307898

date

type

bcbio

gatk

TP

FP

FN

FDR

FNR

Target

Total called

2016-12-08

SNP

1.0.0

3.6

7,892

266

30,851

3%

80%

38,743

8,158

2016-12-08

INDEL

1.0.0

3.6

256

117

2,788

31%

92%

3,044

373

2018-06-06

SNP

1.0.9

4.0.1.2

6,817

2,575

31,926

27%

82%

38,743

9,392

2018-06-06

INDEL

1.0.9

4.0.1.2

228

24,448

2,816

99%

93%

3,044

24,676

2018-06-14

SNP

1.0.9

3.8

7,936

315

30,807

4%

80%

38,743

8,251

2018-06-14

INDEL

1.0.9

3.8

306

280

2,738

48%

90%

3,044

586

2019-03-01

SNP

1.1.3

4.1.0.0

6,380

842

32,363

12%

84%

38,743

7,222

2019-03-01

INDEL

1.1.3

4.1.0.0

374

3,131

2,670

89%

88%

3,044

3,505

2019-03-01

SNP

1.1.3

3.8

6,244

558

32,499

8%

84%

38,743

6,802

2019-03-01

INDEL

1.1.3

3.8

192

1,951

2,852

91%

94%

3,044

2,143

Validation results, 2020, hg38

date

data

type

bcbio

caller

TP

FP

FN

FDR

FNR

Target

Total called

2020-05-12

SRR307898

SNP

1.2.3

gatk,4.1.6.0

5,849

1,570

29,194

21%

83%

35,043

7,419

2020-05-12

SRR307898

INDEL

1.2.3

gatk,4.1.6.0

181

2907

2458

94%

93%

2,639

3,088

2020-05-12

SRR307898

SNP

1.2.3

vardict-java,1.7.0

6,333

916

28,710

13%

82%

35,043

7,249

2020-05-12

SRR307898

INDEL

1.2.3

vardict-java,1.7.0

166

580

2,473

78%

94%

2,639

746

2020-05-13

SRR307898

SNP

1.2.3

gatk3.8-1.0

5720

1205

29,323

17%

83%

35043

6925

2020-05-13

SRR307898

INDEL

1.2.3

gatk3.8-1.0

184

2,141

2,455

92%

93%

2,639

2325

Conclusions

  • It is not surprising to see high False Negative rate FN = FN/(FN+TP) in RNA-seq variant calling, i.e. that we are not calling 83% of variants. We validate against intersect.hg38.bed, which is based on exome capture regions and is representing all protein coding genes. Only a minor fraction of genes are expressed in blood, only these genes have read coverage that makes variant calling possible.

  • Indel calling is not reliable with RNA-seq data.

  • Current recommendation is to use gatk3.8 > vardict > gatk4 for better SNP calling precision in bcbio.

  • Annotation of RNA-editing sites and removal of variants around splice junctions improved precision in bcbio, but still more work is needed to achieve better precision at the level of Piskol2013 (<1% FDR).

TODO

  • update gatk3.8 validation

  • validate with SRR5665260

  • integrate validation into bcbio

  • improve post GATK4 filters for better precision

  • paired DNA/RNA-seq variant calling

  • somatic variant calling with RNA-seq, see discusion

  • track pybedtools release.