Variant calling using bulk RNA-seq data¶

2020-05-12: works for validation samples, fails for some random samples due to pybedtools issue. To avoid HaplotypeCallerSpark issue, update to the latest development version.

Workflow¶

This workflow demonstrates how to call variants with GATK3.8 using bulk RNA-seq data of GM12878 cell line (blood).

At least 3 RNA-seq datasets exist for NA12878:

SRR307897 is of bad quality - don’t use it;
SRR307898 is a bit old (Illumina GAII) but it was used in Piskol2013 article, which is reliable work on RNA-seq variant calling validation;
SRR5665260, NextSeq-500.

GATK3.8 requires additional installation step.

1. Project structure¶

mkdir NA12878_RNA-seq_validation
cd NA12878_RNA-seq_validation
mkdir config input final work

2. Download input data (~6G total)¶

cd input
wget -c -O NA12878_1.fq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR307/SRR307898/SRR307898_1.fastq.gz
wget -c -O NA12878_2.fq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR307/SRR307898/SRR307898_2.fastq.gz

or (if ebi is not online)

wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR307898/SRR307898.3
fastq-dump --gzip --split-files SRR307898.3

3. Config file `input/NA12878.yaml`¶

Note the use of batch even for a single sample.

details:
- algorithm:
    aligner: star
    strandedness: unstranded
    variantcaller: gatk-haplotype
    tools_off:
    - gatk4
  analysis: RNA-seq
  description: NA12878_SRR307898
  files:
  - /path/NA12878/input/NA12878_SRR307898_1.fq.gz
  - /path/NA12878/input/NA12878_SRR307898_2.fq.gz
  genome_build: hg38
  metadata:
    batch: NA12878
fc_name: NA12878
resources:
  default:
    cores: 4
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 15G
upload:
  dir: ../final

4. Run bcbio - assuming 60G/4cores machine or cluster node.¶

cd work
bcbio_nextgen.py ../config/NA12878.yaml -n 4

Parameters¶

variantcaller: gatk-haplotype or vardict. You can use just one variant caller for RNA-seq data in a bcbio project. If you want calls from two callers, run a separate project or edit variantcaller parameter and re-run.
tools_off: [gatk4]: when set, runs gatk3.8, which gives better precision.
batch is required:
```
metadata:
  batch: NA12878
```

Validation¶

How to validate calls from bcbio¶

Use high quality variants (PASS filters, depth>=10 reads), remove potential RNA editing events.

bcftools view -f PASS -e "INFO/DP<10 | INFO/possible_rnaedit==1" NA12878-gatk-haplotype-annotated.vcf.gz |  bgzip -c > NA12878.pass.vcf.gz
tabix NA12878.pass.vcf.gz
wget https://raw.githubusercontent.com/naumenko-sa/cre/master/data/intersect.hg38.bed
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878.validate.sh
NA12878.validate.sh \
NA12878.pass.vcf.gz \
/path/bcbio/genomes/Hsapiens/hg38/validation/giab-NA12878/truth_small_variants.vcf.gz \
intersect.hg38.bed \
/path/bcbio/genomes/Hsapiens/hg38/rtg/hg38.sdf

Results will be in NA12878.pass.vcf.gz.stat:

snp tp-baseline 5720
indels tp-baseline 184
snp fp 1205
indels fp 2141
snp fn 29323
indels fn 2455

Validation results, 2016-2019, GRCh37, SRR307898¶

date	type	bcbio	gatk	TP	FP	FN	FDR	FNR	Target	Total called
2016-12-08	SNP	1.0.0	3.6	7,892	266	30,851	3%	80%	38,743	8,158
2016-12-08	INDEL	1.0.0	3.6	256	117	2,788	31%	92%	3,044	373
2018-06-06	SNP	1.0.9	4.0.1.2	6,817	2,575	31,926	27%	82%	38,743	9,392
2018-06-06	INDEL	1.0.9	4.0.1.2	228	24,448	2,816	99%	93%	3,044	24,676
2018-06-14	SNP	1.0.9	3.8	7,936	315	30,807	4%	80%	38,743	8,251
2018-06-14	INDEL	1.0.9	3.8	306	280	2,738	48%	90%	3,044	586
2019-03-01	SNP	1.1.3	4.1.0.0	6,380	842	32,363	12%	84%	38,743	7,222
2019-03-01	INDEL	1.1.3	4.1.0.0	374	3,131	2,670	89%	88%	3,044	3,505
2019-03-01	SNP	1.1.3	3.8	6,244	558	32,499	8%	84%	38,743	6,802
2019-03-01	INDEL	1.1.3	3.8	192	1,951	2,852	91%	94%	3,044	2,143

Validation results, 2020, hg38¶

date	data	type	bcbio	caller	TP	FP	FN	FDR	FNR	Target	Total called
2020-05-12	SRR307898	SNP	1.2.3	gatk,4.1.6.0	5,849	1,570	29,194	21%	83%	35,043	7,419
2020-05-12	SRR307898	INDEL	1.2.3	gatk,4.1.6.0	181	2907	2458	94%	93%	2,639	3,088
2020-05-12	SRR307898	SNP	1.2.3	vardict-java,1.7.0	6,333	916	28,710	13%	82%	35,043	7,249
2020-05-12	SRR307898	INDEL	1.2.3	vardict-java,1.7.0	166	580	2,473	78%	94%	2,639	746
2020-05-13	SRR307898	SNP	1.2.3	gatk3.8-1.0	5720	1205	29,323	17%	83%	35043	6925
2020-05-13	SRR307898	INDEL	1.2.3	gatk3.8-1.0	184	2,141	2,455	92%	93%	2,639	2325

Conclusions¶

It is not surprising to see high False Negative rate FN = FN/(FN+TP) in RNA-seq variant calling, i.e. that we are not calling 83% of variants. We validate against intersect.hg38.bed, which is based on exome capture regions and is representing all protein coding genes. Only a minor fraction of genes are expressed in blood, only these genes have read coverage that makes variant calling possible.
Indel calling is not reliable with RNA-seq data.
Current recommendation is to use gatk3.8 > vardict > gatk4 for better SNP calling precision in bcbio.
Annotation of RNA-editing sites and removal of variants around splice junctions improved precision in bcbio, but still more work is needed to achieve better precision at the level of Piskol2013 (<1% FDR).

TODO¶

update gatk3.8 validation
validate with SRR5665260
integrate validation into bcbio
improve post GATK4 filters for better precision
paired DNA/RNA-seq variant calling
somatic variant calling with RNA-seq, see discusion
track pybedtools release.