bcbio-nextgen runs in a temporary work directory which contains a number of processing intermediates. Pipeline completion extracts the final useful output files into a separate directory, specified by the Upload. This configuration allows upload to local directories, Galaxy, or Amazon S3. Once extracting and confirming the output files, you can delete the temporary directory to save space.
The output directory contains sample specific output files labeled by sample name and a more general project directory. The sample directories contain all of the sample specific output files, while the project directory contains global files like project summaries or batched population level variant calls. See the Teaching documentation for a full variant calling example with additional details about configuration setting and resulting output files.
project-summary.yaml– Top level YAML format summary file with statistics on read alignments and duplications as well as analysis specific metrics.
programs.txt– Program versions for bcbio-nextgen and software run in the pipeline. This enables reproduction of analyses.
multiqcrun MultiQC to gather all QC metrics from different tools, such as, cutadapt, featureCounts, samtools, STAR … into an unique HTML report.
metadata.csv– CSV with the metadata in the YAML file.
data_versions.csv– Data versions for bcbio-nextgen and software
SAMPLE/qc– Directory of quality control runs for the sample. These include charts and metrics for assessing quality of sequencing and analysis.
SAMPLE-ready.bam– A prepared BAM file of the aligned reads. Depending on the analysis used, this may include trimmed, recalibrated and realigned reads following alignment.
grading-summary.csv– Grading details comparing each sample to a reference set of calls. This will only have information when providing a validation callset.
BATCH-caller.vcf– Variants called for a population/batch of samples by a particular caller.
BATCH-caller.db– A GEMINI database associating variant calls with a wide variety of third party annotations. This provides a queryable framework for assessing variant quality statistics.
SAMPLE-caller.vcf– Variants calls for an individual sample.
SAMPLE-gdc-viral-completeness.txt– Optional viral contamination estimates. File is of the format depth, 1x, 5x, 25x. depth is the number of reads aligning to the virus. 1x, 5x, 25x are percentage of the viral sequence covered by reads of 1x, 5x, 25x depth. Real viral contamination will have broad coverage across the entire genome, so high numbers for these values, depending on sequencing depth. High depth and low viral sequence coverage means a likely false positive.
Workflow for analysis¶
For gene-level analyses, we recommend loading the gene-level counts.csv.gz and the metadata.csv.gz and using DESeq2 to do the analysis. For a more in-depth walkthrough of how to use DESeq2, refer to our DGE_workshop.
For transcript-level analyses, we recommend using sleuth with the bootstrap samples. You can load the abundance.h5 files from Salmon, or if you set
kallisto as an expression caller, use the abundance.h5 files from that.
Another great alternative is to use the Salmon quantification to look at differential transcript usage (DTU) instead of differential transcript expression (DTE). The idea behind DTU is you are looking for transcripts of genes that have been flipped from one isoform to another. The Swimming downstream tutorial has a nice walkthrough of how to do that.
single cell RNA-Seq¶
tagcounts.mtx– count matrix compatible with dgCMatrix type in R.
tagcounts-dupes.mtx– count matrix compatible with dgCMatrix type in R but with the duplicated reads counted.
tagcounts.mtx.colnames– cell names that would be the columns for the matrix.
tagcounts.mtx.rownames– gene names that would be the rows for the matrix.
tagcounts.mtx.metadata– metadata that match the colnames for the matrix. This is coming from the barcode.csv file and the metadata given in the YAML config file. for the matrix.
cb-histogram.txt– total number of dedup reads assigned to a cell. Comparing colSums(tagcounts.mtx) to this number can tell you how many reads mapped to genes.
To create Seurat object:
mkdir data cd data result_dir=bcbio_project/final/project_dir cp $result_dir/tagcounts.mtx matrix.mtx cp $result_dir/tagcounts.mtx.colnames barcodes.tsv cp $result_dir/tagcounts.mtx.rownames features.tsv for f in *;do gzip $f;done; cd ..
library(Seurat) counts <- Read10X(data.dir = "data", gene.column = 1) seurat_object <- CreateSeuratObject(counts = counts, min.features = 100) saveRDS(seurat_object, "seurat.bcbio.RDS")
SAMPLE-transcriptome.bam– BAM file aligned to transcriptome.
SAMPLE-mtx.*– gene counts as explained in the project directory.
counts_mirna.tsv– miRBase miRNA count matrix.
counts.tsv– miRBase isomiRs count matrix. The ID is made of 5 tags: miRNA name, SNPs, additions, trimming at 5 and trimming at 3. Here there is detail explanation of the naming .
counts_mirna_novel.tsv– miRDeep2 miRNA count matrix.
counts_novel.tsv– miRDeep2 isomiRs. See counts.tsv explanation for more detail. count matrix.
seqcluster– output of seqcluster tool. Inside this folder, counts.tsv has count matrix for all clusters found over the genome.
seqclusterViz– input file for interactive browser at https://github.com/lpantano/seqclusterViz
report– Rmd template to help with downstream analysis like QC metrics, differential expression, and clustering.
Below is an example sample directory for a sample called rep1. There are four sets of peak files for each sample, for each peak caller, one set for each of the nucleosome-free (NF), mononucleosome (MF), dinucleosome (DF) and trinucleosome (TF) regions. There are BAM files of reach of those regions as well.
This section collects useful scripts and tools to do downstream analysis of bcbio-nextgen outputs. If you have pointers to useful tools, please add them to the documentation.