Workflow

Overview¶

RNAseq is one the most common NGS datatypes. It typically comes in two flavors:

total RNAseq
polyA RNAseq

The core of most RNAseq data analysis workflows are mostly identical involving the following steps:

trim fastq reads
align trimmed reads to genome using split aware aligner
count reads-per-gene based on existing gene model
collect QC metrics at various stages

The above steps assume that one already has a pre-built genomic index to align the trimmed fastq reads against. If you do not have such index, the genomic sequence in FASTA format and gene annotations in GTF format are required to generate a index prior to alignment.

The follow graphic depicts the basic workflow that we will attempt to create on AWS:

drawing

The inputs will be read from a s3 bucket and the outputs will be written to the same s3 bucket. Individual dockers will be created for various steps of the pipeline as described below:

Pipeline Step	Tool in Docker	DockerHub Link
Trim	CutAdapt¹	ccrgb_cutadapt_v1.18
Align	STAR²	ccrgb_star_v2.7.10a
BuildIndex	STAR²	ccrgb_star_v2.7.10a
Count	RSEM³	ccrgb_star_v2.7.10a
Report	MultiQC⁴	ccrgb_multiqc_v1.12

Inputs¶

The inputs for the pipeline can be broadly be classified into:

inputs required to create the index for mapping, i.e., genomic FASTA file and its gene annotations file (GTF)
raw FASTQ reads per sample

Fasta¶

This is the genome in FASTA format. We will be using hg38 version of the human genome for our purposes and it can be downloaded from here.

GTF¶

GTF or Gene Transfer Format is a file which includes all the gene annotations or gene models. This includes the information about the location of the genes and their splicing events. We will use the GENCODEs release 38 for this.

FASTQ¶

Paired end (PE) FASTQ files per sample will be provided as input. which dataset to use as test input is TBD

Outputs¶

The pipeline is expected to produce 2 primary outputs:

CountsMatrix¶

This is a tab-delimited file with nsample + 1 number of columns (first column is the gene identifier and it is followed by one column per sample of counts data) and ngenes + 1 number of rows (first row is the header containing sample names and every subsequent row gives counts per gene tab-delimited for each sample).

Report¶

Basic stats are collected to generated a HTML report using MultiQC.

Pipelining frameworks¶

There are 3 or 4 pipelining frameworks which are popular among bioinformatic community, specifically for NGS data analysis. CCBR has successfully used Snakemake⁵ for the past few years to execute reproducible NGS data analysis on the Biowulf HPC cluster. We will be leveraging this extensive experience to build a Snakemake-based pipeline for the above outlined workflow to be run on the AWS cloud specifically using AWS Genomics CLI.

In addition to Snakemake, a Nextflow-based pipeline will also be built to mimic the same tasks achieved by the previously mentioned Snakemake-based workflow. The major reasons for this repetition of efforts are:

AWS Genomic CLI adaptation of Nextflow seems more mature than Snakemake.
Nextflow workflows seemed to be more widely accepted eg. SBG, DNAnexus, etc.
Assess differences in building workflows in Snakemake vs Nextflow. This insight will be valuable in shaping future pipeline-development directions we take.

Snakemake¶

We will be using CCBR's RNAseq repository as reference to build our MVP workflow in Snakemake.

Nextflow¶

We will be using Nfcore's RNAseq pipeline as a reference to build our MVP workflow in Nextflow

References¶

¹ http://dx.doi.org/10.14806/ej.17.1.200

² http://dx.doi.org/10.1093/bioinformatics/bts635

³ https://doi.org/10.1186/1471-2105-12-32

⁴ http://dx.doi.org/10.1093/bioinformatics/btw354

⁵ https://doi.org/10.12688/f1000research.29032.2

Last update: 2022-07-13