Flow Logo

Pipelines

Differential Expression Analysis v1.0

This is a pipeline written by Goodwright, you can find the source code and more information about running from the command line here.

Pipeline Summary

Differential expression analysis is a statistical method used to identify differences in gene expression or other molecular features between two or more groups of samples. This approach is critical in identifying molecular biomarkers associated with disease or other biological conditions, as well as understanding the underlying mechanisms of complex biological processes.

This pipeline is designed to process simple count matrices, which are the output of many count-based workflows such as RNA-seq. Count matrices for other applications can also be easily created as a simple sample versus feature count table and used as input. The count matrix input is paired with a sample sheet that describes metadata associated with each sample. The sample sheet can include any number of attributes that group the sample experimentally.

A set of comparisons based on the supplied statistical design, using reference and contrast factors, as well as more complex comparisons involving blocking factors is then performed. The primary output is a results table for each comparison, which shows the statistical significance of each feature. We also provide a rich set of reporting outputs to help users assess the quality of their data and visualise their results.

The pipeline executes the following main stages:

  • Input sample sheet checking and experimental validation
  • Run differential analysis using DESeq2
  • Generate DESeq2 standard plots
  • Run dimenionality reduction on the DESeq2 output using pcaExplorer
  • Run geneset enrichment analysis on each comparison result set

Tool Usage List

  • DESeq2 - Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.
  • pcaExplorer - Functionality for visualization of RNA-seq datasets based on Principal Components Analysis.
  • genekitr - A collection of R tools designed to aid in performing and visualising gene set enrichment analysis

Input

Count Matrix

Supplied using the --counts parameter. This is a numeric square matrix file, comma or tab-separated, with a column for every sample against a count of features. If you have run nf-core/rnaseq on Flow you can use the tsv outputted by this pipeline, for example "salmon.merged.gene_counts.tsv". Otherwise, you will need to upload this tsv via the "Upload Data" page and choose "Other data".

For example:

gene_idgene_nameSRX8042381SRX8042382SRX8042383
ENSMUSG00000023978Prph21024
ENSMUSG00000030324Rho1811412
ENSMUSG00000031450Grk12031
ENSMUSG00000056055Sag52310
ENSMUSG00000025900Rp11732
ENSMUSG00000026834Acvr1c104366
ENSMUSG00000025386Pde6g781

At a minmimum an id column must be provided which is always defined as the first column. Additional info columns can be included for context which can be utilised by downstream processes such as gene name. Meta data is seperated from sample columns by the sample ids provided in the samplesheet

Samplesheet

Supplied using the --samplesheet parameter. At a minimum the samplesheet must contain a set of sample ids which match what is in the supplied counts file and at least one column to use as a comparison variable. Any number of additional status columns can be included for more complex experimental designs such as in the example below where an experimental condition and batch number are provided.

You will need to upload this csv via the "Upload Data" page and choose "Other data".

sample_idconditionbatch
WT_REP1A1
WT_REP2A2
RAP1_UNINDUCED_REP1B1
RAP1_UNINDUCED_REP2B2

Experimental Design

The experimental design that defines which comparisons and blocking factors are used can be tuned using three different parameters:

  • contrast_column: This is the column in the samplesheet from which comparisons will be generated. By default this is set to condition.
  • comparisons: This defines which comparisons will be run from the contrast column. By default this is set to all, in this configuration an all against all comparison set of all variables in the contrast column will be performed. In designs that have a large number of variables or where some comparisons would be invalid, users can defined their own comparison set using the following notation C_A:B_C. Comparisons must be delimited by : with the reference factor on the left seperated by an underscore with the contrast value on the right.
  • blocking_factors: This variable allows for multi-factor experimental designs. By default this is blank. Blocking factors must be targeted on specific comparisons and therefore is not compatible with the comparison all setting. The rational behind this is that if a user is constructing complex experimental designs then they will always be using specific comparisons. Comparison targets are seperated by commas; within a target users must specify a target comparison followed by the blocking factor syntax using the : seperator. Inside the blocking factor syntax, multiple blocking factors can be defined using a ; separated list. All blocking factors must exist as a column in the input sample sheet. A_B:,A_C:batch,A_D:batch;time In this example we target several comparisons with different blocking factors. The A_B comparison has no blocking factors set and is a valid setting provided by the user for completness; however, you do not need to provide settings for all comparisons in this parameter. See the DESeq2 docs on multi-factor design here for more information.

Output

Pipeline output will default to the path ./results unless modified with the --outdir parameter.

The deseq2 folder contains output data pertaining to the differential analysis comparisons. The folder will always contain the size factors and normalised counts in its root folder as well as a plots folder that contains experiment-level diagnostic plots such as the dispersion and ma plots.

For each comparison an additional sub-folder will be included that contains the primary output results table and any other comparison specific plots such as Volcano plots.

Additional folders are outputted depending on which additional processing modules are active and are clearly named. For example the geneset enrichment analysis module is names gsea.

Pipeline Parameters

  • --study_name: All output from the pipeline will be placed in the directory $outdir/$study_name. This is so several runs with different parameters can be performed without needing to worry about mixing up results. The default path is results/study.

  • --count_sep: Defines the count separator used in the input count files. Defaults to tab \t

Differential Analysis

This pipeline exposes most key DESeq2 parameters via nextflow parameters with the dsq_ prefix. More information on these parameters can be found in the table below, or in the DESeq2 documentation.

ParameterDefaultDescription
dsq_testWaldOptions: Wald, LRT
dsq_fit_typeparametricOptions: parametric, local, mean, glmGamPoi
dsq_min_replicates_for_replace7The minimum number of replicates required in order to use replaceOutliers on a sample
dsq_sf_typeratioOptions: ratio, poscounts, iterate
dsq_lfc_threshold0A non-negative value which specifies a log2 fold change threshold
dsq_alt_hypothesisgreaterAbsOptions: greaterAbs, lessAbs, greater, less
dsq_alpha0.1The significance cutoff used for optimizing the independent filtering
dsq_minmu0.5Lower bound on the estimated count (used when calculating contrasts)
dsq_lfcshrink_typeashrOptions: apeglm, ashr, normal
dsq_p_thresh1p-value cut off for filtering the deseq2 results table for downstream modules

Gene set Enrichment Analysis

The most important parameter for this module is correct setting of --organism, this informs genekitr which organism to pull data for and must be set by finding the correct species name here. If no organism is suitable or users wish to switch this module off then the --skip_gsea parameter can be used. genekitr parameters are exposed via nextflow parameters with the gsea_ prefix. More information on these parameters can be found in the table below, or in the genekitr documentation.

ParameterDefaultDescription
gsea_p_cutoff0.05A numeric of cutoff for both pvalue and adjusted pvalue
gsea_q_cutoff0.05A numeric of cutoff for both qvalue
gsea_ontologymfBiological Processes (BP), Molecular Functions (MF), Cellular Components (CC)
gsea_min_gset_size10Minimal size of each gene set for analysis
gsea_max_gset_size500Max size of each gene set for analysis
gsea_p_adjust_methodBHChoose from "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
gsea_pathway_count10How many pathways to show on the plots
gsea_stats_metricp.adjustStats metric for the plots - p.adjust, pvalue, qvalue
gsea_term_metricFoldEnrichTerm metric for the ora plots - FoldEnrich, GeneRatio, Count, RichFactor
gsea_scale_ratio0.25Plot scale ratio
gsea_main_text_size5Plot text size
gsea_legend_text_size8Plot legend size

Authors and contact

This DSL2 Nextflow pipeline is maintained by Goodwright. To raise any issues or comments with the pipeline you can (in order of preference):

Previous
ChIP-Seq