Flow Logo

Pipelines

Differential Expression Analysis Pipeline

Overview

The Differential Expression Analysis pipeline (v1.0) is a comprehensive workflow for identifying differences in gene expression or other molecular features between experimental conditions. This pipeline is critical for identifying molecular biomarkers associated with disease or other biological conditions, as well as understanding the underlying mechanisms of complex biological processes.

The pipeline is designed to process count matrices from RNA-seq or other count-based assays, perform statistical analysis using DESeq2, and provide rich visualization and interpretation tools including PCA analysis and gene set enrichment analysis.

Pipeline Summary

The differential expression workflow performs the following steps:

  1. Input Validation - Sample sheet checking and experimental design validation
  2. Data Normalization - DESeq2 normalization and size factor calculation
  3. Differential Analysis - Statistical testing using DESeq2 negative binomial model
  4. Quality Control - Diagnostic plots including dispersion and MA plots
  5. Dimensionality Reduction - PCA analysis using pcaExplorer
  6. Functional Analysis - Gene set enrichment analysis for each comparison
  7. Visualization - Volcano plots, heatmaps, and enrichment plots
  8. Report Generation - Comprehensive results tables and summary reports

Input Requirements

Count Matrix

The count matrix is supplied using the --counts parameter. This should be a numeric matrix file (comma or tab-separated) with samples as columns and features as rows.

Format requirements:

  • First column must be feature IDs (e.g., gene IDs)
  • Additional metadata columns allowed (e.g., gene names)
  • Sample columns must match IDs in the sample sheet

If you have run nf-core/rnaseq on Flow, you can use the gene counts output directly (e.g., salmon.merged.gene_counts.tsv).

Example count matrix:

gene_id	gene_name	SRX8042381	SRX8042382	SRX8042383
ENSMUSG00000023978	Prph2	10	2	4
ENSMUSG00000030324	Rho	181	14	12
ENSMUSG00000031450	Grk1	20	3	1
ENSMUSG00000056055	Sag	52	3	10
ENSMUSG00000025900	Rp1	17	3	2

Sample Sheet Format

The sample sheet is supplied using the --samplesheet parameter. It must contain sample IDs matching the count matrix and at least one comparison variable.

ColumnDescriptionRequired
sample_idUnique sample identifier matching count matrixYes
conditionPrimary comparison variable (or custom column)Yes
Additional columnsBatch effects, covariates, blocking factorsNo

Example sample sheet:

sample_id,condition,batch
WT_REP1,control,1
WT_REP2,control,2
TREAT_REP1,treatment,1
TREAT_REP2,treatment,2

Experimental Design Configuration

Configure your experimental design using these parameters:

  • contrast_column: Column name for comparisons (default: condition)
  • comparisons: Comparison strategy
    • all: All pairwise comparisons (default)
    • Custom: A_B:A_C:B_C format (reference_contrast)
  • blocking_factors: Multi-factor designs
    • Format: A_B:batch,A_C:batch;time
    • Not compatible with comparisons=all

Key Parameters

Required Parameters

ParameterDescriptionDefault
--countsPath to count matrix fileRequired
--samplesheetPath to sample sheet CSVRequired
--outdirOutput directory./results
--organismOrganism for GSEA (see species list)Required

General Parameters

ParameterDescriptionDefault
--study_nameStudy name for output organizationstudy
--count_sepCount file separator\t
--skip_gseaSkip gene set enrichment analysisfalse

DESeq2 Parameters

ParameterDescriptionDefault
--dsq_testStatistical test (Wald, LRT)Wald
--dsq_fit_typeDispersion fit typeparametric
--dsq_min_replicates_for_replaceMin replicates for outlier replacement7
--dsq_sf_typeSize factor estimation methodratio
--dsq_lfc_thresholdLog2 fold change threshold0
--dsq_alt_hypothesisAlternative hypothesisgreaterAbs
--dsq_alphaFDR significance cutoff0.1
--dsq_lfcshrink_typeLFC shrinkage methodashr
--dsq_p_threshP-value filter for downstream analysis1

Gene Set Enrichment Parameters

ParameterDescriptionDefault
--gsea_p_cutoffP-value cutoff0.05
--gsea_q_cutoffQ-value cutoff0.05
--gsea_ontologyGO ontology (BP, MF, CC)MF
--gsea_min_gset_sizeMinimum gene set size10
--gsea_max_gset_sizeMaximum gene set size500
--gsea_pathway_countNumber of pathways to plot10

Pipeline Outputs

Directory Structure

results/study/
├── deseq2/
│   ├── size_factors.csv
│   ├── normalized_counts.csv
│   ├── plots/
│   │   ├── dispersion_plot.pdf
│   │   └── ma_plots.pdf
│   └── [comparison_name]/
│       ├── results.csv
│       ├── volcano_plot.pdf
│       └── significant_genes.csv
├── pcaexplorer/
│   ├── pca_plots.pdf
│   └── sample_distances.pdf
└── gsea/
    └── [comparison_name]/
        ├── enrichment_results.csv
        └── pathway_plots.pdf

Output Files

  • DESeq2 Results

    • results.csv: Full differential expression results with statistics
    • significant_genes.csv: Filtered significant genes
    • normalized_counts.csv: Size-factor normalized expression matrix
  • Visualization

    • volcano_plot.pdf: Volcano plots for each comparison
    • dispersion_plot.pdf: Mean-dispersion relationship
    • ma_plots.pdf: MA plots showing fold changes vs expression
  • Enrichment Analysis

    • enrichment_results.csv: Significant pathways and gene sets
    • pathway_plots.pdf: Dot plots and enrichment maps

Example Usage

Basic Analysis

nextflow run goodwright/differential-analysis \
  --counts gene_counts.tsv \
  --samplesheet samples.csv \
  --organism "Homo sapiens" \
  --outdir results \
  -profile docker

Custom Comparisons with Batch Correction

nextflow run goodwright/differential-analysis \
  --counts gene_counts.tsv \
  --samplesheet samples.csv \
  --organism "Mus musculus" \
  --comparisons "control_treatment:control_knockout" \
  --blocking_factors "control_treatment:batch" \
  --outdir results \
  -profile docker

Analysis without GSEA

nextflow run goodwright/differential-analysis \
  --counts gene_counts.tsv \
  --samplesheet samples.csv \
  --skip_gsea \
  --outdir results \
  -profile docker

Tips and Best Practices

Experimental Design

  • Include at least 3 biological replicates per condition
  • Balance experimental groups across batches when possible
  • Record all potential confounding factors in the sample sheet

Data Preparation

  • Use raw counts, not normalized values (TPM, RPKM)
  • Remove genes with very low counts across all samples
  • Verify sample IDs match between count matrix and sample sheet

Parameter Selection

  • Use LRT test for time-series or multi-group comparisons
  • Enable outlier replacement only with ≥7 replicates
  • Choose appropriate organism name for accurate GSEA

Quality Control

  • Check dispersion plots for good fit
  • Review PCA plots for batch effects
  • Verify volcano plots show expected distribution

Troubleshooting

Common Issues

Issue: No differentially expressed genes found

  • Solution: Check if samples cluster by condition in PCA
  • Reduce stringency with higher --dsq_alpha value
  • Verify correct --contrast_column specification

Issue: GSEA fails or returns no results

  • Solution: Verify organism name matches species list
  • Check that gene IDs match expected format (Ensembl, Symbol)
  • Increase --gsea_max_gset_size for broader pathways

Issue: Batch effects dominate results

  • Solution: Include batch in --blocking_factors
  • Consider batch correction before analysis
  • Ensure balanced experimental design

Issue: Memory errors with large datasets

  • Solution: Increase memory allocation in config
  • Filter low-count genes more stringently
  • Process subsets of comparisons separately

Additional Resources

Previous
CLIP-Seq