Differential Expression Analysis Pipeline

Overview

The Differential Expression Analysis pipeline (v1.0) is a comprehensive workflow for identifying differences in gene expression or other molecular features between experimental conditions. This pipeline is critical for identifying molecular biomarkers associated with disease or other biological conditions, as well as understanding the underlying mechanisms of complex biological processes.

The pipeline is designed to process count matrices from RNA-seq or other count-based assays, perform statistical analysis using DESeq2, and provide rich visualization and interpretation tools including PCA analysis and gene set enrichment analysis.

Pipeline Summary

The differential expression workflow performs the following steps:

Input Validation - Sample sheet checking and experimental design validation
Data Normalization - DESeq2 normalization and size factor calculation
Differential Analysis - Statistical testing using DESeq2 negative binomial model
Quality Control - Diagnostic plots including dispersion and MA plots
Dimensionality Reduction - PCA analysis using pcaExplorer
Functional Analysis - Gene set enrichment analysis for each comparison
Visualization - Volcano plots, heatmaps, and enrichment plots
Report Generation - Comprehensive results tables and summary reports

Input Requirements

Count Matrix

The count matrix is supplied using the --counts parameter. This should be a numeric matrix file (comma or tab-separated) with samples as columns and features as rows.

Format requirements:

First column must be feature IDs (e.g., gene IDs)
Additional metadata columns allowed (e.g., gene names)
Sample columns must match IDs in the sample sheet

If you have run nf-core/rnaseq on Flow, you can use the gene counts output directly (e.g., salmon.merged.gene_counts.tsv).

Example count matrix:

gene_id	gene_name	SRX8042381	SRX8042382	SRX8042383
ENSMUSG00000023978	Prph2	10	2	4
ENSMUSG00000030324	Rho	181	14	12
ENSMUSG00000031450	Grk1	20	3	1
ENSMUSG00000056055	Sag	52	3	10
ENSMUSG00000025900	Rp1	17	3	2

Sample Sheet Format

The sample sheet is supplied using the --samplesheet parameter. It must contain sample IDs matching the count matrix and at least one comparison variable.

Column	Description	Required
`sample_id`	Unique sample identifier matching count matrix	Yes
`condition`	Primary comparison variable (or custom column)	Yes
Additional columns	Batch effects, covariates, blocking factors	No

Example sample sheet:

sample_id,condition,batch
WT_REP1,control,1
WT_REP2,control,2
TREAT_REP1,treatment,1
TREAT_REP2,treatment,2

Experimental Design Configuration

Configure your experimental design using these parameters:

contrast_column: Column name for comparisons (default: condition)
comparisons: Comparison strategy
- all: All pairwise comparisons (default)
- Custom: A_B:A_C:B_C format (reference_contrast)
blocking_factors: Multi-factor designs
- Format: A_B:batch,A_C:batch;time
- Not compatible with comparisons=all

Key Parameters

Required Parameters

Parameter	Description	Default
`--counts`	Path to count matrix file	Required
`--samplesheet`	Path to sample sheet CSV	Required
`--outdir`	Output directory	./results
`--organism`	Organism for GSEA (see species list)	Required

General Parameters

Parameter	Description	Default
`--study_name`	Study name for output organization	study
`--count_sep`	Count file separator	\t
`--skip_gsea`	Skip gene set enrichment analysis	false

DESeq2 Parameters

Parameter	Description	Default
`--dsq_test`	Statistical test (Wald, LRT)	Wald
`--dsq_fit_type`	Dispersion fit type	parametric
`--dsq_min_replicates_for_replace`	Min replicates for outlier replacement	7
`--dsq_sf_type`	Size factor estimation method	ratio
`--dsq_lfc_threshold`	Log2 fold change threshold	0
`--dsq_alt_hypothesis`	Alternative hypothesis	greaterAbs
`--dsq_alpha`	FDR significance cutoff	0.1
`--dsq_lfcshrink_type`	LFC shrinkage method	ashr
`--dsq_p_thresh`	P-value filter for downstream analysis	1

Gene Set Enrichment Parameters

Parameter	Description	Default
`--gsea_p_cutoff`	P-value cutoff	0.05
`--gsea_q_cutoff`	Q-value cutoff	0.05
`--gsea_ontology`	GO ontology (BP, MF, CC)	MF
`--gsea_min_gset_size`	Minimum gene set size	10
`--gsea_max_gset_size`	Maximum gene set size	500
`--gsea_pathway_count`	Number of pathways to plot	10

Pipeline Outputs

Directory Structure

results/study/
├── deseq2/
│   ├── size_factors.csv
│   ├── normalized_counts.csv
│   ├── plots/
│   │   ├── dispersion_plot.pdf
│   │   └── ma_plots.pdf
│   └── [comparison_name]/
│       ├── results.csv
│       ├── volcano_plot.pdf
│       └── significant_genes.csv
├── pcaexplorer/
│   ├── pca_plots.pdf
│   └── sample_distances.pdf
└── gsea/
    └── [comparison_name]/
        ├── enrichment_results.csv
        └── pathway_plots.pdf

Output Files

DESeq2 Results
- results.csv: Full differential expression results with statistics
- significant_genes.csv: Filtered significant genes
- normalized_counts.csv: Size-factor normalized expression matrix
Visualization
- volcano_plot.pdf: Volcano plots for each comparison
- dispersion_plot.pdf: Mean-dispersion relationship
- ma_plots.pdf: MA plots showing fold changes vs expression
Enrichment Analysis
- enrichment_results.csv: Significant pathways and gene sets
- pathway_plots.pdf: Dot plots and enrichment maps

Example Usage

Basic Analysis

nextflow run goodwright/differential-analysis \
  --counts gene_counts.tsv \
  --samplesheet samples.csv \
  --organism "Homo sapiens" \
  --outdir results \
  -profile docker

Custom Comparisons with Batch Correction

nextflow run goodwright/differential-analysis \
  --counts gene_counts.tsv \
  --samplesheet samples.csv \
  --organism "Mus musculus" \
  --comparisons "control_treatment:control_knockout" \
  --blocking_factors "control_treatment:batch" \
  --outdir results \
  -profile docker

Analysis without GSEA

nextflow run goodwright/differential-analysis \
  --counts gene_counts.tsv \
  --samplesheet samples.csv \
  --skip_gsea \
  --outdir results \
  -profile docker

Tips and Best Practices

Experimental Design

Include at least 3 biological replicates per condition
Balance experimental groups across batches when possible
Record all potential confounding factors in the sample sheet

Data Preparation

Use raw counts, not normalized values (TPM, RPKM)
Remove genes with very low counts across all samples
Verify sample IDs match between count matrix and sample sheet

Parameter Selection

Use LRT test for time-series or multi-group comparisons
Enable outlier replacement only with ≥7 replicates
Choose appropriate organism name for accurate GSEA

Quality Control

Check dispersion plots for good fit
Review PCA plots for batch effects
Verify volcano plots show expected distribution

Troubleshooting

Common Issues

Issue: No differentially expressed genes found

Solution: Check if samples cluster by condition in PCA
Reduce stringency with higher --dsq_alpha value
Verify correct --contrast_column specification

Issue: GSEA fails or returns no results

Solution: Verify organism name matches species list
Check that gene IDs match expected format (Ensembl, Symbol)
Increase --gsea_max_gset_size for broader pathways

Issue: Batch effects dominate results

Solution: Include batch in --blocking_factors
Consider batch correction before analysis
Ensure balanced experimental design

Issue: Memory errors with large datasets

Solution: Increase memory allocation in config
Filter low-count genes more stringently
Process subsets of comparisons separately

Additional Resources

Pipeline Source: GitHub - goodwright/differential-analysis
DESeq2 Documentation: Bioconductor - DESeq2
pcaExplorer: Bioconductor - pcaExplorer
genekitr: genekitr.fun
Support: Goodwright Slack
Contact: charlotte.capitanchik@goodwright.com