Pipelines
Differential Expression Analysis Pipeline
Overview
The Differential Expression Analysis pipeline (v1.0) is a comprehensive workflow for identifying differences in gene expression or other molecular features between experimental conditions. This pipeline is critical for identifying molecular biomarkers associated with disease or other biological conditions, as well as understanding the underlying mechanisms of complex biological processes.
The pipeline is designed to process count matrices from RNA-seq or other count-based assays, perform statistical analysis using DESeq2, and provide rich visualization and interpretation tools including PCA analysis and gene set enrichment analysis.
Pipeline Summary
The differential expression workflow performs the following steps:
- Input Validation - Sample sheet checking and experimental design validation
- Data Normalization - DESeq2 normalization and size factor calculation
- Differential Analysis - Statistical testing using DESeq2 negative binomial model
- Quality Control - Diagnostic plots including dispersion and MA plots
- Dimensionality Reduction - PCA analysis using pcaExplorer
- Functional Analysis - Gene set enrichment analysis for each comparison
- Visualization - Volcano plots, heatmaps, and enrichment plots
- Report Generation - Comprehensive results tables and summary reports
Input Requirements
Count Matrix
The count matrix is supplied using the --counts
parameter. This should be a numeric matrix file (comma or tab-separated) with samples as columns and features as rows.
Format requirements:
- First column must be feature IDs (e.g., gene IDs)
- Additional metadata columns allowed (e.g., gene names)
- Sample columns must match IDs in the sample sheet
If you have run nf-core/rnaseq on Flow, you can use the gene counts output directly (e.g., salmon.merged.gene_counts.tsv
).
Example count matrix:
gene_id gene_name SRX8042381 SRX8042382 SRX8042383
ENSMUSG00000023978 Prph2 10 2 4
ENSMUSG00000030324 Rho 181 14 12
ENSMUSG00000031450 Grk1 20 3 1
ENSMUSG00000056055 Sag 52 3 10
ENSMUSG00000025900 Rp1 17 3 2
Sample Sheet Format
The sample sheet is supplied using the --samplesheet
parameter. It must contain sample IDs matching the count matrix and at least one comparison variable.
Column | Description | Required |
---|---|---|
sample_id | Unique sample identifier matching count matrix | Yes |
condition | Primary comparison variable (or custom column) | Yes |
Additional columns | Batch effects, covariates, blocking factors | No |
Example sample sheet:
sample_id,condition,batch
WT_REP1,control,1
WT_REP2,control,2
TREAT_REP1,treatment,1
TREAT_REP2,treatment,2
Experimental Design Configuration
Configure your experimental design using these parameters:
contrast_column
: Column name for comparisons (default:condition
)comparisons
: Comparison strategyall
: All pairwise comparisons (default)- Custom:
A_B:A_C:B_C
format (reference_contrast)
blocking_factors
: Multi-factor designs- Format:
A_B:batch,A_C:batch;time
- Not compatible with
comparisons=all
- Format:
Key Parameters
Required Parameters
Parameter | Description | Default |
---|---|---|
--counts | Path to count matrix file | Required |
--samplesheet | Path to sample sheet CSV | Required |
--outdir | Output directory | ./results |
--organism | Organism for GSEA (see species list) | Required |
General Parameters
Parameter | Description | Default |
---|---|---|
--study_name | Study name for output organization | study |
--count_sep | Count file separator | \t |
--skip_gsea | Skip gene set enrichment analysis | false |
DESeq2 Parameters
Parameter | Description | Default |
---|---|---|
--dsq_test | Statistical test (Wald, LRT) | Wald |
--dsq_fit_type | Dispersion fit type | parametric |
--dsq_min_replicates_for_replace | Min replicates for outlier replacement | 7 |
--dsq_sf_type | Size factor estimation method | ratio |
--dsq_lfc_threshold | Log2 fold change threshold | 0 |
--dsq_alt_hypothesis | Alternative hypothesis | greaterAbs |
--dsq_alpha | FDR significance cutoff | 0.1 |
--dsq_lfcshrink_type | LFC shrinkage method | ashr |
--dsq_p_thresh | P-value filter for downstream analysis | 1 |
Gene Set Enrichment Parameters
Parameter | Description | Default |
---|---|---|
--gsea_p_cutoff | P-value cutoff | 0.05 |
--gsea_q_cutoff | Q-value cutoff | 0.05 |
--gsea_ontology | GO ontology (BP, MF, CC) | MF |
--gsea_min_gset_size | Minimum gene set size | 10 |
--gsea_max_gset_size | Maximum gene set size | 500 |
--gsea_pathway_count | Number of pathways to plot | 10 |
Pipeline Outputs
Directory Structure
results/study/
├── deseq2/
│ ├── size_factors.csv
│ ├── normalized_counts.csv
│ ├── plots/
│ │ ├── dispersion_plot.pdf
│ │ └── ma_plots.pdf
│ └── [comparison_name]/
│ ├── results.csv
│ ├── volcano_plot.pdf
│ └── significant_genes.csv
├── pcaexplorer/
│ ├── pca_plots.pdf
│ └── sample_distances.pdf
└── gsea/
└── [comparison_name]/
├── enrichment_results.csv
└── pathway_plots.pdf
Output Files
DESeq2 Results
results.csv
: Full differential expression results with statisticssignificant_genes.csv
: Filtered significant genesnormalized_counts.csv
: Size-factor normalized expression matrix
Visualization
volcano_plot.pdf
: Volcano plots for each comparisondispersion_plot.pdf
: Mean-dispersion relationshipma_plots.pdf
: MA plots showing fold changes vs expression
Enrichment Analysis
enrichment_results.csv
: Significant pathways and gene setspathway_plots.pdf
: Dot plots and enrichment maps
Example Usage
Basic Analysis
nextflow run goodwright/differential-analysis \
--counts gene_counts.tsv \
--samplesheet samples.csv \
--organism "Homo sapiens" \
--outdir results \
-profile docker
Custom Comparisons with Batch Correction
nextflow run goodwright/differential-analysis \
--counts gene_counts.tsv \
--samplesheet samples.csv \
--organism "Mus musculus" \
--comparisons "control_treatment:control_knockout" \
--blocking_factors "control_treatment:batch" \
--outdir results \
-profile docker
Analysis without GSEA
nextflow run goodwright/differential-analysis \
--counts gene_counts.tsv \
--samplesheet samples.csv \
--skip_gsea \
--outdir results \
-profile docker
Tips and Best Practices
Experimental Design
- Include at least 3 biological replicates per condition
- Balance experimental groups across batches when possible
- Record all potential confounding factors in the sample sheet
Data Preparation
- Use raw counts, not normalized values (TPM, RPKM)
- Remove genes with very low counts across all samples
- Verify sample IDs match between count matrix and sample sheet
Parameter Selection
- Use
LRT
test for time-series or multi-group comparisons - Enable outlier replacement only with ≥7 replicates
- Choose appropriate organism name for accurate GSEA
Quality Control
- Check dispersion plots for good fit
- Review PCA plots for batch effects
- Verify volcano plots show expected distribution
Troubleshooting
Common Issues
Issue: No differentially expressed genes found
- Solution: Check if samples cluster by condition in PCA
- Reduce stringency with higher
--dsq_alpha
value - Verify correct
--contrast_column
specification
Issue: GSEA fails or returns no results
- Solution: Verify organism name matches species list
- Check that gene IDs match expected format (Ensembl, Symbol)
- Increase
--gsea_max_gset_size
for broader pathways
Issue: Batch effects dominate results
- Solution: Include batch in
--blocking_factors
- Consider batch correction before analysis
- Ensure balanced experimental design
Issue: Memory errors with large datasets
- Solution: Increase memory allocation in config
- Filter low-count genes more stringently
- Process subsets of comparisons separately
Additional Resources
- Pipeline Source: GitHub - goodwright/differential-analysis
- DESeq2 Documentation: Bioconductor - DESeq2
- pcaExplorer: Bioconductor - pcaExplorer
- genekitr: genekitr.fun
- Support: Goodwright Slack
- Contact: charlotte.capitanchik@goodwright.com