scDownstream

Overview

Flow provides the nf-core/scdownstream v1.0.0 pipeline for comprehensive downstream analysis of single-cell RNA sequencing (scRNA-seq) data. It takes filtered count matrices as input and provides cell type annotation, trajectory inference, differential expression, and integrative analysis across multiple samples or conditions.

This pipeline complements upstream processing pipelines (like Cell Ranger or STARsolo) by focusing on biological interpretation and advanced computational analyses of single-cell data.

Pipeline Summary

The workflow includes:

Data Integration
- Multiple sample integration
- Batch correction (Harmony, Seurat, scVI)
- Dataset merging and normalization
Quality Control
- Cell filtering metrics
- Gene filtering
- Doublet detection
- Ambient RNA removal
Dimensionality Reduction
- PCA analysis
- UMAP/tSNE visualization
- Feature selection
Clustering & Annotation
- Graph-based clustering
- Automated cell type annotation
- Reference-based mapping
- Manual marker validation
Differential Analysis
- Differential expression
- Gene set enrichment
- Pathway analysis
- Cell-cell communication
Trajectory Analysis
- Pseudotime inference
- Lineage reconstruction
- RNA velocity
- Cell fate prediction

Input Requirements

Count Matrices

Filtered feature-barcode matrices
H5AD format (preferred)
H5, MTX, or CSV formats
Multiple samples supported

Metadata Requirements

Sample sheet with experimental design:

sample,path,condition,batch,species
sample1,/path/to/sample1.h5ad,control,batch1,human
sample2,/path/to/sample2.h5ad,control,batch1,human
sample3,/path/to/sample3.h5ad,treated,batch2,human
sample4,/path/to/sample4.h5ad,treated,batch2,human

Reference Data

Cell type reference datasets
Gene signatures for annotation
Pathway databases
Ligand-receptor databases

Key Parameters

Input Configuration

--input: Sample sheet path
--matrix_format: Input format (h5ad, h5, mtx)
--genome: Species (human, mouse)
--transcript_type: mRNA or total RNA

Quality Control

--min_cells: Minimum cells per gene
--min_features: Minimum genes per cell
--max_features: Maximum genes per cell
--max_mito: Maximum mitochondrial percentage
--doublet_detection: Method (scrublet, doubletfinder)

Integration Methods

--integration_method:
- harmony: Fast batch correction
- seurat: CCA/RPCA integration
- scvi: Deep learning integration
- scanvi: Semi-supervised integration
--integration_features: Number of features

Clustering Parameters

--clustering_resolution: Granularity (0.1-2.0)
--clustering_algorithm: leiden or louvain
--n_neighbors: KNN graph construction
--min_dist: UMAP minimum distance

Cell Type Annotation

--annotation_method:
- celltypist: Automated annotation
- singleR: Reference-based
- manual: Marker genes only
--reference_dataset: Built-in or custom
--confidence_threshold: Annotation confidence

Differential Expression

--de_method: wilcox, t-test, MAST, DESeq2
--min_logfc: Log fold change threshold
--min_pct: Minimum cell percentage
--comparison_groups: Conditions to compare

Pipeline Outputs

Processed Data

Integrated Object
- integrated.h5ad: Merged, batch-corrected data
- metadata.csv: Complete cell metadata
- features.csv: Selected features
Quality Reports
- QC metrics summary
- Filtering statistics
- Integration diagnostics

Clustering Results

Cell Clusters
- Cluster assignments
- Cluster markers
- Cluster statistics
- Dendrograms
Visualizations
- UMAP/tSNE plots
- Feature plots
- Violin plots
- Dot plots

Cell Type Analysis

Annotations
- Cell type labels
- Confidence scores
- Marker expression
- Reference mapping
Composition
- Cell type proportions
- Condition comparisons
- Statistical tests

Differential Expression

DE Results
- Gene lists per comparison
- Volcano plots
- MA plots
- Heatmaps
Functional Analysis
- GO enrichment
- KEGG pathways
- Gene set scores
- Network analysis

Trajectory Analysis

Pseudotime
- Cell ordering
- Branch points
- Gene dynamics
- Fate probabilities
RNA Velocity
- Velocity vectors
- Stream plots
- Driver genes
- Terminal states

Interactive Reports

HTML Report: Comprehensive analysis summary
CellxGene: Interactive data browser
UCSC Cell Browser: Web visualization

Analysis Workflows

Standard Single-Sample

--integration_method none
--clustering_resolution 0.6
--annotation_method celltypist
--de_method wilcox

Multi-Sample Integration

--integration_method harmony
--batch_key batch
--clustering_resolution 0.8
--comparison_groups condition

Disease vs Control

--integration_method scvi
--de_groups "disease,control"
--pathway_analysis true
--cell_communication true

Developmental Analysis

--trajectory_analysis true
--rna_velocity true
--diffusion_maps true
--annotation_method manual

Best Practices

Data Preparation

Use properly filtered matrices
Include all relevant metadata
Plan batch structure carefully
Consider sequencing depth

Quality Control

Set QC thresholds based on data
Examine QC plots before filtering
Remove low-quality batches
Document filtering decisions

Integration Strategy

Choose method based on batch effects
Evaluate integration success
Preserve biological variation
Check marker preservation

Biological Interpretation

Validate cell types with markers
Compare multiple DE methods
Focus on robust findings
Consider technical limitations

Troubleshooting

Common Issues

Over-clustering

Reduce clustering resolution
Increase minimum cluster size
Check for technical effects
Merge similar clusters

Poor Integration

Try different methods
Adjust integration parameters
Remove problematic batches
Use more integration features

Annotation Problems

Update reference datasets
Lower confidence threshold
Use multiple methods
Manual curation needed

Trajectory Artifacts

Verify starting cells
Check expression dynamics
Remove cell cycle effects
Validate with known markers

Advanced Features

Custom References

--annotation_reference /path/to/reference.h5ad
--reference_markers custom_markers.csv
--transfer_labels true

--modality_weights "RNA:0.7,ADT:0.3"
--wnn_integration true
--cross_modality true

Spatial Mapping

--spatial_reference spatial_data.h5ad
--mapping_method tangram
--spatial_plots true

Perturbation Analysis

--perturbation_key treatment
--pseudobulk true
--mixscape_analysis true

Output Interpretation

Key Metrics

nCells: Total cells passing QC
nClusters: Number of cell populations
Integration Score: Batch mixing metric
Annotation Accuracy: If ground truth available

Biological Insights

Cell Type Composition: Population changes
Marker Genes: Cluster-defining features
DE Genes: Condition-specific changes
Pathways: Enriched biological processes

Quality Indicators

Silhouette Score: Cluster separation
ARI: Clustering stability
LISI: Integration quality
Velocity Coherence: Trajectory confidence

Additional Resources

Full documentation: nf-core/scdownstream documentation
Pipeline source code: GitHub - nf-core/scdownstream
Scanpy documentation: scanpy.readthedocs.io
Seurat documentation: satijalab.org/seurat
Support: Join the #scdownstream channel on nf-core Slack
Citation: Wolf et al. (2018) doi.org/10.1186/s13059-017-1382-0

Overview

Pipeline Summary

Input Requirements

Count Matrices

Metadata Requirements

Reference Data

Key Parameters

Input Configuration

Quality Control

Integration Methods

Clustering Parameters

Cell Type Annotation

Differential Expression

Pipeline Outputs

Processed Data

Clustering Results

Cell Type Analysis

Differential Expression

Trajectory Analysis

Interactive Reports

Analysis Workflows

Standard Single-Sample

Multi-Sample Integration

Disease vs Control

Developmental Analysis

Best Practices

Data Preparation

Quality Control

Integration Strategy

Biological Interpretation

Troubleshooting

Common Issues

Over-clustering

Poor Integration

Annotation Problems

Trajectory Artifacts

Advanced Features

Custom References

Multi-Modal Integration

Spatial Mapping

Perturbation Analysis

Output Interpretation

Key Metrics

Biological Insights

Quality Indicators

Additional Resources