Pipelines
RNA-seq Pipeline
Flow provides the nf-core/rnaseq v3.19.0 pipeline - a best-practice analysis pipeline for RNA sequencing data.
Overview
The RNA-seq pipeline performs comprehensive analysis of RNA sequencing data, from raw reads to gene expression quantification and quality control. It supports various experimental designs and library preparations.
Key Features
Flexible Input Support
- Single-end and paired-end sequencing
- Multiple lanes per sample (automatic merging)
- Technical replicate handling
- Automatic strandedness detection
Multiple Analysis Routes
- Alignment with STAR or HISAT2
- Pseudoalignment with Salmon or Kallisto
- Transcript quantification and gene expression
Comprehensive Quality Control
- Read quality assessment (FastQC)
- Alignment quality metrics (RSeQC, Qualimap)
- Sample correlation analysis
- MultiQC integrated report
Advanced Features
- UMI-based deduplication
- Ribosomal RNA removal
- Contamination screening
- 3' digital gene expression support
Pipeline Summary
Preprocessing
- Merge re-sequenced FastQ files
- Read quality control (FastQC)
- UMI extraction (if applicable)
- Adapter and quality trimming (Trim Galore!)
- Ribosomal RNA removal (optional)
Alignment
- Genome alignment (STAR/HISAT2)
- Transcriptome pseudoalignment (Salmon/Kallisto)
- Sort and index alignments
- Mark duplicate reads
Quantification
- Gene-level counting (featureCounts/Salmon)
- Transcript assembly (StringTie)
- TPM/FPKM normalization
- Generate expression matrices
Quality Control
- Alignment QC (RSeQC, Qualimap)
- Library complexity (Preseq)
- Sample similarity (DESeq2)
- Comprehensive report (MultiQC)
Input Requirements
Sample Sheet Format
Create a CSV file with your sample information:
sample,fastq_1,fastq_2,strandedness
control_rep1,/path/to/control_rep1_R1.fastq.gz,/path/to/control_rep1_R2.fastq.gz,reverse
control_rep2,/path/to/control_rep2_R1.fastq.gz,/path/to/control_rep2_R2.fastq.gz,reverse
treatment_rep1,/path/to/treatment_rep1_R1.fastq.gz,/path/to/treatment_rep1_R2.fastq.gz,reverse
treatment_rep2,/path/to/treatment_rep2_R1.fastq.gz,/path/to/treatment_rep2_R2.fastq.gz,reverse
Columns:
sample
: Unique sample identifierfastq_1
: Path to R1 FASTQ filefastq_2
: Path to R2 FASTQ file (leave empty for single-end)strandedness
: Library strandedness (forward
,reverse
,unstranded
, orauto
)
Reference Requirements
- Genome FASTA: Reference genome sequence
- GTF/GFF: Gene annotation file
- Optional: Pre-built STAR/HISAT2 indices
Key Parameters
Basic Parameters
--input
: Path to sample sheet CSV--outdir
: Output directory path--genome
: Reference genome name (e.g., 'GRCh38')--gtf
: Path to GTF annotation file--fasta
: Path to genome FASTA file
Alignment Options
--aligner
: Choose aligner (star_salmon
,star_rsem
,hisat2
)--pseudo_aligner
: Enable pseudoalignment (salmon
)--skip_alignment
: Skip all alignment steps--save_unaligned
: Save unaligned reads
Quantification Options
--featurecounts_feature_type
: Feature type for counting (default: 'exon')--featurecounts_group_type
: Grouping for features (default: 'gene_id')--salmon_quant_libtype
: Salmon library type (auto-detected by default)
Quality Control
--skip_qc
: Skip all QC steps--skip_fastqc
: Skip FastQC--skip_rseqc
: Skip RSeQC--skip_qualimap
: Skip Qualimap--skip_preseq
: Skip Preseq--skip_multiqc
: Skip MultiQC
Advanced Options
--with_umi
: Enable UMI-based deduplication--umitools_extract_method
: UMI extraction method--remove_ribo_rna
: Remove ribosomal RNA reads--save_trimmed
: Save trimmed FASTQ files
Pipeline Outputs
Expression Matrices
The main gene expression results are in results/star_salmon/
:
salmon.merged.gene_counts.tsv
: Raw gene counts matrixsalmon.merged.gene_tpm.tsv
: TPM normalized expressionsalmon.merged.gene_counts.rds
: R SummarizedExperiment object
Quality Control Reports
multiqc/multiqc_report.html
: Comprehensive QC reportfastqc/
: Read quality reportsrseqc/
: RNA-seq specific metricsqualimap/
: Alignment quality assessment
Alignment Files
star_salmon/
: STAR alignments and Salmon quantificationstar_salmon/bigwig/
: Coverage tracks for genome browsers
Additional Outputs
trimgalore/
: Trimmed reads (if--save_trimmed
)unmapped/
: Unaligned reads (if--save_unaligned
)deseq2_qc/
: Sample clustering and PCA plots
Example Usage
Basic RNA-seq Analysis
nextflow run nf-core/rnaseq \
--input samplesheet.csv \
--outdir results \
--genome GRCh38 \
-profile docker
Custom Reference with STAR
nextflow run nf-core/rnaseq \
--input samplesheet.csv \
--outdir results \
--fasta genome.fa \
--gtf genes.gtf \
--star_index /path/to/star/index/ \
--aligner star_salmon \
-profile docker
UMI-based Analysis
nextflow run nf-core/rnaseq \
--input samplesheet.csv \
--outdir results \
--genome GRCh38 \
--with_umi \
--umitools_extract_method 'regex' \
--umitools_bc_pattern '^(?P<umi_1>.{6})(?P<discard_1>.{4}).*' \
-profile docker
Tips and Best Practices
Library Preparation
- Strand-specific protocols are highly recommended for accurate transcript quantification
- Ribosomal depletion is preferred over poly-A selection for comprehensive transcriptome coverage
- Adequate sequencing depth:
- 20-30 million reads for standard gene expression
- 50-100 million for transcript discovery
- 100+ million for allele-specific expression
Experimental Design
- Biological replicates: Minimum 3 per condition (more is better for statistical power)
- Batch effects: Distribute conditions across sequencing batches and process all samples together
- Time points: Consider circadian rhythms for time-course experiments
Analysis Considerations
- Strandedness detection: Use
--strandedness auto
if unsure - the pipeline will detect it automatically - Memory Requirements: STAR alignment requires ~30GB RAM for human genome. Consider HISAT2 for limited resources
- Reference Quality: Use primary assembly (no patches/haplotypes) and comprehensive gene annotations (GENCODE/Ensembl)
- UMI handling: Essential for low-input samples or single-cell preparations
Quality Thresholds
- Read quality: Q30 > 80% of bases
- Duplication rate: <50% for standard RNA-seq
- rRNA contamination: <10% (ideally <5%)
- Uniquely mapped: >70% of reads
- Review MultiQC report to identify samples that may need to be excluded
Troubleshooting
Common Issues
Issue: Low alignment rate (<60%)
- Solution: Check if correct genome/annotation version is used
- Verify library preparation method matches analysis parameters
- Check for adapter contamination or poor quality reads
- Consider contamination from other species
Issue: High ribosomal RNA content
- Solution: Verify rRNA depletion was performed during library prep
- Use
--remove_ribo_rna
to bioinformatically remove rRNA reads - Check if depletion kit was appropriate for your organism
Issue: Inconsistent strandedness results
- Solution: Verify library prep protocol documentation
- Check if samples were mixed from different batches
- Use
--strandedness auto
to detect automatically
Issue: Low gene detection rate
- Solution: Increase sequencing depth
- Check RNA quality (RIN scores) from extraction
- Verify correct annotation file is used
- Consider tissue-specific expression patterns
Issue: High duplication rates
- Solution: Expected for highly expressed genes
- Use UMI-based deduplication if available
- Check if PCR cycles were excessive
- Consider
--skip_markduplicates
for low-input samples
Issue: Outlier samples in PCA/clustering
- Solution: Check individual sample QC metrics
- Look for batch effects or technical issues
- Verify sample identity and metadata
- Consider removing outliers for downstream analysis
Issue: Memory errors during alignment
- Solution: Reduce
--star_max_parallel_jobs
- Increase memory allocation with
--max_memory
- Use pre-built genome indices
- Process in smaller batches
Additional Resources
- Full documentation: nf-core/rnaseq documentation
- Usage guide: nf-co.re/rnaseq/3.19.0/docs/usage
- Output description: nf-co.re/rnaseq/3.19.0/docs/output
- Parameter reference: nf-co.re/rnaseq/3.19.0/parameters
- Pipeline source code: GitHub - nf-core/rnaseq
- Support: Join the
#rnaseq
channel on nf-core Slack - Citation: doi.org/10.5281/zenodo.1400710