RNA-seq Pipeline

Flow provides the nf-core/rnaseq v3.19.0 pipeline - a best-practice analysis pipeline for RNA sequencing data.

Overview

The RNA-seq pipeline performs comprehensive analysis of RNA sequencing data, from raw reads to gene expression quantification and quality control. It supports various experimental designs and library preparations.

Key Features

Flexible Input Support
- Single-end and paired-end sequencing
- Multiple lanes per sample (automatic merging)
- Technical replicate handling
- Automatic strandedness detection
Multiple Analysis Routes
- Alignment with STAR or HISAT2
- Pseudoalignment with Salmon or Kallisto
- Transcript quantification and gene expression
Comprehensive Quality Control
- Read quality assessment (FastQC)
- Alignment quality metrics (RSeQC, Qualimap)
- Sample correlation analysis
- MultiQC integrated report
Advanced Features
- UMI-based deduplication
- Ribosomal RNA removal
- Contamination screening
- 3' digital gene expression support

Pipeline Summary

Preprocessing
- Merge re-sequenced FastQ files
- Read quality control (FastQC)
- UMI extraction (if applicable)
- Adapter and quality trimming (Trim Galore!)
- Ribosomal RNA removal (optional)
Alignment
- Genome alignment (STAR/HISAT2)
- Transcriptome pseudoalignment (Salmon/Kallisto)
- Sort and index alignments
- Mark duplicate reads
Quantification
- Gene-level counting (featureCounts/Salmon)
- Transcript assembly (StringTie)
- TPM/FPKM normalization
- Generate expression matrices
Quality Control
- Alignment QC (RSeQC, Qualimap)
- Library complexity (Preseq)
- Sample similarity (DESeq2)
- Comprehensive report (MultiQC)

Input Requirements

Sample Sheet Format

Create a CSV file with your sample information:

sample,fastq_1,fastq_2,strandedness
control_rep1,/path/to/control_rep1_R1.fastq.gz,/path/to/control_rep1_R2.fastq.gz,reverse
control_rep2,/path/to/control_rep2_R1.fastq.gz,/path/to/control_rep2_R2.fastq.gz,reverse
treatment_rep1,/path/to/treatment_rep1_R1.fastq.gz,/path/to/treatment_rep1_R2.fastq.gz,reverse
treatment_rep2,/path/to/treatment_rep2_R1.fastq.gz,/path/to/treatment_rep2_R2.fastq.gz,reverse

Columns:

sample: Unique sample identifier
fastq_1: Path to R1 FASTQ file
fastq_2: Path to R2 FASTQ file (leave empty for single-end)
strandedness: Library strandedness (forward, reverse, unstranded, or auto)

Reference Requirements

Genome FASTA: Reference genome sequence
GTF/GFF: Gene annotation file
Optional: Pre-built STAR/HISAT2 indices

Key Parameters

Basic Parameters

--input: Path to sample sheet CSV
--outdir: Output directory path
--genome: Reference genome name (e.g., 'GRCh38')
--gtf: Path to GTF annotation file
--fasta: Path to genome FASTA file

Alignment Options

--aligner: Choose aligner (star_salmon, star_rsem, hisat2)
--pseudo_aligner: Enable pseudoalignment (salmon)
--skip_alignment: Skip all alignment steps
--save_unaligned: Save unaligned reads

Quantification Options

--featurecounts_feature_type: Feature type for counting (default: 'exon')
--featurecounts_group_type: Grouping for features (default: 'gene_id')
--salmon_quant_libtype: Salmon library type (auto-detected by default)

Quality Control

--skip_qc: Skip all QC steps
--skip_fastqc: Skip FastQC
--skip_rseqc: Skip RSeQC
--skip_qualimap: Skip Qualimap
--skip_preseq: Skip Preseq
--skip_multiqc: Skip MultiQC

Advanced Options

--with_umi: Enable UMI-based deduplication
--umitools_extract_method: UMI extraction method
--remove_ribo_rna: Remove ribosomal RNA reads
--save_trimmed: Save trimmed FASTQ files

Pipeline Outputs

Expression Matrices

The main gene expression results are in results/star_salmon/:

salmon.merged.gene_counts.tsv: Raw gene counts matrix
salmon.merged.gene_tpm.tsv: TPM normalized expression
salmon.merged.gene_counts.rds: R SummarizedExperiment object

Quality Control Reports

multiqc/multiqc_report.html: Comprehensive QC report
fastqc/: Read quality reports
rseqc/: RNA-seq specific metrics
qualimap/: Alignment quality assessment

Alignment Files

star_salmon/: STAR alignments and Salmon quantification
star_salmon/bigwig/: Coverage tracks for genome browsers

Additional Outputs

trimgalore/: Trimmed reads (if --save_trimmed)
unmapped/: Unaligned reads (if --save_unaligned)
deseq2_qc/: Sample clustering and PCA plots

Example Usage

Basic RNA-seq Analysis

nextflow run nf-core/rnaseq \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh38 \
  -profile docker

Custom Reference with STAR

nextflow run nf-core/rnaseq \
  --input samplesheet.csv \
  --outdir results \
  --fasta genome.fa \
  --gtf genes.gtf \
  --star_index /path/to/star/index/ \
  --aligner star_salmon \
  -profile docker

UMI-based Analysis

nextflow run nf-core/rnaseq \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh38 \
  --with_umi \
  --umitools_extract_method 'regex' \
  --umitools_bc_pattern '^(?P<umi_1>.{6})(?P<discard_1>.{4}).*' \
  -profile docker

Tips and Best Practices

Library Preparation

Strand-specific protocols are highly recommended for accurate transcript quantification
Ribosomal depletion is preferred over poly-A selection for comprehensive transcriptome coverage
Adequate sequencing depth:
- 20-30 million reads for standard gene expression
- 50-100 million for transcript discovery
- 100+ million for allele-specific expression

Experimental Design

Biological replicates: Minimum 3 per condition (more is better for statistical power)
Batch effects: Distribute conditions across sequencing batches and process all samples together
Time points: Consider circadian rhythms for time-course experiments

Analysis Considerations

Strandedness detection: Use --strandedness auto if unsure - the pipeline will detect it automatically
Memory Requirements: STAR alignment requires ~30GB RAM for human genome. Consider HISAT2 for limited resources
Reference Quality: Use primary assembly (no patches/haplotypes) and comprehensive gene annotations (GENCODE/Ensembl)
UMI handling: Essential for low-input samples or single-cell preparations

Quality Thresholds

Read quality: Q30 > 80% of bases
Duplication rate: <50% for standard RNA-seq
rRNA contamination: <10% (ideally <5%)
Uniquely mapped: >70% of reads
Review MultiQC report to identify samples that may need to be excluded

Troubleshooting

Common Issues

Issue: Low alignment rate (<60%)

Solution: Check if correct genome/annotation version is used
Verify library preparation method matches analysis parameters
Check for adapter contamination or poor quality reads
Consider contamination from other species

Issue: High ribosomal RNA content

Solution: Verify rRNA depletion was performed during library prep
Use --remove_ribo_rna to bioinformatically remove rRNA reads
Check if depletion kit was appropriate for your organism

Issue: Inconsistent strandedness results

Solution: Verify library prep protocol documentation
Check if samples were mixed from different batches
Use --strandedness auto to detect automatically

Issue: Low gene detection rate

Solution: Increase sequencing depth
Check RNA quality (RIN scores) from extraction
Verify correct annotation file is used
Consider tissue-specific expression patterns

Issue: High duplication rates

Solution: Expected for highly expressed genes
Use UMI-based deduplication if available
Check if PCR cycles were excessive
Consider --skip_markduplicates for low-input samples

Issue: Outlier samples in PCA/clustering

Solution: Check individual sample QC metrics
Look for batch effects or technical issues
Verify sample identity and metadata
Consider removing outliers for downstream analysis

Issue: Memory errors during alignment

Solution: Reduce --star_max_parallel_jobs
Increase memory allocation with --max_memory
Use pre-built genome indices
Process in smaller batches

Additional Resources

Full documentation: nf-core/rnaseq documentation
Usage guide: nf-co.re/rnaseq/3.19.0/docs/usage
Output description: nf-co.re/rnaseq/3.19.0/docs/output
Parameter reference: nf-co.re/rnaseq/3.19.0/parameters
Pipeline source code: GitHub - nf-core/rnaseq
Support: Join the #rnaseq channel on nf-core Slack
Citation: doi.org/10.5281/zenodo.1400710