Flow Logo

Pipelines

RNA-seq Pipeline

Flow provides the nf-core/rnaseq v3.19.0 pipeline - a best-practice analysis pipeline for RNA sequencing data.


Overview

The RNA-seq pipeline performs comprehensive analysis of RNA sequencing data, from raw reads to gene expression quantification and quality control. It supports various experimental designs and library preparations.

Key Features

  • Flexible Input Support

    • Single-end and paired-end sequencing
    • Multiple lanes per sample (automatic merging)
    • Technical replicate handling
    • Automatic strandedness detection
  • Multiple Analysis Routes

    • Alignment with STAR or HISAT2
    • Pseudoalignment with Salmon or Kallisto
    • Transcript quantification and gene expression
  • Comprehensive Quality Control

    • Read quality assessment (FastQC)
    • Alignment quality metrics (RSeQC, Qualimap)
    • Sample correlation analysis
    • MultiQC integrated report
  • Advanced Features

    • UMI-based deduplication
    • Ribosomal RNA removal
    • Contamination screening
    • 3' digital gene expression support

Pipeline Summary

  1. Preprocessing

    • Merge re-sequenced FastQ files
    • Read quality control (FastQC)
    • UMI extraction (if applicable)
    • Adapter and quality trimming (Trim Galore!)
    • Ribosomal RNA removal (optional)
  2. Alignment

    • Genome alignment (STAR/HISAT2)
    • Transcriptome pseudoalignment (Salmon/Kallisto)
    • Sort and index alignments
    • Mark duplicate reads
  3. Quantification

    • Gene-level counting (featureCounts/Salmon)
    • Transcript assembly (StringTie)
    • TPM/FPKM normalization
    • Generate expression matrices
  4. Quality Control

    • Alignment QC (RSeQC, Qualimap)
    • Library complexity (Preseq)
    • Sample similarity (DESeq2)
    • Comprehensive report (MultiQC)

Input Requirements

Sample Sheet Format

Create a CSV file with your sample information:

sample,fastq_1,fastq_2,strandedness
control_rep1,/path/to/control_rep1_R1.fastq.gz,/path/to/control_rep1_R2.fastq.gz,reverse
control_rep2,/path/to/control_rep2_R1.fastq.gz,/path/to/control_rep2_R2.fastq.gz,reverse
treatment_rep1,/path/to/treatment_rep1_R1.fastq.gz,/path/to/treatment_rep1_R2.fastq.gz,reverse
treatment_rep2,/path/to/treatment_rep2_R1.fastq.gz,/path/to/treatment_rep2_R2.fastq.gz,reverse

Columns:

  • sample: Unique sample identifier
  • fastq_1: Path to R1 FASTQ file
  • fastq_2: Path to R2 FASTQ file (leave empty for single-end)
  • strandedness: Library strandedness (forward, reverse, unstranded, or auto)

Reference Requirements

  • Genome FASTA: Reference genome sequence
  • GTF/GFF: Gene annotation file
  • Optional: Pre-built STAR/HISAT2 indices

Key Parameters

Basic Parameters

  • --input: Path to sample sheet CSV
  • --outdir: Output directory path
  • --genome: Reference genome name (e.g., 'GRCh38')
  • --gtf: Path to GTF annotation file
  • --fasta: Path to genome FASTA file

Alignment Options

  • --aligner: Choose aligner (star_salmon, star_rsem, hisat2)
  • --pseudo_aligner: Enable pseudoalignment (salmon)
  • --skip_alignment: Skip all alignment steps
  • --save_unaligned: Save unaligned reads

Quantification Options

  • --featurecounts_feature_type: Feature type for counting (default: 'exon')
  • --featurecounts_group_type: Grouping for features (default: 'gene_id')
  • --salmon_quant_libtype: Salmon library type (auto-detected by default)

Quality Control

  • --skip_qc: Skip all QC steps
  • --skip_fastqc: Skip FastQC
  • --skip_rseqc: Skip RSeQC
  • --skip_qualimap: Skip Qualimap
  • --skip_preseq: Skip Preseq
  • --skip_multiqc: Skip MultiQC

Advanced Options

  • --with_umi: Enable UMI-based deduplication
  • --umitools_extract_method: UMI extraction method
  • --remove_ribo_rna: Remove ribosomal RNA reads
  • --save_trimmed: Save trimmed FASTQ files

Pipeline Outputs

Expression Matrices

The main gene expression results are in results/star_salmon/:

  • salmon.merged.gene_counts.tsv: Raw gene counts matrix
  • salmon.merged.gene_tpm.tsv: TPM normalized expression
  • salmon.merged.gene_counts.rds: R SummarizedExperiment object

Quality Control Reports

  • multiqc/multiqc_report.html: Comprehensive QC report
  • fastqc/: Read quality reports
  • rseqc/: RNA-seq specific metrics
  • qualimap/: Alignment quality assessment

Alignment Files

  • star_salmon/: STAR alignments and Salmon quantification
  • star_salmon/bigwig/: Coverage tracks for genome browsers

Additional Outputs

  • trimgalore/: Trimmed reads (if --save_trimmed)
  • unmapped/: Unaligned reads (if --save_unaligned)
  • deseq2_qc/: Sample clustering and PCA plots

Example Usage

Basic RNA-seq Analysis

nextflow run nf-core/rnaseq \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh38 \
  -profile docker

Custom Reference with STAR

nextflow run nf-core/rnaseq \
  --input samplesheet.csv \
  --outdir results \
  --fasta genome.fa \
  --gtf genes.gtf \
  --star_index /path/to/star/index/ \
  --aligner star_salmon \
  -profile docker

UMI-based Analysis

nextflow run nf-core/rnaseq \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh38 \
  --with_umi \
  --umitools_extract_method 'regex' \
  --umitools_bc_pattern '^(?P<umi_1>.{6})(?P<discard_1>.{4}).*' \
  -profile docker

Tips and Best Practices

Library Preparation

  • Strand-specific protocols are highly recommended for accurate transcript quantification
  • Ribosomal depletion is preferred over poly-A selection for comprehensive transcriptome coverage
  • Adequate sequencing depth:
    • 20-30 million reads for standard gene expression
    • 50-100 million for transcript discovery
    • 100+ million for allele-specific expression

Experimental Design

  • Biological replicates: Minimum 3 per condition (more is better for statistical power)
  • Batch effects: Distribute conditions across sequencing batches and process all samples together
  • Time points: Consider circadian rhythms for time-course experiments

Analysis Considerations

  • Strandedness detection: Use --strandedness auto if unsure - the pipeline will detect it automatically
  • Memory Requirements: STAR alignment requires ~30GB RAM for human genome. Consider HISAT2 for limited resources
  • Reference Quality: Use primary assembly (no patches/haplotypes) and comprehensive gene annotations (GENCODE/Ensembl)
  • UMI handling: Essential for low-input samples or single-cell preparations

Quality Thresholds

  • Read quality: Q30 > 80% of bases
  • Duplication rate: <50% for standard RNA-seq
  • rRNA contamination: <10% (ideally <5%)
  • Uniquely mapped: >70% of reads
  • Review MultiQC report to identify samples that may need to be excluded

Troubleshooting

Common Issues

Issue: Low alignment rate (<60%)

  • Solution: Check if correct genome/annotation version is used
  • Verify library preparation method matches analysis parameters
  • Check for adapter contamination or poor quality reads
  • Consider contamination from other species

Issue: High ribosomal RNA content

  • Solution: Verify rRNA depletion was performed during library prep
  • Use --remove_ribo_rna to bioinformatically remove rRNA reads
  • Check if depletion kit was appropriate for your organism

Issue: Inconsistent strandedness results

  • Solution: Verify library prep protocol documentation
  • Check if samples were mixed from different batches
  • Use --strandedness auto to detect automatically

Issue: Low gene detection rate

  • Solution: Increase sequencing depth
  • Check RNA quality (RIN scores) from extraction
  • Verify correct annotation file is used
  • Consider tissue-specific expression patterns

Issue: High duplication rates

  • Solution: Expected for highly expressed genes
  • Use UMI-based deduplication if available
  • Check if PCR cycles were excessive
  • Consider --skip_markduplicates for low-input samples

Issue: Outlier samples in PCA/clustering

  • Solution: Check individual sample QC metrics
  • Look for batch effects or technical issues
  • Verify sample identity and metadata
  • Consider removing outliers for downstream analysis

Issue: Memory errors during alignment

  • Solution: Reduce --star_max_parallel_jobs
  • Increase memory allocation with --max_memory
  • Use pre-built genome indices
  • Process in smaller batches

Additional Resources

Previous
Demultiplexing