Flow Logo

Pipelines

CLIP-Seq Pipeline

Overview

The CLIP-Seq pipeline (goodwright/clipseq v1.0dev) enables high-resolution studies of RNA binding protein (RBP)-RNA interactions at transcriptomic scale. Flow supports various forms of single-end CLIP data including variants of iCLIP (e.g., irCLIP, iCLIP2, iiCLIP) and eCLIP.

The pipeline is maintained by Goodwright in collaboration with Ule lab and the developers of the nf-core/clipseq pipeline.

Pipeline Summary

The CLIP-Seq pipeline processes your data through these major steps:

  1. Quality Control & Preprocessing - Initial FastQC, adapter trimming and quality filtering
  2. UMI Extraction - Move UMIs from reads to headers for deduplication
  3. Pre-alignment - Map to rRNA/tRNA sequences to filter contaminants
  4. Genome Alignment - Align unmapped reads to reference genome with STAR
  5. Deduplication - UMI-based deduplication to identify unique crosslinks
  6. Peak Calling - Multiple algorithms (Paraclu, Clippy, iCount) for peak detection
  7. Motif Analysis - PEKA analysis for enriched k-mer motifs
  8. Quality Reports - Comprehensive MultiQC report with CLIP-specific metrics

Input Requirements

Sample Sheet Format

The pipeline requires a CSV sample sheet with the following columns:

ColumnDescriptionRequired
sampleUnique sample identifierYes
fastq_1Path to FASTQ fileYes
fastq_2Path to paired read fileNo (single-end only)
groupExperimental group nameYes
replicateReplicate identifierYes

Example sample sheet:

sample,fastq_1,fastq_2,group,replicate
WT_rep1,WT_rep1.fastq.gz,,wildtype,1
WT_rep2,WT_rep2.fastq.gz,,wildtype,2
MUT_rep1,MUT_rep1.fastq.gz,,mutant,1
MUT_rep2,MUT_rep2.fastq.gz,,mutant,2

Additional Requirements

  • Reference Genome: Soft-masked FASTA file (recommended for PEKA)
  • Gene Annotation: GTF file matching the genome version
  • Small RNA Reference: FASTA file containing rRNA/tRNA sequences

Key Parameters

Required Parameters

ParameterDescriptionDefault
--samplesheetPath to sample sheet CSV fileRequired
--fastaPath to reference genome FASTARequired
--gtfPath to gene annotation GTFRequired
--smrna_fastaPath to small RNA FASTARequired
--outdirOutput directory for results./results

UMI Processing Parameters

ParameterDescriptionDefault
--run_move_umi_to_headerMove UMI from read to headerfalse
--move_umiUMI pattern (e.g., 'NNNNNN')null
--umi_separatorDelimiter for UMI in headersrbc:
--run_umi_dedupEnable UMI-based deduplicationtrue

Alignment Parameters

ParameterDescriptionDefault
--trim_lengthMinimum read length after trimming10
--bowtie_paramsBowtie pre-mapping parameters-v 2 -m 100 --norc --best --strata
--star_paramsSTAR alignment parametersSee docs

Peak Calling Parameters

ParameterDescriptionDefault
--paraclu_min_valueMinimum value for Paraclu clustering10
--clippy_paramsAdditional Clippy parameters""
--icount_peaks_paramsAdditional iCount parameters""
--peka_paramsAdditional PEKA parameters""

Output Control Parameters

ParameterDescriptionDefault
--save_referenceSave generated reference filestrue
--save_indexesSave alignment indexestrue
--save_merged_fastqSave merged FASTQ filestrue
--save_trimmedSave trimmed readstrue
--save_align_intermedSave intermediate alignmentstrue

Pipeline Outputs

Directory Structure

results/
├── 00_genome/          # Reference files and indexes
├── 01_prealign/        # Trimming and QC reports
├── 02_alignment/       # Alignment files
├── 03_filt_dedup/      # Filtered and deduplicated BAMs
├── 04_crosslinks/      # Crosslink positions and coverage
├── 05_peak_calling/    # Peak calls from multiple algorithms
└── 06_reports/         # QC and summary reports

Output Files

  • Crosslink Files

    • *.genome.bed: Genomic crosslink positions
    • *.norm.genome.bedgraph: Normalized coverage (CPM) for genome browser
    • *.transcript.bed: Transcriptomic crosslink positions
  • Peak Calls

    • clippy/*.clippy.bed: Clippy peak calls
    • paraclu/*.paraclu.peaks.bed: Paraclu clusters
    • icount/*.significant_sites.bed: iCount significant crosslinks
  • Analysis Summaries

    • icount/*summary_gene.tsv: Gene-level crosslink counts
    • icount/*summary_type.tsv: RNA type distribution
    • icount/*rnamaps.pdf: RNA maps around landmarks
  • Motif Analysis

    • peka/*_enrichment.pdf: PEKA k-mer enrichment plots
    • peka/*_pwm.txt: Position weight matrices
  • Quality Reports

    • multiqc/multiqc_report.html: Comprehensive QC report
    • clipqc/*.json: CLIP-specific quality metrics

Example Usage

Basic CLIP-Seq Analysis

nextflow run goodwright/clipseq \
  --samplesheet samples.csv \
  --fasta genome.fa \
  --gtf annotation.gtf \
  --smrna_fasta rRNA_tRNA.fa \
  --outdir results \
  -profile docker

Analysis with UMI in Reads

nextflow run goodwright/clipseq \
  --samplesheet samples.csv \
  --fasta genome.fa \
  --gtf annotation.gtf \
  --smrna_fasta rRNA_tRNA.fa \
  --run_move_umi_to_header true \
  --move_umi "NNNNNN" \
  --outdir results \
  -profile docker

Analysis without UMI Deduplication

nextflow run goodwright/clipseq \
  --samplesheet samples.csv \
  --fasta genome.fa \
  --gtf annotation.gtf \
  --smrna_fasta rRNA_tRNA.fa \
  --run_umi_dedup false \
  --outdir results \
  -profile docker

Tips and Best Practices

Sample Preparation

  • Use descriptive sample names: proteinName_cellType_condition_replicate
  • Ensure consistent barcode design across experiments
  • Include appropriate input/control samples when possible

UMI Handling

  • Verify UMI location (in read vs header) for public data
  • Check SRA data carefully - headers may be stripped
  • Use appropriate UMI pattern (e.g., 'NNNNNN' for 6nt UMI)

Genome Preparation

  • Use soft-masked genomes for better PEKA repeat filtering
  • Ensure GTF annotation matches genome version
  • Include comprehensive rRNA/tRNA sequences for pre-mapping

Parameter Optimization

  • Adjust --paraclu_min_value for sparse data or low complexity samples
  • Modify --trim_length based on read length distribution
  • Tune alignment stringency via --star_params for difficult samples

On Flow Platform

  • Run "Prepare CLIP-Seq" first if not already available for your species
  • Use the CLIP annotation template for multiplexed samples
  • Samples with identical Group and Replicate values will be merged

Troubleshooting

Common Issues

Issue: Low number of unique crosslinks

  • Solution: Check UMI handling parameters match library preparation
  • Verify adapter trimming is removing all adapter sequences
  • Consider adjusting deduplication strategy

Issue: High percentage of reads mapping to rRNA/tRNA

  • Solution: Verify library preparation protocol quality
  • Check if rRNA depletion was performed
  • Ensure comprehensive small RNA reference file

Issue: No peaks detected

  • Solution: Lower --paraclu_min_value threshold
  • Check crosslink file has sufficient coverage
  • Verify correct strand information in analysis

Issue: Pipeline fails at PEKA analysis

  • Solution: Ensure genome is soft-masked
  • Check for sufficient crosslinks in peak regions
  • Try running without PEKA using skip options

Issue: Memory errors during alignment

  • Solution: Reduce STAR --limitBAMsortRAM parameter
  • Process samples in smaller batches
  • Increase memory allocation in config

Additional Resources

Previous
ChIP-Seq