CLIP-Seq Pipeline

Overview

The CLIP-Seq pipeline (goodwright/clipseq v1.0dev) enables high-resolution studies of RNA binding protein (RBP)-RNA interactions at transcriptomic scale. Flow supports various forms of single-end CLIP data including variants of iCLIP (e.g., irCLIP, iCLIP2, iiCLIP) and eCLIP.

The pipeline is maintained by Goodwright in collaboration with Ule lab and the developers of the nf-core/clipseq pipeline.

Pipeline Summary

The CLIP-Seq pipeline processes your data through these major steps:

Quality Control & Preprocessing - Initial FastQC, adapter trimming and quality filtering
UMI Extraction - Move UMIs from reads to headers for deduplication
Pre-alignment - Map to rRNA/tRNA sequences to filter contaminants
Genome Alignment - Align unmapped reads to reference genome with STAR
Deduplication - UMI-based deduplication to identify unique crosslinks
Peak Calling - Multiple algorithms (Paraclu, Clippy, iCount) for peak detection
Motif Analysis - PEKA analysis for enriched k-mer motifs
Quality Reports - Comprehensive MultiQC report with CLIP-specific metrics

Input Requirements

Sample Sheet Format

The pipeline requires a CSV sample sheet with the following columns:

Column	Description	Required
`sample`	Unique sample identifier	Yes
`fastq_1`	Path to FASTQ file	Yes
`fastq_2`	Path to paired read file	No (single-end only)
`group`	Experimental group name	Yes
`replicate`	Replicate identifier	Yes

Example sample sheet:

sample,fastq_1,fastq_2,group,replicate
WT_rep1,WT_rep1.fastq.gz,,wildtype,1
WT_rep2,WT_rep2.fastq.gz,,wildtype,2
MUT_rep1,MUT_rep1.fastq.gz,,mutant,1
MUT_rep2,MUT_rep2.fastq.gz,,mutant,2

Additional Requirements

Reference Genome: Soft-masked FASTA file (recommended for PEKA)
Gene Annotation: GTF file matching the genome version
Small RNA Reference: FASTA file containing rRNA/tRNA sequences

Key Parameters

Required Parameters

Parameter	Description	Default
`--samplesheet`	Path to sample sheet CSV file	Required
`--fasta`	Path to reference genome FASTA	Required
`--gtf`	Path to gene annotation GTF	Required
`--smrna_fasta`	Path to small RNA FASTA	Required
`--outdir`	Output directory for results	./results

UMI Processing Parameters

Parameter	Description	Default
`--run_move_umi_to_header`	Move UMI from read to header	false
`--move_umi`	UMI pattern (e.g., 'NNNNNN')	null
`--umi_separator`	Delimiter for UMI in headers	rbc:
`--run_umi_dedup`	Enable UMI-based deduplication	true

Alignment Parameters

Parameter	Description	Default
`--trim_length`	Minimum read length after trimming	10
`--bowtie_params`	Bowtie pre-mapping parameters	-v 2 -m 100 --norc --best --strata
`--star_params`	STAR alignment parameters	See docs

Peak Calling Parameters

Parameter	Description	Default
`--paraclu_min_value`	Minimum value for Paraclu clustering	10
`--clippy_params`	Additional Clippy parameters	""
`--icount_peaks_params`	Additional iCount parameters	""
`--peka_params`	Additional PEKA parameters	""

Output Control Parameters

Parameter	Description	Default
`--save_reference`	Save generated reference files	true
`--save_indexes`	Save alignment indexes	true
`--save_merged_fastq`	Save merged FASTQ files	true
`--save_trimmed`	Save trimmed reads	true
`--save_align_intermed`	Save intermediate alignments	true

Pipeline Outputs

Directory Structure

results/
├── 00_genome/          # Reference files and indexes
├── 01_prealign/        # Trimming and QC reports
├── 02_alignment/       # Alignment files
├── 03_filt_dedup/      # Filtered and deduplicated BAMs
├── 04_crosslinks/      # Crosslink positions and coverage
├── 05_peak_calling/    # Peak calls from multiple algorithms
└── 06_reports/         # QC and summary reports

Output Files

Crosslink Files
- *.genome.bed: Genomic crosslink positions
- *.norm.genome.bedgraph: Normalized coverage (CPM) for genome browser
- *.transcript.bed: Transcriptomic crosslink positions
Peak Calls
- clippy/*.clippy.bed: Clippy peak calls
- paraclu/*.paraclu.peaks.bed: Paraclu clusters
- icount/*.significant_sites.bed: iCount significant crosslinks
Analysis Summaries
- icount/*summary_gene.tsv: Gene-level crosslink counts
- icount/*summary_type.tsv: RNA type distribution
- icount/*rnamaps.pdf: RNA maps around landmarks
Motif Analysis
- peka/*_enrichment.pdf: PEKA k-mer enrichment plots
- peka/*_pwm.txt: Position weight matrices
Quality Reports
- multiqc/multiqc_report.html: Comprehensive QC report
- clipqc/*.json: CLIP-specific quality metrics

Example Usage

Basic CLIP-Seq Analysis

nextflow run goodwright/clipseq \
  --samplesheet samples.csv \
  --fasta genome.fa \
  --gtf annotation.gtf \
  --smrna_fasta rRNA_tRNA.fa \
  --outdir results \
  -profile docker

Analysis with UMI in Reads

nextflow run goodwright/clipseq \
  --samplesheet samples.csv \
  --fasta genome.fa \
  --gtf annotation.gtf \
  --smrna_fasta rRNA_tRNA.fa \
  --run_move_umi_to_header true \
  --move_umi "NNNNNN" \
  --outdir results \
  -profile docker

Analysis without UMI Deduplication

nextflow run goodwright/clipseq \
  --samplesheet samples.csv \
  --fasta genome.fa \
  --gtf annotation.gtf \
  --smrna_fasta rRNA_tRNA.fa \
  --run_umi_dedup false \
  --outdir results \
  -profile docker

Tips and Best Practices

Sample Preparation

Use descriptive sample names: proteinName_cellType_condition_replicate
Ensure consistent barcode design across experiments
Include appropriate input/control samples when possible

UMI Handling

Verify UMI location (in read vs header) for public data
Check SRA data carefully - headers may be stripped
Use appropriate UMI pattern (e.g., 'NNNNNN' for 6nt UMI)

Genome Preparation

Use soft-masked genomes for better PEKA repeat filtering
Ensure GTF annotation matches genome version
Include comprehensive rRNA/tRNA sequences for pre-mapping

Parameter Optimization

Adjust --paraclu_min_value for sparse data or low complexity samples
Modify --trim_length based on read length distribution
Tune alignment stringency via --star_params for difficult samples

On Flow Platform

Run "Prepare CLIP-Seq" first if not already available for your species
Use the CLIP annotation template for multiplexed samples
Samples with identical Group and Replicate values will be merged

Troubleshooting

Common Issues

Issue: Low number of unique crosslinks

Solution: Check UMI handling parameters match library preparation
Verify adapter trimming is removing all adapter sequences
Consider adjusting deduplication strategy

Issue: High percentage of reads mapping to rRNA/tRNA

Solution: Verify library preparation protocol quality
Check if rRNA depletion was performed
Ensure comprehensive small RNA reference file

Issue: No peaks detected

Solution: Lower --paraclu_min_value threshold
Check crosslink file has sufficient coverage
Verify correct strand information in analysis

Issue: Pipeline fails at PEKA analysis

Solution: Ensure genome is soft-masked
Check for sufficient crosslinks in peak regions
Try running without PEKA using skip options

Issue: Memory errors during alignment

Solution: Reduce STAR --limitBAMsortRAM parameter
Process samples in smaller batches
Increase memory allocation in config

Additional Resources

Pipeline Documentation: GitHub - goodwright/clipseq
nf-core CLIP-seq: nf-co.re/clipseq
Ule Lab Resources: ulelab.info
Support: Contact Flow support for pipeline-specific issues
Citation: Data Science Issues in Studying Protein–RNA Interactions with CLIP Technologies