Pipelines
CLIP-Seq Pipeline
Overview
The CLIP-Seq pipeline (goodwright/clipseq v1.0dev) enables high-resolution studies of RNA binding protein (RBP)-RNA interactions at transcriptomic scale. Flow supports various forms of single-end CLIP data including variants of iCLIP (e.g., irCLIP, iCLIP2, iiCLIP) and eCLIP.
The pipeline is maintained by Goodwright in collaboration with Ule lab and the developers of the nf-core/clipseq pipeline.
Pipeline Summary
The CLIP-Seq pipeline processes your data through these major steps:
- Quality Control & Preprocessing - Initial FastQC, adapter trimming and quality filtering
- UMI Extraction - Move UMIs from reads to headers for deduplication
- Pre-alignment - Map to rRNA/tRNA sequences to filter contaminants
- Genome Alignment - Align unmapped reads to reference genome with STAR
- Deduplication - UMI-based deduplication to identify unique crosslinks
- Peak Calling - Multiple algorithms (Paraclu, Clippy, iCount) for peak detection
- Motif Analysis - PEKA analysis for enriched k-mer motifs
- Quality Reports - Comprehensive MultiQC report with CLIP-specific metrics
Input Requirements
Sample Sheet Format
The pipeline requires a CSV sample sheet with the following columns:
Column | Description | Required |
---|---|---|
sample | Unique sample identifier | Yes |
fastq_1 | Path to FASTQ file | Yes |
fastq_2 | Path to paired read file | No (single-end only) |
group | Experimental group name | Yes |
replicate | Replicate identifier | Yes |
Example sample sheet:
sample,fastq_1,fastq_2,group,replicate
WT_rep1,WT_rep1.fastq.gz,,wildtype,1
WT_rep2,WT_rep2.fastq.gz,,wildtype,2
MUT_rep1,MUT_rep1.fastq.gz,,mutant,1
MUT_rep2,MUT_rep2.fastq.gz,,mutant,2
Additional Requirements
- Reference Genome: Soft-masked FASTA file (recommended for PEKA)
- Gene Annotation: GTF file matching the genome version
- Small RNA Reference: FASTA file containing rRNA/tRNA sequences
Key Parameters
Required Parameters
Parameter | Description | Default |
---|---|---|
--samplesheet | Path to sample sheet CSV file | Required |
--fasta | Path to reference genome FASTA | Required |
--gtf | Path to gene annotation GTF | Required |
--smrna_fasta | Path to small RNA FASTA | Required |
--outdir | Output directory for results | ./results |
UMI Processing Parameters
Parameter | Description | Default |
---|---|---|
--run_move_umi_to_header | Move UMI from read to header | false |
--move_umi | UMI pattern (e.g., 'NNNNNN') | null |
--umi_separator | Delimiter for UMI in headers | rbc: |
--run_umi_dedup | Enable UMI-based deduplication | true |
Alignment Parameters
Parameter | Description | Default |
---|---|---|
--trim_length | Minimum read length after trimming | 10 |
--bowtie_params | Bowtie pre-mapping parameters | -v 2 -m 100 --norc --best --strata |
--star_params | STAR alignment parameters | See docs |
Peak Calling Parameters
Parameter | Description | Default |
---|---|---|
--paraclu_min_value | Minimum value for Paraclu clustering | 10 |
--clippy_params | Additional Clippy parameters | "" |
--icount_peaks_params | Additional iCount parameters | "" |
--peka_params | Additional PEKA parameters | "" |
Output Control Parameters
Parameter | Description | Default |
---|---|---|
--save_reference | Save generated reference files | true |
--save_indexes | Save alignment indexes | true |
--save_merged_fastq | Save merged FASTQ files | true |
--save_trimmed | Save trimmed reads | true |
--save_align_intermed | Save intermediate alignments | true |
Pipeline Outputs
Directory Structure
results/
├── 00_genome/ # Reference files and indexes
├── 01_prealign/ # Trimming and QC reports
├── 02_alignment/ # Alignment files
├── 03_filt_dedup/ # Filtered and deduplicated BAMs
├── 04_crosslinks/ # Crosslink positions and coverage
├── 05_peak_calling/ # Peak calls from multiple algorithms
└── 06_reports/ # QC and summary reports
Output Files
Crosslink Files
*.genome.bed
: Genomic crosslink positions*.norm.genome.bedgraph
: Normalized coverage (CPM) for genome browser*.transcript.bed
: Transcriptomic crosslink positions
Peak Calls
clippy/*.clippy.bed
: Clippy peak callsparaclu/*.paraclu.peaks.bed
: Paraclu clustersicount/*.significant_sites.bed
: iCount significant crosslinks
Analysis Summaries
icount/*summary_gene.tsv
: Gene-level crosslink countsicount/*summary_type.tsv
: RNA type distributionicount/*rnamaps.pdf
: RNA maps around landmarks
Motif Analysis
peka/*_enrichment.pdf
: PEKA k-mer enrichment plotspeka/*_pwm.txt
: Position weight matrices
Quality Reports
multiqc/multiqc_report.html
: Comprehensive QC reportclipqc/*.json
: CLIP-specific quality metrics
Example Usage
Basic CLIP-Seq Analysis
nextflow run goodwright/clipseq \
--samplesheet samples.csv \
--fasta genome.fa \
--gtf annotation.gtf \
--smrna_fasta rRNA_tRNA.fa \
--outdir results \
-profile docker
Analysis with UMI in Reads
nextflow run goodwright/clipseq \
--samplesheet samples.csv \
--fasta genome.fa \
--gtf annotation.gtf \
--smrna_fasta rRNA_tRNA.fa \
--run_move_umi_to_header true \
--move_umi "NNNNNN" \
--outdir results \
-profile docker
Analysis without UMI Deduplication
nextflow run goodwright/clipseq \
--samplesheet samples.csv \
--fasta genome.fa \
--gtf annotation.gtf \
--smrna_fasta rRNA_tRNA.fa \
--run_umi_dedup false \
--outdir results \
-profile docker
Tips and Best Practices
Sample Preparation
- Use descriptive sample names:
proteinName_cellType_condition_replicate
- Ensure consistent barcode design across experiments
- Include appropriate input/control samples when possible
UMI Handling
- Verify UMI location (in read vs header) for public data
- Check SRA data carefully - headers may be stripped
- Use appropriate UMI pattern (e.g., 'NNNNNN' for 6nt UMI)
Genome Preparation
- Use soft-masked genomes for better PEKA repeat filtering
- Ensure GTF annotation matches genome version
- Include comprehensive rRNA/tRNA sequences for pre-mapping
Parameter Optimization
- Adjust
--paraclu_min_value
for sparse data or low complexity samples - Modify
--trim_length
based on read length distribution - Tune alignment stringency via
--star_params
for difficult samples
On Flow Platform
- Run "Prepare CLIP-Seq" first if not already available for your species
- Use the CLIP annotation template for multiplexed samples
- Samples with identical Group and Replicate values will be merged
Troubleshooting
Common Issues
Issue: Low number of unique crosslinks
- Solution: Check UMI handling parameters match library preparation
- Verify adapter trimming is removing all adapter sequences
- Consider adjusting deduplication strategy
Issue: High percentage of reads mapping to rRNA/tRNA
- Solution: Verify library preparation protocol quality
- Check if rRNA depletion was performed
- Ensure comprehensive small RNA reference file
Issue: No peaks detected
- Solution: Lower
--paraclu_min_value
threshold - Check crosslink file has sufficient coverage
- Verify correct strand information in analysis
Issue: Pipeline fails at PEKA analysis
- Solution: Ensure genome is soft-masked
- Check for sufficient crosslinks in peak regions
- Try running without PEKA using skip options
Issue: Memory errors during alignment
- Solution: Reduce STAR
--limitBAMsortRAM
parameter - Process samples in smaller batches
- Increase memory allocation in config
Additional Resources
- Pipeline Documentation: GitHub - goodwright/clipseq
- nf-core CLIP-seq: nf-co.re/clipseq
- Ule Lab Resources: ulelab.info
- Support: Contact Flow support for pipeline-specific issues
- Citation: Data Science Issues in Studying Protein–RNA Interactions with CLIP Technologies