Pipelines
CLIP-Seq in detail
CLIPSeq v1.0 📎
Full source code available on Github. Specific instructions for Flow usage can be found here. This DSL2 CLIP-Seq pipeline is written and maintained by Goodwright in collaboration with Ule lab and the developers of the DSL1 nf-core/clipseq pipeline.
Pipeline Summary
Cross-linking and immunoprecipitation followed by sequencing (CLIP) has allowed high resolution studies of RNA binding protein (RBP)-RNA interactions at transcriptomic scale.
This pipeline enables analysis of various forms of single-end CLIP data including variants of iCLIP (eg. irCLIP, iCLIP2, iiCLIP) and eCLIP (note that we have achieved comparable results using this pipeline to study reformatted paired-end eCLIP also). The pipeline currently doesn't support mutation calling and therefore might not be suitable for PAR-CLIP analysis, but we plan to include this at a future date.
Reads are first 3' end trimmed for adapters and sequence quality before being mapped using Bowtie to a "small RNA" index consisting of rRNA and tRNA sequences. Reads that remain unmapped are then mapped to the chosen species' genome. Aligned reads are then deduplicated according to unique molecular identifier (UMI) and start position. Start position minus one of deduplicated reads is taken as the crosslinking position and crosslink bed files are produced. Peaks are then called using several tools: Paraclu, Clippy and iCount. Summary files reporting transcriptomic region, RNA biotype and gene level crosslink counts are produced. RNA maps of crosslinks are drawn around important transcriptomic landmarks. Crosslinks and Clippy peaks are taken as input to PEKA, to determine postionally enriched kmer motifs. Finally a summary of quality control metrics is generated and presented as a MultiQC report.
For a detailed review of the considerations behind each analysis step you can consult "Data Science Issues in Studying Protein–RNA Interactions with CLIP Technologies".
Required Input
This pipeline requires demultiplexed fastq sample input and genome/annotation files produced as output by the "prepare clipseq" pipeline. We recommend using our demultiplex pipeline to produce these individual samples fastq.
Curated Outputs
These are what we would consider to be the most commonly used outputs of the pipeline, so on the Flow platform we present these as "curated ouputs". All of the outputs are listed in the "Pipeline in Detail" section below.
Normalised genome crosslink bedgraph Genomic crosslinks normalised to total sample crosslinks in bedgraph format for genome browser viewing. Produced by CROSSLINK_NORMCOVERAGE process.
Genomic peaks Genomic peaks generated by Clippy. Produced by CLIPPY_GENOME process.
Crosslink summary Crosslinks summarised by gene, type (eg. CDS, intron) and subtype (eg. lncRNA, mRNA). Produced by ICOUNT_SUMMARY process.
K-mer enrichment K-mer enrichment motif analysis performed by PEKA. PDF file with graphs showing K-mer occurrence distributions around thresholded crosslink sites. Produced by PEKA process.
Quality control report MultiQC report of important quality control metrics. Produced by MULTIQC process.
Commonly used parameters
Moving UMI from fastq reads to read header
If you are analysing demultiplexed sample files, then depending on where you have sourced your fastq from, the UMI and experimental barcode might still be present at the 5' end of reads, which will cause errors in downstream analysis. Note this is common for historical public data downloaded from ArrayExpress, for example.
To enable this option set run_move_umi_to_header = true
and ensure you provide the UMI format to move_umi
, for example move_umi='NNNNNN'
.
Changing the UMI delimiter
Depending on how you demultiplexed your reads, you may need to change the delimiter that UMICollapse uses to search for UMIs in your aligned reads. We default to using "rbc:" which is the output of Ultraplex, the demultiplexer we use in our demultiplexing pipeline.
To change, set umi_separator
.
Turning off deduplication
For older data with very short or no UMIs at all, you may want to skip the UMI deduplication step.
To turn of deduplications, set run_umi_dedup = false
Changing other individual tool parameters
When you are working with data you're already familiar with you might have specific parameters in mind for certain tools, here are your options for changing:
- Minimum length of reads kept by Trim Galore! after trimming,
trim_length
, eg.10
- Bowtie parameters for pre-mapping,
bowtie_params
, eg."-v 2 -m 100 --norc --best --strata"
- STAR parameters for genomic mapping,
star_params
, eg."--outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outSAMattributes All --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterType BySJout --alignIntronMin 20 --alignIntronMax 1000000 --outFilterScoreMin 10 --alignEndsType Extend5pOfRead1 --twopassMode Basic"
- Clippy peak calling parameters,
clippy_params
, eg."--window_size 50 --width 0.8"
- iCount peak calling parameters,
icount_peak_params
, eg."--dist 20 --slop 3"
- Paraclu minimum cut off value,
paraclu_min_value
, eg.30
- PEKA motif enrichment parameters,
peka_params
, eg."-p 0 -sr genome -sub False"
Pipeline in Detail
This detailed description will present each subworkflow
/module run in the pipeline and give detailed information of inputs and outputs with filenames, and default parameters.
PREPARE_CLIPSEQ
Prepares all of the alignment indexes and annotation files required by the pipeline, this only needs to be run once per genome, after this the output files can be provided to the pipeline to avoid this subworkflow running.
PARSE_FASTQ_INPUT
Checks samplesheet input, then formats it and fastq files into an appropriate channel format for Nextflow.
i. SAMPLE_BASE_SAMPLESHEET_CHECK: This module checks that the samplesheet input meets requirements. It outputs a sanitised samplesheet in a ".csv" file.
ii. CAT_FASTQ: This module concatenates fastq files labelled as being the same sample.
The output of this subworkflow is a Nextflow channel containing a tuple of metadata and fastq files.
(Optional)
UMI_TO_HEADER
FASTQC_TRIMGALORE
i. FASTQC: Checks sequencing quality and adapter content of raw reads.
⚪Input - sample
fastq
. 🟢Output - sample_fastqc.htmlfastqc_html
, sample_fastqc.zipfastqc_zip
.ii. TRIMGALORE: Trims for 3'adapters, read quality at 3' end and removes reads that become too short after this. Manual.
🟣Parameters -
--fastqc --length ${params.trim_length} -q20
Default params.trim_length = 10.⚪Input - sample
fastq
. 🟢Output - sample_trimmed.fq.gzfastq
, sample_trimming_report.txttrim_log
, sample_trimmed_fastqc.htmlfastqc_trim_html
, sample_trimmed_fastqc.zipfastqc_trim_zip
.RNA_ALIGN
i. BOWTIE_ALIGN : Aligns reads to rRNA, tRNA sequences referred to as small RNA, smRNA. Manual.
🟣Parameters -
${params.bowtie_params}
Default params.bowtie_params = "-v 2 -m 100 --norc --best --strata"⚪Input - sample_trimmed.fq.gz
reads
, bowtie (folder with bowtie index)index
. 🟢Output - sample.bambam
, sample.outlog
, sample.unmapped.fastq.gzfastq
.ii. STAR_ALIGN : Aligns unmapped reads from smRNA Bowtie mapping to the genome, also referred to as target. Manual.
🟣Parameters -
--readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM ${params.star_params}
Default params.star_params = " --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outSAMattributes All --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterType BySJout --alignIntronMin 20 --alignIntronMax 1000000 --outFilterScoreMin 10 --alignEndsType Extend5pOfRead1 --twopassMode Basic"⚪Input - sample.unmapped.fastq.gz
reads
, star (folder with star index)index
. 🟢Output - sample.Aligned.sortedByCoord.out.bambam_sorted
, sample.Aligned.toTranscriptome.out.bambam_transcript
, sample.unmapped_1.fastq.gzfastq
, sample.Log.final.outlog_final
, sample.Log.outlog_out
, sample.Log.progress.outlog_progress
i. SAMTOOLS_INDEX_GENOME : Indexes genome BAM file Manual.
⚪Input - sample.Aligned.sortedByCoord.out.bam
bam
, 🟢Output - sample.Aligned.sortedByCoord.out.bam.baibai
.ii. SAMTOOLS_SORT_TRANSCRIPT : Sorts transcript bam file. Manual.
⚪Input - sample.Aligned.toTranscriptome.out.bam
bam
, 🟢Output - sample.Aligned.toTranscriptome.out.bam.baibai
.iii. SAMTOOLS_INDEX_TRANSCRIPT : Indexes transcript bam file. Manual.
⚪Input - sample.Aligned.sortedByCoord.out.bam
bam
, 🟢Output - sample.Aligned.sortedByCoord.out.bam.baibai
.FILTER_TRANSCRIPTS
BAM_SORT_STATS_SAMTOOLS_TRANSCRIPT
GENOME_DEDUP
TRANSCRIPT_DEDUP
CALC_GENOME_CROSSLINKS
CALC_TRANSCRIPT_CROSSLINKS
CLIPPY
PARACLU_ANALYSE
ICOUNT_ANALYSE
This subworkflow contains the following processes:i. ICOUNT_SUMMARY
ii. ICOUNT_RNAMAPS
iii. ICOUNT_SIGXLS
iv. ICOUNT_PEAKS
PEKA
CLIPSEQ_CLIPQC
MULTIQC
Common Issues
Analysis of publicly available data
Analysis of publicly available data, especially older data can present issues as sometimes the barcode structure or structure of the uploaded fastq is not obvious. Things to check are:
- Look at the first few reads in the fastq - is the random barcode (UMI) in the header? This tells us that the 5' barcode and 3' adapter have been removed from the read already, make sure to check what format the random barcode (UMI) is provided in, eg. "rbc:" and change the
umi_separator
parameter if you need to. - If the random barcode is not in the header then you can use the
run_move_umi_to_header = true
parameter. The structure of the barcodes should be provided in the manuscript, but this is not always the case. - Be careful for data in SRA, where the read header is entirely removed - this might have removed the information of random barcode without the manuscript authors noticing.
FAQ
What is the difference between this pipeline and nf-core/clipseq?
This pipeline is based on nf-core/clipseq but is updated in a few ways. For one, the Nextflow code has been updated from DSL1 to the modular DSL2. This pipeline also includes additional tools, namely PEKA for positional motif analysis and Clippy for peak calling.
I would like to contribute to this pipeline, or make a request for a new feature. Where can I do this?
To raise any issues or comments with the pipeline you can (in order of preference):
- Raise an issue in the GitHub repository
- Write to us in the Flow Slack
- Email charlotte.capitanchik@goodwright.com
Authors
This DSL2 CLIP-Seq pipeline is written and maintained by Goodwright in collaboration with Ule lab and the developers of the DSL1 nf-core/clipseq pipeline.