Pipelines
Running CLIP-Seq on Flow
Introduction
CLIP is a method to study protein-RNA interactions, with several possible experimental variables and optimisations that can be discussed on the CLIP forum. New questions can be asked in our Slack workspace in the channel #general-questions-for-clip-forum.
This DSL2 CLIP-Seq pipeline is written and maintained by Goodwright in collaboration with Ule lab and the developers of the DSL1 nf-core/clipseq pipeline. You can view the source code here. This tutorial explains how to analyse your data using this pipeline on Flow.
There is a separate document containing detailed information about the pipeline itself.
Overview
To run your CLIP-Seq analysis on Flow, you will need to:
- Upload your sample fastq file, pay attention to whether it is multiplexed or demultiplexed already.
- If your sample is multiplexed, you will need to fill in an annotation spreadsheet with metadata and upload this too.
- Check that there is a previous suitable "Prepare Clipseq" run for your species of interest, if not you will first need to run this pipeline to generate all the genome indexes and annotations required for CLIP-Seq analysis. The input for this is simply a genome fasta file and a matching annotation gtf. Note that specifically for CLIP-Seq analysis, PEKA can make use of a soft-masked genome fasta to filter repeats, so you may want to consider this when choosing a genome fasta file.
- If your sample is multiplexed you will first need to run the demultiplexing pipeline. Note that the downstream CLIP-Seq pipeline will not be automatically triggered unlike iMaps.
- If you have uploaded demultiplexed samples you will need to run the CLIP-Seq pipeline yourself.
Note on running CLIP-Seq v1.0
Clicking on the Samples box will result in a table appearing on the screen - there are three columns - Sample, Group and Replicate. Clicking in the sample box will allow you to select your uploaded samples. Together, group and replicate will make the name of all the data files resulting from your analysis. If group and replicate are identical between samples then these sample fastqs will be merged before running the pipeline.
Providing Metadata
We highly recommend providing as much detailed metadata as possible at upload time of your data. This is a bit annoying, but in our experience people usually thank themselves for this later for the following reasons:
- Such details are often freshest in your mind at time of upload.
- You may revisit data years later and find it difficult to identify samples or conditions without metadata.
- To make your data publically available certain metadata is required, therefore it is best to complete early to avoid stress later, for example when you are submitting a manuscript.
Multiplexed samples
For multiplexed samples, metadata must be provided by excel spreadsheet at the time of demultiplexing. You should use the annotation template, available here.
Every field is required except for those labelled as "optional". Some fields have restricted terms to help keep the Flow database clean and easily searchable. For these fields you can find the "Valid Terms" in a separate worksheet in the annotation excel spreadsheet. If the term you need isn't there, you can request a new one in the #request-new-terms channel of our Slack Workspace and we'll help you right away.
Please find a detailed explanation of the metadata fields below:
Sample Name The name of your sample that will be propogated to all generated output files of pipelines. To keep the names simple and informative we suggest (but don't enforce) the following format: "proteinName_CellType_condition_replicate".
Project Name This is used to define the set of experiments that are part of the same study. If the data have already been published, this could be the title of the manuscript. The name needs to be unique (not shared with any other projects on the server), unless you wish the experiments to be added to a previous project. Try to combine experiments into the same project whenever possible, as it keeps data organised. You can combine experiments for multiple proteins, cell types and species, into the same project, as long as they are part of an integrated study. The visibility and sharing of data is defined per project, so make sure that you wish all data in the same project to be shared with the same people. Read more about managing projects.
Scientist (optional) This is the person who generated the data. The name here should be their Flow username, you can find people's usernames by searching their name in the search box on the Flow site.
PI (optional) This is the Principal Investigator (PI) or manager of the project. The name here should be their Flow username, you can find people's usernames by searching their name in the search box on the Flow site.
Organization (optional) This is the organisation where the data was generated, this can help people later to search for data produced at the same institute.
Method You need to choose one of the experimental methods from the Valid Terms list (e.g iCLIP, eCLIP).
Pipeline Currently the only pipeline available for CLIP analysis is "Primary CLIP Analysis".
Protocol Document (optional) This should be the name of the pdf file containing the protocol used for the experiment. One of the available files (from previous experiments) can be selected, or a new file provided for upload on the server along with the data.
Protein Protein names should be according to the accepted gene symbol on ENSEMBL (check here). For miCLIP, the type of RNA methylation studied is entered here. For mock input, 'input' or 'SMI' can be used. For controls, you need to use one of the terms from the list (for instance, IgG, ctrl), or otherwise the annotation will be rejected.
If this is a control where you don't expect high-quality data for a specific protein, then don't put protein name first. Instead, choose a control term from the valid term list, or contact us on Slack to add something appropriate for your data.
- No UV conditions: If UV was omitted, write noUV, and if you wish to specify the name of the protein that was purified, separate it by a colon (eg. noUV:TARDBP) - Protein tags: Any other details (mutations, tags, etc) should be separated by a colon or hyphen - Flow doesn't check the info that is after these signs, so here you are free to add any detail you wish. If protein has a tag, separate it by a - sign. The terminal position of the tag can be marked by 'n' or 'c' (eg. TARDBP-nGFP). - Protein mutations: If protein has mutations, then separate by a colon, for instance TARDBP:340del346 means deletion between amino acids 340 and 346.
Cell or Tissue You need to choose one of the terms from the list, or if the method is not on the list, contact us - otherwise your annotation will be rejected by the pipeline. If you have detailed annotation to add to an existing term, this can be done with the colon. For example, the term brain already exists, so if you wish to add details on the region or age, you can add it after the colon, and separate the terms with the underslash. The pipeline only checks for terms before the colon. So if brain is on the list, the annotation will be accepted, regardless of what you add after the colon. Space is allowed (eg. brain:P10 cortex).
Condition (optional) A short explanation of the condition, (e.g. TARDBP siRNA knockdown). Don't add here information that is already in other columns (such as protein, cell line, etc).
Species A two letter code representing a valid species, note that we currently support 6 species: Dm (Drosophila melanogaster), Dr (Danio rerio), Hs (Homo sapiens), Mm (Mus musculus), Rn (Rattus norvegicus), Sc (Saccharomyces cerevisiae).
Comments (optional) Any additional information you'd like to provide about the sample collection or quality.
5' Barcode eg. NNNGGCGNN. This is the 5' barcode at the beginning of the read. N's indicate random nucleotides used as UMIs. In demultiplexing this barcode will be removed and and the N nucleotides added to the fastq header to create the unique molecular identifier (UMI).
3' Barcode (optional) eg. NNATG. Sometimes barcodes are also added at the 3' end of the read. If present, during demultiplexing the random nucleotides here will be combined with those at the 5' end to assemble the UMI.
3' Adapter Sequence The sequence of the sequencing adaptor, used as input to read trimming. This allows to remove the sequence of the adapter in cases where the cDNAs are so short that the reads include also part of the adapter sequence.
3' Adapter Name (optional) In some labs the 3'barcode and adapter combo might be given a shorthand name, this column lets you label this.
Sequencer The machine used to sequence the library, eg. HiSeq 2500.
Purification Method (Antibody) Give details of the antibody/nanobody/small molecule used to purify your protein/modification.
RT Primer (optional) eg. 5'-X33NNCGCCNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC-3'. The reverse transcription primer.
5' Index (optional) Sequence of 5' index, which is a separate barcode that allows multiplexing of libraries prepared with other protocols (RNAseq, etc).
3' Index (optional) Sequence of 3' index, which is a separate barcode that allows multiplexing of libraries prepared with other protocols (RNAseq, etc).
PubMed ID (optional) If your data is published, provide the PubMed ID here.
ArrayExpress link (optional) If you have also uploaded the data to ArrayExpress, provide the link here.
GEO link (optional) If you have also uploaded the data to GEO, provide the link here.
Demultiplexed samples
All of the above metadata should be provided for demultiplexed samples, but needs to be inputted on Flow itself rather than via a spreadsheet.