Pipelines
CAS9point4
cas9point4
Analysis workflow for ONT data from mouse CRISPR amplicon sequencing
cas9point4 filters for potential correct amplicons from CRISPR edited animals by filtering your sample's reads for determinant sequences (e.g the insertion of a LoxP or Cre site). Any read that does not contain the determinant(s) will be removed from the pileup.
The strategy aims to report only those reads that represent correct or close-to correct mutants that contain your desired modifications. Small variants are then called using medaka, sniffles and clair3 from the final filtered BAM file.
The pipeline should execute to completion with the multiqc
process running last.
Running
Samplesheet
A valid samplesheet is the only required input for the pipeline. The samplesheet describes each input sample with its associated reads/barcode, determinants and ROI references.
The example below shows a typical samplesheet for processing a nanopore run that has already been demultiplexed into fastq files.
<div style="overflow-x: auto;">
sample_name | length_perc | determinant_1 | determinant_2 | determinant_3 | roi_ref | reads | barcode |
---|---|---|---|---|---|---|---|
MPEG1-CRE-CAS-LINE4-B6N_1.1c | 75 | cre_ref.fa | Mpeg1_Cre.fa | BC01.fastq.gz | |||
MPEG1-CRE-CAS-LINE3-B6N_1.1d | 75 | cre_ref.fa | Mpeg1_Cre.fa | BC02.fastq.gz | |||
PRO_4274.1h | 75 | cre_ref.fa | Mpeg1_Cre.fa | BC03.fastq.gz | |||
PRO_4274.4a | 75 | cre_ref.fa | Mpeg1_Cre.fa | BC04.fastq.gz | |||
CX3CL1-FLOX-CAS-LINE1-B6N_1.1c | 75 | loxp_5prime.fa | loxp_3prime.fa | Cx3cl1.fa | BC05.fastq.gz | ||
PRO_3976.1f | 75 | loxp_5prime.fa | loxp_3prime.fa | Cx3cl1.fa | BC06.fastq.gz | ||
PRO_4264.4a | 75 | loxp_5prime.fa | loxp_3prime.fa | Pam.fa | BC07.fastq.gz | ||
PAM-FLOX-CAS-LINE1-B6N_1.1a | 75 | loxp_5prime.fa | loxp_3prime.fa | Pam.fa | BC08.fastq.gz | ||
PRO_4345.5a | 75 | loxp_5prime.fa | loxp_3prime.fa | Prdm8.fa | BC09.fastq.gz | ||
PRO_4345.3g | 75 | loxp_5prime.fa | loxp_3prime.fa | Prdm8.fa | BC10.fastq.gz | ||
PRO_4282.4e | 75 | loxp_5prime.fa | loxp_3prime.fa | Hnf1a.fa | BC11.fastq.gz |
</div>
Note: File paths shown above are simplified. Use full paths in your actual samplesheet (e.g., path/to/cre_ref.fa
).
sample_name:
- Description: Enter a unique name for each sample.
- Example:
MPEG1-CRE-CAS-LINE4-B6N_1.1c
length_perc:
- Description: Specify the minimum determinant overlap for the read to be valid.
- Example:
75
(indicates 75%)
determinant_1, determinant_2, determinant_3:
- Description: These columns are for entering the paths to determinant reference files. Fill in the appropriate path for each determinant. If a determinant is not applicable, leave the column empty.
- Example:
- determinant_1:
path/to/cre_ref.fa
- determinant_2:
path/to/loxp_5prime.fa
(if applicable) - determinant_3: Leave blank if not applicable
- determinant_1:
roi_ref:
- Description: Enter the path to the region of interest (ROI) reference file.
- Example:
path/to/Mpeg1_Cre.fa
reads:
- Description: Specify the path to the reads file in FASTQ format If the run is targeting demultiplexed files, otherwise leave blank.
- Example:
path/to/BC01.fastq.gz
barcode:
- Description: If the pipeline is being run with fast5 files then enter the barcode that targets the sample such as
BC01
- Example: Leave blank or specify barcode
BC01
.
- Description: If the pipeline is being run with fast5 files then enter the barcode that targets the sample such as
sample_name,length_perc,determinant_1,determinant_2,determinant_3,roi_ref,reads,barcode
MPEG1-CRE-CAS-LINE4-B6N_1.1c,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC01.fastq.gz,
MPEG1-CRE-CAS-LINE3-B6N_1.1d,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC02.fastq.gz,
PRO_4274.1h,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC03.fastq.gz,
PRO_4274.4a,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC04.fastq.gz,
CX3CL1-FLOX-CAS-LINE1-B6N_1.1c,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Cx3cl1_Flox.fa,tests/data/dataset_1/reads_all/BC05.fastq.gz,
PRO_3976.1f,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Cx3cl1_Flox.fa,tests/data/dataset_1/reads_all/BC06.fastq.gz,
PRO_4264.4a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Pam_Flox.fa,tests/data/dataset_1/reads_all/BC07.fastq.gz,
PAM-FLOX-CAS-LINE1-B6N_1.1a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Pam_Flox.fa,tests/data/dataset_1/reads_all/BC08.fastq.gz,
PRO_4345.5a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Prdm8_Flox.fa,tests/data/dataset_1/reads_all/BC09.fastq.gz,
PRO_4345.3g,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Prdm8_Flox.fa,tests/data/dataset_1/reads_all/BC10.fastq.gz,
PRO_4282.4e,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Hnf1a_Flox.fa,tests/data/dataset_1/reads_all/BC11.fastq.gz,
Pipeline Modes
FASTQ Mode
To the run pipeline in FASTQ mode leave the barcode
column blank and enter the paths to the fastq files in the reads
column.
FAST5 Mode
To run the pipeline against a raw fast5 folder from a nanopore sequencing run, you must leave the reads
column blank and fill in the barcode column to which each sample maps. You must also then specify the fast5
parameter that points to the fast5 folder. This folder can either be a direct folder location or a compressed .tar.gz
.
The barcode column does not need to be unique per sample. It is possible to have multiple samples in the same barcode as long as their ROI's are different. The unique identifier for each sample is built from sample_name/roi/barcode
.
Below is an example samplesheet for the pipeline in FAST5 mode:
sample_name | length_perc | determinant_1 | determinant_2 | determinant_3 | roi_ref | reads | barcode |
---|---|---|---|---|---|---|---|
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_5prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode01 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_5prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode02 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_5prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode03 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_5prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode04 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_3prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode05 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_3prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode06 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_3prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode07 | |||
PCDH19-DDCRE | 75 | path/to/pcdh19_ddcre_3prime.fa | path/to/PCDH19-DDCRE-5prime.fa | barcode08 | |||
PCDH19-DDCRE | 99 | path/to/pcdh19_wt.fa | path/to/Pcdh19_WT.fa | barcode09 | |||
SPIB-MCHERRY | 75 | path/to/mcherry.fa | path/to/Spib_mCherry.fa | barcode10 | |||
SPIB-MCHERRY | 75 | path/to/mcherry.fa | path/to/Spib_mCherry.fa | barcode11 | |||
TLR2-FLOX | 75 | path/to/loxp_3prime.fa | path/to/loxp_5prime.fa | path/to/Tlr2_Flox.fa | barcode12 | ||
Kcnc1-Flex | 75 | path/to/kcnc1_flex_a_cassette.fa | path/to/Kcnc1-Flex-5PRIME.fa | barcode01 | |||
Kcnc1-Flex | 99 | path/to/kcnc1_flex_wt.fa | path/to/Kcnc1-Flex-WT.fa | barcode06 | |||
Kcnc1-Flex | 75 | path/to/kcnc1_flex_b_cassette.fa | path/to/Kcnc1-Flex-3PRIME.fa | barcode09 | |||
Kcnc1-Flex | 75 | path/to/kcnc1-6232_insertionem.fa | path/to/Kcnc1-6232-InsertionEM.fa | barcode09 |
sample_name,length_perc,determinant_1,determinant_2,determinant_3,roi_ref,reads,barcode
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode01
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode02
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode03
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode04
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode05
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode06
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode07
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode08
PCDH19-DDCRE,99,tests/data/dataset_2/det/pcdh19_wt.fa,,,tests/data/dataset_2/roi_refs/Pcdh19_WT.fa,,barcode09
SPIB-MCHERRY,75,tests/data/dataset_2/det/mcherry.fa,,,tests/data/dataset_2/roi_refs/Spib_mCherry.fa,,barcode10
SPIB-MCHERRY,75,tests/data/dataset_2/det/mcherry.fa,,,tests/data/dataset_2/roi_refs/Spib_mCherry.fa,,barcode11
TLR2-FLOX,75,tests/data/dataset_2/det/loxp_3prime.fa,tests/data/dataset_2/det/loxp_5prime.fa,,tests/data/dataset_2/roi_refs/Tlr2_Flox.fa,,barcode12
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1_flex_a_cassette.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-5PRIME.fa,,barcode01
Kcnc1-Flex,99,tests/data/dataset_2/det/kcnc1_flex_wt.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-WT.fa,,barcode06
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1_flex_b_cassette.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-3PRIME.fa,,barcode09
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1-6232_insertionem.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-6232-InsertionEM.fa,,barcode09
Run Steps
The following section outlines the steps of the pipeline execution.
1. Input Checking
- PARSE_SAMPLESHEET: Parses the sample sheet file into metadata and checks for errors.
2. Convert Fast5 to FASTQ (if required)
- DEMULTIPLEX_FAST5: Decompresses, calls bases, and sorts metadata for Fast5 files.
3. ROI and Determinant Processing
- MERGE_DETERMINANTS: Concatenates determinant files into a single reference file.
- BLAST_MAKEBLASTDB: Builds a BLAST database of merged determinant sequences.
- DET_SIZES: Calculates sizes of the determinants.
- SAMTOOLS_FAIDX: Indexes the ROI references.
- ROI_SIZES: Calculates sizes of the ROIs.
- SEQTK_HRUN: Scans ROI reference files for homopolymers.
- FORMAT_BED: Sorts and formats the homopolymer output BED file.
- BEDTOOLS_SLOP: Adds 1 to each end of BED ranges in the homopolymer output file.
4. Initial Quality Control (QC)
- NANOPLOT_FASTQ: Runs nanoplot on reads.
5. Read Filtering
- FILTLONG: Filters long reads.
6. Alignment and Fragment Processing
- MINIMAP2_ALIGN: Aligns to ROI reference using Minimap2.
- SAMTOOLS_INDEX_ALIGN: Indexes the BAM file.
- SAMTOOLS_VIEW: Filters reads for mapping quality.
- FILTER_BAM_RATIO: Filters reads using alignment length and mismatch ratio criteria.
7. Blast and Blast-Dependant Processing
- BLAST_ROI_REF: Searches ROI references against determinants database.
- SEQTK_SEQ: Converts FastQ reads to Fasta.
- BLAST_READS: Searches reads against determinants database.
- FILTER_BAM_BLAST: Further filters reads based on BLAST results.
8. Variant Calculation
- SAMTOOLS_DEPTH_PASSED: Calculates depth of passed reads.
- SUBSAMPLE_BAM_COV: Subsamples reads to standardize coverage.
- MEDAKA_VARIANT: Calls variants using Medaka.
- BEDTOOLS_INTERSECT: Filters VCF against homopolymer regions.
- SNIFFLES: Runs Sniffles for structural variant calling.
- CLAIR3_CALL: Runs Clair3 for variant calling.
9. Report Generation
- SAMTOOLS_DEPTH_ALIGNED: Calculates coverage depth from aligned reads.
- PYSAMSTATS: Calculates strand variations.
- GEN_REPORTS: Generates plots and inputs for MultiQC.
- DUMP_SOFTWARE_VERSIONS: Collects software versions.
- MULTIQC: Runs MultiQC for aggregate analysis reports.
Pipeline Output
The results of the CAS9POINT4 pipeline are organized into several directories, each containing specific output files. Below is a description of each directory and its contents:
bam
Directory
This directory contains BAM files at different stages of processing:
- 01_initial_align: Initial alignment BAM files and their indices (
.bai
). - 02_basic_filt: BAM files after basic filtering and their indices.
- 03_complex_filt: BAM files after more complex filtering, their indices, and various statistics (
flagstat
,idxstats
,stats
). - 04_blast_filt: Contains subdirectories for
failed
andpassed
reads with corresponding BAM files.
blast
Directory
Contains BLAST result files:
- reads: BLAST results for individual reads.
- roi_ref: BLAST results for regions of interest (ROIs).
filtlong
Directory
Includes filtered FastQ files and log files from the Filtlong process.
homopolymers
Directory
Contains BED files identifying homopolymeric regions in various references.
pipeline_info
Directory
Stores various reports and logs related to pipeline execution, such as HTML reports, execution traces, samplesheet validations, and software versions.
reports
Directory
A collection of analysis reports and visualizations:
- coverage: Coverage depth plots and a compiled PDF report.
- depth: Depth information in TSV format.
- multiqc: Comprehensive reports from MultiQC analysis including data files and plots.
- nanoplot_fastq: NanoPlot reports and statistics for FastQ files.
- strand_variation: Data files containing strand variation information.
variation
Directory
Contains output files related to variant analysis:
- clair3: VCF files from Clair3 variant calling.
- depth: Depth information for passed reads in TSV format.
- medaka: Subdirectories for
filtered
andraw
outputs from Medaka variant calling, including VCF files and HDF5 probability files. - sniffles: Output from Sniffles for structural variant calling, including
.snf
and.vcf
files. - subsample: BAM files and indices from the subsampling step, aiming to standardize coverage across samples.
Pipeline Parameters
This section describes all other available parameters in the pipeline.
Input
samplesheet
Description: Path to the samplesheet filefast5
Default:null
Description: Path to the FAST5 file, if applicable.
Output Options
outdir
Default:./results
Description: Directory where the pipeline's results will be saved.monochrome_logs
Default:false
Description: If true, logs will be in monochrome.debug
Default:false
Description: Enables debug mode if set to true.
Max Resource Options
max_memory
Default:128.GB
Description: Maximum memory allocation for the pipeline per process. Processes may request less.max_cpus
Default:16
Description: Maximum number of CPUs that can be used per process. Processes may request less.max_time
Default:240.h
Description: Maximum time allocation for the pipeline per process. Processes may request less.
Pipeline Parameters
experiment_name
Default:"exp"
Description: Name of the experiment. Used for naming files and formatting output.dorado_bc_kit
Default:
null
Description: The specific BC kit name to use for base calling. This parameter is required if running in fast5 mode. For example,
EXP-PBC001
. For more information, see Dorado Basecalling.Known possible values:
SQK-16S024
SQK-16S114-24
SQK-PBK004
SQK-LWB001
SQK-PCB109
SQK-PCB110
SQK-PCB111-24
SQK-PCB114-24
EXP-NBD103
EXP-NBD104
EXP-NBD114
SQK-NBD111-24
SQK-NBD114-24
EXP-NBD196
SQK-MLK111-96-XL
SQK-NBD111-96
SQK-NBD114-96
SQK-MLK114-96-XL
EXP-PBC001
EXP-PBC096
SQK-RAB204
SQK-RAB201
SQK-RBK001
SQK-RBK110-96
SQK-RBK111-96
SQK-RBK114-96
SQK-RBK111-24
SQK-RBK114-24
SQK-RBK004
VSK-PTC001
VSK-VPS001
SQK-RPB004
SQK-RLB001
SQK-RPB114-24
VSK-VMK001
VSK-VMK004
dorado_auto_model
Default:"hac"
Description: Auto selection of the closest barcode model based on keywords.hac
will run the latest compatiblehac
model. See https://github.com/nanoporetech/doradodorado_model
Description: The specific dorado model to use e.g.dna_r9.4.1_e8_hac@v3.3
. Overidesdorado_auto_model
.dorado_device
Default:"cuda:all"
Description: Device specification for Dorado. Edit to have more fine grained control over CPU/GPU base calling execution.dorado_both_ends
Default:true
Description: Force barcode detection from both ends of the read.filtlong_meanqual
Default:90
Description: Threshold to keep the top percentage of reads based on quality.blastn_outfmt
Default:6
Description: Output format for BLASTN.min_mapq
Default:50
Description: Minimum mapping quality for alignment.align_ratio_thresh
Default:0.5
Description: Threshold for alignment ratio.mismatch_ratio_thresh
Default:0.3
Description: Threshold for mismatch ratio.homopol_length
Default:5
Description: Length of homopolymer sequences to consider.var_coverage_thresh
Default:100000
Description: Coverage threshold for variant calling.var_enable_subsampling
Default:true
Description: Enable subsampling for variant analysis.medaka_model
Default:"r941_min_high_g360"
Description: Model used for Medaka.sniffles_min_cov_mult
Default:0.1
Description: Minimum coverage multiplier for Sniffles.clair3_model
Default:'r941_prom_hac_g360+g422'
Description: Model used for Clair3.