cas9point4

Analysis workflow for ONT data from mouse CRISPR amplicon sequencing

cas9point4 filters for potential correct amplicons from CRISPR edited animals by filtering your sample's reads for determinant sequences (e.g the insertion of a LoxP or Cre site). Any read that does not contain the determinant(s) will be removed from the pileup.

The strategy aims to report only those reads that represent correct or close-to correct mutants that contain your desired modifications. Small variants are then called using medaka, sniffles and clair3 from the final filtered BAM file.

The pipeline should execute to completion with the multiqc process running last.

Running

Samplesheet

A valid samplesheet is the only required input for the pipeline. The samplesheet describes each input sample with its associated reads/barcode, determinants and ROI references.

The example below shows a typical samplesheet for processing a nanopore run that has already been demultiplexed into fastq files.

sample_name	length_perc	determinant_1	determinant_2	roi_ref	reads
MPEG1-CRE-CAS-LINE4-B6N_1.1c	75	cre_ref.fa		Mpeg1_Cre.fa	BC01.fastq.gz
MPEG1-CRE-CAS-LINE3-B6N_1.1d	75	cre_ref.fa		Mpeg1_Cre.fa	BC02.fastq.gz
PRO_4274.1h	75	cre_ref.fa		Mpeg1_Cre.fa	BC03.fastq.gz
PRO_4274.4a	75	cre_ref.fa		Mpeg1_Cre.fa	BC04.fastq.gz
CX3CL1-FLOX-CAS-LINE1-B6N_1.1c	75	loxp_5prime.fa	loxp_3prime.fa	Cx3cl1.fa	BC05.fastq.gz
PRO_3976.1f	75	loxp_5prime.fa	loxp_3prime.fa	Cx3cl1.fa	BC06.fastq.gz
PRO_4264.4a	75	loxp_5prime.fa	loxp_3prime.fa	Pam.fa	BC07.fastq.gz
PAM-FLOX-CAS-LINE1-B6N_1.1a	75	loxp_5prime.fa	loxp_3prime.fa	Pam.fa	BC08.fastq.gz
PRO_4345.5a	75	loxp_5prime.fa	loxp_3prime.fa	Prdm8.fa	BC09.fastq.gz
PRO_4345.3g	75	loxp_5prime.fa	loxp_3prime.fa	Prdm8.fa	BC10.fastq.gz
PRO_4282.4e	75	loxp_5prime.fa	loxp_3prime.fa	Hnf1a.fa	BC11.fastq.gz

</div>

Note: File paths shown above are simplified. Use full paths in your actual samplesheet (e.g., path/to/cre_ref.fa).

sample_name:
- Description: Enter a unique name for each sample.
- Example: MPEG1-CRE-CAS-LINE4-B6N_1.1c
length_perc:
- Description: Specify the minimum determinant overlap for the read to be valid.
- Example: 75 (indicates 75%)
determinant_1, determinant_2, determinant_3:
- Description: These columns are for entering the paths to determinant reference files. Fill in the appropriate path for each determinant. If a determinant is not applicable, leave the column empty.
- Example:
  - determinant_1: path/to/cre_ref.fa
  - determinant_2: path/to/loxp_5prime.fa (if applicable)
  - determinant_3: Leave blank if not applicable
roi_ref:
- Description: Enter the path to the region of interest (ROI) reference file.
- Example: path/to/Mpeg1_Cre.fa
reads:
- Description: Specify the path to the reads file in FASTQ format If the run is targeting demultiplexed files, otherwise leave blank.
- Example: path/to/BC01.fastq.gz
barcode:
- Description: If the pipeline is being run with fast5 files then enter the barcode that targets the sample such as BC01
- Example: Leave blank or specify barcode BC01.

sample_name,length_perc,determinant_1,determinant_2,determinant_3,roi_ref,reads,barcode
MPEG1-CRE-CAS-LINE4-B6N_1.1c,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC01.fastq.gz,
MPEG1-CRE-CAS-LINE3-B6N_1.1d,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC02.fastq.gz,
PRO_4274.1h,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC03.fastq.gz,
PRO_4274.4a,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC04.fastq.gz,
CX3CL1-FLOX-CAS-LINE1-B6N_1.1c,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Cx3cl1_Flox.fa,tests/data/dataset_1/reads_all/BC05.fastq.gz,
PRO_3976.1f,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Cx3cl1_Flox.fa,tests/data/dataset_1/reads_all/BC06.fastq.gz,
PRO_4264.4a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Pam_Flox.fa,tests/data/dataset_1/reads_all/BC07.fastq.gz,
PAM-FLOX-CAS-LINE1-B6N_1.1a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Pam_Flox.fa,tests/data/dataset_1/reads_all/BC08.fastq.gz,
PRO_4345.5a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Prdm8_Flox.fa,tests/data/dataset_1/reads_all/BC09.fastq.gz,
PRO_4345.3g,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Prdm8_Flox.fa,tests/data/dataset_1/reads_all/BC10.fastq.gz,
PRO_4282.4e,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Hnf1a_Flox.fa,tests/data/dataset_1/reads_all/BC11.fastq.gz,

Pipeline Modes

FASTQ Mode

To the run pipeline in FASTQ mode leave the barcode column blank and enter the paths to the fastq files in the reads column.

FAST5 Mode

To run the pipeline against a raw fast5 folder from a nanopore sequencing run, you must leave the reads column blank and fill in the barcode column to which each sample maps. You must also then specify the fast5 parameter that points to the fast5 folder. This folder can either be a direct folder location or a compressed .tar.gz.

The barcode column does not need to be unique per sample. It is possible to have multiple samples in the same barcode as long as their ROI's are different. The unique identifier for each sample is built from sample_name/roi/barcode.

Below is an example samplesheet for the pipeline in FAST5 mode:

sample_name	length_perc	determinant_1	determinant_2	roi_ref	barcode
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_5prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode01
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_5prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode02
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_5prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode03
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_5prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode04
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_3prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode05
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_3prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode06
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_3prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode07
PCDH19-DDCRE	75	path/to/pcdh19_ddcre_3prime.fa		path/to/PCDH19-DDCRE-5prime.fa	barcode08
PCDH19-DDCRE	99	path/to/pcdh19_wt.fa		path/to/Pcdh19_WT.fa	barcode09
SPIB-MCHERRY	75	path/to/mcherry.fa		path/to/Spib_mCherry.fa	barcode10
SPIB-MCHERRY	75	path/to/mcherry.fa		path/to/Spib_mCherry.fa	barcode11
TLR2-FLOX	75	path/to/loxp_3prime.fa	path/to/loxp_5prime.fa	path/to/Tlr2_Flox.fa	barcode12
Kcnc1-Flex	75	path/to/kcnc1_flex_a_cassette.fa		path/to/Kcnc1-Flex-5PRIME.fa	barcode01
Kcnc1-Flex	99	path/to/kcnc1_flex_wt.fa		path/to/Kcnc1-Flex-WT.fa	barcode06
Kcnc1-Flex	75	path/to/kcnc1_flex_b_cassette.fa		path/to/Kcnc1-Flex-3PRIME.fa	barcode09
Kcnc1-Flex	75	path/to/kcnc1-6232_insertionem.fa		path/to/Kcnc1-6232-InsertionEM.fa	barcode09

sample_name,length_perc,determinant_1,determinant_2,determinant_3,roi_ref,reads,barcode
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode01
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode02
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode03
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode04
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode05
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode06
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode07
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode08
PCDH19-DDCRE,99,tests/data/dataset_2/det/pcdh19_wt.fa,,,tests/data/dataset_2/roi_refs/Pcdh19_WT.fa,,barcode09
SPIB-MCHERRY,75,tests/data/dataset_2/det/mcherry.fa,,,tests/data/dataset_2/roi_refs/Spib_mCherry.fa,,barcode10
SPIB-MCHERRY,75,tests/data/dataset_2/det/mcherry.fa,,,tests/data/dataset_2/roi_refs/Spib_mCherry.fa,,barcode11
TLR2-FLOX,75,tests/data/dataset_2/det/loxp_3prime.fa,tests/data/dataset_2/det/loxp_5prime.fa,,tests/data/dataset_2/roi_refs/Tlr2_Flox.fa,,barcode12
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1_flex_a_cassette.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-5PRIME.fa,,barcode01
Kcnc1-Flex,99,tests/data/dataset_2/det/kcnc1_flex_wt.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-WT.fa,,barcode06
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1_flex_b_cassette.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-3PRIME.fa,,barcode09
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1-6232_insertionem.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-6232-InsertionEM.fa,,barcode09

Run Steps

The following section outlines the steps of the pipeline execution.

1. Input Checking

PARSE_SAMPLESHEET: Parses the sample sheet file into metadata and checks for errors.

2. Convert Fast5 to FASTQ (if required)

DEMULTIPLEX_FAST5: Decompresses, calls bases, and sorts metadata for Fast5 files.

3. ROI and Determinant Processing

MERGE_DETERMINANTS: Concatenates determinant files into a single reference file.
BLAST_MAKEBLASTDB: Builds a BLAST database of merged determinant sequences.
DET_SIZES: Calculates sizes of the determinants.
SAMTOOLS_FAIDX: Indexes the ROI references.
ROI_SIZES: Calculates sizes of the ROIs.
SEQTK_HRUN: Scans ROI reference files for homopolymers.
FORMAT_BED: Sorts and formats the homopolymer output BED file.
BEDTOOLS_SLOP: Adds 1 to each end of BED ranges in the homopolymer output file.

4. Initial Quality Control (QC)

NANOPLOT_FASTQ: Runs nanoplot on reads.

5. Read Filtering

FILTLONG: Filters long reads.

6. Alignment and Fragment Processing

MINIMAP2_ALIGN: Aligns to ROI reference using Minimap2.
SAMTOOLS_INDEX_ALIGN: Indexes the BAM file.
SAMTOOLS_VIEW: Filters reads for mapping quality.
FILTER_BAM_RATIO: Filters reads using alignment length and mismatch ratio criteria.

7. Blast and Blast-Dependant Processing

BLAST_ROI_REF: Searches ROI references against determinants database.
SEQTK_SEQ: Converts FastQ reads to Fasta.
BLAST_READS: Searches reads against determinants database.
FILTER_BAM_BLAST: Further filters reads based on BLAST results.

8. Variant Calculation

SAMTOOLS_DEPTH_PASSED: Calculates depth of passed reads.
SUBSAMPLE_BAM_COV: Subsamples reads to standardize coverage.
MEDAKA_VARIANT: Calls variants using Medaka.
BEDTOOLS_INTERSECT: Filters VCF against homopolymer regions.
SNIFFLES: Runs Sniffles for structural variant calling.
CLAIR3_CALL: Runs Clair3 for variant calling.

9. Report Generation

SAMTOOLS_DEPTH_ALIGNED: Calculates coverage depth from aligned reads.
PYSAMSTATS: Calculates strand variations.
GEN_REPORTS: Generates plots and inputs for MultiQC.
DUMP_SOFTWARE_VERSIONS: Collects software versions.
MULTIQC: Runs MultiQC for aggregate analysis reports.

Pipeline Output

The results of the CAS9POINT4 pipeline are organized into several directories, each containing specific output files. Below is a description of each directory and its contents:

`bam` Directory

This directory contains BAM files at different stages of processing:

01_initial_align: Initial alignment BAM files and their indices (.bai).
02_basic_filt: BAM files after basic filtering and their indices.
03_complex_filt: BAM files after more complex filtering, their indices, and various statistics (flagstat, idxstats, stats).
04_blast_filt: Contains subdirectories for failed and passed reads with corresponding BAM files.

`blast` Directory

Contains BLAST result files:

reads: BLAST results for individual reads.
roi_ref: BLAST results for regions of interest (ROIs).

`filtlong` Directory

Includes filtered FastQ files and log files from the Filtlong process.

`homopolymers` Directory

Contains BED files identifying homopolymeric regions in various references.

`pipeline_info` Directory

Stores various reports and logs related to pipeline execution, such as HTML reports, execution traces, samplesheet validations, and software versions.

`reports` Directory

A collection of analysis reports and visualizations:

coverage: Coverage depth plots and a compiled PDF report.
depth: Depth information in TSV format.
multiqc: Comprehensive reports from MultiQC analysis including data files and plots.
nanoplot_fastq: NanoPlot reports and statistics for FastQ files.
strand_variation: Data files containing strand variation information.

`variation` Directory

Contains output files related to variant analysis:

clair3: VCF files from Clair3 variant calling.
depth: Depth information for passed reads in TSV format.
medaka: Subdirectories for filtered and raw outputs from Medaka variant calling, including VCF files and HDF5 probability files.
sniffles: Output from Sniffles for structural variant calling, including .snf and .vcf files.
subsample: BAM files and indices from the subsampling step, aiming to standardize coverage across samples.

Pipeline Parameters

This section describes all other available parameters in the pipeline.

Input

samplesheet
Description: Path to the samplesheet file
fast5
Default: null Description: Path to the FAST5 file, if applicable.

Output Options

outdir
Default: ./results
Description: Directory where the pipeline's results will be saved.
monochrome_logs
Default: false
Description: If true, logs will be in monochrome.
debug
Default: false
Description: Enables debug mode if set to true.

Max Resource Options

max_memory
Default: 128.GB
Description: Maximum memory allocation for the pipeline per process. Processes may request less.
max_cpus
Default: 16
Description: Maximum number of CPUs that can be used per process. Processes may request less.
max_time
Default: 240.h
Description: Maximum time allocation for the pipeline per process. Processes may request less.

Pipeline Parameters

experiment_name
Default: "exp"
Description: Name of the experiment. Used for naming files and formatting output.
dorado_bc_kit
Default: null
Description: The specific BC kit name to use for base calling. This parameter is required if running in fast5 mode. For example, EXP-PBC001. For more information, see Dorado Basecalling.
Known possible values:
- SQK-16S024
- SQK-16S114-24
- SQK-PBK004
- SQK-LWB001
- SQK-PCB109
- SQK-PCB110
- SQK-PCB111-24
- SQK-PCB114-24
- EXP-NBD103
- EXP-NBD104
- EXP-NBD114
- SQK-NBD111-24
- SQK-NBD114-24
- EXP-NBD196
- SQK-MLK111-96-XL
- SQK-NBD111-96
- SQK-NBD114-96
- SQK-MLK114-96-XL
- EXP-PBC001
- EXP-PBC096
- SQK-RAB204
- SQK-RAB201
- SQK-RBK001
- SQK-RBK110-96
- SQK-RBK111-96
- SQK-RBK114-96
- SQK-RBK111-24
- SQK-RBK114-24
- SQK-RBK004
- VSK-PTC001
- VSK-VPS001
- SQK-RPB004
- SQK-RLB001
- SQK-RPB114-24
- VSK-VMK001
- VSK-VMK004
dorado_auto_model
Default: "hac"
Description: Auto selection of the closest barcode model based on keywords. hac will run the latest compatible hac model. See https://github.com/nanoporetech/dorado
dorado_model
Description: The specific dorado model to use e.g. dna_r9.4.1_e8_hac@v3.3. Overides dorado_auto_model.
dorado_device
Default: "cuda:all"
Description: Device specification for Dorado. Edit to have more fine grained control over CPU/GPU base calling execution.
dorado_both_ends
Default: true
Description: Force barcode detection from both ends of the read.
filtlong_meanqual
Default: 90
Description: Threshold to keep the top percentage of reads based on quality.
blastn_outfmt
Default: 6
Description: Output format for BLASTN.
min_mapq
Default: 50
Description: Minimum mapping quality for alignment.
align_ratio_thresh
Default: 0.5
Description: Threshold for alignment ratio.
mismatch_ratio_thresh
Default: 0.3
Description: Threshold for mismatch ratio.
homopol_length
Default: 5
Description: Length of homopolymer sequences to consider.
var_coverage_thresh
Default: 100000
Description: Coverage threshold for variant calling.
var_enable_subsampling
Default: true
Description: Enable subsampling for variant analysis.
medaka_model
Default: "r941_min_high_g360"
Description: Model used for Medaka.
sniffles_min_cov_mult
Default: 0.1
Description: Minimum coverage multiplier for Sniffles.
clair3_model
Default: 'r941_prom_hac_g360+g422'
Description: Model used for Clair3.

cas9point4

Running

Samplesheet

Pipeline Modes

FASTQ Mode

FAST5 Mode

Run Steps

1. Input Checking

2. Convert Fast5 to FASTQ (if required)

3. ROI and Determinant Processing

4. Initial Quality Control (QC)

5. Read Filtering

6. Alignment and Fragment Processing

7. Blast and Blast-Dependant Processing

8. Variant Calculation

9. Report Generation

Pipeline Output

bam Directory

blast Directory

filtlong Directory

homopolymers Directory

pipeline_info Directory

reports Directory

variation Directory

Pipeline Parameters

Input

Output Options

Max Resource Options

Pipeline Parameters

`bam` Directory

`blast` Directory

`filtlong` Directory

`homopolymers` Directory

`pipeline_info` Directory

`reports` Directory

`variation` Directory