Flow Logo

Pipelines

CAS9point4

cas9point4

Analysis workflow for ONT data from mouse CRISPR amplicon sequencing

cas9point4 filters for potential correct amplicons from CRISPR edited animals by filtering your sample's reads for determinant sequences (e.g the insertion of a LoxP or Cre site). Any read that does not contain the determinant(s) will be removed from the pileup.

The strategy aims to report only those reads that represent correct or close-to correct mutants that contain your desired modifications. Small variants are then called using medaka, sniffles and clair3 from the final filtered BAM file.

The pipeline should execute to completion with the multiqc process running last.

Running

Samplesheet

A valid samplesheet is the only required input for the pipeline. The samplesheet describes each input sample with its associated reads/barcode, determinants and ROI references.

The example below shows a typical samplesheet for processing a nanopore run that has already been demultiplexed into fastq files.

<div style="overflow-x: auto;">

sample_namelength_percdeterminant_1determinant_2determinant_3roi_refreadsbarcode
MPEG1-CRE-CAS-LINE4-B6N_1.1c75cre_ref.faMpeg1_Cre.faBC01.fastq.gz
MPEG1-CRE-CAS-LINE3-B6N_1.1d75cre_ref.faMpeg1_Cre.faBC02.fastq.gz
PRO_4274.1h75cre_ref.faMpeg1_Cre.faBC03.fastq.gz
PRO_4274.4a75cre_ref.faMpeg1_Cre.faBC04.fastq.gz
CX3CL1-FLOX-CAS-LINE1-B6N_1.1c75loxp_5prime.faloxp_3prime.faCx3cl1.faBC05.fastq.gz
PRO_3976.1f75loxp_5prime.faloxp_3prime.faCx3cl1.faBC06.fastq.gz
PRO_4264.4a75loxp_5prime.faloxp_3prime.faPam.faBC07.fastq.gz
PAM-FLOX-CAS-LINE1-B6N_1.1a75loxp_5prime.faloxp_3prime.faPam.faBC08.fastq.gz
PRO_4345.5a75loxp_5prime.faloxp_3prime.faPrdm8.faBC09.fastq.gz
PRO_4345.3g75loxp_5prime.faloxp_3prime.faPrdm8.faBC10.fastq.gz
PRO_4282.4e75loxp_5prime.faloxp_3prime.faHnf1a.faBC11.fastq.gz

</div>

Note: File paths shown above are simplified. Use full paths in your actual samplesheet (e.g., path/to/cre_ref.fa).

  • sample_name:

    • Description: Enter a unique name for each sample.
    • Example: MPEG1-CRE-CAS-LINE4-B6N_1.1c
  • length_perc:

    • Description: Specify the minimum determinant overlap for the read to be valid.
    • Example: 75 (indicates 75%)
  • determinant_1, determinant_2, determinant_3:

    • Description: These columns are for entering the paths to determinant reference files. Fill in the appropriate path for each determinant. If a determinant is not applicable, leave the column empty.
    • Example:
      • determinant_1: path/to/cre_ref.fa
      • determinant_2: path/to/loxp_5prime.fa (if applicable)
      • determinant_3: Leave blank if not applicable
  • roi_ref:

    • Description: Enter the path to the region of interest (ROI) reference file.
    • Example: path/to/Mpeg1_Cre.fa
  • reads:

    • Description: Specify the path to the reads file in FASTQ format If the run is targeting demultiplexed files, otherwise leave blank.
    • Example: path/to/BC01.fastq.gz
  • barcode:

    • Description: If the pipeline is being run with fast5 files then enter the barcode that targets the sample such as BC01
    • Example: Leave blank or specify barcode BC01.
sample_name,length_perc,determinant_1,determinant_2,determinant_3,roi_ref,reads,barcode
MPEG1-CRE-CAS-LINE4-B6N_1.1c,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC01.fastq.gz,
MPEG1-CRE-CAS-LINE3-B6N_1.1d,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC02.fastq.gz,
PRO_4274.1h,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC03.fastq.gz,
PRO_4274.4a,75,tests/data/dataset_1/det/cre_ref.fa,,,tests/data/dataset_1/roi_refs/Mpeg1_Cre.fa,tests/data/dataset_1/reads_all/BC04.fastq.gz,
CX3CL1-FLOX-CAS-LINE1-B6N_1.1c,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Cx3cl1_Flox.fa,tests/data/dataset_1/reads_all/BC05.fastq.gz,
PRO_3976.1f,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Cx3cl1_Flox.fa,tests/data/dataset_1/reads_all/BC06.fastq.gz,
PRO_4264.4a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Pam_Flox.fa,tests/data/dataset_1/reads_all/BC07.fastq.gz,
PAM-FLOX-CAS-LINE1-B6N_1.1a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Pam_Flox.fa,tests/data/dataset_1/reads_all/BC08.fastq.gz,
PRO_4345.5a,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Prdm8_Flox.fa,tests/data/dataset_1/reads_all/BC09.fastq.gz,
PRO_4345.3g,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Prdm8_Flox.fa,tests/data/dataset_1/reads_all/BC10.fastq.gz,
PRO_4282.4e,75,tests/data/dataset_1/det/loxp_5prime.fa,tests/data/dataset_1/det/loxp_3prime.fa,,tests/data/dataset_1/roi_refs/Hnf1a_Flox.fa,tests/data/dataset_1/reads_all/BC11.fastq.gz,

Pipeline Modes

FASTQ Mode

To the run pipeline in FASTQ mode leave the barcode column blank and enter the paths to the fastq files in the reads column.

FAST5 Mode

To run the pipeline against a raw fast5 folder from a nanopore sequencing run, you must leave the reads column blank and fill in the barcode column to which each sample maps. You must also then specify the fast5 parameter that points to the fast5 folder. This folder can either be a direct folder location or a compressed .tar.gz.

The barcode column does not need to be unique per sample. It is possible to have multiple samples in the same barcode as long as their ROI's are different. The unique identifier for each sample is built from sample_name/roi/barcode.

Below is an example samplesheet for the pipeline in FAST5 mode:

sample_namelength_percdeterminant_1determinant_2determinant_3roi_refreadsbarcode
PCDH19-DDCRE75path/to/pcdh19_ddcre_5prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode01
PCDH19-DDCRE75path/to/pcdh19_ddcre_5prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode02
PCDH19-DDCRE75path/to/pcdh19_ddcre_5prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode03
PCDH19-DDCRE75path/to/pcdh19_ddcre_5prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode04
PCDH19-DDCRE75path/to/pcdh19_ddcre_3prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode05
PCDH19-DDCRE75path/to/pcdh19_ddcre_3prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode06
PCDH19-DDCRE75path/to/pcdh19_ddcre_3prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode07
PCDH19-DDCRE75path/to/pcdh19_ddcre_3prime.fapath/to/PCDH19-DDCRE-5prime.fabarcode08
PCDH19-DDCRE99path/to/pcdh19_wt.fapath/to/Pcdh19_WT.fabarcode09
SPIB-MCHERRY75path/to/mcherry.fapath/to/Spib_mCherry.fabarcode10
SPIB-MCHERRY75path/to/mcherry.fapath/to/Spib_mCherry.fabarcode11
TLR2-FLOX75path/to/loxp_3prime.fapath/to/loxp_5prime.fapath/to/Tlr2_Flox.fabarcode12
Kcnc1-Flex75path/to/kcnc1_flex_a_cassette.fapath/to/Kcnc1-Flex-5PRIME.fabarcode01
Kcnc1-Flex99path/to/kcnc1_flex_wt.fapath/to/Kcnc1-Flex-WT.fabarcode06
Kcnc1-Flex75path/to/kcnc1_flex_b_cassette.fapath/to/Kcnc1-Flex-3PRIME.fabarcode09
Kcnc1-Flex75path/to/kcnc1-6232_insertionem.fapath/to/Kcnc1-6232-InsertionEM.fabarcode09
sample_name,length_perc,determinant_1,determinant_2,determinant_3,roi_ref,reads,barcode
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode01
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode02
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode03
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_5prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode04
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode05
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode06
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode07
PCDH19-DDCRE,75,tests/data/dataset_2/det/pcdh19_ddcre_3prime.fa,,,tests/data/dataset_2/roi_refs/PCDH19-DDCRE-5prime.fa,,barcode08
PCDH19-DDCRE,99,tests/data/dataset_2/det/pcdh19_wt.fa,,,tests/data/dataset_2/roi_refs/Pcdh19_WT.fa,,barcode09
SPIB-MCHERRY,75,tests/data/dataset_2/det/mcherry.fa,,,tests/data/dataset_2/roi_refs/Spib_mCherry.fa,,barcode10
SPIB-MCHERRY,75,tests/data/dataset_2/det/mcherry.fa,,,tests/data/dataset_2/roi_refs/Spib_mCherry.fa,,barcode11
TLR2-FLOX,75,tests/data/dataset_2/det/loxp_3prime.fa,tests/data/dataset_2/det/loxp_5prime.fa,,tests/data/dataset_2/roi_refs/Tlr2_Flox.fa,,barcode12
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1_flex_a_cassette.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-5PRIME.fa,,barcode01
Kcnc1-Flex,99,tests/data/dataset_2/det/kcnc1_flex_wt.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-WT.fa,,barcode06
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1_flex_b_cassette.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-Flex-3PRIME.fa,,barcode09
Kcnc1-Flex,75,tests/data/dataset_2/det/kcnc1-6232_insertionem.fa,,,tests/data/dataset_2/roi_refs/Kcnc1-6232-InsertionEM.fa,,barcode09

Run Steps

The following section outlines the steps of the pipeline execution.

1. Input Checking

  • PARSE_SAMPLESHEET: Parses the sample sheet file into metadata and checks for errors.

2. Convert Fast5 to FASTQ (if required)

  • DEMULTIPLEX_FAST5: Decompresses, calls bases, and sorts metadata for Fast5 files.

3. ROI and Determinant Processing

  • MERGE_DETERMINANTS: Concatenates determinant files into a single reference file.
  • BLAST_MAKEBLASTDB: Builds a BLAST database of merged determinant sequences.
  • DET_SIZES: Calculates sizes of the determinants.
  • SAMTOOLS_FAIDX: Indexes the ROI references.
  • ROI_SIZES: Calculates sizes of the ROIs.
  • SEQTK_HRUN: Scans ROI reference files for homopolymers.
  • FORMAT_BED: Sorts and formats the homopolymer output BED file.
  • BEDTOOLS_SLOP: Adds 1 to each end of BED ranges in the homopolymer output file.

4. Initial Quality Control (QC)

  • NANOPLOT_FASTQ: Runs nanoplot on reads.

5. Read Filtering

  • FILTLONG: Filters long reads.

6. Alignment and Fragment Processing

  • MINIMAP2_ALIGN: Aligns to ROI reference using Minimap2.
  • SAMTOOLS_INDEX_ALIGN: Indexes the BAM file.
  • SAMTOOLS_VIEW: Filters reads for mapping quality.
  • FILTER_BAM_RATIO: Filters reads using alignment length and mismatch ratio criteria.

7. Blast and Blast-Dependant Processing

  • BLAST_ROI_REF: Searches ROI references against determinants database.
  • SEQTK_SEQ: Converts FastQ reads to Fasta.
  • BLAST_READS: Searches reads against determinants database.
  • FILTER_BAM_BLAST: Further filters reads based on BLAST results.

8. Variant Calculation

  • SAMTOOLS_DEPTH_PASSED: Calculates depth of passed reads.
  • SUBSAMPLE_BAM_COV: Subsamples reads to standardize coverage.
  • MEDAKA_VARIANT: Calls variants using Medaka.
  • BEDTOOLS_INTERSECT: Filters VCF against homopolymer regions.
  • SNIFFLES: Runs Sniffles for structural variant calling.
  • CLAIR3_CALL: Runs Clair3 for variant calling.

9. Report Generation

  • SAMTOOLS_DEPTH_ALIGNED: Calculates coverage depth from aligned reads.
  • PYSAMSTATS: Calculates strand variations.
  • GEN_REPORTS: Generates plots and inputs for MultiQC.
  • DUMP_SOFTWARE_VERSIONS: Collects software versions.
  • MULTIQC: Runs MultiQC for aggregate analysis reports.

Pipeline Output

The results of the CAS9POINT4 pipeline are organized into several directories, each containing specific output files. Below is a description of each directory and its contents:

bam Directory

This directory contains BAM files at different stages of processing:

  • 01_initial_align: Initial alignment BAM files and their indices (.bai).
  • 02_basic_filt: BAM files after basic filtering and their indices.
  • 03_complex_filt: BAM files after more complex filtering, their indices, and various statistics (flagstat, idxstats, stats).
  • 04_blast_filt: Contains subdirectories for failed and passed reads with corresponding BAM files.

blast Directory

Contains BLAST result files:

  • reads: BLAST results for individual reads.
  • roi_ref: BLAST results for regions of interest (ROIs).

filtlong Directory

Includes filtered FastQ files and log files from the Filtlong process.

homopolymers Directory

Contains BED files identifying homopolymeric regions in various references.

pipeline_info Directory

Stores various reports and logs related to pipeline execution, such as HTML reports, execution traces, samplesheet validations, and software versions.

reports Directory

A collection of analysis reports and visualizations:

  • coverage: Coverage depth plots and a compiled PDF report.
  • depth: Depth information in TSV format.
  • multiqc: Comprehensive reports from MultiQC analysis including data files and plots.
  • nanoplot_fastq: NanoPlot reports and statistics for FastQ files.
  • strand_variation: Data files containing strand variation information.

variation Directory

Contains output files related to variant analysis:

  • clair3: VCF files from Clair3 variant calling.
  • depth: Depth information for passed reads in TSV format.
  • medaka: Subdirectories for filtered and raw outputs from Medaka variant calling, including VCF files and HDF5 probability files.
  • sniffles: Output from Sniffles for structural variant calling, including .snf and .vcf files.
  • subsample: BAM files and indices from the subsampling step, aiming to standardize coverage across samples.

Pipeline Parameters

This section describes all other available parameters in the pipeline.

Input

  • samplesheet
    Description: Path to the samplesheet file

  • fast5
    Default: null Description: Path to the FAST5 file, if applicable.

Output Options

  • outdir
    Default: ./results
    Description: Directory where the pipeline's results will be saved.

  • monochrome_logs
    Default: false
    Description: If true, logs will be in monochrome.

  • debug
    Default: false
    Description: Enables debug mode if set to true.

Max Resource Options

  • max_memory
    Default: 128.GB
    Description: Maximum memory allocation for the pipeline per process. Processes may request less.

  • max_cpus
    Default: 16
    Description: Maximum number of CPUs that can be used per process. Processes may request less.

  • max_time
    Default: 240.h
    Description: Maximum time allocation for the pipeline per process. Processes may request less.

Pipeline Parameters

  • experiment_name
    Default: "exp"
    Description: Name of the experiment. Used for naming files and formatting output.

  • dorado_bc_kit

    Default: null

    Description: The specific BC kit name to use for base calling. This parameter is required if running in fast5 mode. For example, EXP-PBC001. For more information, see Dorado Basecalling.

    Known possible values:

    • SQK-16S024
    • SQK-16S114-24
    • SQK-PBK004
    • SQK-LWB001
    • SQK-PCB109
    • SQK-PCB110
    • SQK-PCB111-24
    • SQK-PCB114-24
    • EXP-NBD103
    • EXP-NBD104
    • EXP-NBD114
    • SQK-NBD111-24
    • SQK-NBD114-24
    • EXP-NBD196
    • SQK-MLK111-96-XL
    • SQK-NBD111-96
    • SQK-NBD114-96
    • SQK-MLK114-96-XL
    • EXP-PBC001
    • EXP-PBC096
    • SQK-RAB204
    • SQK-RAB201
    • SQK-RBK001
    • SQK-RBK110-96
    • SQK-RBK111-96
    • SQK-RBK114-96
    • SQK-RBK111-24
    • SQK-RBK114-24
    • SQK-RBK004
    • VSK-PTC001
    • VSK-VPS001
    • SQK-RPB004
    • SQK-RLB001
    • SQK-RPB114-24
    • VSK-VMK001
    • VSK-VMK004
  • dorado_auto_model
    Default: "hac"
    Description: Auto selection of the closest barcode model based on keywords. hac will run the latest compatible hac model. See https://github.com/nanoporetech/dorado

  • dorado_model
    Description: The specific dorado model to use e.g. dna_r9.4.1_e8_hac@v3.3. Overides dorado_auto_model.

  • dorado_device
    Default: "cuda:all"
    Description: Device specification for Dorado. Edit to have more fine grained control over CPU/GPU base calling execution.

  • dorado_both_ends
    Default: true
    Description: Force barcode detection from both ends of the read.

  • filtlong_meanqual
    Default: 90
    Description: Threshold to keep the top percentage of reads based on quality.

  • blastn_outfmt
    Default: 6
    Description: Output format for BLASTN.

  • min_mapq
    Default: 50
    Description: Minimum mapping quality for alignment.

  • align_ratio_thresh
    Default: 0.5
    Description: Threshold for alignment ratio.

  • mismatch_ratio_thresh
    Default: 0.3
    Description: Threshold for mismatch ratio.

  • homopol_length
    Default: 5
    Description: Length of homopolymer sequences to consider.

  • var_coverage_thresh
    Default: 100000
    Description: Coverage threshold for variant calling.

  • var_enable_subsampling
    Default: true
    Description: Enable subsampling for variant analysis.

  • medaka_model
    Default: "r941_min_high_g360"
    Description: Model used for Medaka.

  • sniffles_min_cov_mult
    Default: 0.1
    Description: Minimum coverage multiplier for Sniffles.

  • clair3_model
    Default: 'r941_prom_hac_g360+g422'
    Description: Model used for Clair3.

Previous
scDownstream (Single-cell)