Flow Logo

Pipelines

Sarek

Overview

Flow provides the nf-core/sarek v3.4.0 pipeline for germline and somatic variant calling analysis using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), or targeted sequencing data. It implements best practices from the Broad Institute's GATK workflows and supports tumor-normal pair analysis.

Sarek can start from various entry points including raw FASTQ files, aligned BAMs, or recalibrated BAMs, making it flexible for different analysis needs and computational resources.


Pipeline Summary

The complete pipeline includes:

  1. Preprocessing

    • Quality control (FastQC)
    • Read alignment (BWA-MEM2)
    • Duplicate marking (GATK MarkDuplicates)
    • Base quality recalibration (GATK BQSR)
  2. Variant Calling

    • Multiple variant callers supported:
      • GATK HaplotypeCaller (germline)
      • GATK Mutect2 (somatic)
      • Strelka2 (germline/somatic)
      • Freebayes (germline)
      • VarDict (germline/somatic)
      • DeepVariant (germline)
  3. Annotation

    • VEP (Variant Effect Predictor)
    • snpEff
    • Custom annotation databases
  4. Quality Control

    • MultiQC report
    • Variant calling metrics
    • Coverage analysis

Input Requirements

Sample Data

  • FASTQ files (paired-end recommended)
  • Pre-aligned BAM files (optional entry point)
  • Minimum 30X coverage for WGS
  • Minimum 100X coverage for WES

Reference Files

  • Reference genome (FASTA)
  • Known variant databases (dbSNP, COSMIC)
  • Target intervals (for WES/panels)
  • PON (Panel of Normals) for somatic calling

Sample Sheet Format

patient,sample,lane,fastq_1,fastq_2,sex,status
PAT001,NORMAL,L001,normal_R1.fastq.gz,normal_R2.fastq.gz,XX,0
PAT001,TUMOR,L001,tumor_R1.fastq.gz,tumor_R2.fastq.gz,XX,1

Key Parameters

Entry Points

  • --step: Pipeline entry point
    • mapping: Start from FASTQ
    • markduplicates: Start from mapped BAM
    • recalibrate: Start from duplicate-marked BAM
    • variant_calling: Start from recalibrated BAM

Analysis Type

  • --tools: Variant callers to use
    • haplotypecaller: GATK germline
    • mutect2: GATK somatic
    • strelka: Strelka2
    • freebayes: FreeBayes
    • deepvariant: DeepVariant

Intervals

  • --intervals: Target regions (BED/interval_list)
  • --no_intervals: Skip interval processing
  • --nucleotides_per_second: Interval splitting

Annotation

  • --annotate_tools: Annotation tools (vep, snpeff)
  • --vep_cache: VEP cache directory
  • --snpeff_cache: snpEff cache directory

Pipeline Outputs

Preprocessing Outputs

  1. Aligned BAMs

    • Sorted and indexed BAM files
    • Alignment statistics
    • Insert size metrics
  2. Quality Reports

    • FastQC reports
    • Alignment QC metrics
    • Duplicate rates
    • Coverage statistics

Variant Calling Outputs

  1. VCF Files

    • Raw variant calls
    • Filtered variants
    • Normalized VCFs
  2. Caller-Specific Files

    • GATK: GVCF files, contamination estimates
    • Mutect2: Orientation bias, contamination
    • Strelka: EVS/SVS files

Annotation Outputs

  1. Annotated VCFs

    • Functional annotations
    • Population frequencies
    • Clinical significance
  2. Summary Reports

    • Variant statistics
    • Effect summaries
    • Gene-level impacts

Example Usage

Basic Germline Analysis

nextflow run nf-core/sarek \
  --input samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools haplotypecaller,vep \
  --outdir results \
  -profile docker

Somatic Tumor-Normal Analysis

nextflow run nf-core/sarek \
  --input samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools mutect2,strelka,vep \
  --pon panel_of_normals.vcf.gz \
  --outdir results \
  -profile docker

Tumor-Only Analysis

nextflow run nf-core/sarek \
  --input samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools mutect2 \
  --pon panel_of_normals.vcf.gz \
  --germline_resource gnomad.vcf.gz \
  --outdir results \
  -profile docker

Trio Analysis for De Novo Variants

nextflow run nf-core/sarek \
  --input trio_samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools haplotypecaller \
  --joint_germline \
  --ped family.ped \
  --outdir results \
  -profile docker

Tips and Best Practices

Sample Preparation

  • Tumor purity: Aim for >20% tumor content for reliable somatic calling
  • Matched normals: Essential for accurate somatic variant detection
  • Library consistency: Use same prep method for tumor-normal pairs
  • Technical replicates: Include for validation of key findings

Variant Calling Strategy

  • Multiple callers: Use consensus approach for high-confidence calls
  • Germline: HaplotypeCaller for sensitivity, DeepVariant for accuracy
  • Somatic: Combine Mutect2 + Strelka2 for comprehensive detection
  • Joint calling: Improves variant detection in family/cohort studies

Resource Optimization

  • Intervals: Use for WES/targeted panels to reduce compute time
  • Scatter-gather: Enable for large cohorts or WGS data
  • Entry points: Start from recalibrated BAMs if re-analyzing
  • Reference caching: Use --save_reference on first run

Quality Control

  • Contamination check: Review cross-sample contamination estimates
  • Mutation rates: Verify expected rates for your sample type
  • Caller concordance: Higher confidence for variants called by multiple tools
  • Orthogonal validation: Confirm critical findings with independent method

Output Interpretation

  • Variant priorities: Focus on PASS filter, high impact, rare variants
  • Multiple tool calls: Higher confidence when detected by multiple callers
  • Functional impact: Prioritize coding and splice site variants
  • Population frequency: Filter against gnomAD/ExAC for rare variants
  • Somatic classification: Consider VAF for clonal vs subclonal mutations

Troubleshooting

Common Issues

Issue: Low number of variant calls

  • Solution: Check coverage depth meets requirements (30X WGS, 100X WES)
  • Verify tumor purity is sufficient (>20% recommended)
  • Review base quality score distribution
  • Ensure correct --analysis_type is specified

Issue: High false positive rate

  • Solution: Use Panel of Normals (PON) for filtering
  • Apply stricter quality filters (increase QUAL threshold)
  • Check for cross-sample contamination
  • Review mapping quality distribution

Issue: Resource problems or out of memory

  • Solution: Split analysis by chromosome using intervals
  • Reduce --max_cpus to limit parallelization
  • Use targeted intervals for WES/panel data
  • Increase memory allocation with --max_memory

Issue: Incorrect genome build

  • Solution: Verify sample alignment matches --genome parameter
  • Check VEP cache version matches genome build
  • Ensure intervals file corresponds to reference genome

Issue: Sample mix-ups or contamination

  • Solution: Use --check_contamination flag
  • Review sample clustering in MultiQC
  • Verify sample sheet mappings are correct
  • Check for unexpected high heterozygosity

Additional Resources

Previous
Airrflow (BCR/TCR)