Sarek - Docs

Overview

Flow provides the nf-core/sarek v3.4.0 pipeline for germline and somatic variant calling analysis using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), or targeted sequencing data. It implements best practices from the Broad Institute's GATK workflows and supports tumor-normal pair analysis.

Sarek can start from various entry points including raw FASTQ files, aligned BAMs, or recalibrated BAMs, making it flexible for different analysis needs and computational resources.

Pipeline Summary

The complete pipeline includes:

Preprocessing
- Quality control (FastQC)
- Read alignment (BWA-MEM2)
- Duplicate marking (GATK MarkDuplicates)
- Base quality recalibration (GATK BQSR)
Variant Calling
- Multiple variant callers supported:
  - GATK HaplotypeCaller (germline)
  - GATK Mutect2 (somatic)
  - Strelka2 (germline/somatic)
  - Freebayes (germline)
  - VarDict (germline/somatic)
  - DeepVariant (germline)
Annotation
- VEP (Variant Effect Predictor)
- snpEff
- Custom annotation databases
Quality Control
- MultiQC report
- Variant calling metrics
- Coverage analysis

Input Requirements

Sample Data

FASTQ files (paired-end recommended)
Pre-aligned BAM files (optional entry point)
Minimum 30X coverage for WGS
Minimum 100X coverage for WES

Reference Files

Reference genome (FASTA)
Known variant databases (dbSNP, COSMIC)
Target intervals (for WES/panels)
PON (Panel of Normals) for somatic calling

Sample Sheet Format

patient,sample,lane,fastq_1,fastq_2,sex,status
PAT001,NORMAL,L001,normal_R1.fastq.gz,normal_R2.fastq.gz,XX,0
PAT001,TUMOR,L001,tumor_R1.fastq.gz,tumor_R2.fastq.gz,XX,1

Key Parameters

Entry Points

--step: Pipeline entry point
- mapping: Start from FASTQ
- markduplicates: Start from mapped BAM
- recalibrate: Start from duplicate-marked BAM
- variant_calling: Start from recalibrated BAM

Analysis Type

--tools: Variant callers to use
- haplotypecaller: GATK germline
- mutect2: GATK somatic
- strelka: Strelka2
- freebayes: FreeBayes
- deepvariant: DeepVariant

Intervals

--intervals: Target regions (BED/interval_list)
--no_intervals: Skip interval processing
--nucleotides_per_second: Interval splitting

Annotation

--annotate_tools: Annotation tools (vep, snpeff)
--vep_cache: VEP cache directory
--snpeff_cache: snpEff cache directory

Pipeline Outputs

Preprocessing Outputs

Aligned BAMs
- Sorted and indexed BAM files
- Alignment statistics
- Insert size metrics
Quality Reports
- FastQC reports
- Alignment QC metrics
- Duplicate rates
- Coverage statistics

Variant Calling Outputs

VCF Files
- Raw variant calls
- Filtered variants
- Normalized VCFs
Caller-Specific Files
- GATK: GVCF files, contamination estimates
- Mutect2: Orientation bias, contamination
- Strelka: EVS/SVS files

Annotation Outputs

Annotated VCFs
- Functional annotations
- Population frequencies
- Clinical significance
Summary Reports
- Variant statistics
- Effect summaries
- Gene-level impacts

Example Usage

Basic Germline Analysis

nextflow run nf-core/sarek \
  --input samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools haplotypecaller,vep \
  --outdir results \
  -profile docker

Somatic Tumor-Normal Analysis

nextflow run nf-core/sarek \
  --input samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools mutect2,strelka,vep \
  --pon panel_of_normals.vcf.gz \
  --outdir results \
  -profile docker

Tumor-Only Analysis

nextflow run nf-core/sarek \
  --input samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools mutect2 \
  --pon panel_of_normals.vcf.gz \
  --germline_resource gnomad.vcf.gz \
  --outdir results \
  -profile docker

Trio Analysis for De Novo Variants

nextflow run nf-core/sarek \
  --input trio_samplesheet.csv \
  --genome GATK.GRCh38 \
  --tools haplotypecaller \
  --joint_germline \
  --ped family.ped \
  --outdir results \
  -profile docker

Tips and Best Practices

Sample Preparation

Tumor purity: Aim for >20% tumor content for reliable somatic calling
Matched normals: Essential for accurate somatic variant detection
Library consistency: Use same prep method for tumor-normal pairs
Technical replicates: Include for validation of key findings

Variant Calling Strategy

Multiple callers: Use consensus approach for high-confidence calls
Germline: HaplotypeCaller for sensitivity, DeepVariant for accuracy
Somatic: Combine Mutect2 + Strelka2 for comprehensive detection
Joint calling: Improves variant detection in family/cohort studies

Resource Optimization

Intervals: Use for WES/targeted panels to reduce compute time
Scatter-gather: Enable for large cohorts or WGS data
Entry points: Start from recalibrated BAMs if re-analyzing
Reference caching: Use --save_reference on first run

Quality Control

Contamination check: Review cross-sample contamination estimates
Mutation rates: Verify expected rates for your sample type
Caller concordance: Higher confidence for variants called by multiple tools
Orthogonal validation: Confirm critical findings with independent method

Output Interpretation

Variant priorities: Focus on PASS filter, high impact, rare variants
Multiple tool calls: Higher confidence when detected by multiple callers
Functional impact: Prioritize coding and splice site variants
Population frequency: Filter against gnomAD/ExAC for rare variants
Somatic classification: Consider VAF for clonal vs subclonal mutations

Troubleshooting

Common Issues

Issue: Low number of variant calls

Solution: Check coverage depth meets requirements (30X WGS, 100X WES)
Verify tumor purity is sufficient (>20% recommended)
Review base quality score distribution
Ensure correct --analysis_type is specified

Issue: High false positive rate

Solution: Use Panel of Normals (PON) for filtering
Apply stricter quality filters (increase QUAL threshold)
Check for cross-sample contamination
Review mapping quality distribution

Issue: Resource problems or out of memory

Solution: Split analysis by chromosome using intervals
Reduce --max_cpus to limit parallelization
Use targeted intervals for WES/panel data
Increase memory allocation with --max_memory

Issue: Incorrect genome build

Solution: Verify sample alignment matches --genome parameter
Check VEP cache version matches genome build
Ensure intervals file corresponds to reference genome

Issue: Sample mix-ups or contamination

Solution: Use --check_contamination flag
Review sample clustering in MultiQC
Verify sample sheet mappings are correct
Check for unexpected high heterozygosity

Additional Resources

Full documentation: nf-core/sarek documentation
Pipeline source code: GitHub - nf-core/sarek
GATK Best Practices: gatk.broadinstitute.org
Support: Join the #sarek channel on nf-core Slack
Citation: Garcia et al. (2020) doi.org/10.12688/f1000research.16665.2