Pipelines
Sarek
Overview
Flow provides the nf-core/sarek v3.4.0 pipeline for germline and somatic variant calling analysis using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WES), or targeted sequencing data. It implements best practices from the Broad Institute's GATK workflows and supports tumor-normal pair analysis.
Sarek can start from various entry points including raw FASTQ files, aligned BAMs, or recalibrated BAMs, making it flexible for different analysis needs and computational resources.
Pipeline Summary
The complete pipeline includes:
Preprocessing
- Quality control (FastQC)
- Read alignment (BWA-MEM2)
- Duplicate marking (GATK MarkDuplicates)
- Base quality recalibration (GATK BQSR)
Variant Calling
- Multiple variant callers supported:
- GATK HaplotypeCaller (germline)
- GATK Mutect2 (somatic)
- Strelka2 (germline/somatic)
- Freebayes (germline)
- VarDict (germline/somatic)
- DeepVariant (germline)
- Multiple variant callers supported:
Annotation
- VEP (Variant Effect Predictor)
- snpEff
- Custom annotation databases
Quality Control
- MultiQC report
- Variant calling metrics
- Coverage analysis
Input Requirements
Sample Data
- FASTQ files (paired-end recommended)
- Pre-aligned BAM files (optional entry point)
- Minimum 30X coverage for WGS
- Minimum 100X coverage for WES
Reference Files
- Reference genome (FASTA)
- Known variant databases (dbSNP, COSMIC)
- Target intervals (for WES/panels)
- PON (Panel of Normals) for somatic calling
Sample Sheet Format
patient,sample,lane,fastq_1,fastq_2,sex,status
PAT001,NORMAL,L001,normal_R1.fastq.gz,normal_R2.fastq.gz,XX,0
PAT001,TUMOR,L001,tumor_R1.fastq.gz,tumor_R2.fastq.gz,XX,1
Key Parameters
Entry Points
--step
: Pipeline entry pointmapping
: Start from FASTQmarkduplicates
: Start from mapped BAMrecalibrate
: Start from duplicate-marked BAMvariant_calling
: Start from recalibrated BAM
Analysis Type
--tools
: Variant callers to usehaplotypecaller
: GATK germlinemutect2
: GATK somaticstrelka
: Strelka2freebayes
: FreeBayesdeepvariant
: DeepVariant
Intervals
--intervals
: Target regions (BED/interval_list)--no_intervals
: Skip interval processing--nucleotides_per_second
: Interval splitting
Annotation
--annotate_tools
: Annotation tools (vep, snpeff)--vep_cache
: VEP cache directory--snpeff_cache
: snpEff cache directory
Pipeline Outputs
Preprocessing Outputs
Aligned BAMs
- Sorted and indexed BAM files
- Alignment statistics
- Insert size metrics
Quality Reports
- FastQC reports
- Alignment QC metrics
- Duplicate rates
- Coverage statistics
Variant Calling Outputs
VCF Files
- Raw variant calls
- Filtered variants
- Normalized VCFs
Caller-Specific Files
- GATK: GVCF files, contamination estimates
- Mutect2: Orientation bias, contamination
- Strelka: EVS/SVS files
Annotation Outputs
Annotated VCFs
- Functional annotations
- Population frequencies
- Clinical significance
Summary Reports
- Variant statistics
- Effect summaries
- Gene-level impacts
Example Usage
Basic Germline Analysis
nextflow run nf-core/sarek \
--input samplesheet.csv \
--genome GATK.GRCh38 \
--tools haplotypecaller,vep \
--outdir results \
-profile docker
Somatic Tumor-Normal Analysis
nextflow run nf-core/sarek \
--input samplesheet.csv \
--genome GATK.GRCh38 \
--tools mutect2,strelka,vep \
--pon panel_of_normals.vcf.gz \
--outdir results \
-profile docker
Tumor-Only Analysis
nextflow run nf-core/sarek \
--input samplesheet.csv \
--genome GATK.GRCh38 \
--tools mutect2 \
--pon panel_of_normals.vcf.gz \
--germline_resource gnomad.vcf.gz \
--outdir results \
-profile docker
Trio Analysis for De Novo Variants
nextflow run nf-core/sarek \
--input trio_samplesheet.csv \
--genome GATK.GRCh38 \
--tools haplotypecaller \
--joint_germline \
--ped family.ped \
--outdir results \
-profile docker
Tips and Best Practices
Sample Preparation
- Tumor purity: Aim for >20% tumor content for reliable somatic calling
- Matched normals: Essential for accurate somatic variant detection
- Library consistency: Use same prep method for tumor-normal pairs
- Technical replicates: Include for validation of key findings
Variant Calling Strategy
- Multiple callers: Use consensus approach for high-confidence calls
- Germline: HaplotypeCaller for sensitivity, DeepVariant for accuracy
- Somatic: Combine Mutect2 + Strelka2 for comprehensive detection
- Joint calling: Improves variant detection in family/cohort studies
Resource Optimization
- Intervals: Use for WES/targeted panels to reduce compute time
- Scatter-gather: Enable for large cohorts or WGS data
- Entry points: Start from recalibrated BAMs if re-analyzing
- Reference caching: Use
--save_reference
on first run
Quality Control
- Contamination check: Review cross-sample contamination estimates
- Mutation rates: Verify expected rates for your sample type
- Caller concordance: Higher confidence for variants called by multiple tools
- Orthogonal validation: Confirm critical findings with independent method
Output Interpretation
- Variant priorities: Focus on PASS filter, high impact, rare variants
- Multiple tool calls: Higher confidence when detected by multiple callers
- Functional impact: Prioritize coding and splice site variants
- Population frequency: Filter against gnomAD/ExAC for rare variants
- Somatic classification: Consider VAF for clonal vs subclonal mutations
Troubleshooting
Common Issues
Issue: Low number of variant calls
- Solution: Check coverage depth meets requirements (30X WGS, 100X WES)
- Verify tumor purity is sufficient (>20% recommended)
- Review base quality score distribution
- Ensure correct
--analysis_type
is specified
Issue: High false positive rate
- Solution: Use Panel of Normals (PON) for filtering
- Apply stricter quality filters (increase QUAL threshold)
- Check for cross-sample contamination
- Review mapping quality distribution
Issue: Resource problems or out of memory
- Solution: Split analysis by chromosome using intervals
- Reduce
--max_cpus
to limit parallelization - Use targeted intervals for WES/panel data
- Increase memory allocation with
--max_memory
Issue: Incorrect genome build
- Solution: Verify sample alignment matches
--genome
parameter - Check VEP cache version matches genome build
- Ensure intervals file corresponds to reference genome
Issue: Sample mix-ups or contamination
- Solution: Use
--check_contamination
flag - Review sample clustering in MultiQC
- Verify sample sheet mappings are correct
- Check for unexpected high heterozygosity
Additional Resources
- Full documentation: nf-core/sarek documentation
- Pipeline source code: GitHub - nf-core/sarek
- GATK Best Practices: gatk.broadinstitute.org
- Support: Join the
#sarek
channel on nf-core Slack - Citation: Garcia et al. (2020) doi.org/10.12688/f1000research.16665.2