Flow Logo

Pipelines

ChIP-Seq Pipeline

Overview

The ChIP-Seq pipeline is based on nf-core/chipseq v2.0.0, a bioinformatics best-practice analysis pipeline for ChIP-seq data. This pipeline performs quality control, alignment, peak calling, and differential binding analysis for ChIP-seq experiments.

Pipeline Features

  • Multiple alignment options: BWA, Chromap, Bowtie2, STAR
  • Comprehensive quality control: FastQC reports and MultiQC summaries
  • Peak calling: MACS2 for identifying enriched regions
  • Peak annotation: HOMER for genomic feature annotation
  • Differential binding analysis: Compare binding between conditions
  • Reproducible analysis: Docker/Singularity containers ensure consistency

Pipeline Workflow

  1. Raw read QC - FastQC assessment of sequencing quality
  2. Adapter trimming - Trim Galore! removes adapter sequences
  3. Alignment - Map reads to reference genome
  4. Duplicate marking - Identify and mark PCR duplicates
  5. Filtering and QC - Remove low-quality alignments
  6. Peak calling - MACS2 identifies enriched regions
  7. Peak annotation - HOMER annotates peaks with genomic features
  8. Differential analysis - Compare binding across conditions

Input Requirements

Sample Sheet Format

The pipeline requires a CSV sample sheet with the following columns:

ColumnDescription
sampleUnique sample identifier (spaces converted to underscores)
fastq_1Path to Read 1 FASTQ file (gzipped)
fastq_2Path to Read 2 FASTQ file (optional for paired-end data)
antibodyName of the ChIP antibody used
controlCorresponding input control sample name

Example sample sheet:

sample,fastq_1,fastq_2,antibody,control
CONTROL_1,control_1.fastq.gz,,input,
CONTROL_2,control_2.fastq.gz,,input,
TREATMENT_1,treatment_1_R1.fastq.gz,treatment_1_R2.fastq.gz,h3k27ac,CONTROL_1
TREATMENT_2,treatment_2_R1.fastq.gz,treatment_2_R2.fastq.gz,h3k27ac,CONTROL_2

Key Parameters

ParameterDescriptionDefault
--inputPath to input sample sheet CSV fileRequired
--outdirOutput directory for resultsRequired
--genomeReference genome (e.g., GRCh37, GRCh38)Required
--save_referenceSave generated reference genome indicesfalse
-profileConfiguration profile (docker/singularity)Required

Running the Pipeline

Basic execution command:

nextflow run nf-core/chipseq \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh37 \
  -profile docker

Resume an interrupted run:

nextflow run nf-core/chipseq \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh37 \
  -profile docker \
  -resume

Output Files

The pipeline generates the following key outputs:

  • Quality Control

    • FastQC reports for raw and processed reads
    • MultiQC report summarizing all QC metrics
  • Alignments

    • BAM files with aligned reads
    • BAM index files (.bai)
    • Alignment statistics
  • Peak Files

    • Narrowpeak/broadpeak files from MACS2
    • Summit files indicating peak centers
    • Peak annotation files from HOMER
  • Coverage Tracks

    • BigWig files for genome browser visualization
    • Normalized coverage tracks
  • Differential Binding

    • Differential peak analysis results
    • Plots and statistics for differential binding

Tips and Best Practices

Experimental Design

  • Control samples are critical: Always include appropriate input or IgG controls
  • Biological replicates: Include at least 2-3 biological replicates per condition
  • Sequencing depth: Aim for 20-50 million reads for transcription factors, 50-100 million for histone marks
  • Antibody validation: Ensure antibody specificity through western blot or ChIP-qPCR

Sample Preparation

  • Cross-linking optimization: Use 1% formaldehyde for most proteins, consider dual cross-linking for larger complexes
  • Fragmentation consistency: Aim for 200-300bp fragments for transcription factors, 200-500bp for histone marks
  • Library complexity: Check library complexity metrics to ensure sufficient unique molecules

Parameter Optimization

  • Peak calling mode: Use narrow for transcription factors and sharp histone marks (H3K4me3, H3K27ac)
  • Broad peak calling: Use broad mode for spreading marks like H3K36me3, H3K27me3
  • Effective genome size: Use appropriate size for your organism (e.g., 2.7e9 for human, 1.87e9 for mouse)
  • Duplicate removal: Keep duplicates for low complexity samples or single-end data

Quality Assessment

  • FRiP score: Fraction of Reads in Peaks should be >5% for acceptable quality
  • NSC and RSC: Check strand cross-correlation metrics for library quality
  • Peak reproducibility: Use IDR (Irreproducible Discovery Rate) for replicate concordance
  • Signal-to-noise: Review fingerprint plots in MultiQC report

Troubleshooting

Common Issues

Issue: Low number of peaks detected

  • Solution: Check antibody quality and ChIP efficiency
  • Reduce MACS2 q-value threshold (try 0.05 or 0.1)
  • Verify sufficient sequencing depth for your target
  • Check fragment size distribution is appropriate

Issue: High background or noisy data

  • Solution: Increase stringency in peak calling (lower q-value)
  • Check for PCR duplicate rates - high rates indicate low complexity
  • Verify proper input/IgG control subtraction
  • Consider additional filtering based on fold enrichment

Issue: Peaks not overlapping between replicates

  • Solution: Check if replicates cluster together in PCA/correlation analysis
  • Use IDR to identify reproducible peaks
  • Consider batch effects in sample processing
  • Verify consistent antibody lot numbers

Issue: Memory errors during alignment

  • Solution: Reduce number of parallel alignment jobs
  • Use --max_memory and --max_cpus to limit resources
  • Process samples in smaller batches

Issue: No differential binding detected

  • Solution: Check if samples separate by condition in PCA plots
  • Verify sufficient peak overlap between samples
  • Consider using a more lenient FDR threshold
  • Ensure peak calling was performed consistently

Issue: Adapter contamination in results

  • Solution: Check FastQC reports for adapter content
  • Adjust Trim Galore parameters for more aggressive trimming
  • Verify correct adapter sequences are specified

Additional Resources

Previous
RNA-Seq