Pipelines
ChIP-Seq Pipeline
Overview
The ChIP-Seq pipeline is based on nf-core/chipseq v2.0.0, a bioinformatics best-practice analysis pipeline for ChIP-seq data. This pipeline performs quality control, alignment, peak calling, and differential binding analysis for ChIP-seq experiments.
Pipeline Features
- Multiple alignment options: BWA, Chromap, Bowtie2, STAR
- Comprehensive quality control: FastQC reports and MultiQC summaries
- Peak calling: MACS2 for identifying enriched regions
- Peak annotation: HOMER for genomic feature annotation
- Differential binding analysis: Compare binding between conditions
- Reproducible analysis: Docker/Singularity containers ensure consistency
Pipeline Workflow
- Raw read QC - FastQC assessment of sequencing quality
- Adapter trimming - Trim Galore! removes adapter sequences
- Alignment - Map reads to reference genome
- Duplicate marking - Identify and mark PCR duplicates
- Filtering and QC - Remove low-quality alignments
- Peak calling - MACS2 identifies enriched regions
- Peak annotation - HOMER annotates peaks with genomic features
- Differential analysis - Compare binding across conditions
Input Requirements
Sample Sheet Format
The pipeline requires a CSV sample sheet with the following columns:
Column | Description |
---|---|
sample | Unique sample identifier (spaces converted to underscores) |
fastq_1 | Path to Read 1 FASTQ file (gzipped) |
fastq_2 | Path to Read 2 FASTQ file (optional for paired-end data) |
antibody | Name of the ChIP antibody used |
control | Corresponding input control sample name |
Example sample sheet:
sample,fastq_1,fastq_2,antibody,control
CONTROL_1,control_1.fastq.gz,,input,
CONTROL_2,control_2.fastq.gz,,input,
TREATMENT_1,treatment_1_R1.fastq.gz,treatment_1_R2.fastq.gz,h3k27ac,CONTROL_1
TREATMENT_2,treatment_2_R1.fastq.gz,treatment_2_R2.fastq.gz,h3k27ac,CONTROL_2
Key Parameters
Parameter | Description | Default |
---|---|---|
--input | Path to input sample sheet CSV file | Required |
--outdir | Output directory for results | Required |
--genome | Reference genome (e.g., GRCh37, GRCh38) | Required |
--save_reference | Save generated reference genome indices | false |
-profile | Configuration profile (docker/singularity) | Required |
Running the Pipeline
Basic execution command:
nextflow run nf-core/chipseq \
--input samplesheet.csv \
--outdir results \
--genome GRCh37 \
-profile docker
Resume an interrupted run:
nextflow run nf-core/chipseq \
--input samplesheet.csv \
--outdir results \
--genome GRCh37 \
-profile docker \
-resume
Output Files
The pipeline generates the following key outputs:
Quality Control
- FastQC reports for raw and processed reads
- MultiQC report summarizing all QC metrics
Alignments
- BAM files with aligned reads
- BAM index files (.bai)
- Alignment statistics
Peak Files
- Narrowpeak/broadpeak files from MACS2
- Summit files indicating peak centers
- Peak annotation files from HOMER
Coverage Tracks
- BigWig files for genome browser visualization
- Normalized coverage tracks
Differential Binding
- Differential peak analysis results
- Plots and statistics for differential binding
Tips and Best Practices
Experimental Design
- Control samples are critical: Always include appropriate input or IgG controls
- Biological replicates: Include at least 2-3 biological replicates per condition
- Sequencing depth: Aim for 20-50 million reads for transcription factors, 50-100 million for histone marks
- Antibody validation: Ensure antibody specificity through western blot or ChIP-qPCR
Sample Preparation
- Cross-linking optimization: Use 1% formaldehyde for most proteins, consider dual cross-linking for larger complexes
- Fragmentation consistency: Aim for 200-300bp fragments for transcription factors, 200-500bp for histone marks
- Library complexity: Check library complexity metrics to ensure sufficient unique molecules
Parameter Optimization
- Peak calling mode: Use
narrow
for transcription factors and sharp histone marks (H3K4me3, H3K27ac) - Broad peak calling: Use
broad
mode for spreading marks like H3K36me3, H3K27me3 - Effective genome size: Use appropriate size for your organism (e.g., 2.7e9 for human, 1.87e9 for mouse)
- Duplicate removal: Keep duplicates for low complexity samples or single-end data
Quality Assessment
- FRiP score: Fraction of Reads in Peaks should be >5% for acceptable quality
- NSC and RSC: Check strand cross-correlation metrics for library quality
- Peak reproducibility: Use IDR (Irreproducible Discovery Rate) for replicate concordance
- Signal-to-noise: Review fingerprint plots in MultiQC report
Troubleshooting
Common Issues
Issue: Low number of peaks detected
- Solution: Check antibody quality and ChIP efficiency
- Reduce MACS2 q-value threshold (try 0.05 or 0.1)
- Verify sufficient sequencing depth for your target
- Check fragment size distribution is appropriate
Issue: High background or noisy data
- Solution: Increase stringency in peak calling (lower q-value)
- Check for PCR duplicate rates - high rates indicate low complexity
- Verify proper input/IgG control subtraction
- Consider additional filtering based on fold enrichment
Issue: Peaks not overlapping between replicates
- Solution: Check if replicates cluster together in PCA/correlation analysis
- Use IDR to identify reproducible peaks
- Consider batch effects in sample processing
- Verify consistent antibody lot numbers
Issue: Memory errors during alignment
- Solution: Reduce number of parallel alignment jobs
- Use
--max_memory
and--max_cpus
to limit resources - Process samples in smaller batches
Issue: No differential binding detected
- Solution: Check if samples separate by condition in PCA plots
- Verify sufficient peak overlap between samples
- Consider using a more lenient FDR threshold
- Ensure peak calling was performed consistently
Issue: Adapter contamination in results
- Solution: Check FastQC reports for adapter content
- Adjust Trim Galore parameters for more aggressive trimming
- Verify correct adapter sequences are specified
Additional Resources
- Full documentation: nf-core/chipseq documentation
- Pipeline source code: GitHub repository
- Support: Join the
#chipseq
channel on nf-core Slack - Citation: 10.5281/zenodo.3240506