Pipelines
Demultiplexing Pipeline
Overview
The demultiplexing pipeline is powered by Ultraplex v1.2.5, an all-in-one software package for processing and demultiplexing FASTQ files. This pipeline is designed to separate multiplexed sequencing data into individual sample files based on barcode sequences. It's particularly optimized for CLIP-seq experiments but can be used for any barcoded sequencing data.
The pipeline can be found under "Utilities" in the "Run analysis" page on Flow.
Pipeline Summary
The demultiplexing workflow performs the following steps:
- Format Conversion - Converts Excel samplesheets to CSV format if needed
- Barcode Processing - Transforms samplesheet information into Ultraplex-compatible barcode format
- Quality Trimming - Removes poor quality bases from reads
- Adapter Removal - Trims sequencing adapters from reads
- Barcode Detection - Identifies 5' and/or 3' barcodes in each read
- Sample Assignment - Assigns reads to samples based on detected barcodes
- UMI Handling - Moves unique molecular identifiers to read headers
- Output Generation - Creates individual FASTQ files for each sample
Input Requirements
Sample Sheet Format
The pipeline requires a sample sheet in CSV or XLSX format with the following columns:
Column | Description | Required | Example |
---|---|---|---|
id | Unique sample identifier | Yes | Sample_1 |
5prime_barcode | 5' barcode sequence (N for variable positions) | Yes | NNNGGCGNN |
3prime_barcode | 3' barcode sequence | No | AACTTGAA |
adapter_sequence | 3' adapter sequence to trim | No | AGATCGGAAGAGCGGTTCAG |
adapter_name | Name/identifier for the adapter | No | TruSeq |
Example sample sheet:
id,5prime_barcode,3prime_barcode,adapter_sequence,adapter_name
Control_1,NNNGGCGNN,,AGATCGGAAGAGCGGTTCAG,TruSeq
Control_2,NNNTTGTNN,,AGATCGGAAGAGCGGTTCAG,TruSeq
Treatment_1,NNNCCGANN,AACTTGAA,AGATCGGAAGAGCGGTTCAG,TruSeq
Treatment_2,NNNAAGCNN,TTGAACTT,AGATCGGAAGAGCGGTTCAG,TruSeq
FASTQ Input Files
- Single-end or paired-end FASTQ files (gzipped or uncompressed)
- Multiplexed data containing reads from multiple samples
- Any sequencing platform (Illumina, MGI, etc.)
Key Parameters
Required Parameters
Parameter | Description | Default |
---|---|---|
--fastq | Path to multiplexed FASTQ file(s) | Required |
--samplesheet | Path to sample sheet (CSV/XLSX) | Required |
--outdir | Output directory for results | Required |
Optional Parameters
Parameter | Description | Default |
---|---|---|
--threads | Number of CPU threads to use | All available |
--dont_build_reference | Skip reference building (for barcodes >8nt) | false |
--quality_threshold | Minimum base quality score | 20 |
Pipeline Outputs
Directory Structure
results/
├── demultiplexed/
│ ├── Sample_1.fastq.gz
│ ├── Sample_2.fastq.gz
│ └── ...
├── no_match/
│ └── no_match.fastq.gz
├── logs/
│ └── ultraplex.log
└── pipeline_info/
└── versions.yml
Output Files
Demultiplexed FASTQ Files
- One file per sample:
[sample_id].fastq.gz
- Contains all reads assigned to that sample
- UMIs moved to read headers with "rbc:" prefix
- One file per sample:
Unassigned Reads
no_match.fastq.gz
: Reads that couldn't be assigned to any barcode- Useful for troubleshooting barcode issues
Demultiplexing Statistics
ultraplex.log
: Detailed statistics including read counts per sample- Assignment rates and quality metrics
Software Versions
versions.yml
: Records all software versions used
Example Usage
Basic Demultiplexing
nextflow run demultiplex \
--fastq multiplexed_data.fastq.gz \
--samplesheet samples.csv \
--outdir results
Paired-end Demultiplexing
nextflow run demultiplex \
--fastq "multiplexed_*_{1,2}.fastq.gz" \
--samplesheet samples.xlsx \
--outdir results
Performance Optimization
nextflow run demultiplex \
--fastq lane1.fastq.gz \
--samplesheet barcodes.csv \
--threads 16 \
--outdir results
Tips and Best Practices
Barcode Design
- Keep barcodes the same length for optimal performance
- Use 'N' positions for UMIs or variable sequences
- Ensure barcodes have sufficient edit distance (≥2) between samples
- For barcodes >8nt, use
--dont_build_reference
flag
Performance Optimization
- Ultraplex can process ~400 million reads in 20 minutes
- Use multiple threads for faster processing
- Ensure sufficient disk space for output files
Quality Control
- Check the
no_match.fastq.gz
file - high unassigned rates may indicate:- Incorrect barcode sequences in sample sheet
- Poor sequencing quality
- Sample contamination
On Flow Platform
- Samples are automatically created during demultiplexing
- Fill out the annotation spreadsheet completely for accurate metadata
- Use descriptive sample IDs as they become part of the filename
Troubleshooting
Common Issues
Issue: High percentage of unassigned reads
- Solution: Verify barcode sequences match library preparation
- Check for barcode mutations or sequencing errors
- Consider allowing mismatches in barcode matching
Issue: Pipeline fails with "ambiguous barcodes" error
- Solution: Ensure all barcodes are unique
- Check that N positions are consistent across barcodes
- Verify no barcode is a subsequence of another
Issue: Out of memory errors
- Solution: Reduce number of threads
- Process files in smaller batches
- Increase available memory allocation
Issue: Slow performance on large files
- Solution: Use
--dont_build_reference
for long barcodes - Ensure fast disk I/O for input/output files
- Use compressed (gzipped) input files
Additional Resources
- Ultraplex GitHub: github.com/ulelab/ultraplex
- Ultraplex Documentation: Ultraplex User Guide
- Support: Contact Flow support for pipeline-specific issues
- Best Practices: Multiplexed Sequencing Guidelines