Demultiplexing Pipeline

Overview

The demultiplexing pipeline is powered by Ultraplex v1.2.5, an all-in-one software package for processing and demultiplexing FASTQ files. This pipeline is designed to separate multiplexed sequencing data into individual sample files based on barcode sequences. It's particularly optimized for CLIP-seq experiments but can be used for any barcoded sequencing data.

The pipeline can be found under "Utilities" in the "Run analysis" page on Flow.

Pipeline Summary

The demultiplexing workflow performs the following steps:

Format Conversion - Converts Excel samplesheets to CSV format if needed
Barcode Processing - Transforms samplesheet information into Ultraplex-compatible barcode format
Quality Trimming - Removes poor quality bases from reads
Adapter Removal - Trims sequencing adapters from reads
Barcode Detection - Identifies 5' and/or 3' barcodes in each read
Sample Assignment - Assigns reads to samples based on detected barcodes
UMI Handling - Moves unique molecular identifiers to read headers
Output Generation - Creates individual FASTQ files for each sample

Input Requirements

Sample Sheet Format

The pipeline requires a sample sheet in CSV or XLSX format with the following columns:

Column	Description	Required	Example
`id`	Unique sample identifier	Yes	Sample_1
`5prime_barcode`	5' barcode sequence (N for variable positions)	Yes	NNNGGCGNN
`3prime_barcode`	3' barcode sequence	No	AACTTGAA
`adapter_sequence`	3' adapter sequence to trim	No	AGATCGGAAGAGCGGTTCAG
`adapter_name`	Name/identifier for the adapter	No	TruSeq

Example sample sheet:

id,5prime_barcode,3prime_barcode,adapter_sequence,adapter_name
Control_1,NNNGGCGNN,,AGATCGGAAGAGCGGTTCAG,TruSeq
Control_2,NNNTTGTNN,,AGATCGGAAGAGCGGTTCAG,TruSeq
Treatment_1,NNNCCGANN,AACTTGAA,AGATCGGAAGAGCGGTTCAG,TruSeq
Treatment_2,NNNAAGCNN,TTGAACTT,AGATCGGAAGAGCGGTTCAG,TruSeq

FASTQ Input Files

Single-end or paired-end FASTQ files (gzipped or uncompressed)
Multiplexed data containing reads from multiple samples
Any sequencing platform (Illumina, MGI, etc.)

Key Parameters

Required Parameters

Parameter	Description	Default
`--fastq`	Path to multiplexed FASTQ file(s)	Required
`--samplesheet`	Path to sample sheet (CSV/XLSX)	Required
`--outdir`	Output directory for results	Required

Optional Parameters

Parameter	Description	Default
`--threads`	Number of CPU threads to use	All available
`--dont_build_reference`	Skip reference building (for barcodes >8nt)	false
`--quality_threshold`	Minimum base quality score	20

Pipeline Outputs

Directory Structure

results/
├── demultiplexed/
│   ├── Sample_1.fastq.gz
│   ├── Sample_2.fastq.gz
│   └── ...
├── no_match/
│   └── no_match.fastq.gz
├── logs/
│   └── ultraplex.log
└── pipeline_info/
    └── versions.yml

Output Files

Demultiplexed FASTQ Files
- One file per sample: [sample_id].fastq.gz
- Contains all reads assigned to that sample
- UMIs moved to read headers with "rbc:" prefix
Unassigned Reads
- no_match.fastq.gz: Reads that couldn't be assigned to any barcode
- Useful for troubleshooting barcode issues
Demultiplexing Statistics
- ultraplex.log: Detailed statistics including read counts per sample
- Assignment rates and quality metrics
Software Versions
- versions.yml: Records all software versions used

Example Usage

Basic Demultiplexing

nextflow run demultiplex \
  --fastq multiplexed_data.fastq.gz \
  --samplesheet samples.csv \
  --outdir results

Paired-end Demultiplexing

nextflow run demultiplex \
  --fastq "multiplexed_*_{1,2}.fastq.gz" \
  --samplesheet samples.xlsx \
  --outdir results

Performance Optimization

nextflow run demultiplex \
  --fastq lane1.fastq.gz \
  --samplesheet barcodes.csv \
  --threads 16 \
  --outdir results

Tips and Best Practices

Barcode Design

Keep barcodes the same length for optimal performance
Use 'N' positions for UMIs or variable sequences
Ensure barcodes have sufficient edit distance (≥2) between samples
For barcodes >8nt, use --dont_build_reference flag

Performance Optimization

Ultraplex can process ~400 million reads in 20 minutes
Use multiple threads for faster processing
Ensure sufficient disk space for output files

Quality Control

Check the no_match.fastq.gz file - high unassigned rates may indicate:
- Incorrect barcode sequences in sample sheet
- Poor sequencing quality
- Sample contamination

On Flow Platform

Samples are automatically created during demultiplexing
Fill out the annotation spreadsheet completely for accurate metadata
Use descriptive sample IDs as they become part of the filename

Troubleshooting

Common Issues

Issue: High percentage of unassigned reads

Solution: Verify barcode sequences match library preparation
Check for barcode mutations or sequencing errors
Consider allowing mismatches in barcode matching

Issue: Pipeline fails with "ambiguous barcodes" error

Solution: Ensure all barcodes are unique
Check that N positions are consistent across barcodes
Verify no barcode is a subsequence of another

Issue: Out of memory errors

Solution: Reduce number of threads
Process files in smaller batches
Increase available memory allocation

Issue: Slow performance on large files

Solution: Use --dont_build_reference for long barcodes
Ensure fast disk I/O for input/output files
Use compressed (gzipped) input files

Additional Resources

Ultraplex GitHub: github.com/ulelab/ultraplex
Ultraplex Documentation: Ultraplex User Guide
Support: Contact Flow support for pipeline-specific issues
Best Practices: Multiplexed Sequencing Guidelines