Flow Logo

Pipelines

Demultiplexing Pipeline

Overview

The demultiplexing pipeline is powered by Ultraplex v1.2.5, an all-in-one software package for processing and demultiplexing FASTQ files. This pipeline is designed to separate multiplexed sequencing data into individual sample files based on barcode sequences. It's particularly optimized for CLIP-seq experiments but can be used for any barcoded sequencing data.

The pipeline can be found under "Utilities" in the "Run analysis" page on Flow.

Pipeline Summary

The demultiplexing workflow performs the following steps:

  1. Format Conversion - Converts Excel samplesheets to CSV format if needed
  2. Barcode Processing - Transforms samplesheet information into Ultraplex-compatible barcode format
  3. Quality Trimming - Removes poor quality bases from reads
  4. Adapter Removal - Trims sequencing adapters from reads
  5. Barcode Detection - Identifies 5' and/or 3' barcodes in each read
  6. Sample Assignment - Assigns reads to samples based on detected barcodes
  7. UMI Handling - Moves unique molecular identifiers to read headers
  8. Output Generation - Creates individual FASTQ files for each sample

Input Requirements

Sample Sheet Format

The pipeline requires a sample sheet in CSV or XLSX format with the following columns:

ColumnDescriptionRequiredExample
idUnique sample identifierYesSample_1
5prime_barcode5' barcode sequence (N for variable positions)YesNNNGGCGNN
3prime_barcode3' barcode sequenceNoAACTTGAA
adapter_sequence3' adapter sequence to trimNoAGATCGGAAGAGCGGTTCAG
adapter_nameName/identifier for the adapterNoTruSeq

Example sample sheet:

id,5prime_barcode,3prime_barcode,adapter_sequence,adapter_name
Control_1,NNNGGCGNN,,AGATCGGAAGAGCGGTTCAG,TruSeq
Control_2,NNNTTGTNN,,AGATCGGAAGAGCGGTTCAG,TruSeq
Treatment_1,NNNCCGANN,AACTTGAA,AGATCGGAAGAGCGGTTCAG,TruSeq
Treatment_2,NNNAAGCNN,TTGAACTT,AGATCGGAAGAGCGGTTCAG,TruSeq

FASTQ Input Files

  • Single-end or paired-end FASTQ files (gzipped or uncompressed)
  • Multiplexed data containing reads from multiple samples
  • Any sequencing platform (Illumina, MGI, etc.)

Key Parameters

Required Parameters

ParameterDescriptionDefault
--fastqPath to multiplexed FASTQ file(s)Required
--samplesheetPath to sample sheet (CSV/XLSX)Required
--outdirOutput directory for resultsRequired

Optional Parameters

ParameterDescriptionDefault
--threadsNumber of CPU threads to useAll available
--dont_build_referenceSkip reference building (for barcodes >8nt)false
--quality_thresholdMinimum base quality score20

Pipeline Outputs

Directory Structure

results/
├── demultiplexed/
│   ├── Sample_1.fastq.gz
│   ├── Sample_2.fastq.gz
│   └── ...
├── no_match/
│   └── no_match.fastq.gz
├── logs/
│   └── ultraplex.log
└── pipeline_info/
    └── versions.yml

Output Files

  • Demultiplexed FASTQ Files

    • One file per sample: [sample_id].fastq.gz
    • Contains all reads assigned to that sample
    • UMIs moved to read headers with "rbc:" prefix
  • Unassigned Reads

    • no_match.fastq.gz: Reads that couldn't be assigned to any barcode
    • Useful for troubleshooting barcode issues
  • Demultiplexing Statistics

    • ultraplex.log: Detailed statistics including read counts per sample
    • Assignment rates and quality metrics
  • Software Versions

    • versions.yml: Records all software versions used

Example Usage

Basic Demultiplexing

nextflow run demultiplex \
  --fastq multiplexed_data.fastq.gz \
  --samplesheet samples.csv \
  --outdir results

Paired-end Demultiplexing

nextflow run demultiplex \
  --fastq "multiplexed_*_{1,2}.fastq.gz" \
  --samplesheet samples.xlsx \
  --outdir results

Performance Optimization

nextflow run demultiplex \
  --fastq lane1.fastq.gz \
  --samplesheet barcodes.csv \
  --threads 16 \
  --outdir results

Tips and Best Practices

Barcode Design

  • Keep barcodes the same length for optimal performance
  • Use 'N' positions for UMIs or variable sequences
  • Ensure barcodes have sufficient edit distance (≥2) between samples
  • For barcodes >8nt, use --dont_build_reference flag

Performance Optimization

  • Ultraplex can process ~400 million reads in 20 minutes
  • Use multiple threads for faster processing
  • Ensure sufficient disk space for output files

Quality Control

  • Check the no_match.fastq.gz file - high unassigned rates may indicate:
    • Incorrect barcode sequences in sample sheet
    • Poor sequencing quality
    • Sample contamination

On Flow Platform

  • Samples are automatically created during demultiplexing
  • Fill out the annotation spreadsheet completely for accurate metadata
  • Use descriptive sample IDs as they become part of the filename

Troubleshooting

Common Issues

Issue: High percentage of unassigned reads

  • Solution: Verify barcode sequences match library preparation
  • Check for barcode mutations or sequencing errors
  • Consider allowing mismatches in barcode matching

Issue: Pipeline fails with "ambiguous barcodes" error

  • Solution: Ensure all barcodes are unique
  • Check that N positions are consistent across barcodes
  • Verify no barcode is a subsequence of another

Issue: Out of memory errors

  • Solution: Reduce number of threads
  • Process files in smaller batches
  • Increase available memory allocation

Issue: Slow performance on large files

  • Solution: Use --dont_build_reference for long barcodes
  • Ensure fast disk I/O for input/output files
  • Use compressed (gzipped) input files

Additional Resources

Previous
Searching the database