Core Flow Concepts
Executions
Overview
An Execution represents a single run of a pipeline in Flow. It's the complete record of a computational analysis - capturing not just the results, but the entire context of how those results were generated. This includes inputs, parameters, software versions, compute resources, and the complete chain of processing steps.
Executions are the foundation of reproducible science in Flow, ensuring that any analysis can be understood, validated, and repeated.
What is an Execution?
An execution encompasses:
- The computational run: Actual processing of data through a pipeline
- Complete provenance: Every detail needed to reproduce the analysis
- Real-time monitoring: Live status updates during processing
- Output management: All generated files and their relationships
- Quality metrics: Performance and validation information
Execution Lifecycle
States and Transitions
Every execution progresses through defined states:
Created → Submitted → Running → [Success/Failed] → Complete
↓ ↓ ↓ ↓ ↓
Draft Queued Processing Finished Archived
State Descriptions:
- Created: Initial state, configuration in progress
- Submitted: Sent to compute infrastructure
- Queued: Waiting for resources
- Running: Actively processing data
- Success: Completed without errors
- Failed: Encountered unrecoverable errors
- Cancelled: Manually terminated
- Retrying: Attempting recovery from failure
Timing and Duration
Executions track critical timestamps:
- Created: When execution was initialized
- Submitted: When sent to compute cluster
- Started: When processing began
- Finished: When processing completed
- Duration: Total processing time
Execution Components
Input Configuration
Every execution captures its inputs:
Data Inputs: Files and directories
- Raw sequencing data
- Reference genomes
- Annotation files
- Previous execution outputs
Sample Inputs: Biological samples
- Sample metadata
- Associated data files
- Experimental conditions
Parameters: Configuration settings
- Algorithm parameters
- Quality thresholds
- Resource allocations
Process Executions
Within each execution, individual processes run:
Execution
├── Process 1: FastQC
│ ├── Input: raw_reads.fastq
│ ├── Output: quality_report.html
│ └── Status: Success (2 min)
├── Process 2: Trimming
│ ├── Input: raw_reads.fastq
│ ├── Output: trimmed_reads.fastq
│ └── Status: Success (5 min)
└── Process 3: Alignment
├── Input: trimmed_reads.fastq
├── Output: aligned_reads.bam
└── Status: Running (10 min elapsed)
Each process tracks:
- Status: Success, failed, or running
- Exit code: Process return value
- Duration: Processing time
- Resources: CPU and memory usage
- Logs: stdout and stderr output
Output Organization
Execution outputs follow a structured hierarchy:
execution_[id]/
├── results/ # Primary outputs
│ ├── processed_data/ # Main results
│ ├── qc_reports/ # Quality control
│ └── logs/ # Processing logs
├── pipeline_info/ # Execution metadata
│ ├── execution_report.html
│ ├── execution_trace.txt
│ └── execution_timeline.html
└── intermediate/ # Optional intermediate files
Monitoring Executions
Real-time Status
During execution, monitor:
- Overall progress: Percentage complete
- Process status: Individual step progress
- Resource usage: CPU, memory, storage
- Queue position: For pending jobs
- Error alerts: Immediate failure notification
Execution Logs
Multiple log levels provide insights:
Execution Log: High-level pipeline progress
[2024-01-15 10:30:15] Pipeline started [2024-01-15 10:30:45] FastQC completed (2/10 processes) [2024-01-15 10:35:20] Alignment started (5/10 processes)
Process Logs: Individual tool output
=== Process: STAR_ALIGN === Started: 2024-01-15 10:35:20 Input files: sample1_R1.fastq, sample1_R2.fastq STAR version 2.7.10a Loading genome index... Mapping reads... Finished successfully in 15 minutes
Error Logs: Debugging information
ERROR: Process failed with exit code 1 STDERR: Insufficient memory for genome index Suggested fix: Increase memory allocation
Performance Metrics
Executions collect performance data:
- CPU hours: Total compute time
- Peak memory: Maximum RAM usage
- I/O statistics: Read/write operations
- Queue time: Wait before execution
- Process timeline: Gantt chart visualization
Working with Executions
Creating Executions
Start a new execution by:
- Selecting a pipeline: Choose appropriate workflow
- Configuring inputs: Select data/samples
- Setting parameters: Adjust analysis settings
- Reviewing setup: Verify configuration
- Submitting: Launch the execution
Managing Running Executions
While running, you can:
- Monitor progress: Check status updates
- View logs: Read process output
- Pause/Resume: Temporarily halt processing
- Cancel: Stop and clean up
- Modify resources: Adjust allocations
Handling Failures
When executions fail:
Identify the issue:
- Check error logs
- Review process exit codes
- Examine resource usage
Common failure causes:
- Insufficient memory
- Malformed input files
- Missing dependencies
- Parameter conflicts
Recovery options:
- Retry: Attempt same configuration
- Resume: Continue from last successful step
- Modify: Adjust parameters and retry
- Debug: Run with increased logging
Reusing Executions
Leverage previous executions:
- Clone execution: Copy configuration to new run
- Use outputs: Chain to downstream pipelines
- Share results: Make available to collaborators
- Export configuration: Save for reproducibility
Execution Outputs
Data Objects
Executions create new Data objects:
- Automatically registered: In Flow's database
- Linked to execution: Full provenance
- Categorized by type: Results, QC, logs
- Access controlled: Inherits permissions
Reports and Visualizations
Standard execution reports include:
- Execution report: Overall summary
- MultiQC report: Aggregated quality metrics
- Timeline visualization: Process scheduling
- Resource usage: Computational statistics
Metadata and Provenance
Every execution preserves:
execution_id: exec_12345
pipeline: nf-core/rnaseq
version: 3.12.0
timestamp: 2024-01-15T10:30:15Z
user: researcher@institution.edu
parameters:
genome: GRCh38
aligner: star_salmon
trimmer: trimgalore
inputs:
- sample1_R1.fastq.gz
- sample1_R2.fastq.gz
outputs:
- results/star_salmon/sample1.Aligned.sortedByCoord.out.bam
- results/star_salmon/sample1.genes.results
compute:
duration: 2h 15m
cpu_hours: 18.5
peak_memory: 32GB
Advanced Features
Batch Executions
Process multiple samples together:
- Sample sheets: Define batch inputs
- Parallel processing: Concurrent sample analysis
- Merged outputs: Combined results
- Consistent parameters: Ensure comparability
Execution Dependencies
Chain executions for complex workflows:
Genome Prep → RNA-seq → Differential Expression
↓ ↓ ↓
Indices Alignments Statistical Results
Retry Strategies
Automatic failure recovery:
- Process-level retry: Retry failed steps
- Resource escalation: Increase memory/time
- Error patterns: Skip known issues
- Checkpoint recovery: Resume from saved state
Resource Optimization
Tune execution performance:
- Process-specific resources: Customize per step
- Dynamic allocation: Scale based on input
- Priority queues: Expedite urgent jobs
- Cost optimization: Balance speed vs expense
Best Practices
Before Execution
- Validate inputs: Check file formats and quality
- Test parameters: Run on subset first
- Estimate resources: Plan compute needs
- Document purpose: Note experimental goals
During Execution
- Monitor regularly: Check for early failures
- Review warnings: Address potential issues
- Document observations: Note unexpected behavior
- Save checkpoints: Enable resume capability
After Execution
- Verify outputs: Check completeness and quality
- Archive results: Move to long-term storage
- Document outcomes: Update project notes
- Clean up: Remove temporary files
Troubleshooting
Common Issues
Execution won't start
- Check input file permissions
- Verify compute resources available
- Ensure pipeline is active
- Review parameter validity
Execution fails immediately
- Examine input file formats
- Check parameter syntax
- Verify reference data exists
- Review resource requests
Execution runs but produces no output
- Check process exit codes
- Review log files
- Verify write permissions
- Examine disk space
Execution takes too long
- Review resource allocation
- Check for process bottlenecks
- Consider parameter optimization
- Verify input data size
Debug Strategies
- Enable verbose logging: Increase detail level
- Run single sample: Isolate issues
- Check intermediate files: Verify each step
- Compare to successful runs: Identify differences
Integration with Flow
Data Lineage
Executions maintain complete data lineage:
Raw Data → Execution 1 → Processed Data → Execution 2 → Final Results
↓ ↓
Parameters Parameters
& Versions & Versions
Collaboration
Share executions with teams:
- View permissions: See results and logs
- Clone permissions: Rerun with modifications
- Output access: Use generated data
- Report sharing: Distribute findings
Reproducibility
Ensure reproducible science:
- Version tracking: Pipeline and tool versions
- Parameter records: Complete configuration
- Environment capture: Container definitions
- Compute details: Resource specifications