Flow Logo

Core Flow Concepts

Executions

Overview

An Execution represents a single run of a pipeline in Flow. It's the complete record of a computational analysis - capturing not just the results, but the entire context of how those results were generated. This includes inputs, parameters, software versions, compute resources, and the complete chain of processing steps.

Executions are the foundation of reproducible science in Flow, ensuring that any analysis can be understood, validated, and repeated.

What is an Execution?

An execution encompasses:

  • The computational run: Actual processing of data through a pipeline
  • Complete provenance: Every detail needed to reproduce the analysis
  • Real-time monitoring: Live status updates during processing
  • Output management: All generated files and their relationships
  • Quality metrics: Performance and validation information

Execution Lifecycle

States and Transitions

Every execution progresses through defined states:

Created → Submitted → Running → [Success/Failed] → Complete
   ↓         ↓           ↓            ↓              ↓
 Draft    Queued    Processing    Finished      Archived

State Descriptions:

  1. Created: Initial state, configuration in progress
  2. Submitted: Sent to compute infrastructure
  3. Queued: Waiting for resources
  4. Running: Actively processing data
  5. Success: Completed without errors
  6. Failed: Encountered unrecoverable errors
  7. Cancelled: Manually terminated
  8. Retrying: Attempting recovery from failure

Timing and Duration

Executions track critical timestamps:

  • Created: When execution was initialized
  • Submitted: When sent to compute cluster
  • Started: When processing began
  • Finished: When processing completed
  • Duration: Total processing time

Execution Components

Input Configuration

Every execution captures its inputs:

  1. Data Inputs: Files and directories

    • Raw sequencing data
    • Reference genomes
    • Annotation files
    • Previous execution outputs
  2. Sample Inputs: Biological samples

    • Sample metadata
    • Associated data files
    • Experimental conditions
  3. Parameters: Configuration settings

    • Algorithm parameters
    • Quality thresholds
    • Resource allocations

Process Executions

Within each execution, individual processes run:

Execution
├── Process 1: FastQC
│   ├── Input: raw_reads.fastq
│   ├── Output: quality_report.html
│   └── Status: Success (2 min)
├── Process 2: Trimming
│   ├── Input: raw_reads.fastq
│   ├── Output: trimmed_reads.fastq
│   └── Status: Success (5 min)
└── Process 3: Alignment
    ├── Input: trimmed_reads.fastq
    ├── Output: aligned_reads.bam
    └── Status: Running (10 min elapsed)

Each process tracks:

  • Status: Success, failed, or running
  • Exit code: Process return value
  • Duration: Processing time
  • Resources: CPU and memory usage
  • Logs: stdout and stderr output

Output Organization

Execution outputs follow a structured hierarchy:

execution_[id]/
├── results/              # Primary outputs
│   ├── processed_data/   # Main results
│   ├── qc_reports/       # Quality control
│   └── logs/             # Processing logs
├── pipeline_info/        # Execution metadata
│   ├── execution_report.html
│   ├── execution_trace.txt
│   └── execution_timeline.html
└── intermediate/         # Optional intermediate files

Monitoring Executions

Real-time Status

During execution, monitor:

  • Overall progress: Percentage complete
  • Process status: Individual step progress
  • Resource usage: CPU, memory, storage
  • Queue position: For pending jobs
  • Error alerts: Immediate failure notification

Execution Logs

Multiple log levels provide insights:

  1. Execution Log: High-level pipeline progress

    [2024-01-15 10:30:15] Pipeline started
    [2024-01-15 10:30:45] FastQC completed (2/10 processes)
    [2024-01-15 10:35:20] Alignment started (5/10 processes)
    
  2. Process Logs: Individual tool output

    === Process: STAR_ALIGN ===
    Started: 2024-01-15 10:35:20
    Input files: sample1_R1.fastq, sample1_R2.fastq
    STAR version 2.7.10a
    Loading genome index...
    Mapping reads...
    Finished successfully in 15 minutes
    
  3. Error Logs: Debugging information

    ERROR: Process failed with exit code 1
    STDERR: Insufficient memory for genome index
    Suggested fix: Increase memory allocation
    

Performance Metrics

Executions collect performance data:

  • CPU hours: Total compute time
  • Peak memory: Maximum RAM usage
  • I/O statistics: Read/write operations
  • Queue time: Wait before execution
  • Process timeline: Gantt chart visualization

Working with Executions

Creating Executions

Start a new execution by:

  1. Selecting a pipeline: Choose appropriate workflow
  2. Configuring inputs: Select data/samples
  3. Setting parameters: Adjust analysis settings
  4. Reviewing setup: Verify configuration
  5. Submitting: Launch the execution

Managing Running Executions

While running, you can:

  • Monitor progress: Check status updates
  • View logs: Read process output
  • Pause/Resume: Temporarily halt processing
  • Cancel: Stop and clean up
  • Modify resources: Adjust allocations

Handling Failures

When executions fail:

  1. Identify the issue:

    • Check error logs
    • Review process exit codes
    • Examine resource usage
  2. Common failure causes:

    • Insufficient memory
    • Malformed input files
    • Missing dependencies
    • Parameter conflicts
  3. Recovery options:

    • Retry: Attempt same configuration
    • Resume: Continue from last successful step
    • Modify: Adjust parameters and retry
    • Debug: Run with increased logging

Reusing Executions

Leverage previous executions:

  1. Clone execution: Copy configuration to new run
  2. Use outputs: Chain to downstream pipelines
  3. Share results: Make available to collaborators
  4. Export configuration: Save for reproducibility

Execution Outputs

Data Objects

Executions create new Data objects:

  • Automatically registered: In Flow's database
  • Linked to execution: Full provenance
  • Categorized by type: Results, QC, logs
  • Access controlled: Inherits permissions

Reports and Visualizations

Standard execution reports include:

  • Execution report: Overall summary
  • MultiQC report: Aggregated quality metrics
  • Timeline visualization: Process scheduling
  • Resource usage: Computational statistics

Metadata and Provenance

Every execution preserves:

execution_id: exec_12345
pipeline: nf-core/rnaseq
version: 3.12.0
timestamp: 2024-01-15T10:30:15Z
user: researcher@institution.edu
parameters:
  genome: GRCh38
  aligner: star_salmon
  trimmer: trimgalore
inputs:
  - sample1_R1.fastq.gz
  - sample1_R2.fastq.gz
outputs:
  - results/star_salmon/sample1.Aligned.sortedByCoord.out.bam
  - results/star_salmon/sample1.genes.results
compute:
  duration: 2h 15m
  cpu_hours: 18.5
  peak_memory: 32GB

Advanced Features

Batch Executions

Process multiple samples together:

  • Sample sheets: Define batch inputs
  • Parallel processing: Concurrent sample analysis
  • Merged outputs: Combined results
  • Consistent parameters: Ensure comparability

Execution Dependencies

Chain executions for complex workflows:

Genome Prep → RNA-seq → Differential Expression
      ↓           ↓              ↓
   Indices    Alignments    Statistical Results

Retry Strategies

Automatic failure recovery:

  • Process-level retry: Retry failed steps
  • Resource escalation: Increase memory/time
  • Error patterns: Skip known issues
  • Checkpoint recovery: Resume from saved state

Resource Optimization

Tune execution performance:

  • Process-specific resources: Customize per step
  • Dynamic allocation: Scale based on input
  • Priority queues: Expedite urgent jobs
  • Cost optimization: Balance speed vs expense

Best Practices

Before Execution

  1. Validate inputs: Check file formats and quality
  2. Test parameters: Run on subset first
  3. Estimate resources: Plan compute needs
  4. Document purpose: Note experimental goals

During Execution

  1. Monitor regularly: Check for early failures
  2. Review warnings: Address potential issues
  3. Document observations: Note unexpected behavior
  4. Save checkpoints: Enable resume capability

After Execution

  1. Verify outputs: Check completeness and quality
  2. Archive results: Move to long-term storage
  3. Document outcomes: Update project notes
  4. Clean up: Remove temporary files

Troubleshooting

Common Issues

Execution won't start

  • Check input file permissions
  • Verify compute resources available
  • Ensure pipeline is active
  • Review parameter validity

Execution fails immediately

  • Examine input file formats
  • Check parameter syntax
  • Verify reference data exists
  • Review resource requests

Execution runs but produces no output

  • Check process exit codes
  • Review log files
  • Verify write permissions
  • Examine disk space

Execution takes too long

  • Review resource allocation
  • Check for process bottlenecks
  • Consider parameter optimization
  • Verify input data size

Debug Strategies

  1. Enable verbose logging: Increase detail level
  2. Run single sample: Isolate issues
  3. Check intermediate files: Verify each step
  4. Compare to successful runs: Identify differences

Integration with Flow

Data Lineage

Executions maintain complete data lineage:

Raw Data → Execution 1 → Processed Data → Execution 2 → Final Results
              ↓                               ↓
          Parameters                      Parameters
          & Versions                      & Versions

Collaboration

Share executions with teams:

  • View permissions: See results and logs
  • Clone permissions: Rerun with modifications
  • Output access: Use generated data
  • Report sharing: Distribute findings

Reproducibility

Ensure reproducible science:

  • Version tracking: Pipeline and tool versions
  • Parameter records: Complete configuration
  • Environment capture: Container definitions
  • Compute details: Resource specifications

Next Steps

Previous
Pipelines