Flow Logo

Core Flow Concepts

Core Concepts

Overview

Flow is built around a set of core concepts that work together to provide a comprehensive bioinformatics data management and analysis platform. Understanding these concepts and their relationships is essential for effectively using Flow.

Conceptual Model

At its heart, Flow organizes biological research data through these interconnected concepts:

Projects → Samples → Data → Pipelines → Executions → Results
    ↓         ↓        ↓         ↓           ↓          ↓
  Groups    Metadata  Files   Versions    Status    New Data

Core Concepts

🔬 Projects

A Project is the top-level organizational unit in Flow. It represents a research study or experiment and contains all related samples, data, and analyses.

Key characteristics:

  • Groups related biological samples together
  • Controls privacy settings for all contained data
  • Enables collaboration through user and group permissions
  • Links to publications and external databases

Example: A cancer genomics study comparing tumor and normal tissue samples from 50 patients.

Learn more about Projects →

🧪 Samples

A Sample represents a biological specimen that has been or will be sequenced. Each sample contains comprehensive metadata about its origin, preparation, and sequencing.

Key characteristics:

  • Belongs to a project (or replicate group)
  • Has a specific type (RNA-seq, ChIP-seq, etc.)
  • Contains biological metadata (organism, tissue, condition)
  • Links to raw sequencing data through filesets
  • Can be input to multiple pipeline analyses

Example: A liver tissue sample from patient P001, prepared for RNA-seq analysis.

Learn more about Samples →

📁 Data

Data represents files and directories in Flow. This includes both uploaded raw data and files generated by pipeline executions.

Data categories:

  • Raw Data: Uploaded FASTQ files from sequencers
  • Annotation: Reference files (genomes, GTF files)
  • Generated: Output from pipeline executions
  • Multiplexed/Demultiplexed: Special categories for pooled samples

Key characteristics:

  • Immutable once uploaded
  • Tracked with full provenance
  • Can be private or public
  • Linked to samples or executions that created them

Learn more about Data →

🔄 Pipelines

A Pipeline is a bioinformatics workflow that processes data. Flow provides curated pipelines from nf-core and custom workflows.

Pipeline types:

  • Primary Analysis: RNA-seq, ChIP-seq, variant calling
  • Secondary Analysis: Differential expression, peak calling
  • Utilities: Demultiplexing, file conversion
  • Genome Preparation: Reference genome indexing

Key characteristics:

  • Version controlled with reproducible environments
  • Defined inputs, parameters, and outputs
  • Can depend on other pipelines
  • Validated and tested

Learn more about Pipelines →

⚙️ Executions

An Execution represents a single run of a pipeline with specific inputs and parameters.

Key characteristics:

  • Tracks complete provenance (inputs, parameters, versions)
  • Real-time status monitoring
  • Generates new data objects
  • Can be retried or resumed
  • Links upstream data to downstream results

Execution lifecycle:

Created → Submitted → Running → [Success/Failed] → Complete

Learn more about Executions →

👥 Groups

Groups represent research teams, labs, or organizations. They enable collaborative data management and analysis.

Key characteristics:

  • Users can belong to multiple groups
  • Three permission levels: Member, Admin, Owner
  • Share ownership of projects, samples, and data
  • Integrate with institutional systems

Learn more about Groups →

Conceptual Overview Diagram

The following diagram illustrates how all Flow concepts connect and interact:

graph TB
    %% Users and Groups
    U[Users] --> G[Groups]
    U --> P[Projects]
    G --> P

    %% Projects contain Samples
    P --> S[Samples]
    
    %% Samples link to Data
    S --> F[Filesets]
    F --> D1[Raw Data<br/>FASTQ files]
    
    %% Pipelines process data
    PL[Pipelines] --> PV[Pipeline<br/>Versions]
    
    %% Executions connect everything
    S --> E[Executions]
    D1 --> E
    PV --> E
    
    %% Executions produce new data
    E --> PE[Process<br/>Executions]
    PE --> D2[Generated Data<br/>Results, QC, Logs]
    
    %% Data can feed into new executions
    D2 --> E2[New<br/>Executions]
    
    %% Permissions flow
    P -.->|permissions<br/>cascade| S
    S -.->|permissions<br/>cascade| D1
    S -.->|permissions<br/>cascade| D2
    
    %% Styling
    classDef user fill:#e1f5fe,stroke:#01579b
    classDef bio fill:#f3e5f5,stroke:#4a148c
    classDef data fill:#fff3e0,stroke:#e65100
    classDef compute fill:#e8f5e9,stroke:#1b5e20
    classDef output fill:#fce4ec,stroke:#880e4f
    
    class U,G user
    class P,S,F bio
    class D1 data
    class PL,PV,E,PE,E2 compute
    class D2 output

This diagram shows:

  • Blue: User and group entities
  • Purple: Biological organization (projects, samples)
  • Orange: Input data
  • Green: Computational processes
  • Pink: Generated outputs

Relationships and Data Flow

Sample to Data Relationship

Samples are connected to their raw sequencing data through Filesets:

Sample → Fileset → Data Files
         (pairs)    ├── R1.fastq.gz
                    └── R2.fastq.gz

Pipeline Execution Flow

The typical analysis workflow follows this pattern:

1. Upload Data → Creates Data objects
2. Create Sample → Links to Data via Fileset
3. Run Pipeline → Creates Execution
4. Execution processes inputs → Generates new Data
5. New Data available for downstream analysis

Permission Model

Flow uses a hierarchical permission system:

Project Permissions
    ↓ (inherit)
Sample Permissions
    ↓ (inherit)
Data Permissions

Three levels: Access (view) → Edit → Share

Practical Examples

Example 1: RNA-seq Experiment

Project: "Mouse Brain Development Study"
├── Sample: "E14_Brain_Rep1" (embryonic day 14)
│   └── Data: E14_Brain_Rep1_R1.fastq.gz, R2.fastq.gz
├── Sample: "E14_Brain_Rep2"
│   └── Data: E14_Brain_Rep2_R1.fastq.gz, R2.fastq.gz
├── Sample: "P0_Brain_Rep1" (postnatal day 0)
│   └── Data: P0_Brain_Rep1_R1.fastq.gz, R2.fastq.gz
└── Execution: "RNA-seq Pipeline v3.12"
    ├── Input: All samples
    ├── Parameters: genome=mm10, strandedness=reverse
    └── Output: Gene counts, QC reports, BAM files

Example 2: ChIP-seq with Input Controls

Project: "H3K27ac Enhancer Mapping"
├── Sample: "GM12878_H3K27ac" (ChIP)
├── Sample: "GM12878_Input" (Control)
├── Execution: "ChIP-seq Pipeline v2.0"
│   ├── Pairs: H3K27ac with Input control
│   └── Output: Peak calls, coverage tracks
└── Execution: "Differential Binding Analysis"
    └── Compare: Treatment vs Control conditions

Best Practices

  1. Organize by Project: Group related experiments together
  2. Consistent Naming: Use clear, descriptive names for samples
  3. Complete Metadata: Fill in all relevant sample information
  4. Version Control: Note which pipeline versions you use
  5. Document Parameters: Record why you chose specific settings
  6. Share Appropriately: Use groups for team collaboration

Next Steps

Previous
CAS9point4 (CRISPR)