Core Concepts

Overview

Flow is built around a set of core concepts that work together to provide a comprehensive bioinformatics data management and analysis platform. Understanding these concepts and their relationships is essential for effectively using Flow.

Conceptual Model

At its heart, Flow organizes biological research data through these interconnected concepts:

Projects → Samples → Data → Pipelines → Executions → Results
    ↓         ↓        ↓         ↓           ↓          ↓
  Groups    Metadata  Files   Versions    Status    New Data

Core Concepts

🔬 Projects

A Project is the top-level organizational unit in Flow. It represents a research study or experiment and contains all related samples, data, and analyses.

Key characteristics:

Groups related biological samples together
Controls privacy settings for all contained data
Enables collaboration through user and group permissions
Links to publications and external databases

Example: A cancer genomics study comparing tumor and normal tissue samples from 50 patients.

Learn more about Projects →

🧪 Samples

A Sample represents a biological specimen that has been or will be sequenced. Each sample contains comprehensive metadata about its origin, preparation, and sequencing.

Key characteristics:

Belongs to a project (or replicate group)
Has a specific type (RNA-seq, ChIP-seq, etc.)
Contains biological metadata (organism, tissue, condition)
Links to raw sequencing data through filesets
Can be input to multiple pipeline analyses

Example: A liver tissue sample from patient P001, prepared for RNA-seq analysis.

Learn more about Samples →

📁 Data

Data represents files and directories in Flow. This includes both uploaded raw data and files generated by pipeline executions.

Data categories:

Raw Data: Uploaded FASTQ files from sequencers
Annotation: Reference files (genomes, GTF files)
Generated: Output from pipeline executions
Multiplexed/Demultiplexed: Special categories for pooled samples

Key characteristics:

Immutable once uploaded
Tracked with full provenance
Can be private or public
Linked to samples or executions that created them

Learn more about Data →

🔄 Pipelines

A Pipeline is a bioinformatics workflow that processes data. Flow provides curated pipelines from nf-core and custom workflows.

Pipeline types:

Primary Analysis: RNA-seq, ChIP-seq, variant calling
Secondary Analysis: Differential expression, peak calling
Utilities: Demultiplexing, file conversion
Genome Preparation: Reference genome indexing

Key characteristics:

Version controlled with reproducible environments
Defined inputs, parameters, and outputs
Can depend on other pipelines
Validated and tested

Learn more about Pipelines →

⚙️ Executions

An Execution represents a single run of a pipeline with specific inputs and parameters.

Key characteristics:

Tracks complete provenance (inputs, parameters, versions)
Real-time status monitoring
Generates new data objects
Can be retried or resumed
Links upstream data to downstream results

Execution lifecycle:

Created → Submitted → Running → [Success/Failed] → Complete

Learn more about Executions →

👥 Groups

Groups represent research teams, labs, or organizations. They enable collaborative data management and analysis.

Key characteristics:

Users can belong to multiple groups
Three permission levels: Member, Admin, Owner
Share ownership of projects, samples, and data
Integrate with institutional systems

Learn more about Groups →

Conceptual Overview Diagram

The following diagram illustrates how all Flow concepts connect and interact:

graph TB
    %% Users and Groups
    U[Users] --> G[Groups]
    U --> P[Projects]
    G --> P

    %% Projects contain Samples
    P --> S[Samples]
    
    %% Samples link to Data
    S --> F[Filesets]
    F --> D1[Raw Data<br/>FASTQ files]
    
    %% Pipelines process data
    PL[Pipelines] --> PV[Pipeline<br/>Versions]
    
    %% Executions connect everything
    S --> E[Executions]
    D1 --> E
    PV --> E
    
    %% Executions produce new data
    E --> PE[Process<br/>Executions]
    PE --> D2[Generated Data<br/>Results, QC, Logs]
    
    %% Data can feed into new executions
    D2 --> E2[New<br/>Executions]
    
    %% Permissions flow
    P -.->|permissions<br/>cascade| S
    S -.->|permissions<br/>cascade| D1
    S -.->|permissions<br/>cascade| D2
    
    %% Styling
    classDef user fill:#e1f5fe,stroke:#01579b
    classDef bio fill:#f3e5f5,stroke:#4a148c
    classDef data fill:#fff3e0,stroke:#e65100
    classDef compute fill:#e8f5e9,stroke:#1b5e20
    classDef output fill:#fce4ec,stroke:#880e4f
    
    class U,G user
    class P,S,F bio
    class D1 data
    class PL,PV,E,PE,E2 compute
    class D2 output

This diagram shows:

Blue: User and group entities
Purple: Biological organization (projects, samples)
Orange: Input data
Green: Computational processes
Pink: Generated outputs

Relationships and Data Flow

Sample to Data Relationship

Samples are connected to their raw sequencing data through Filesets:

Sample → Fileset → Data Files
         (pairs)    ├── R1.fastq.gz
                    └── R2.fastq.gz

Pipeline Execution Flow

The typical analysis workflow follows this pattern:

1. Upload Data → Creates Data objects
2. Create Sample → Links to Data via Fileset
3. Run Pipeline → Creates Execution
4. Execution processes inputs → Generates new Data
5. New Data available for downstream analysis

Permission Model

Flow uses a hierarchical permission system:

Project Permissions
    ↓ (inherit)
Sample Permissions
    ↓ (inherit)
Data Permissions

Three levels: Access (view) → Edit → Share

Practical Examples

Example 1: RNA-seq Experiment

Project: "Mouse Brain Development Study"
├── Sample: "E14_Brain_Rep1" (embryonic day 14)
│   └── Data: E14_Brain_Rep1_R1.fastq.gz, R2.fastq.gz
├── Sample: "E14_Brain_Rep2"
│   └── Data: E14_Brain_Rep2_R1.fastq.gz, R2.fastq.gz
├── Sample: "P0_Brain_Rep1" (postnatal day 0)
│   └── Data: P0_Brain_Rep1_R1.fastq.gz, R2.fastq.gz
└── Execution: "RNA-seq Pipeline v3.12"
    ├── Input: All samples
    ├── Parameters: genome=mm10, strandedness=reverse
    └── Output: Gene counts, QC reports, BAM files

Example 2: ChIP-seq with Input Controls

Project: "H3K27ac Enhancer Mapping"
├── Sample: "GM12878_H3K27ac" (ChIP)
├── Sample: "GM12878_Input" (Control)
├── Execution: "ChIP-seq Pipeline v2.0"
│   ├── Pairs: H3K27ac with Input control
│   └── Output: Peak calls, coverage tracks
└── Execution: "Differential Binding Analysis"
    └── Compare: Treatment vs Control conditions

Best Practices

Organize by Project: Group related experiments together
Consistent Naming: Use clear, descriptive names for samples
Complete Metadata: Fill in all relevant sample information
Version Control: Note which pipeline versions you use
Document Parameters: Record why you chose specific settings
Share Appropriately: Use groups for team collaboration

Overview

Conceptual Model

Core Concepts

🔬 Projects

🧪 Samples

📁 Data

🔄 Pipelines

⚙️ Executions

👥 Groups

Conceptual Overview Diagram

Relationships and Data Flow

Sample to Data Relationship

Pipeline Execution Flow

Permission Model

Practical Examples

Example 1: RNA-seq Experiment

Example 2: ChIP-seq with Input Controls

Best Practices

Next Steps