Core Flow Concepts
Core Concepts
Overview
Flow is built around a set of core concepts that work together to provide a comprehensive bioinformatics data management and analysis platform. Understanding these concepts and their relationships is essential for effectively using Flow.
Conceptual Model
At its heart, Flow organizes biological research data through these interconnected concepts:
Projects → Samples → Data → Pipelines → Executions → Results
↓ ↓ ↓ ↓ ↓ ↓
Groups Metadata Files Versions Status New Data
Core Concepts
🔬 Projects
A Project is the top-level organizational unit in Flow. It represents a research study or experiment and contains all related samples, data, and analyses.
Key characteristics:
- Groups related biological samples together
- Controls privacy settings for all contained data
- Enables collaboration through user and group permissions
- Links to publications and external databases
Example: A cancer genomics study comparing tumor and normal tissue samples from 50 patients.
🧪 Samples
A Sample represents a biological specimen that has been or will be sequenced. Each sample contains comprehensive metadata about its origin, preparation, and sequencing.
Key characteristics:
- Belongs to a project (or replicate group)
- Has a specific type (RNA-seq, ChIP-seq, etc.)
- Contains biological metadata (organism, tissue, condition)
- Links to raw sequencing data through filesets
- Can be input to multiple pipeline analyses
Example: A liver tissue sample from patient P001, prepared for RNA-seq analysis.
📁 Data
Data represents files and directories in Flow. This includes both uploaded raw data and files generated by pipeline executions.
Data categories:
- Raw Data: Uploaded FASTQ files from sequencers
- Annotation: Reference files (genomes, GTF files)
- Generated: Output from pipeline executions
- Multiplexed/Demultiplexed: Special categories for pooled samples
Key characteristics:
- Immutable once uploaded
- Tracked with full provenance
- Can be private or public
- Linked to samples or executions that created them
🔄 Pipelines
A Pipeline is a bioinformatics workflow that processes data. Flow provides curated pipelines from nf-core and custom workflows.
Pipeline types:
- Primary Analysis: RNA-seq, ChIP-seq, variant calling
- Secondary Analysis: Differential expression, peak calling
- Utilities: Demultiplexing, file conversion
- Genome Preparation: Reference genome indexing
Key characteristics:
- Version controlled with reproducible environments
- Defined inputs, parameters, and outputs
- Can depend on other pipelines
- Validated and tested
⚙️ Executions
An Execution represents a single run of a pipeline with specific inputs and parameters.
Key characteristics:
- Tracks complete provenance (inputs, parameters, versions)
- Real-time status monitoring
- Generates new data objects
- Can be retried or resumed
- Links upstream data to downstream results
Execution lifecycle:
Created → Submitted → Running → [Success/Failed] → Complete
👥 Groups
Groups represent research teams, labs, or organizations. They enable collaborative data management and analysis.
Key characteristics:
- Users can belong to multiple groups
- Three permission levels: Member, Admin, Owner
- Share ownership of projects, samples, and data
- Integrate with institutional systems
Conceptual Overview Diagram
The following diagram illustrates how all Flow concepts connect and interact:
graph TB
%% Users and Groups
U[Users] --> G[Groups]
U --> P[Projects]
G --> P
%% Projects contain Samples
P --> S[Samples]
%% Samples link to Data
S --> F[Filesets]
F --> D1[Raw Data<br/>FASTQ files]
%% Pipelines process data
PL[Pipelines] --> PV[Pipeline<br/>Versions]
%% Executions connect everything
S --> E[Executions]
D1 --> E
PV --> E
%% Executions produce new data
E --> PE[Process<br/>Executions]
PE --> D2[Generated Data<br/>Results, QC, Logs]
%% Data can feed into new executions
D2 --> E2[New<br/>Executions]
%% Permissions flow
P -.->|permissions<br/>cascade| S
S -.->|permissions<br/>cascade| D1
S -.->|permissions<br/>cascade| D2
%% Styling
classDef user fill:#e1f5fe,stroke:#01579b
classDef bio fill:#f3e5f5,stroke:#4a148c
classDef data fill:#fff3e0,stroke:#e65100
classDef compute fill:#e8f5e9,stroke:#1b5e20
classDef output fill:#fce4ec,stroke:#880e4f
class U,G user
class P,S,F bio
class D1 data
class PL,PV,E,PE,E2 compute
class D2 output
This diagram shows:
- Blue: User and group entities
- Purple: Biological organization (projects, samples)
- Orange: Input data
- Green: Computational processes
- Pink: Generated outputs
Relationships and Data Flow
Sample to Data Relationship
Samples are connected to their raw sequencing data through Filesets:
Sample → Fileset → Data Files
(pairs) ├── R1.fastq.gz
└── R2.fastq.gz
Pipeline Execution Flow
The typical analysis workflow follows this pattern:
1. Upload Data → Creates Data objects
2. Create Sample → Links to Data via Fileset
3. Run Pipeline → Creates Execution
4. Execution processes inputs → Generates new Data
5. New Data available for downstream analysis
Permission Model
Flow uses a hierarchical permission system:
Project Permissions
↓ (inherit)
Sample Permissions
↓ (inherit)
Data Permissions
Three levels: Access (view) → Edit → Share
Practical Examples
Example 1: RNA-seq Experiment
Project: "Mouse Brain Development Study"
├── Sample: "E14_Brain_Rep1" (embryonic day 14)
│ └── Data: E14_Brain_Rep1_R1.fastq.gz, R2.fastq.gz
├── Sample: "E14_Brain_Rep2"
│ └── Data: E14_Brain_Rep2_R1.fastq.gz, R2.fastq.gz
├── Sample: "P0_Brain_Rep1" (postnatal day 0)
│ └── Data: P0_Brain_Rep1_R1.fastq.gz, R2.fastq.gz
└── Execution: "RNA-seq Pipeline v3.12"
├── Input: All samples
├── Parameters: genome=mm10, strandedness=reverse
└── Output: Gene counts, QC reports, BAM files
Example 2: ChIP-seq with Input Controls
Project: "H3K27ac Enhancer Mapping"
├── Sample: "GM12878_H3K27ac" (ChIP)
├── Sample: "GM12878_Input" (Control)
├── Execution: "ChIP-seq Pipeline v2.0"
│ ├── Pairs: H3K27ac with Input control
│ └── Output: Peak calls, coverage tracks
└── Execution: "Differential Binding Analysis"
└── Compare: Treatment vs Control conditions
Best Practices
- Organize by Project: Group related experiments together
- Consistent Naming: Use clear, descriptive names for samples
- Complete Metadata: Fill in all relevant sample information
- Version Control: Note which pipeline versions you use
- Document Parameters: Record why you chose specific settings
- Share Appropriately: Use groups for team collaboration