Architecture guide

Flow is built as a comprehensive bioinformatics data analysis platform using a multi-layered architecture. This guide explains how the different components work together to provide a seamless experience for analyzing biological data.

5-Layer Architecture

Flow operates on five distinct layers, each adding functionality and abstraction over the layer below it. Users can interact with Flow at any layer depending on their needs and technical expertise.

Layer 1: Bioinformatics Tools

The foundation of Flow consists of established command-line bioinformatics tools that perform the actual data analysis:

STAR and Bowtie for sequence alignment
MultiQC for quality control reports
Ultraplex for demultiplexing
Various specialized tools for RNA-seq, ChIP-seq, CLIP-seq analysis

These tools typically have intensive resource requirements and are designed to run on high-performance computing (HPC) clusters or cloud infrastructure.

Layer 2: Nextflow Pipelines

Bioinformatics tools are orchestrated using Nextflow, a workflow language designed for scalable and reproducible pipelines:

process GUNZIP {
    label 'process_low'
    container 'biocontainers/biocontainers:v1.2.0_cv1'

    input:
    path(archive)

    output:
    tuple path("$gunzip"), emit: gunzip

    script:
    gunzip = archive.toString() - '.gz'
    """
    gunzip -f $archive
    """
}

Key features:

Modules: Reusable components wrapping individual tools
Workflows: Chain modules together with data channels
Executors: Run on local machines, HPC clusters, or cloud platforms
Containers: Docker/Singularity support for reproducibility

Layer 3: nextflow.py

A Python library that provides programmatic access to Nextflow pipelines:

import nextflow

# Run a pipeline from Python
execution = nextflow.run(
    pipeline="nf-core/rnaseq",
    params={"input": "samples.csv", "genome": "GRCh38"},
    profile=["docker"]
)

# Access execution details
print(f"Started: {execution.started}")
print(f"Duration: {execution.duration}")

This layer enables Flow to integrate Nextflow pipelines into a larger application framework.

Layer 4: Django API

The core application layer built with Django, providing:

REST API for comprehensive platform access
REST API for file operations and frontend optimization
Database models for users, pipelines, executions, and data
Celery workers for asynchronous pipeline execution
Permission system with multi-tenant support

Layer 5: Web Clients

User-facing interfaces that consume the API:

React Frontend: Modern single-page application at app.flow.bio
flowbio Python Client: Programmatic access for automation and scripting

Container Architecture

Flow runs as a set of 10 Docker containers working together in a single network:

Core Services

db - PostgreSQL database

Stores all application data and metadata
Persisted via Docker volumes

rabbit - RabbitMQ message broker

Manages communication between API and workers
Handles task queuing for pipeline execution

api - Django application server

Serves REST API endpoints
Handles authentication and authorization
Manages file uploads and pipeline submissions

Worker Services

celery1 - Execution workers

Runs Nextflow pipelines
Monitors execution progress
Updates database with results

celery2 - Background task workers

Handles file operations
Generates bulk downloads
Performs maintenance tasks

beat2 - Task scheduler

Triggers periodic tasks
Manages cleanup operations

Web Services

front - React frontend server

Serves the single-page application
No direct container dependencies

files - Static file server

Serves uploaded data and results
Validates access permissions via API

Support Services

db_backups - Database backup service

Creates periodic SQL dumps
Ensures data persistence

local_backup - File backup service (optional)

Syncs uploaded and generated files
Provides additional redundancy

Data Flow

Understanding how data moves through Flow helps explain the architecture:

1. Upload Phase

User → Frontend → API → Storage Volume
                    ↓
                PostgreSQL (metadata)

2. Execution Phase

API → RabbitMQ → Celery Worker
                      ↓
                  Nextflow
                      ↓
              HPC/Cloud Compute

3. Results Phase

Execution Output → Storage Volume
                      ↓
                 Celery Worker → PostgreSQL
                      ↓
                    API → Frontend → User

API Design

REST API

The REST API provides comprehensive access to all platform features with endpoints organized by resource type. See the REST API Reference for complete documentation.

REST Endpoints

Specialized endpoints for file operations:

/api/upload/ - Chunked file uploads
/api/download/ - Secure file downloads
/health/ - Service health checks

Authentication

JWT-based authentication flow:

User logs in with credentials
API returns access and refresh tokens
Frontend includes token in Authorization header
Tokens refresh automatically before expiration

Infrastructure

Storage Volumes

Shared volumes enable data persistence and inter-container communication:

uploads: User-uploaded files (FASTQ, etc.)
executions: Pipeline execution outputs
pipelines: Git repositories of pipeline code
configs: Pipeline configuration files
media: User avatars and images
zips: Bulk download archives

Deployment Options

Cloud Platforms

AWS with managed services (RDS, ECS, S3)
Google Cloud Platform (Cloud SQL, GKE, Cloud Storage)
Azure (Database for PostgreSQL, AKS, Blob Storage)

On-Premise Deployments

Kubernetes clusters for scalability
Integration with existing infrastructure
Support for air-gapped environments

HPC Integration

Slurm executor for job submission
NFS/SMB mounts for shared storage
Custom runners for institute-specific setups

Security Considerations

Network Security

Services communicate over private networks
Public endpoints protected by WAF
TLS termination at load balancer

Data Security

File access validated through API permissions
Database connections use SSL
Sensitive configuration in environment variables

Compute Security

Pipeline execution in isolated containers
Resource limits enforced
User data segregation

Monitoring and Observability

Flow includes built-in monitoring capabilities:

Health endpoints for each service
Prometheus metrics for performance tracking
Grafana dashboards for visualization
Execution logs preserved for debugging

The architecture is designed to be observable, making it easy to diagnose issues and optimize performance.