Flow Logo

Advanced Topics

Architecture guide

Flow is built as a comprehensive bioinformatics data analysis platform using a multi-layered architecture. This guide explains how the different components work together to provide a seamless experience for analyzing biological data.


5-Layer Architecture

Flow operates on five distinct layers, each adding functionality and abstraction over the layer below it. Users can interact with Flow at any layer depending on their needs and technical expertise.

Layer 1: Bioinformatics Tools

The foundation of Flow consists of established command-line bioinformatics tools that perform the actual data analysis:

  • STAR and Bowtie for sequence alignment
  • MultiQC for quality control reports
  • Ultraplex for demultiplexing
  • Various specialized tools for RNA-seq, ChIP-seq, CLIP-seq analysis

These tools typically have intensive resource requirements and are designed to run on high-performance computing (HPC) clusters or cloud infrastructure.

Layer 2: Nextflow Pipelines

Bioinformatics tools are orchestrated using Nextflow, a workflow language designed for scalable and reproducible pipelines:

process GUNZIP {
    label 'process_low'
    container 'biocontainers/biocontainers:v1.2.0_cv1'

    input:
    path(archive)

    output:
    tuple path("$gunzip"), emit: gunzip

    script:
    gunzip = archive.toString() - '.gz'
    """
    gunzip -f $archive
    """
}

Key features:

  • Modules: Reusable components wrapping individual tools
  • Workflows: Chain modules together with data channels
  • Executors: Run on local machines, HPC clusters, or cloud platforms
  • Containers: Docker/Singularity support for reproducibility

Layer 3: nextflow.py

A Python library that provides programmatic access to Nextflow pipelines:

import nextflow

# Run a pipeline from Python
execution = nextflow.run(
    pipeline="nf-core/rnaseq",
    params={"input": "samples.csv", "genome": "GRCh38"},
    profile=["docker"]
)

# Access execution details
print(f"Started: {execution.started}")
print(f"Duration: {execution.duration}")

This layer enables Flow to integrate Nextflow pipelines into a larger application framework.

Layer 4: Django API

The core application layer built with Django, providing:

  • REST API for comprehensive platform access
  • REST API for file operations and frontend optimization
  • Database models for users, pipelines, executions, and data
  • Celery workers for asynchronous pipeline execution
  • Permission system with multi-tenant support

Layer 5: Web Clients

User-facing interfaces that consume the API:

  • React Frontend: Modern single-page application at app.flow.bio
  • flowbio Python Client: Programmatic access for automation and scripting

Container Architecture

Flow runs as a set of 10 Docker containers working together in a single network:

Core Services

db - PostgreSQL database

  • Stores all application data and metadata
  • Persisted via Docker volumes

rabbit - RabbitMQ message broker

  • Manages communication between API and workers
  • Handles task queuing for pipeline execution

api - Django application server

  • Serves REST API endpoints
  • Handles authentication and authorization
  • Manages file uploads and pipeline submissions

Worker Services

celery1 - Execution workers

  • Runs Nextflow pipelines
  • Monitors execution progress
  • Updates database with results

celery2 - Background task workers

  • Handles file operations
  • Generates bulk downloads
  • Performs maintenance tasks

beat2 - Task scheduler

  • Triggers periodic tasks
  • Manages cleanup operations

Web Services

front - React frontend server

  • Serves the single-page application
  • No direct container dependencies

files - Static file server

  • Serves uploaded data and results
  • Validates access permissions via API

Support Services

db_backups - Database backup service

  • Creates periodic SQL dumps
  • Ensures data persistence

local_backup - File backup service (optional)

  • Syncs uploaded and generated files
  • Provides additional redundancy

Data Flow

Understanding how data moves through Flow helps explain the architecture:

1. Upload Phase

User → Frontend → API → Storage Volume

                PostgreSQL (metadata)

2. Execution Phase

API → RabbitMQ → Celery Worker

                  Nextflow

              HPC/Cloud Compute

3. Results Phase

Execution Output → Storage Volume

                 Celery Worker → PostgreSQL

                    API → Frontend → User

API Design

REST API

The REST API provides comprehensive access to all platform features with endpoints organized by resource type. See the REST API Reference for complete documentation.

REST Endpoints

Specialized endpoints for file operations:

  • /api/upload/ - Chunked file uploads
  • /api/download/ - Secure file downloads
  • /health/ - Service health checks

Authentication

JWT-based authentication flow:

  1. User logs in with credentials
  2. API returns access and refresh tokens
  3. Frontend includes token in Authorization header
  4. Tokens refresh automatically before expiration

Infrastructure

Storage Volumes

Shared volumes enable data persistence and inter-container communication:

  • uploads: User-uploaded files (FASTQ, etc.)
  • executions: Pipeline execution outputs
  • pipelines: Git repositories of pipeline code
  • configs: Pipeline configuration files
  • media: User avatars and images
  • zips: Bulk download archives

Deployment Options

Cloud Platforms

  • AWS with managed services (RDS, ECS, S3)
  • Google Cloud Platform (Cloud SQL, GKE, Cloud Storage)
  • Azure (Database for PostgreSQL, AKS, Blob Storage)

On-Premise Deployments

  • Kubernetes clusters for scalability
  • Integration with existing infrastructure
  • Support for air-gapped environments

HPC Integration

  • Slurm executor for job submission
  • NFS/SMB mounts for shared storage
  • Custom runners for institute-specific setups

Security Considerations

Network Security

  • Services communicate over private networks
  • Public endpoints protected by WAF
  • TLS termination at load balancer

Data Security

  • File access validated through API permissions
  • Database connections use SSL
  • Sensitive configuration in environment variables

Compute Security

  • Pipeline execution in isolated containers
  • Resource limits enforced
  • User data segregation

Monitoring and Observability

Flow includes built-in monitoring capabilities:

  • Health endpoints for each service
  • Prometheus metrics for performance tracking
  • Grafana dashboards for visualization
  • Execution logs preserved for debugging

The architecture is designed to be observable, making it easy to diagnose issues and optimize performance.

Previous
Group Management