Advanced Topics
Architecture guide
Flow is built as a comprehensive bioinformatics data analysis platform using a multi-layered architecture. This guide explains how the different components work together to provide a seamless experience for analyzing biological data.
5-Layer Architecture
Flow operates on five distinct layers, each adding functionality and abstraction over the layer below it. Users can interact with Flow at any layer depending on their needs and technical expertise.
Layer 1: Bioinformatics Tools
The foundation of Flow consists of established command-line bioinformatics tools that perform the actual data analysis:
- STAR and Bowtie for sequence alignment
- MultiQC for quality control reports
- Ultraplex for demultiplexing
- Various specialized tools for RNA-seq, ChIP-seq, CLIP-seq analysis
These tools typically have intensive resource requirements and are designed to run on high-performance computing (HPC) clusters or cloud infrastructure.
Layer 2: Nextflow Pipelines
Bioinformatics tools are orchestrated using Nextflow, a workflow language designed for scalable and reproducible pipelines:
process GUNZIP {
label 'process_low'
container 'biocontainers/biocontainers:v1.2.0_cv1'
input:
path(archive)
output:
tuple path("$gunzip"), emit: gunzip
script:
gunzip = archive.toString() - '.gz'
"""
gunzip -f $archive
"""
}
Key features:
- Modules: Reusable components wrapping individual tools
- Workflows: Chain modules together with data channels
- Executors: Run on local machines, HPC clusters, or cloud platforms
- Containers: Docker/Singularity support for reproducibility
Layer 3: nextflow.py
A Python library that provides programmatic access to Nextflow pipelines:
import nextflow
# Run a pipeline from Python
execution = nextflow.run(
pipeline="nf-core/rnaseq",
params={"input": "samples.csv", "genome": "GRCh38"},
profile=["docker"]
)
# Access execution details
print(f"Started: {execution.started}")
print(f"Duration: {execution.duration}")
This layer enables Flow to integrate Nextflow pipelines into a larger application framework.
Layer 4: Django API
The core application layer built with Django, providing:
- REST API for comprehensive platform access
- REST API for file operations and frontend optimization
- Database models for users, pipelines, executions, and data
- Celery workers for asynchronous pipeline execution
- Permission system with multi-tenant support
Layer 5: Web Clients
User-facing interfaces that consume the API:
- React Frontend: Modern single-page application at app.flow.bio
- flowbio Python Client: Programmatic access for automation and scripting
Container Architecture
Flow runs as a set of 10 Docker containers working together in a single network:
Core Services
db - PostgreSQL database
- Stores all application data and metadata
- Persisted via Docker volumes
rabbit - RabbitMQ message broker
- Manages communication between API and workers
- Handles task queuing for pipeline execution
api - Django application server
- Serves REST API endpoints
- Handles authentication and authorization
- Manages file uploads and pipeline submissions
Worker Services
celery1 - Execution workers
- Runs Nextflow pipelines
- Monitors execution progress
- Updates database with results
celery2 - Background task workers
- Handles file operations
- Generates bulk downloads
- Performs maintenance tasks
beat2 - Task scheduler
- Triggers periodic tasks
- Manages cleanup operations
Web Services
front - React frontend server
- Serves the single-page application
- No direct container dependencies
files - Static file server
- Serves uploaded data and results
- Validates access permissions via API
Support Services
db_backups - Database backup service
- Creates periodic SQL dumps
- Ensures data persistence
local_backup - File backup service (optional)
- Syncs uploaded and generated files
- Provides additional redundancy
Data Flow
Understanding how data moves through Flow helps explain the architecture:
1. Upload Phase
User → Frontend → API → Storage Volume
↓
PostgreSQL (metadata)
2. Execution Phase
API → RabbitMQ → Celery Worker
↓
Nextflow
↓
HPC/Cloud Compute
3. Results Phase
Execution Output → Storage Volume
↓
Celery Worker → PostgreSQL
↓
API → Frontend → User
API Design
REST API
The REST API provides comprehensive access to all platform features with endpoints organized by resource type. See the REST API Reference for complete documentation.
REST Endpoints
Specialized endpoints for file operations:
/api/upload/
- Chunked file uploads/api/download/
- Secure file downloads/health/
- Service health checks
Authentication
JWT-based authentication flow:
- User logs in with credentials
- API returns access and refresh tokens
- Frontend includes token in Authorization header
- Tokens refresh automatically before expiration
Infrastructure
Storage Volumes
Shared volumes enable data persistence and inter-container communication:
- uploads: User-uploaded files (FASTQ, etc.)
- executions: Pipeline execution outputs
- pipelines: Git repositories of pipeline code
- configs: Pipeline configuration files
- media: User avatars and images
- zips: Bulk download archives
Deployment Options
Cloud Platforms
- AWS with managed services (RDS, ECS, S3)
- Google Cloud Platform (Cloud SQL, GKE, Cloud Storage)
- Azure (Database for PostgreSQL, AKS, Blob Storage)
On-Premise Deployments
- Kubernetes clusters for scalability
- Integration with existing infrastructure
- Support for air-gapped environments
HPC Integration
- Slurm executor for job submission
- NFS/SMB mounts for shared storage
- Custom runners for institute-specific setups
Security Considerations
Network Security
- Services communicate over private networks
- Public endpoints protected by WAF
- TLS termination at load balancer
Data Security
- File access validated through API permissions
- Database connections use SSL
- Sensitive configuration in environment variables
Compute Security
- Pipeline execution in isolated containers
- Resource limits enforced
- User data segregation
Monitoring and Observability
Flow includes built-in monitoring capabilities:
- Health endpoints for each service
- Prometheus metrics for performance tracking
- Grafana dashboards for visualization
- Execution logs preserved for debugging
The architecture is designed to be observable, making it easy to diagnose issues and optimize performance.