Core Flow Concepts
Data
Data in Flow represents any file used or generated during bioinformatics analysis. From raw sequencing reads to final publication figures, Flow tracks, organizes, and preserves all data while maintaining complete provenance and ensuring reproducibility.
What is Data?
In Flow, "data" encompasses:
- Input files: Raw sequencing data, reference genomes, annotation files
- Intermediate files: Aligned reads, normalized counts, quality reports
- Output files: Analysis results, visualizations, summary statistics
- Metadata files: Sample sheets, configuration files, logs
- Associated files: Publications, protocols, supplementary materials
Every piece of data in Flow is:
- Tracked: With checksums, sizes, and creation timestamps
- Owned: By a specific user or group
- Versioned: Changes create new versions, preserving originals
- Accessible: Through permissions and sharing controls
- Preserved: With backups and long-term storage options
Data Types
Sequencing Data
FASTQ Files
- Raw sequencing reads
- Single-end or paired-end
- Compressed formats (.gz, .bz2)
- Quality scores included
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
BAM/SAM Files
- Aligned sequencing reads
- Binary (BAM) or text (SAM) format
- Contains alignment information
- Indexed for fast access
CRAM Files
- Compressed alignment format
- Reference-based compression
- Smaller than BAM files
- Requires reference genome
Analysis Results
Count Matrices
- Gene expression values
- Peak counts
- Feature quantification
- Usually tab-delimited
Differential Analysis
- Statistical comparisons
- Fold changes and p-values
- Gene lists
- Pathway enrichments
Visualizations
- Quality control plots
- Heatmaps and clustering
- PCA/t-SNE plots
- Publication figures
Reference Data
Genome Files
- FASTA format sequences
- Chromosome assemblies
- Indexed for alignment
- Species-specific
Annotation Files
- GTF/GFF gene models
- BED format regions
- Gene ontology data
- Pathway databases
Metadata Files
Sample Sheets
- Experimental design
- Sample-to-file mapping
- Batch information
- Clinical data
Configuration Files
- Pipeline parameters
- Tool settings
- Resource allocations
- Custom options
Data Lifecycle
1. Upload Phase
Data enters Flow through several routes:
Direct Upload
- Web interface drag-and-drop
- Progress tracking
- Automatic validation
- Batch uploading
API Upload
- Programmatic transfer
- Integration with LIMS
- Automated workflows
- Bulk operations
External Import
- From sequencing facilities
- Cloud storage (S3, GCS)
- FTP/SFTP servers
- Public databases
2. Validation Phase
Upon upload, Flow validates:
- File integrity: Checksums match
- Format compliance: Valid file structure
- Completeness: All expected files present
- Quality metrics: Basic QC checks
- Metadata: Required fields populated
3. Storage Phase
Data is stored with:
- Redundancy: Multiple copies for safety
- Compression: Space-efficient storage
- Organization: Logical directory structure
- Indexing: Fast retrieval
- Encryption: Security at rest
4. Processing Phase
During analysis:
- Access control: Permission verification
- Staging: Data prepared for compute
- Monitoring: Track usage and progress
- Temporary files: Managed workspace
- Result capture: Automatic collection
5. Archive Phase
Long-term preservation:
- Cold storage: Cost-effective archival
- Metadata preservation: Complete context
- Retrieval options: On-demand access
- Compliance: Regulatory requirements
- Retention policies: Automated management
Data Organization
Hierarchical Structure
Flow organizes data hierarchically:
Project/
├── Samples/
│ ├── Sample_001/
│ │ ├── raw_data/
│ │ │ ├── sample_001_R1.fastq.gz
│ │ │ └── sample_001_R2.fastq.gz
│ │ └── executions/
│ │ ├── rnaseq_run_001/
│ │ │ ├── alignments/
│ │ │ ├── counts/
│ │ │ └── qc_reports/
│ │ └── rnaseq_run_002/
│ └── Sample_002/
└── Shared_Data/
├── reference_genome/
└── annotation_files/
Naming Conventions
Automatic Naming
- System-generated unique IDs
- Timestamp-based versioning
- Pipeline-specific patterns
- Collision avoidance
Custom Naming
- User-defined schemes
- Metadata-based patterns
- Batch renaming tools
- Export formatting
Data Relationships
Parent-Child
- Raw data → Processed data
- Sample → Analysis results
- Pipeline → Output files
Versioning
- Original → Modified versions
- Parameter changes → New outputs
- Re-analysis → Updated results
Cross-References
- Multiple samples → Combined analysis
- Shared references → Many analyses
- Integrated results → Multi-omics
Data Access
Permissions Model
Access Levels:
- None: No access to data
- View: Can see and download
- Use: Can use in analyses
- Modify: Can update metadata
- Delete: Can remove data
Inheritance:
- Project permissions flow to data
- Sample permissions apply to results
- Explicit permissions override inherited
Sharing Mechanisms
Direct Sharing
- Grant specific user access
- Set expiration dates
- Track access history
- Revoke when needed
Group Sharing
- Share with lab groups
- Institutional access
- Collaborative projects
- Managed centrally
Public Sharing
- Make data publicly accessible
- Generate stable URLs
- DOI assignment
- Citation tracking
Data Transfer
Download Options
- Individual file download
- Bulk download with manifest
- Compressed archives
- Cloud-to-cloud transfer
Streaming Access
- View without downloading
- Partial file access
- Real-time processing
- Bandwidth optimization
Data Quality
Quality Metrics
Sequencing Quality
- Per-base quality scores
- Read length distribution
- Adapter contamination
- Duplication rates
- GC content
Alignment Quality
- Mapping rates
- Insert size distribution
- Coverage uniformity
- Mapping quality scores
Analysis Quality
- Statistical confidence
- Technical replicates
- Batch effects
- Outlier detection
Quality Control
Automatic QC
- FastQC on uploads
- MultiQC aggregation
- Pipeline-specific checks
- Threshold alerts
Manual Review
- Interactive reports
- Custom visualizations
- Comparative analysis
- Expert annotation
Data Formats
Supported Formats
Sequencing
- FASTQ (.fastq, .fq, .fastq.gz)
- FASTA (.fasta, .fa, .fasta.gz)
- SAM/BAM/CRAM (.sam, .bam, .cram)
Genomic Regions
- BED (.bed, .bed.gz)
- GFF/GTF (.gff, .gtf, .gff3)
- VCF (.vcf, .vcf.gz)
- BigWig/BigBed (.bw, .bb)
Tabular Data
- CSV/TSV (.csv, .tsv, .txt)
- Excel (.xlsx, .xls)
- HDF5 (.h5, .hdf5)
- Parquet (.parquet)
Images
- PNG/JPEG (.png, .jpg, .jpeg)
- SVG (.svg)
- PDF (.pdf)
- TIFF (.tif, .tiff)
Format Conversion
Flow provides automatic conversion between:
- FASTQ ↔ FASTA
- SAM ↔ BAM
- GFF ↔ GTF
- Various compression formats
Best Practices
Data Upload
Verify before upload
- Check file integrity
- Confirm correct files
- Review file sizes
- Test with small batch
Organize locally first
- Clear naming scheme
- Logical folder structure
- Complete metadata
- Remove duplicates
Use appropriate methods
- Web for small files
- API for automation
- Bulk upload for many files
- Direct transfer for large data
Data Management
Maintain metadata
- Document thoroughly
- Update regularly
- Use controlled vocabularies
- Link related data
Monitor usage
- Track storage costs
- Review access logs
- Clean up temporary files
- Archive completed projects
Plan for growth
- Estimate storage needs
- Budget for computation
- Design scalable organization
- Implement retention policies
Data Sharing
Set permissions carefully
- Minimum necessary access
- Time-limited shares
- Review regularly
- Document decisions
Prepare for publication
- Clean up file names
- Complete all metadata
- Generate DOIs
- Create data packages
Advanced Features
Data Provenance
Complete tracking of data lineage:
- Origin and creation time
- All processing steps
- Parameter settings used
- Software versions
- User actions
Data Integration
Combine data from multiple sources:
- Multi-omics integration
- Cross-project analysis
- External database links
- Federated queries
Automated Workflows
Trigger actions on data events:
- Auto-process on upload
- Quality check pipelines
- Notification rules
- Archival policies
Data APIs
Programmatic access to all data:
- RESTful endpoints
- GraphQL queries
- Streaming protocols
- Bulk operations
Troubleshooting
Common Issues
Upload failures?
- Check network stability
- Verify file permissions
- Ensure sufficient space
- Try smaller chunks
Can't access data?
- Verify permissions
- Check project membership
- Confirm data location
- Contact data owner
Format not recognized?
- Verify file extension
- Check file contents
- Try manual format specification
- Convert if necessary
Performance issues?
- Use appropriate access method
- Consider data locality
- Optimize query patterns
- Request increased resources
Next Steps
- Upload your data: Get started with data upload
- File formats guide: Detailed format specifications
- Running analyses: Process your data
- Data permissions: Control access
For a comprehensive overview of Flow's data concepts, see the Core Concepts guide.